Steven Van Ingelgem created TIKA-2843:
-----------------------------------------
Summary: Question about strange characters in the output
Key: TIKA-2843
URL: https://issues.apache.org/jira/browse/TIKA-2843
Project: Tika
Issue Type: Bug
Affects Versions: 1.20
Reporter: Steven Van Ingelgem
I started my server like this: "java -jar tika-server-1.20.jar -server"
I was working with a RAR file to get the information and I noticed that a LOT
of weird output was included.
This file contained both binaries (.so), .class and .java files (as well as
some additional resources like png, html & certificates).
The problem is that I can consistently reproduce it with that rar file, but I
cannot share it.
So I tried to reproduce it this way:
"curl -T tika-server-1.20.jar localhost:9998/tika --header "Accept: text/plain"
> out.txt"
This gave me a little bit of the same problem (just not as bad as I had it with
the rar) :(
{code}
schemaorg_apache_xmlbeans/system/sD023D6490046BA0250A839A9AD24C443/agautoformatattributegroup.xsb
Úzº¾�����������9[http://schemas.openxmlformats.org/spreadsheetml/2006/main]�^MAG_AutoFormat��unqualified�8<xsd:attributeGroup
name="AG_AutoFormat"
xmlns="[http://schemas.openxmlformats.org/spreadsheetml/2006/main]"
xmlns:xsd="[http://www.w3.org/2001/XMLSchema]">
<xsd:attribute name="autoFormatId" type="xsd:unsignedInt">
<xsd:annotation>
{code}
My first question is: why is this outputted?
Some tests with the rar-file (not the tika-jar) showed me that each file
separately is extracted properly. (meaning: i get proper text)
Plus that when I delete files from the rar, some files are extracted properly
which were not extracted properly before.
Furthermore I noticed a distinctive pattern: EF BF BD (which seems to be an
UTF-8 replacement character).
But it's not with every rar, for a test I downloaded the "br" dump of wikimedia
and rarred it. then "/rmeta/text" on it, and that extracted properly.
So I'm guessing some kind of buffer overflowing into the next text-extraction?
What could I do to debug this more in-depth and/or provide you with some more
info so you could tackle this bug?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)