[
https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14330017#comment-14330017
]
Tyler Palsulich commented on TIKA-1437:
---------------------------------------
[~Lukeliush], can you make a couple updates to make this easier to test? First,
come up with a small (few line) file with this problem. That way, we can be
sure we can legally include the file within Tika. Also, can you reformat your
testing script as a Tika JUnit TestCase? You can see an example
[here|https://github.com/apache/tika/blob/trunk/tika-parsers/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java].
The file you have might just be corrupted -- giving different results. And, as
Tim mentioned, no detector will be perfect, so different detectors will give
different results. But, the above changes will help us narrow it down. Thanks!
> encoding issue in AutoDetectReader
> ----------------------------------
>
> Key: TIKA-1437
> URL: https://issues.apache.org/jira/browse/TIKA-1437
> Project: Tika
> Issue Type: Bug
> Components: detector, parser
> Affects Versions: 1.6
> Environment: Windows 8
> Reporter: Luke sh
> Priority: Critical
> Attachments: EncodingProblem.java, computrabajo-ar-20121108.tsv,
> e9.jpg, ef.jpg
>
>
> We are having an encoding problem with Tika AutoDetectReader;
> we are using AutoDetectReader to read an stream to extract the string values
> by calling readLine()::AutoDetectReader. We find that the Encoding problem is
> happening in UniversalEncodingDetector being called by AutoDetectReader when
> reading the input stream being passed as one of the arguments in our
> TSVParser’s parse method.
> We are using AutoDetectReader in our parser and we believed it was able auto
> detect an correct encoding from the input stream being passed to it, but we
> are seeing several garbled chars bubbling up in our outputted and converted
> files from our parser; we find out that the encoding problem is happening in
> the UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader is
> reading the stream with UTF-8 which is incorrect encoding; and the correct
> encoding is ISO-8859-1.
> I am attaching the screenshot of what char difference we are seeing in the
> input tsv file and converted/outputed file. they are e9.jpg and ef.jpg,
> please read the description for details.
> The problem is that the AutoDetectReader is decoding and reading the chars
> with incorrect encoding.
> BTW, We were able to work around this problem with CharsetDetector, which
> seems to generate a valid encoding for the moment with which we can use to
> read the tsv file properly.
> However, the problem is we cannot use AutoDetectReader, we have to create our
> own TSVAutoDetectReader incorporated with CharsetDetector in the detect
> method; AutoDetectReader class seems to be less flexible for us to extend its
> functions, many of its methods are restricted with private constraints, we
> cannot manually set encoding or override the existing implementation for
> detecting encoding.
> In addition, I am also not confident about CharsetDetector either; as I am
> seeing different encodings produced by CharsetDetector and AutoDetectReader
> for different tsv files; But for now, we might live with CharsetDetector, as
> CharsetDetector is solving the current encoding problem.
> Finally, I would like to also please give you my test program (PFA:
> EncodingProblem.java) that reads an inputted tsv directory and displays a
> list of encodings for each of the tsv files in the directory produced by
> AutoDetectReader, UniversalEncodingDetector(which is being called by
> AutoDetectReader) and CharsetDetector; so you could probably see the
> difference, they are producing different encodings for some tsv files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)