[
https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shuai Liu updated TIKA-1437:
----------------------------
Attachment: computrabajo-ar-20121108.tsv
The problem tsv file with which we are having the encoding problem.
Please run the attached EncodingProblem.java to see the different encoding
produced by different tika encoding detection implementation.
> encoding issue in AutoDetectReader
> ----------------------------------
>
> Key: TIKA-1437
> URL: https://issues.apache.org/jira/browse/TIKA-1437
> Project: Tika
> Issue Type: Bug
> Components: detector, parser
> Affects Versions: 1.6
> Environment: Windows 8
> Reporter: Shuai Liu
> Priority: Critical
> Attachments: EncodingProblem.java, computrabajo-ar-20121108.tsv
>
>
> We are having an encoding problem with Tika AutoDetectReader;
> we are using AutoDetectReader to read an stream to extract the string values
> by calling readLine()::AutoDetectReader. We find that the Encoding problem is
> happening in UniversalEncodingDetector being called by AutoDetectReader when
> reading the input stream being passed as one of the arguments in our
> TSVParser’s parse method.
> We are using AutoDetectReader in our parser and we believed it was able auto
> detect an correct encoding from the input stream being passed to it, but we
> are seeing several garbled chars bubbling up in our outputted and converted
> files from our parser; we find out that the encoding problem is happening in
> the UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader is
> reading the stream with UTF-8 which is incorrect encoding; and the correct
> encoding is ISO-8859-1.
> I am attaching the screenshot of what I am talking about, the following is a
> raw tsv file; you can see the hex code E9 is presented as a char between M
> and xico, I believe it is a ‘e’ but in different encoding/language.
> The problem is that the AutoDetectReader is decoding and reading the chars
> with incorrect encoding.
> BTW, We were able to work around this problem with CharsetDetector, which
> seems to generate a valid encoding for the moment with which we can use to
> read the tsv file properly.
> However, the problem is we cannot use AutoDetectReader, we have to create our
> own TSVAutoDetectReader incorporated with CharsetDetector in the detect
> method; AutoDetectReader class seems to be less flexible for us to extend its
> functions, many of its methods are restricted with private constraints, we
> cannot manually set encoding or override the existing implementation for
> detecting encoding.
> In addition, I am also not confident about CharsetDetector either; as I am
> seeing different encodings produced by CharsetDetector and AutoDetectReader
> for different tsv files; But for now, we might live with CharsetDetector, as
> CharsetDetector is solving the current encoding problem.
> Finally, I would like to please give you my test program (PFA:
> EncodingProblem.java) that reads an inputted tsv directory and displays a
> list of encodings for each of the tsv files in the directory produced by
> AutoDetectReader, UniversalEncodingDetector(which is being called by
> AutoDetectReader) and CharsetDetector; so you could probably see the
> difference, they are producing different encodings for some tsv files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)