[jira] [Resolved] (TIKA-2475) discrepancy between CharsetDetector APIs

Tim Allison (JIRA) Wed, 11 Oct 2017 06:22:26 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison resolved TIKA-2475.
-------------------------------
    Resolution: Fixed

Thank you!  

Note: If you want to be the second person in the world to try tika-eval, you 
can easily identify diffs in output between different versions of Tika.  See: 
https://wiki.apache.org/tika/TikaEval

> discrepancy between CharsetDetector APIs
> ----------------------------------------
>
>                 Key: TIKA-2475
>                 URL: https://issues.apache.org/jira/browse/TIKA-2475
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14, 1.15, 1.16
>         Environment: Mac OSX 10.12.6, Java 1.8.0_111
>            Reporter: Sean Story
>         Attachments: multi-language.txt
>
>
> h3. Problem
> I ran into this trying to use CharsetDetector to detect charsets of 
> attachments on emails when the mail client doesn't specify one. This used to 
> work for us in tika 1.10, but in a recent upgrade to 1.14, behavior seems to 
> have changed. I've attached a sample file, whose charset is ISO-8859-1, and 
> was detected as such with Tika 1.10. When we updated our tika dependency, we 
> noticed that this sample data (a mix of English, Portuguese, and Spanish 
> language) was getting output as a lot of junk Chinese characters. Upon 
> inspection, it was determined that this was because our usage of the newer 
> tika dep was detecting the file as UTF-16LE, instead of ISO-8859-1.
> I've attached a sample file (multi-language.txt)
> Below is a Spock test that demonstrates the issue:
> {noformat}
>     def "test charset detection on multilingual file"(){
>         setup:
>         def file = new File("src/test/resources/data/multi-language.txt")
>         when: "using the InputStream api"
>         def detector = new CharsetDetector()
>         detector.setText(file.newInputStream())
>         def fileCharSet = detector.detect()
>         then: "successfully detects the charset"
>         fileCharSet.name.startsWith("ISO")
>         when: "using the byte[] api, and munging the input"
>         detector = new CharsetDetector()
>         detector.setText(file.newInputStream().bytes)
>         detector.MungeInput()
>         fileCharSet = detector.detect()
>         then: "sucessfully detects the charset"
>         fileCharSet.name.startsWith("ISO")
>         when: "using the byte[] api alone"
>         detector = new CharsetDetector()
>         detector.setText(file.newInputStream().bytes)
>         fileCharSet = detector.detect()
>         then: "this will fail - detects UTF-16LE instead"
>         fileCharSet.name.startsWith("ISO")
>     }
> {noformat}
> As is shown in the above test, I believe the issue is that the 
> CharsetDetector's various {{setText()}} functions do not delegate to one 
> another, and in one the {{MungeInput()}} function is called, and in the other 
> it is not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (TIKA-2475) discrepancy between CharsetDetector APIs

Reply via email to