[
https://issues.apache.org/jira/browse/TIKA-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison resolved TIKA-2475.
-------------------------------
Resolution: Fixed
Thank you!
Note: If you want to be the second person in the world to try tika-eval, you
can easily identify diffs in output between different versions of Tika. See:
https://wiki.apache.org/tika/TikaEval
> discrepancy between CharsetDetector APIs
> ----------------------------------------
>
> Key: TIKA-2475
> URL: https://issues.apache.org/jira/browse/TIKA-2475
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.14, 1.15, 1.16
> Environment: Mac OSX 10.12.6, Java 1.8.0_111
> Reporter: Sean Story
> Attachments: multi-language.txt
>
>
> h3. Problem
> I ran into this trying to use CharsetDetector to detect charsets of
> attachments on emails when the mail client doesn't specify one. This used to
> work for us in tika 1.10, but in a recent upgrade to 1.14, behavior seems to
> have changed. I've attached a sample file, whose charset is ISO-8859-1, and
> was detected as such with Tika 1.10. When we updated our tika dependency, we
> noticed that this sample data (a mix of English, Portuguese, and Spanish
> language) was getting output as a lot of junk Chinese characters. Upon
> inspection, it was determined that this was because our usage of the newer
> tika dep was detecting the file as UTF-16LE, instead of ISO-8859-1.
> I've attached a sample file (multi-language.txt)
> Below is a Spock test that demonstrates the issue:
> {noformat}
> def "test charset detection on multilingual file"(){
> setup:
> def file = new File("src/test/resources/data/multi-language.txt")
> when: "using the InputStream api"
> def detector = new CharsetDetector()
> detector.setText(file.newInputStream())
> def fileCharSet = detector.detect()
> then: "successfully detects the charset"
> fileCharSet.name.startsWith("ISO")
> when: "using the byte[] api, and munging the input"
> detector = new CharsetDetector()
> detector.setText(file.newInputStream().bytes)
> detector.MungeInput()
> fileCharSet = detector.detect()
> then: "sucessfully detects the charset"
> fileCharSet.name.startsWith("ISO")
> when: "using the byte[] api alone"
> detector = new CharsetDetector()
> detector.setText(file.newInputStream().bytes)
> fileCharSet = detector.detect()
> then: "this will fail - detects UTF-16LE instead"
> fileCharSet.name.startsWith("ISO")
> }
> {noformat}
> As is shown in the above test, I believe the issue is that the
> CharsetDetector's various {{setText()}} functions do not delegate to one
> another, and in one the {{MungeInput()}} function is called, and in the other
> it is not.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)