[jira] [Commented] (TIKA-2475) discrepancy between CharsetDetector APIs

Hudson (JIRA) Wed, 11 Oct 2017 08:11:36 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16200426#comment-16200426
 ]


Hudson commented on TIKA-2475:
------------------------------

FAILURE: Integrated in Jenkins build Tika-trunk #1378 (See 
[https://builds.apache.org/job/Tika-trunk/1378/])
fix for TIKA-2475 contributed by seanstory (sean.story: 
[https://github.com/apache/tika/commit/1f38be359b2735de94e9ae5850a3622e5393f77b])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
* (add) tika-parsers/src/test/resources/test-documents/multi-language.txt
TIKA-2475 mods and some new tests/cleanup for CharsetDetector. This (tallison: 
[https://github.com/apache/tika/commit/94850f2e7c7d3df6a06a924fc6d643c0f6181643])
* (edit) 
tika-parsers/src/test/java/org/apache/tika/parser/txt/CharsetDetectorTest.java
* (edit) 
tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java
* (edit) CHANGES.txt


> discrepancy between CharsetDetector APIs
> ----------------------------------------
>
>                 Key: TIKA-2475
>                 URL: https://issues.apache.org/jira/browse/TIKA-2475
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.14, 1.15, 1.16
>         Environment: Mac OSX 10.12.6, Java 1.8.0_111
>            Reporter: Sean Story
>         Attachments: multi-language.txt
>
>
> h3. Problem
> I ran into this trying to use CharsetDetector to detect charsets of 
> attachments on emails when the mail client doesn't specify one. This used to 
> work for us in tika 1.10, but in a recent upgrade to 1.14, behavior seems to 
> have changed. I've attached a sample file, whose charset is ISO-8859-1, and 
> was detected as such with Tika 1.10. When we updated our tika dependency, we 
> noticed that this sample data (a mix of English, Portuguese, and Spanish 
> language) was getting output as a lot of junk Chinese characters. Upon 
> inspection, it was determined that this was because our usage of the newer 
> tika dep was detecting the file as UTF-16LE, instead of ISO-8859-1.
> I've attached a sample file (multi-language.txt)
> Below is a Spock test that demonstrates the issue:
> {noformat}
>     def "test charset detection on multilingual file"(){
>         setup:
>         def file = new File("src/test/resources/data/multi-language.txt")
>         when: "using the InputStream api"
>         def detector = new CharsetDetector()
>         detector.setText(file.newInputStream())
>         def fileCharSet = detector.detect()
>         then: "successfully detects the charset"
>         fileCharSet.name.startsWith("ISO")
>         when: "using the byte[] api, and munging the input"
>         detector = new CharsetDetector()
>         detector.setText(file.newInputStream().bytes)
>         detector.MungeInput()
>         fileCharSet = detector.detect()
>         then: "sucessfully detects the charset"
>         fileCharSet.name.startsWith("ISO")
>         when: "using the byte[] api alone"
>         detector = new CharsetDetector()
>         detector.setText(file.newInputStream().bytes)
>         fileCharSet = detector.detect()
>         then: "this will fail - detects UTF-16LE instead"
>         fileCharSet.name.startsWith("ISO")
>     }
> {noformat}
> As is shown in the above test, I believe the issue is that the 
> CharsetDetector's various {{setText()}} functions do not delegate to one 
> another, and in one the {{MungeInput()}} function is called, and in the other 
> it is not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (TIKA-2475) discrepancy between CharsetDetector APIs

Reply via email to