Sean Story created TIKA-2475:
--------------------------------

             Summary: discrepancy between CharsetDetector APIs
                 Key: TIKA-2475
                 URL: https://issues.apache.org/jira/browse/TIKA-2475
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.16, 1.15, 1.14
         Environment: Mac OSX 10.12.6, Java 1.8.0_111
            Reporter: Sean Story


h3. Problem
I ran into this trying to use CharsetDetector to detect charsets of attachments 
on emails when the mail client doesn't specify one. This used to work for us in 
tika 1.10, but in a recent upgrade to 1.14, behavior seems to have changed. 
I've attached a sample file, whose charset is ISO-8859-1, and was detected as 
such with Tika 1.10. When we updated our tika dependency, we noticed that this 
sample data (a mix of English, Portuguese, and Spanish language) was getting 
output as a lot of junk Chinese characters. Upon inspection, it was determined 
that this was because our usage of the newer tika dep was detecting the file as 
UTF-16LE, instead of ISO-8859-1.

I've attached a sample file (multi-language.txt)

Below is a Spock test that demonstrates the issue:
{noformat}
    def "test charset detection on multilingual file"(){
        setup:
        def file = new File("src/test/resources/data/multi-language.txt")

        when: "using the InputStream api"
        def detector = new CharsetDetector()
        detector.setText(file.newInputStream())
        def fileCharSet = detector.detect()

        then: "successfully detects the charset"
        fileCharSet.name.startsWith("ISO")

        when: "using the byte[] api, and munging the input"
        detector = new CharsetDetector()
        detector.setText(file.newInputStream().bytes)
        detector.MungeInput()
        fileCharSet = detector.detect()

        then: "sucessfully detects the charset"
        fileCharSet.name.startsWith("ISO")

        when: "using the byte[] api alone"
        detector = new CharsetDetector()
        detector.setText(file.newInputStream().bytes)
        fileCharSet = detector.detect()

        then: "this will fail - detects UTF-16LE instead"
        fileCharSet.name.startsWith("ISO")
    }
{noformat}

As is shown in the above test, I believe the issue is that the 
CharsetDetector's various {{setText()}} functions do not delegate to one 
another, and in one the {{MungeInput()}} function is called, and in the other 
it is not.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to