[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Hans Brende (JIRA) Sun, 25 Nov 2018 21:56:34 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698497#comment-16698497
 ]


Hans Brende commented on TIKA-2038:
-----------------------------------

As sort of a sanity check on my part, I wanted to make sure that every possible 
combination of 4-byte sequences returned the same result whether using 
jchardet's nsDetector, or isolating to use only jchardet's nsUTF8Verifier. So I 
created the following test case:

{code:java}
@Test
public void testJchardet() {
    long valid = 0;
    long invalid = 0;
    nsUTF8Verifier v = new nsUTF8Verifier();

    for (int b0 = 0; b0 < 256; b0++) {
        for (int b1 = 0; b1 < 256; b1++) {
            for (int b2 = 0; b2 < 256; b2++) {
                for (int b3 = 0; b3 < 256; b3++) {
                    byte[] bytes = {(byte)b0, (byte)b1, (byte)b2, (byte)b3};
                    boolean validExpected = isValidNsDetector(bytes);

                    org.junit.Assert.assertEquals(validExpected, 
isValidNsUtf8Verifier(bytes, v));

                    if (validExpected) {
                        valid++;
                    } else {
                        invalid++;
                    }
                }
            }
        }
    }

    System.out.println("Success! valid: " + valid + "; invalid: " + invalid);
}

private static boolean isValidNsUtf8Verifier(byte[] bytes, nsUTF8Verifier v) {
    byte state = 0;
    for (byte b : bytes) {
        state = nsUTF8Verifier.getNextState(v, b, state);
    }
    return state != 1;
}

private static boolean isValidNsDetector(byte[] bytes) {
    nsDetector det = new nsDetector(nsDetector.ALL);
    det.DoIt(bytes, bytes.length, false);
    det.DataEnd();
    return "UTF-8".equalsIgnoreCase(det.getProbableCharsets()[0]);
}
{code}

I wouldn't recommend running this test, as it took 4 hours and 13 minutes to 
complete on my machine. But here are the results:

{noformat}
Success! valid: 544958257; invalid: 3750009039

Process finished with exit code 0
{noformat}

Obviously, this doesn't prove my earlier claim per se, since byte sequences 
will almost always be longer than 4 bytes, but it does add a sanity check of 
sorts (and it does prove it for all 4-byte sequences, at least ;)).

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to