[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Hans Brende (JIRA) Mon, 19 Nov 2018 16:16:07 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692478#comment-16692478
 ]


Hans Brende commented on TIKA-2038:
-----------------------------------

Oh, and one small detail I forgot to mention: jchardet also counted the 
following control characters as "illegal" for the purposes of UTF-8 detection: 
SHIFT IN (0x0E), SHIFT OUT (0x0F), and ESC (0x1B). (Although, Tika counts 0x1B 
as a "safe control", so not sure why a discrepancy exists there. 
[[email protected]] can you shed any light on this?)

To retain the exact behavior as jchardet, as regards control characters, it 
would suffice to construct the {{Utf8Statistics}} instance as follows:

{code:java}
Utf8Statistics stats = new Utf8Statistics() {
    @Override
    public void handleCodePoint(int codePoint) {
        if (codePoint == 0x0E || codePoint == 0x0F || codePoint == 0x1B) {
            handleError();
        } else {
            super.handleCodePoint(codePoint);
        }
    }
};
{code}

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to