[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Hans Brende (JIRA) Sun, 16 Dec 2018 10:22:32 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722543#comment-16722543
 ]


Hans Brende commented on TIKA-2038:
-----------------------------------

[~faghani] Glad to hear that my hypothesis was correct, and that F8 is serving 
its intended purpose well! Note: I just released an improved version, 1.1-RC1, 
which should net you additional speed improvements. My own benchmarks show that 
the latest version is more than 100x faster than the way you were using 
nsDetector.

Also, the new utility method that you'll want to use is:

{code:java}
boolean valid = Utf8.isValidUpToTruncation(inputStream);
{code}

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to