[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Hans Brende (JIRA) Wed, 21 Nov 2018 09:04:54 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16694940#comment-16694940
 ]


Hans Brende edited comment on TIKA-2038 at 11/21/18 5:03 PM:
-------------------------------------------------------------

Alternatively, you could use guava's 
{{com.google.common.base.Utf8.isWellFormed(byte[])}} method, which will do 
exactly the same thing as the jchardet implementation (minus counting 0x0E, 
0x0F, and 0x1B as illegal, and minus the two bugs I mentioned). This is 
definitely the most performant option, although you'd lack more detailed text 
statistics about the number of valid/invalid/ascii sequences.


was (Author: hansbrende):
Alternatively, you could use guava's 
{{com.google.common.base.Utf8.isWellFormed(byte[])}} method, which will do 
exactly the same thing as the jchardet implementation (minus counting 0x0E, 
0x0F, and 0x1B as illegal). This is definitely the most performant option, 
although you'd lack more detailed text statistics about the number of 
valid/invalid/ascii sequences.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to