[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698305#comment-16698305 ]
Hans Brende edited comment on TIKA-2038 at 11/26/18 4:54 AM: ------------------------------------------------------------- [~faghani] Thanks for the response! If my understanding of the jchardet & IUST source code is correct, splitting off the UTF-8 detector should be possible, because the method {code:java} getProbableCharsets(){code} does not return the charsets in the order of "best match first" (as icu4j does), but rather, in the order of "first tested first" (and UTF-8 is *always* at index 0 in this ordering if it was not detected to be invalid). was (Author: hansbrende): [~faghani] Thanks for the response! If my understanding of the jchardet & IUST source code is correct, splitting off the UTF-8 detector should be possible, because the method {code:java}getProbableCharsets(){code} does not return the charsets in the order of "best match first" (as Tika does), but rather, in the order of "first tested first" (and UTF-8 is *always* at index 0 in this ordering if it was not detected to be invalid). > A more accurate facility for detecting Charset Encoding of HTML documents > ------------------------------------------------------------------------- > > Key: TIKA-2038 > URL: https://issues.apache.org/jira/browse/TIKA-2038 > Project: Tika > Issue Type: Improvement > Components: core, detector > Reporter: Shabanali Faghani > Priority: Minor > Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, > iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, > lang-wise-eval_source_code.zip, proposedTLDSampling.csv, > tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, > tld_text_html_plus_H_column.xlsx > > > Currently, Tika uses icu4j for detecting charset encoding of HTML documents > as well as the other naturally text documents. But the accuracy of encoding > detector tools, including icu4j, in dealing with the HTML documents is > meaningfully less than from which the other text documents. Hence, in our > project I developed a library that works pretty well for HTML documents, > which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet > Since Tika is widely used with and within some of other Apache stuffs such as > Nutch, Lucene, Solr, etc. and these projects are strongly in connection with > the HTML documents, it seems that having such an facility in Tika also will > help them to become more accurate. -- This message was sent by Atlassian JIRA (v7.6.3#76005)