[
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16698513#comment-16698513
]
Hans Brende commented on TIKA-2038:
-----------------------------------
Here's a more rigorous demonstration of my claim (by counterexample): Supposing
jchardet ordered {{getProbableCharsets()}} by "best match first", then we would
expect the first element of {{getProbableCharsets()}} to always match the
charset reported to the {{nsICharsetDetectionObserver}}. However, that is not
the case, as evident by the following unit test:
{code:java}
@Test
public void checkReportedMatchesFirstProbable() {
final byte[] testBytes = {
0x40, 0x32, 0x2A, 0x3E, 0x13, 0x2D, 0x61, 0x35,
0x72, 0x12, 0x1C, 0x1A, 0x2B, 0x0B, 0x6A, 0x08,
0x55, 0x7C, 0x1F, 0x6E, 0x56, 0x7D, 0x7E, 0x7B,
0x05, 0x32, 0x7E, 0x7D, 0x73
};
ArrayList<String> reportedCharsets = new ArrayList<>();
nsICharsetDetectionObserver observer = reportedCharsets::add;
nsDetector det = new nsDetector(nsDetector.ALL);
det.Init(observer);
det.DoIt(testBytes, testBytes.length, false);
det.DataEnd();
org.junit.Assert.assertEquals(reportedCharsets.get(0),
det.getProbableCharsets()[0]);
}
{code}
Results in a FAILED test:
{noformat}
org.junit.ComparisonFailure:
Expected :HZ-GB-2312
Actual :UTF-8
Process finished with exit code 255
{noformat}
> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
> Issue Type: Improvement
> Components: core, detector
> Reporter: Shabanali Faghani
> Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx,
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip,
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv,
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx,
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents
> as well as the other naturally text documents. But the accuracy of encoding
> detector tools, including icu4j, in dealing with the HTML documents is
> meaningfully less than from which the other text documents. Hence, in our
> project I developed a library that works pretty well for HTML documents,
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
> the HTML documents, it seems that having such an facility in Tika also will
> help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)