[
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692323#comment-16692323
]
Hans Brende edited comment on TIKA-2038 at 11/19/18 10:28 PM:
--------------------------------------------------------------
[~faghani]
[[email protected]]
This issue inspired me to look into how jchardet implemented UTF-8 detection,
since that library appears to be the key to much greater accuracy. Turns out,
it's rather simple: it uses a UTF-8 state machine that goes into an error state
if any invalid UTF-8 byte sequence is detected, and if not, keeps UTF-8 at
index 0 of "probable charsets". Unfortunately, I did see that jchardet v. 1.1
has two bugs: (1) legal code points in the Supplementary Multilingual Plane are
counted as errors, and (2) illegal code points past 0x10FFFF are counted as
legal.
To fix these two bugs and narrow the scope of what is needed from jchardet to
solely UTF-8 detection, I ended up implementing an improved UTF-8 state machine
which you might find useful here: https://github.com/HansBrende/f8. I also made
it available on maven at: org.rypt:f8:1.0.
Peering into the source code of the IUST project, I see that the following
lines:
{code:java}
charset = HTMLCharsetDetector.mozillaJCharDet(rawHtmlByteSequence);
if (charset.equalsIgnoreCase("UTF-8")) {
return Charsets.normalize(charset);
}
private static String mozillaJCharDet(byte[] bytes) {
nsDetector det = new nsDetector(nsDetector.ALL);
det.DoIt(bytes, bytes.length, false);
det.DataEnd();
return det.getProbableCharsets()[0];
}
{code}
could be replaced with:
{code:java}
org.rypt.f8.Utf8Statistics stats = new org.rypt.f8.Utf8Statistics();
stats.write(rawHtmlByteSequence);
if (stats.countInvalid() == 0) {
return "UTF-8";
}
{code}
without loss of accuracy (and in fact, with greater accuracy, due to the 2
bugfixes).
Futhermore, by taking a hint from ICU4j (which counts an InputStream as valid
UTF-8 as long as the number of valid UTF-8 multi-byte sequences is at least an
*order of magnitude* greater than the number of invalid UTF-8 sequences to
allow for possibly corrupted UTF-8 data), this method could be further improved
by doing:
{code:java}
org.rypt.f8.Utf8Statistics stats = new org.rypt.f8.Utf8Statistics();
stats.write(rawHtmlByteSequence);
if (stats.looksLikeUtf8()) { //implemented as: countValid() >
countInvalidIgnoringTruncation() * 10
return "UTF-8";
}
{code}
Please let me know your thoughts!
was (Author: hansbrende):
[~faghani]
[[email protected]]
This issue inspired me to look into how jchardet implemented UTF-8 detection,
since that library appears to be the key to much greater accuracy. Turns out,
it's rather simple: it uses a UTF-8 state machine that goes into an error state
if any invalid UTF-8 byte sequence is detected, and if not, keeps UTF-8 at
index 0 of "probable charsets". Unfortunately, I did see that jchardet v. 1.1
has two bugs: (1) legal code points in the Supplementary Multilingual Plane are
counted as errors, and (2) illegal code points past 0x10FFFF are counted as
legal.
To fix these two bugs and narrow the scope of what is needed from jchardet to
solely UTF-8 detection, I ended up implementing an improved UTF-8 state machine
which you might find useful here: https://github.com/HansBrende/f8. I also made
it available on maven at: org.rypt:f8:1.0.
Peering into the source code of the IUST project, I see that the following
lines:
{code:java}
charset = HTMLCharsetDetector.mozillaJCharDet(rawHtmlByteSequence);
if (charset.equalsIgnoreCase("UTF-8")) {
return Charsets.normalize(charset);
}
private static String mozillaJCharDet(byte[] bytes) {
nsDetector det = new nsDetector(nsDetector.ALL);
det.DoIt(bytes, bytes.length, false);
det.DataEnd();
return det.getProbableCharsets()[0];
}
{code}
could be replaced with:
{code:java}
org.rypt.f8.Utf8Statistics stats = new org.rypt.f8.Utf8Statistics();
stats.write(rawHtmlByteSequence);
if (stats.countInvalid() == 0) {
return "UTF-8";
}
{code}
without loss of accuracy (and in fact, with greater accuracy, due to the 2
bugfixes).
Futhermore, by taking a hint from ICU4j (which counts an InputStream as valid
UTF-8 as long as the number of valid UTF-8 multi-byte sequences is at least an
*order of magnitude* greater than the number of invalid UTF-8 sequences to
allow for possibly corrupted UTF-8 data), this method could be further improved
by doing:
{code:java}
org.rypt.f8.Utf8Statistics stats = new org.rypt.f8.Utf8Statistics();
stats.write(rawHtmlByteSequence);
if (stats.looksLikeUtf8()) {
return "UTF-8";
}
{code}
Please let me know your thoughts!
> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
> Issue Type: Improvement
> Components: core, detector
> Reporter: Shabanali Faghani
> Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx,
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip,
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv,
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx,
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents
> as well as the other naturally text documents. But the accuracy of encoding
> detector tools, including icu4j, in dealing with the HTML documents is
> meaningfully less than from which the other text documents. Hence, in our
> project I developed a library that works pretty well for HTML documents,
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
> the HTML documents, it seems that having such an facility in Tika also will
> help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)