[
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410241#comment-15410241
]
Shabanali Faghani commented on TIKA-2038:
-----------------------------------------
Ok, to come to an end at this stage, I’d to say that there is no need to do
more tests on the first corpus. We know that among these algorithms
*HTMLEncodingDetector + ICU4J with added stripping* is the best choice. In fact
you endorsed “Markup Elimination” phase of my paper. Now I want to prove that
“Ensemble Classification” phase is also needed. In other words I claim that
*HTMLEncodingDetector + Mozilla JCharDet + ICU4J with added stripping* is the
best configuration and I guess your proposed algorithm won’t be as powerful as
my algorithm in real-world conditions, tough still it will be more accurate
than the Tika’s legacy algorithm.
To test my claim let’s concentrate on the second test scenario of my paper,
i.e. language-wise. But beforehand I want to say a point about the relation
between languages and Encodings. Unlike many encoding types such as
GBK(Chinese), Shift_JIS(Japanese), Windows-1251(Russian) and
Windows-1256(Arabic) that are used for just one language, there are some
encoding types such as Unicode(UTF-8, UTF-16, ...) that can be used for all
languages and scripts (Unicode even supports Cuneiform!). Hence, for the sake
of some technical reasons, the accuracy of a typical encoding detector may
greatly vary for different documents with different languages with the same
encoding. For example, from experience I know that the accuracy of JCharDet is
~100% for Persian HTML web pages with UTF-8 encoding but it isn’t very accurate
for HTML web pages with European languages that are encoded with UTF-8.
If I wanted to fully support UTF-8 in my first corpus then I should to collect
a lot of pages with various languages with UTF-8 encoding. This is the same
thing that the second test scenario covers. Furthermore, all encodings that
supports a language are tested altogether in this test scenario.
As I’ve seen before Mozilla JCharDet is extremely biased to UTF-8 (not only its
true-positive is very high for UTF-8 but also its true-negative is very high
for that). From evaluation results you know that none of other algorithms even
ICU4JStripped are not perfect for UTF-8. (cell “I3” in your last evaluation
results). On the other hand we know ~87% of websites use UTF-8. So we should be
as accurate as possible for detecting UTF-8.
I recommend that you test Tika’s current algorithm and your proposed algorithm
along with my proposed algorithm using the URLs of my second test scenario with
these considerations:
# Use charset in HTTP header as ground truth
# Check the validity of the encodings that appear in the HTTP header and then
normalize them. You may use [my
code|https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/main/java/ir/ac/iust/htmlchardet/Charsets.java]
for validation check and also for normalization
# Persist (save) all pages, because we may be forced to re-run the evaluation.
I think it is better that the pages being categorized with this order:
…/Lang-Corpus/Languages(German for example)/Encodings(UTF-8 for example)
# To test my algorithm use htmlchardet-1.0.1.jar
# Turn Meta detection off for all three algorithms
# Run your proposed algorithm twice, once using your markup stripper and once
using a custom HTML parser like Jsoup or TagSoup which repairs HTML document
before extracting its text. Use ISO-8859-1 decoding-encoding trick for using an
HTML parser to extract text. I think your markup stripper will fails (with
false-negative, false-positive and also with exceptions) for many pages,
because I’ve experienced it before. You may have a look at the [TagSoup’s
homepage|http://home.ccil.org/~cowan/tagsoup/] and notice the world “brutish”
inside it! I think in addition to a sense of humor the author had a
real/practical sense of HTML!!
I recommend computing the count of exceptions and false-negatives/positives. To
compute false-negatives/positives you can compare size of the texts that are
extracted by using two approaches (i.e. by your markup stripper and by a HTML
parser).
# For fetching and also saving the pages, don’t use some libraries that make
modification inside pages and clutter its bytes sequences. You can simply use
Apache Commons-IO for fetching as follows:
feching: {{byte[] htmlByteSequence = IOUtils.toByteArray(new URL(urlString));}}
saving: {{FileUtils.writeByteArrayToFile(file, htmlByteSequence);}}
Excuse me for putting these code here but preserving the integrity of the bytes
is very important.
# Since the results are very large/abundant and scattered, in addition to an in
details chart, please aggregate them and provide a table and a diagram just
like [these
ones|https://github.com/shabanali-faghani/IUST-HTMLCharDet#language-wise-evaluation].
In this case other community mumbers can easily understand it.
# Since none of ICU4J, JCharDet and JUniversalCharDet hasn’t GBK in its charset
list, regarding [the evolution of this
charset|https://en.wikipedia.org/wiki/GBK] that is: GB2312(1981) -> GBK(1993)
-> GB18030(2000), please take GBK -> GB18030 and GB2312 -> GB18030 as true
detection. The first charset represents “Charset in HTTP header” and the second
charset represents “Detected Charset”.
# And finally before you start to write the code, you may have a look at [my
code|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/src/test/java/languagewise]
that I’ve wrote for Language-Wise evaluation.
Oh, sorry. It becomes a very long comment!
> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
> Issue Type: Improvement
> Components: core, detector
> Reporter: Shabanali Faghani
> Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx,
> iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents
> as well as the other naturally text documents. But the accuracy of encoding
> detector tools, including icu4j, in dealing with the HTML documents is
> meaningfully less than from which the other text documents. Hence, in our
> project I developed a library that works pretty well for HTML documents,
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
> the HTML documents, it seems that having such an facility in Tika also will
> help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)