[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Fri, 05 Aug 2016 15:52:35 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410241#comment-15410241
 ]


Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

Ok, to come to an end at this stage, I’d to say that there is no need to do 
more tests on the first corpus. We know that among these algorithms 
*HTMLEncodingDetector + ICU4J with added stripping* is the best choice. In fact 
you endorsed “Markup Elimination” phase of my paper. Now I want to prove that 
“Ensemble Classification” phase is also needed. In other words I claim that 
*HTMLEncodingDetector + Mozilla JCharDet + ICU4J with added stripping* is the 
best configuration and I guess your proposed algorithm won’t be as powerful as 
my algorithm in real-world conditions, tough still it will be more accurate 
than the Tika’s legacy algorithm.

To test my claim let’s concentrate on the second test scenario of my paper, 
i.e. language-wise. But beforehand I want to say a point about the relation 
between languages and Encodings. Unlike many encoding types such as 
GBK(Chinese), Shift_JIS(Japanese), Windows-1251(Russian) and 
Windows-1256(Arabic) that are used for just one language, there are some 
encoding types such as Unicode(UTF-8, UTF-16, ...) that can be used for all 
languages and scripts (Unicode even supports Cuneiform!). Hence, for the sake 
of some technical reasons, the accuracy of a typical encoding detector may 
greatly vary for different documents with different languages with the same 
encoding. For example, from experience I know that the accuracy of JCharDet is 
~100% for Persian HTML web pages with UTF-8 encoding but it isn’t very accurate 
for HTML web pages with European languages that are encoded with UTF-8.

If I wanted to fully support UTF-8 in my first corpus then I should to collect 
a lot of pages with various languages with UTF-8 encoding. This is the same 
thing that the second test scenario covers. Furthermore, all encodings that 
supports a language are tested altogether in this test scenario.

As I’ve seen before Mozilla JCharDet is extremely biased to UTF-8 (not only its 
true-positive is very high for UTF-8 but also its true-negative is very high 
for that). From evaluation results you know that none of other algorithms even 
ICU4JStripped are not perfect for UTF-8. (cell “I3” in your last evaluation 
results). On the other hand we know ~87% of websites use UTF-8. So we should be 
as accurate as possible for detecting UTF-8. 

I recommend that you test Tika’s current algorithm and your proposed algorithm 
along with my proposed algorithm using the URLs of my second test scenario with 
these considerations:

# Use charset in HTTP header as ground truth
# Check the validity of the encodings that appear in the HTTP header and then 
normalize them. You may use [my 
code|https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/main/java/ir/ac/iust/htmlchardet/Charsets.java]
 for validation check and also for normalization
# Persist (save) all pages, because we may be forced to re-run the evaluation. 
I think it is better that the pages being categorized with this order: 
…/Lang-Corpus/Languages(German for example)/Encodings(UTF-8 for example)
# To test my algorithm use htmlchardet-1.0.1.jar
# Turn Meta detection off for all three algorithms
# Run your proposed algorithm twice, once using your markup stripper and once 
using a custom HTML parser like Jsoup or TagSoup which repairs HTML document 
before extracting its text. Use ISO-8859-1 decoding-encoding trick for using an 
HTML parser to extract text. I think your markup stripper will fails (with 
false-negative, false-positive and also with exceptions) for many pages, 
because I’ve experienced it before. You may have a look at the [TagSoup’s 
homepage|http://home.ccil.org/~cowan/tagsoup/] and notice the world “brutish” 
inside it! I think in addition to a sense of humor the author had a 
real/practical sense of HTML!!
I recommend computing the count of exceptions and false-negatives/positives. To 
compute false-negatives/positives you can compare size of the texts that are 
extracted by using two approaches (i.e. by your markup stripper and by a HTML 
parser).
# For fetching and also saving the pages, don’t use some libraries that make 
modification inside pages and clutter its bytes sequences. You can simply use 
Apache Commons-IO for fetching as follows:
feching: {{byte[] htmlByteSequence = IOUtils.toByteArray(new URL(urlString));}}
saving: {{FileUtils.writeByteArrayToFile(file, htmlByteSequence);}}
Excuse me for putting these code here but preserving the integrity of the bytes 
is very important. 
# Since the results are very large/abundant and scattered, in addition to an in 
details chart, please aggregate them and provide a table and a diagram just 
like [these 
ones|https://github.com/shabanali-faghani/IUST-HTMLCharDet#language-wise-evaluation].
 In this case other community mumbers can easily understand it.
# Since none of ICU4J, JCharDet and JUniversalCharDet hasn’t GBK in its charset 
list, regarding [the evolution of this 
charset|https://en.wikipedia.org/wiki/GBK] that is: GB2312(1981) -> GBK(1993) 
-> GB18030(2000), please take GBK -> GB18030 and GB2312 -> GB18030 as true 
detection. The first charset represents “Charset in HTTP header” and the second 
charset represents “Detected Charset”.
# And finally before you start to write the code, you may have a look at [my 
code|https://github.com/shabanali-faghani/IUST-HTMLCharDet/tree/master/src/test/java/languagewise]
 that I’ve wrote for Language-Wise evaluation. 

Oh, sorry. It becomes a very long comment!

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to