[
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15410111#comment-15410111
]
Shabanali Faghani commented on TIKA-2038:
-----------------------------------------
Well, now we can have a fair comparison between Tika and IUST. Note that this
comparison is done for the half cases that there is no charset information
available in Meta tags (or the user does not trust in the Meta information at
all). For the other half both IUST and Tika are in par, though Tika fails in
some cases, see TIKA-2050. (IUST failed just on two docs)
In my paper I’ve thoroughly ignored the charsets in Meta tags and I didn’t
involve it in my computations. But if you want to consider it as true, you
should do some changes in the results. Because, now it cannot be interpreted
from your test that the behavior of some of these algorithms in real-world
would be something like the results that you’ve attached, because the corpus
does not properly represents the real-world conditions.
So, for each algorithm that looks for charset inside the Meta tags (i.e. 1, 2,
3 in your list above) you should at first turn its Meta detection off and then
compute its accuracy then divide its accuracy (that would be <= 1) by 2 and
finally add 50% or 0.5 to the division result number. If you do that, the
accuracy of the “1. Tika's default detection algorithm” and “3.
HTMLEncodingDetector”will be fallen. But I think the accuracy of the “2. The
proposed detection algorithm” won’t change (with considering GB18030 as
accepted detection for GBK ).
The behavior of the last four algorithms in real world, i.e. 4,5,6,7 would be
just something like the results that you’ve attached, because they don’t look
for charset in Meta tags even if it exists there.
p.s. In the primitive steps of my work I tested existing tools against just two
encodings including UTF-8 and Windows-1256. Since JUniversalCharDet was totally
failed on Windows-1256 and was not perfect for UTF-8 and hence I was thought
that it is a poor release of JCharDet I threw it away at the very first steps
of my work... and later on I didn’t test it with other encodings. But now it
sounds great for Windows-1251, GBK and Shift_JIS. Nevertheless, in detecting
UTF-8 it is weaker than what I’ve seen before.
> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
> Issue Type: Improvement
> Components: core, detector
> Reporter: Shabanali Faghani
> Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx,
> iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents
> as well as the other naturally text documents. But the accuracy of encoding
> detector tools, including icu4j, in dealing with the HTML documents is
> meaningfully less than from which the other text documents. Hence, in our
> project I developed a library that works pretty well for HTML documents,
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
> the HTML documents, it seems that having such an facility in Tika also will
> help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)