[
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830525#comment-15830525
]
Shabanali Faghani commented on TIKA-2038:
-----------------------------------------
After doing some work in my spare time, I’ve come back with a bunch of results.
Since I hadn’t access to the urls of Common Crawl, inevitably I’ve done the
evaluation using the urls of Alexa top sites. I’ve attached the source code (as
an eclipse project), a runnable jar (with all required resources) and the
results of the evaluation to this issue. The results are prepared in fine-,
mid- and coarse-grained granularity levels. Results in fine-grained are
prepared in a SQLite table, hence so many analyses can be generated/exploited
simply by querying on it.
For a quick view, I’ve presented the results in the coarse-grained level below.
In addition to Tika, IUST and ICU4JStripped I also tested JUniversalCharDet,
IUSTWithJUniversalCharDet and IUSTWithoutMarkupElimination. Details of these
algorithms are as follows:
# *icu4jStripped:* HTML Markups, Styles, CSSs, … are removed and then encoding
is detected by using ICU4j
# *jUniCharDet:* JUniversalCharDet
# *tika:* The Tika’s legacy algorithm without Meta detection, i.e.
JUniversalCharDet + ICU4j
# *iustWithJUniChar:* The IUST algorithm (with Meta detection off) but using
JUniversalCharDet instead of JCharDet, i.e. JUniversalCharDet (for UTF-8) +
ICU4j (with stripping input)
# *iustWithoutMarkupElim:* The IUST algorithm (with Meta detection off) without
Markup Elimination phase, i.e. JCharDet (for UTF-8) + ICU4j (without stripping
input)
# *iust:* The IUST algorithm (with meta detection off), i.e. JCharDet (for
UTF-8) + ICU4j (with stripping input)
… and the results: (underline characters were used to wrap the contents of the
corresponding columns)
||____Language(s)____||#URLs||#have valid charset in HTTP_header||#fetched
successfully||icu4jStripped||__jUniCharDet__||_____tika_____||iustWithJUniChar||iustWithoutMarkupElim||_____iust______||
|*Arabic (.iq, …)*| 1904| 1175| 1168| 603 = %52| 806 = %69|
743 = %64| 931 = %80| 1107 = %95| *1136 = %97*|
|*korean (.kr)*| 1604| 736| 735| 622 = %85| 616 = %84|
632 = %86| 655 = %89| *704 = %96*| 700 = %95|
|*Turkish (.tr)*| 2545| 1569| 1561| 467 = %30| 1209 = %78|
1282 = %82| 1350 = %87| 1466 = %94| *1498 = %96*|
|*Vietnamese (.vn)*| 2721| 1467| 1463| 968 = %66| 1233 = %84|
1265 = %87| 1305 = %89| 1446 = %99| *1448 = %99*|
|*Chinese (.cn)*| 10086| 3870| 3855| 1612 = %42| 3411 = %89|
3452 = %90| 3555 = %92| *3748 = %97*| 3725= %97|
|*French (.fr)*| 13156| 8730| 8713| 1889 = %22| 6555 = %75|
7039 = %81| 7197 = %83| *8266 = %95*| 8076 = %93|
|*Japanese (.jp)*| 20339| 7747| 7737| 3483 = %45| 6455 = %83|
6753 = %87| 6771 = %88| *7623 = %99*| 7619 = %99|
|*Spanish (.es)*| 10208| 6304| 6287| 1571 = %25| 4745 = %76|
4914 = %78| 5308 = %84| *6033 = %96*| 5956 = %95|
|*Italian (.it)*| 11512| 7398| 7372| 1540 = %21| 5646 = %77|
4690 = %64| 6201 = %84| *7074 = %96*| 6970 = %95|
|*Indian, English (.in)*| 12236| 6174| 6157| 995 = %16| 3190 =
%52| 1607 = %26| 3559 = %58| *5987 = %97*| 5973 = %97|
|*Persian (.ir)*| 7396| 4014| 4000| 3752 = %94| 3738 = %94|
3878 = %97| 3899 = %98| 3975 = %99| *3982 = %100*|
|*English (.uk, .us)*| 21565| 13610| 13583| 2653 = %20| 7551 = %56|
4870 = %36| 8886 = %65| *13167 = %97*| 13025 = %96|
|*Russian (.ru)*| 39453| 28289| 28229| 22406 = %79| 24796 = %88|
25449 = %90| 25638 = %91| 22813 = %81| *27563 = %98*|
|*German (.de)*| 35318| 24914| 24891| 5618 = %23| 15655 = %63|
17261 = %69| 18321 = %74| *23552 = %95*| 23095 = %93|
||*All Above (.all!)*| 190043|115997|115751|48179 = %42|85606 = %74|83835 =
%72|93576 = %81|106961 = %92||*110493 = %96*|
There are some amazing observations:
* IUST is 24% more accurate than Tika in overall
* ICU4JStripped is not a good choice at all; in fact it’s the worst choice!
* For 9 of the all 14 languages, the accuracy of IUST algorithm without Markup
Elimination is less than from IUST with Markup Elimination!
* The accuracy of Tika in overall, i.e. 72%, is less than from the accuracy of
JUniversalCharDet that is 74%!! It’s an odd phenomenon because
JUniversalCharDet is a sub-component of Tika. I think this is due to the way
you use JUniversalCharDet in Tika; that is a kind of early-termination in data
feeding … [{{listener.handleData(b, 0, m);}}|
https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/txt/UniversalEncodingDetector.java#L48]
In contrast, in this comparison I used feed-all approach as follows …
{{detector.handleData(rawHtmlByteSequence, 0, rawHtmlByteSequence.length);}}
* JCharDet cannot be substituted by JUniversalCharDet in IUST, see
iustWithJUniChar column
Just before I start to run the evaluation, i.e. when I had completed the code
and configured my VPS, I decided also to test IUST without Markup Elimination
to see how it looks without this phase … and as you can see from the table
above, the result was amazing. For most languages (at least in this test), when
Markup Elimination is done the accuracy of IUST is less than from when this
phase isn’t done. However, IUST with Markup Elimination is more accurate in
overall.
I think this is a great finding because we can benefit from it to improve the
efficiency of IUST. I mean we can do Markup Elimination phase only for html
docs with specific languages; those without doing this phase the reduction of
accuracy would be tangible, e.g. html docs with Russian, Turkish and Arabic
languages. See the pseudo code below:
{code:none|title=The proposed algorithm (modified IUST)|borderStyle=solid}
FUNCTION String detect(byte[] htmlByteSequence, Optional Boolean lookInMeta ←
True, Optional Language language ← None)
BEGIN
IF (lookInMeta = True) THEN
charset ← look for a potential charset inside the Meta tags of
htmlByteSequence {by using a Regex}
IF (charset is valid) THEN
RETURN charset
END IF
charset ← detect charset of htmlByteSequence by using Mozilla JCharDet
{Ensemble Classification. step1}
IF (charset = UTF-8) THEN
RETURN UTF-8
IF (language is in [Arabic, Turkish, Russian, …] list) THEN {Markup
Elimination}
htmlByteSequence ← strip out HTML Markups, Styles, CSSs, … from
htmlByteSequence
{stripping can be done either by using a perfect direct method
or ISO-8859-1 decoding-encoding trick}
END IF
charset ← detect charset of htmlByteSequence by using IBM ICU4J
{Ensemble Classification. step2}
RETURN charset
END
{OMG, a composite syntax from Pascal, VB, Java and even Smalltalk!}
{code}
Note that, in addition to a great improvement in efficiency, since this way we
get the best of the both (iust with and without Markup Elimination), we get
also a bit improvement in accuracy, 1% in the above table, for example.
However, downside of this method is that for html docs with specific languages
the end-programmer must know and determine it for the code.
To redo this evaluation or even to do any other test, I can also share the
corpus (14 languages, 190,043 urls, 115,751 docs, stored in SQLite tables,
compressed, total compressed size 1.33 GB, estimated decompressed size ~6 GB).
Also, if you’d like to redo this evaluation using another set of urls, I’ve
attached a ready to run bundle/project (built) to this issue. Just replace your
urls with the Alexa urls in ./corpus/urls/ path of the project and run the code.
Let me know your thoughts …, and of course, feel free to ask any question about
this evaluation :)
p.s. Thanks to [@DigitalOcean| https://digitalocean.com/] for their great VPS
service.
> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
> Issue Type: Improvement
> Components: core, detector
> Reporter: Shabanali Faghani
> Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx,
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip,
> lang-wise-eval_source_code.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents
> as well as the other naturally text documents. But the accuracy of encoding
> detector tools, including icu4j, in dealing with the HTML documents is
> meaningfully less than from which the other text documents. Hence, in our
> project I developed a library that works pretty well for HTML documents,
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
> the HTML documents, it seems that having such an facility in Tika also will
> help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)