[ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15830525#comment-15830525
 ] 

Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

After doing some work in my spare time, I’ve come back with a bunch of results. 
Since I hadn’t access to the urls of Common Crawl, inevitably I’ve done the 
evaluation using the urls of Alexa top sites. I’ve attached the source code (as 
an eclipse project), a runnable jar (with all required resources) and the 
results of the evaluation to this issue. The results are prepared in fine-, 
mid- and coarse-grained granularity levels. Results in fine-grained are 
prepared in a SQLite table, hence so many analyses can be generated/exploited 
simply by querying on it. 

For a quick view, I’ve presented the results in the coarse-grained level below. 
In addition to Tika, IUST and ICU4JStripped I also tested JUniversalCharDet, 
IUSTWithJUniversalCharDet and IUSTWithoutMarkupElimination. Details of these 
algorithms are as follows:
# *icu4jStripped:* HTML Markups, Styles, CSSs, … are removed and then encoding 
is detected by using ICU4j
# *jUniCharDet:* JUniversalCharDet
# *tika:* The Tika’s legacy algorithm without Meta detection, i.e. 
JUniversalCharDet + ICU4j
# *iustWithJUniChar:* The IUST algorithm (with Meta detection off) but using 
JUniversalCharDet instead of JCharDet, i.e. JUniversalCharDet (for UTF-8) + 
ICU4j (with stripping input)
# *iustWithoutMarkupElim:* The IUST algorithm (with Meta detection off) without 
Markup Elimination phase, i.e. JCharDet (for UTF-8) + ICU4j (without stripping 
input)
# *iust:* The IUST algorithm (with meta detection off), i.e. JCharDet (for 
UTF-8) + ICU4j (with stripping input)

… and the results: (underline characters were used to wrap the contents of the 
corresponding columns)
||____Language(s)____||#URLs||#have valid charset in HTTP_header||#fetched 
successfully||icu4jStripped||__jUniCharDet__||_____tika_____||iustWithJUniChar||iustWithoutMarkupElim||_____iust______||
|*Arabic (.iq, …)*|     1904|   1175|   1168|   603 = %52|      806 = %69|      
743 = %64|      931 = %80|      1107 = %95|     *1136 = %97*|
|*korean (.kr)*|        1604|   736|    735|    622 = %85|      616 = %84|      
632 = %86|      655 = %89|      *704 = %96*|    700 = %95|
|*Turkish (.tr)*|       2545|   1569|   1561|   467 = %30|      1209 = %78|     
1282 = %82|     1350 = %87|     1466 = %94|     *1498 = %96*|
|*Vietnamese (.vn)*|    2721|   1467|   1463|   968 = %66|      1233 = %84|     
1265 = %87|     1305 = %89|     1446 = %99|     *1448 = %99*|
|*Chinese (.cn)*|       10086|  3870|   3855|   1612 = %42|     3411 = %89|     
3452 = %90|     3555 = %92|     *3748 = %97*|   3725= %97|
|*French (.fr)*|        13156|  8730|   8713|   1889 = %22|     6555 = %75|     
7039 = %81|     7197 = %83|     *8266 = %95*|   8076 = %93|
|*Japanese (.jp)*|      20339|  7747|   7737|   3483 = %45|     6455 = %83|     
6753 = %87|     6771 = %88|     *7623 = %99*|   7619 = %99|
|*Spanish (.es)*|       10208|  6304|   6287|   1571 = %25|     4745 = %76|     
4914 = %78|     5308 = %84|     *6033 = %96*|   5956 = %95|
|*Italian (.it)*|       11512|  7398|   7372|   1540 = %21|     5646 = %77|     
4690 = %64|     6201 = %84|     *7074 = %96*|   6970 = %95|
|*Indian, English (.in)*|       12236|  6174|   6157|   995 = %16|      3190 = 
%52|     1607 = %26|     3559 = %58|     *5987 = %97*|   5973 = %97|
|*Persian (.ir)*|       7396|   4014|   4000|   3752 = %94|     3738 = %94|     
3878 = %97|     3899 = %98|     3975 = %99|     *3982 = %100*|
|*English (.uk, .us)*|  21565|  13610|  13583|  2653 = %20|     7551 = %56|     
4870 = %36|     8886 = %65|     *13167 = %97*|  13025 = %96|
|*Russian (.ru)*|       39453|  28289|  28229|  22406 = %79|    24796 = %88|    
25449 = %90|    25638 = %91|    22813 = %81|    *27563 = %98*|
|*German (.de)*|        35318|  24914|  24891|  5618 = %23|     15655 = %63|    
17261 = %69|    18321 = %74|    *23552 = %95*|  23095 = %93|
||*All Above (.all!)*|  190043|115997|115751|48179 = %42|85606 = %74|83835 = 
%72|93576 = %81|106961 = %92||*110493 = %96*|

There are some amazing observations:
* IUST is 24% more accurate than Tika in overall
* ICU4JStripped is not a good choice at all; in fact it’s the worst choice!
* For 9 of the all 14 languages, the accuracy of IUST algorithm without Markup 
Elimination is less than from IUST with Markup Elimination!
* The accuracy of Tika in overall, i.e. 72%, is less than from the accuracy of 
JUniversalCharDet that is 74%!! It’s an odd phenomenon because 
JUniversalCharDet is a sub-component of Tika. I think this is due to the way 
you use JUniversalCharDet in Tika; that is a kind of early-termination in data 
feeding … [{{listener.handleData(b, 0, m);}}| 
https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/txt/UniversalEncodingDetector.java#L48]
In contrast, in this comparison I used feed-all approach as follows …
{{detector.handleData(rawHtmlByteSequence, 0, rawHtmlByteSequence.length);}}
* JCharDet cannot be substituted by JUniversalCharDet in IUST, see 
iustWithJUniChar column

Just before I start to run the evaluation, i.e. when I had completed the code 
and configured my VPS, I decided also to test IUST without Markup Elimination 
to see how it looks without this phase … and as you can see from the table 
above, the result was amazing. For most languages (at least in this test), when 
Markup Elimination is done the accuracy of IUST is less than from when this 
phase isn’t done. However, IUST with Markup Elimination is more accurate in 
overall.
 I think this is a great finding because we can benefit from it to improve the 
efficiency of IUST. I mean we can do Markup Elimination phase only for html 
docs with specific languages; those without doing this phase the reduction of 
accuracy would be tangible, e.g. html docs with Russian, Turkish and Arabic 
languages. See the pseudo code below:

{code:none|title=The proposed algorithm (modified IUST)|borderStyle=solid}
FUNCTION String detect(byte[] htmlByteSequence, Optional Boolean lookInMeta ← 
True, Optional Language language ← None)
BEGIN
        IF (lookInMeta = True) THEN
                charset ← look for a potential charset inside the Meta tags of 
htmlByteSequence  {by using a Regex}
                IF (charset is valid) THEN
                        RETURN charset
        END IF
        charset ← detect charset of htmlByteSequence by using Mozilla JCharDet  
{Ensemble Classification. step1}
        IF (charset = UTF-8) THEN 
                RETURN UTF-8
        IF (language is in [Arabic, Turkish, Russian, …] list) THEN    {Markup 
Elimination}
                htmlByteSequence ← strip out HTML Markups, Styles, CSSs, … from 
htmlByteSequence
                {stripping can be done either by using a perfect direct method 
or ISO-8859-1 decoding-encoding trick}
        END IF
        charset ← detect charset of htmlByteSequence by using IBM ICU4J  
{Ensemble Classification. step2}
        RETURN charset
END
{OMG, a composite syntax from Pascal, VB, Java and even Smalltalk!}
{code}

Note that, in addition to a great improvement in efficiency, since this way we 
get the best of the both (iust with and without Markup Elimination), we get 
also a bit improvement in accuracy, 1% in the above table, for example. 
However, downside of this method is that for html docs with specific languages 
the end-programmer must know and determine it for the code.

To redo this evaluation or even to do any other test, I can also share the 
corpus (14 languages, 190,043 urls, 115,751 docs, stored in SQLite tables, 
compressed, total compressed size 1.33 GB, estimated decompressed size ~6 GB). 
Also, if you’d like to redo this evaluation using another set of urls, I’ve 
attached a ready to run bundle/project (built) to this issue. Just replace your 
urls with the Alexa urls in ./corpus/urls/ path of the project and run the code.

Let me know your thoughts …, and of course, feel free to ask any question about 
this evaluation :)

p.s. Thanks to [@DigitalOcean| https://digitalocean.com/] for their great VPS 
service.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to