[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Mon, 01 Aug 2016 01:40:30 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401715#comment-15401715
 ]


Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

OK, so to give more details about my library to this community and also in 
response to your concerns I would to say:

1) You are right, my repo on github is fairly new (less than 1 year) but its 
algorithm is not new. I developed this library 4 years ago in order to be used 
in a large-scale project… and it works well from that time till now. It was 
under a load of ~1.2 billion pages in the peak time. The bug that I’ve fixed in 
last week was just a tiny mistake that occurred during refactoring the code 
before the first release.

2) Since the accuracy was much more important than performance for us, I 
haven’t done a thorough performance test. Nevertheless, bellow I’ve provided 
the results of a small test that was done on my laptop (intel core i3, java 6, 
Xmx: default (don’t care)):
||Subdirectory||#docs||Total Size (KB)||Average Size (KB)||Detection Time 
(millisecond)||Average Time (millisecond)||
|UTF-8|657|32,216|49|26658|40|
|Windows-1251|314|30,941|99|4423|14|
|GBK|419|43,374|104|20317|48|
|Windows-1256|645|66,592|103|9451|14|
|Shift_JIS|640|25,973|41|7617|11|

Let’s have a little bit more precise look at these results. Due to the logic of 
my algorithm, for the first row of this table, i.e UTF-8, only Mozilla JCharDet 
was used (no JSoup and no ICU4J). But as you can see from this table the 
required time is greater than the three other cases for which documents were 
parsed using JSoup and also the both JCharDet and ICU4J were involved in 
detection prrocess. It means that if the encoding of a page is UTF-8 the 
required time for a positive response from Mozilla JChardet is often greater 
than the required time for … 
* get a negative response from Mozilla JChardet +
* decode the input byte array using “ISO-8859-1”+
* parse that doc and creating DOM tree +
* extract text from DOM tree +
* encode the extracted text using “ISO-8859-1” +
* and detecting its encoding by using icu4j  
… when the encoding of a page is not UTF-8!! In brief, 40 > 14, 11 , … in the 
above table. 

Now let’s have a look at the [distribution of character encodings for websites| 
https://w3techs.com/technologies/history_overview/character_encoding]. Since 
~87% of all websites use UTF-8, if we use a statistical computation for 
computing the weighted average time for detecting encoding of a custom html 
document, I think we would get a similar estimate for both IUST-HTMLCharDet and 
Tika-EncodingDetector. Because, this estimate is strongly biased by Mozilla 
JCharDet and as we know this tool is used in the both algorithms in a similar 
way. Nevertheless, for performance optimizations I will do some tests for …
* using a Regex instead of navigating in DOM tree for seeking charsets in Meta 
tags
* stripping HTML Markups, Scripts and embedded CSSs directly instead of using a 
html parser

3) For computing the accuracy of Tika's legacy method I’ve provided a comment 
below your current evaluation results. As I’ve explained there, the results of 
your current evaluation couldn’t be compared with my evaluation.

bq. Perhaps we could add some code to do that?
Of course, but from experience when I use open sources in my projects, due to 
the versioning and updating considerations I don’t shift my code into them 
unless there wouldn’t be any other suitable solution/option. But pulling other 
code into my projects is *another story*! :)

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to