[
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15401715#comment-15401715
]
Shabanali Faghani commented on TIKA-2038:
-----------------------------------------
OK, so to give more details about my library to this community and also in
response to your concerns I would to say:
1) You are right, my repo on github is fairly new (less than 1 year) but its
algorithm is not new. I developed this library 4 years ago in order to be used
in a large-scale project… and it works well from that time till now. It was
under a load of ~1.2 billion pages in the peak time. The bug that I’ve fixed in
last week was just a tiny mistake that occurred during refactoring the code
before the first release.
2) Since the accuracy was much more important than performance for us, I
haven’t done a thorough performance test. Nevertheless, bellow I’ve provided
the results of a small test that was done on my laptop (intel core i3, java 6,
Xmx: default (don’t care)):
||Subdirectory||#docs||Total Size (KB)||Average Size (KB)||Detection Time
(millisecond)||Average Time (millisecond)||
|UTF-8|657|32,216|49|26658|40|
|Windows-1251|314|30,941|99|4423|14|
|GBK|419|43,374|104|20317|48|
|Windows-1256|645|66,592|103|9451|14|
|Shift_JIS|640|25,973|41|7617|11|
Let’s have a little bit more precise look at these results. Due to the logic of
my algorithm, for the first row of this table, i.e UTF-8, only Mozilla JCharDet
was used (no JSoup and no ICU4J). But as you can see from this table the
required time is greater than the three other cases for which documents were
parsed using JSoup and also the both JCharDet and ICU4J were involved in
detection prrocess. It means that if the encoding of a page is UTF-8 the
required time for a positive response from Mozilla JChardet is often greater
than the required time for …
* get a negative response from Mozilla JChardet +
* decode the input byte array using “ISO-8859-1”+
* parse that doc and creating DOM tree +
* extract text from DOM tree +
* encode the extracted text using “ISO-8859-1” +
* and detecting its encoding by using icu4j
… when the encoding of a page is not UTF-8!! In brief, 40 > 14, 11 , … in the
above table.
Now let’s have a look at the [distribution of character encodings for websites|
https://w3techs.com/technologies/history_overview/character_encoding]. Since
~87% of all websites use UTF-8, if we use a statistical computation for
computing the weighted average time for detecting encoding of a custom html
document, I think we would get a similar estimate for both IUST-HTMLCharDet and
Tika-EncodingDetector. Because, this estimate is strongly biased by Mozilla
JCharDet and as we know this tool is used in the both algorithms in a similar
way. Nevertheless, for performance optimizations I will do some tests for …
* using a Regex instead of navigating in DOM tree for seeking charsets in Meta
tags
* stripping HTML Markups, Scripts and embedded CSSs directly instead of using a
html parser
3) For computing the accuracy of Tika's legacy method I’ve provided a comment
below your current evaluation results. As I’ve explained there, the results of
your current evaluation couldn’t be compared with my evaluation.
bq. Perhaps we could add some code to do that?
Of course, but from experience when I use open sources in my projects, due to
the versioning and updating considerations I don’t shift my code into them
unless there wouldn’t be any other suitable solution/option. But pulling other
code into my projects is *another story*! :)
> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
> Issue Type: Improvement
> Components: core, detector
> Reporter: Shabanali Faghani
> Priority: Minor
> Attachments: tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents
> as well as the other naturally text documents. But the accuracy of encoding
> detector tools, including icu4j, in dealing with the HTML documents is
> meaningfully less than from which the other text documents. Hence, in our
> project I developed a library that works pretty well for HTML documents,
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
> the HTML documents, it seems that having such an facility in Tika also will
> help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)