[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Wed, 08 Feb 2017 01:12:08 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15857043#comment-15857043
 ]


Shabanali Faghani edited comment on TIKA-2038 at 2/8/17 9:11 AM:
-----------------------------------------------------------------

bq. I recognize that the mime types returned by the server are not necessarily 
correct, but this data might be useful.
Years ago when I was a novice java developer I engaged with mime types for a 
while and I know they are unreliable. Hence, I’m very concerned to use them for 
separating html documents. In this regard I suggest “an arrow with two targets” 
(a Persian proverb) solution! Since it seems that in this test the potential 
charset in meta headers is the only available thing that we can use as “ground 
truth”, if we use the 
[HtmlEncodingDetector|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java]
 class of Tika (with [META_TAG_BUFFER_SIZE 
|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L42]
 field that is set to Integer.MAX_VALUE), in addition to extract potential 
charsets from meta headers, it implicitly will act as a html filter.

I think we must also throw away documents with multiple charsets in meta 
headers (see TIKA-2050). This way we can also get rid of rss/feed documents 
that their mime types were set to html (we had some troubles with these 
documents in our project years ago). 

bq. If the goal is to get ~30k per tld, let's sample to obtain 50k on the 
theory that there are duplicates and other reasons for failure.
I think it would be better to use the idea in [this post| 
https://issues.apache.org/jira/browse/TIKA-2038?focusedCommentId=15422448&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15422448]
 for sampling. I will try to describe the idea in details in the next few days.

bq. Any other tlds or mime defs we should add?
I suggest to add *.mx* (Mexico), *.co* (Colombia), *.ar* (Argentina) in 
addition to *.es* for Spanish (the 2nd ranked language by [native speakers| 
https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers]). 
There is also no tld in your list for Portuguese, so I suggest to add *.br* 
(Brazil) and *.pt* (Portugal). *.id* (Indonesia), *.my* (Malaysia), *.nl* 
(Netherlands), … are some other important tlds.


was (Author: faghani):
bq. I recognize that the mime types returned by the server are not necessarily 
correct, but this data might be useful.
Five years ago when I was a novice java developer I engaged with mime types for 
a while and I know they are unreliable. Hence, I’m very concerned to use them 
for separating html documents. In this regard I suggest “an arrow with two 
targets” (a Persian proverb)! It seems that in this test the potential charset 
in meta headers is the only available thing that we can use as “ground truth”. 
So, if we use the 
[HtmlEncodingDetector|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java]
 class of Tika (with [META_TAG_BUFFER_SIZE 
|https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java#L42]
 field that is set to Integer.MAX_VALUE), in addition to extract potential 
charsets from meta headers, it implicitly will act as a html filter.

I think we must throw away documents with multiple charsets in meta headers 
(see TIKA-2050). This way we can also get rid from rss/feed documents that 
their mime type is set to html (we had some trouble with these documents in our 
project years ago). 

bq. If the goal is to get ~30k per tld, let's sample to obtain 50k on the 
theory that there are duplicates and other reasons for failure.
I think it would be better to use the idea in [this post| 
https://issues.apache.org/jira/browse/TIKA-2038?focusedCommentId=15422448&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15422448]
 for sampling. I will try to describe the idea in details in the next few days.

bq. Any other tlds or mime defs we should add?
I suggest to add *.mx* (Mexico), *.co* (Colombia), *.ar* (Argentina) in 
addition to *.es* for Spanish (the 2nd ranked language by [native speakers| 
https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers]). 
There is also no tld in your list for Portuguese, so I suggest to add *.br* 
(Brazil) and *.pt* (Portugal). *.id* (Indonesia), *.my* (Malaysia), *.nl* 
(Netherlands), … are some other important tlds.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to