[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Mon, 06 Feb 2017 15:19:35 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854930#comment-15854930
 ]


Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

You’re welcome Tim,
bq. If you have a complete list of the TLDs we should sample from, that'd be 
useful.

I haven’t a complete list of the TLDs at hand but maybe the method that I had 
used to find corresponding TLDs to some languages would be helpful. Logically, 
I used a kind of SQL inner join on [List of official 
languages|https://en.wikipedia.org/wiki/List_of_official_languages] and 
[Country code top-level 
domains|https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains#Country_code_top-level_domains]
 to extract cc TLDs for languages, IIRC. Right now, I found also [another page 
in 
Wikipedia|https://en.wikipedia.org/wiki/List_of_languages_by_the_number_of_countries_in_which_they_are_recognized_as_an_official_language]
 that is more succinct and very handy. I think doing an inner join between 
[Official languages by countries| 
https://en.wikipedia.org/wiki/List_of_languages_by_the_number_of_countries_in_which_they_are_recognized_as_an_official_language]
 and [Country code top-level 
domains|https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains#Country_code_top-level_domains]
 is a reliable method to get TLDs in the right language list, but with some 
considerations. For example, the [*.ch*| https://en.wikipedia.org/wiki/.ch] cc 
TLD may should not to be considered for German language, because Italian is 
also an official language in Switzerland. Maybe it would be useful to get some 
compound languages in our test, just like *Indian, English* in the table above, 
that is correspond to *.in* TLD.


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to