[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Tue, 16 Aug 2016 01:58:04 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15422448#comment-15422448
 ]


Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

Sampling urls from Common Crawl is a good idea. I was a bit worried about 
selecting test urls from Alexa top sites, because they don’t properly reflect 
real-world conditions.

bq. I can take care of this in a few weeks. 
A couple of weeks ago, [~wastl-nagel] wrote to me that if we need more test and 
evaluation data he can provide us with test data from Common Crawl. It would be 
great if he helps us!

As I’ve seen in [Common Crawl’s Website|http://commoncrawl.org/], they support 
40+ languages and their crawl archive for July 2016 contains ~1.73 billion web 
pages. As an idea, it may be a good sampling to get urls for each 
language/country-code (TLD) based on its ratio in Common Crawl’s archive. I 
mean if the share of German or .de pages, for example, is 5% then we populate 
5% of our sample urls (with total size of up to 300,000) with German or .de 
urls. Also I think up to 15 languages are sufficient for our test.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to