[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Tim Allison (JIRA) Fri, 12 Aug 2016 11:40:50 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418855#comment-15418855
 ]


Tim Allison edited comment on TIKA-2038 at 8/12/16 6:40 PM:
------------------------------------------------------------

bq.  But since I haven’t access to a broadband Internet connection

Oh, ok. I've been thinking about this a bit more.  I think I'd like to sample 
urls from Common Crawl based on country codes in the urls.  I can take care of 
this in a few weeks.

bq. Please send me your markup stripper so I can use it in my code to evaluate 
your both stripper and proposed algorithm.
I'll post that today...if I have time.


bq. BTW, what is tika-eval code?

Code [here|https://github.com/tballison/tika/tree/TIKA-1302] still needs some 
work, but it evaluates the output of two runs of Tika and reports on 
differences in number of exceptions, mime detection diffs, content diff, etc.  
I was hoping to have time to get this ready for 1.14, but 1.15 is looking more 
likely.

You can see an example of the output of the comparison code 
[here|https://github.com/tballison/share/blob/master/poi_comparisons/reports_poi_3_15-beta3_reports.zip?raw=true].


was (Author: [email protected]):
bq.  But since I haven’t access to a broadband Internet connection

Oh, ok. I've been thinking about this a bit more.  I think I'd like to sample 
urls from Common Crawl based on country codes in the urls.  I can take care of 
this in a few weeks.

bq. Please send me your markup stripper so I can use it in my code to evaluate 
your both stripper and proposed algorithm.
I'll post that today.


bq. BTW, what is tika-eval code?

Code [here|https://github.com/tballison/tika/tree/TIKA-1302] still needs some 
work, but it evaluates the output of two runs of Tika and reports on 
differences in number of exceptions, mime detection diffs, content diff, etc.  
I was hoping to have time to get this ready for 1.14, but 1.15 is looking more 
likely.


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to