[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Tim Allison (JIRA) Mon, 01 Aug 2016 06:20:09 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15402016#comment-15402016
 ]


Tim Allison commented on TIKA-2038:
-----------------------------------

bq. 1) You are right, my repo on github is fairly new (less than 1 year) but 
its algorithm is not new. I developed this library 4 years ago in order to be 
used in a large-scale project… and it works well from that time till now. It 
was under a load of ~1.2 billion pages in the peak time. The bug that I’ve 
fixed in last week was just a tiny mistake that occurred during refactoring the 
code before the first release.

Y, we all make mistakes.  Given the importance of this component to Tika, I'm 
now thinking that I'd much prefer to incorporate any modifications directly 
into Tika so that we'll be able to make changes quickly in the future whether 
or not you are still active on your project.  If you'd like to create a pull 
request or a patch, that would be great.  

On 2) above, your point is well taken that we'd only do the double parsing if 
the page does not contain a metaheader and universalchardet doesn't identify 
"UTF-8" per your algorithm.

Some other thoughts: 
we're using UniversalChardet, not jchardet.  Have you evaluated any differences 
between those?  

We're using tagsoup, not jsoup.  Have you evaluated any diffs between those?  
(See TIKA-1599)

Did you evaluate your algorithm on a held-out set, or are you testing on 
training?

Anyways, many thanks for raising this issue.



> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: iust_encodings.zip, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to