[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Thu, 04 Aug 2016 02:38:58 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407489#comment-15407489
 ]


Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

bq. we're using UniversalChardet, not jchardet. Have you evaluated any 
differences between those?

Yes, I’ve evaluated almost all of the existing tools in this context that were 
available 4 years ago at that time. In my tests when encoding of a page was 
“UTF-8” jchardet was ~11% more accurate than juniversalchardet, i.e. 100% vs. 
89% (these percents are highly depended to the language of the pages being 
tested, I will explain it more later). I know that despite jcharset is 
available in Maven Central but its source code isn’t exists there. I've read in 
[this 
paper|http://cs229.stanford.edu/proj2007/KimPark-AutomaticDetectionOfCharacterEncodingAndLanguages.pdf]
 that Mozilla CharDet is open source, regarding date of the jchardet (2003), 
paper (2007) and juniversalchardet (2011) I think they mean jchardet.

bq. We're using tagsoup, not jsoup. Have you evaluated any diffs between those?

No, but I’m wonder why you don’t use Jsoup! Its Selectors are very handy and as 
far as I know neither TagSoup nor Nekho have such a great facility. Also, Its 
straightforward APIs are result in a very clean code… and most likely it is 
more faster than the others. 

bq. Did you evaluate your algorithm on a held-out set, or are you testing on 
training?

Not sure what exactly you mean from “held-out set”, but if your intended 
purpose is something related to [this commented 
lines|https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/test/java/encodingwise/Evaluation.java#L31]
 I'd to say yes but haven't any ready results of evaluations at hand. To 
recreate these sets you can comment/uncomment [these method calls and their 
corresponding charset 
lines|https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/test/java/encodingwise/corpus/SeedsCrawler.java#L44]
 and run the code.

If there would be any special objective I can go further for testing on 
training.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, iust_encodings.zip, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to