[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Tim Allison (JIRA) Mon, 27 Feb 2017 04:14:51 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885670#comment-15885670
 ]


Tim Allison commented on TIKA-2038:
-----------------------------------

bq.  because, the meta headers are removed after stripping a html document!

The current version of the stripper leaves in <meta > headers if they also 
include "charset".  I was toying with the notion of stripping only once, rather 
than for each of the charset detectors.  I included the output of the stripped 
HTMLMeta detector as a sanity check to make sure that the my stripper didn't 
take too much out.

bq. I couldn’t exactly understand what you mean from training
This was a preliminary run, and I figure that we'll be modifying the stripper 
and possibly IUST.  I want to leave a held-out set for the final run/eval.  I 
don't want to report "testing on training".

bq. Also I can’t perceive why you didn’t involve Tika and IUST in your 
comparison!
Tika can be computed from the results algorithmically (if html is null, use 
Universal; if Universal is null, use ICU4j).  I didn't use IUST because this 
was a preliminary run, and I wasn't sure which version I should use.  The one 
on github or the proposed modification 
[above|https://issues.apache.org/jira/browse/TIKA-2038?focusedCommentId=15830525&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15830525
 ] or both?  Let me know which code you'd like me to run.

bq. It maybe would be nice to do also efficiency/speed test for the algorithms.
I want to focus on accuracy first.  We still have to settle on an eval method.  
But, yes, I do want to look at this.

bq. If the http header is available for all documents
It is.  I didn't have time to join the two tables.  I will.  Not all headers 
included encoding information, of course.

bq. (with or without medias?)
Media is included only if it is inlined in the html.  I did not pull references 
to images, etc.

bq. just as a subsidiary note about TIKA-2273,
Y.  I'm hoping that making it configurable and adding documentation will help.  
There are still some required improvements in TIKA-2273. 

bq. I suggest doing a peripheral study about the number of html documents that 
have charset in http and meta header. 
And as another study, I calculated how many times a page has a metaheader for 
charset _and_ how far into the page that metaheader reference is.  I'll share 
that shortly.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html_plus_H_column.xlsx, 
> tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to