[
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15885670#comment-15885670
]
Tim Allison commented on TIKA-2038:
-----------------------------------
bq. because, the meta headers are removed after stripping a html document!
The current version of the stripper leaves in <meta > headers if they also
include "charset". I was toying with the notion of stripping only once, rather
than for each of the charset detectors. I included the output of the stripped
HTMLMeta detector as a sanity check to make sure that the my stripper didn't
take too much out.
bq. I couldn’t exactly understand what you mean from training
This was a preliminary run, and I figure that we'll be modifying the stripper
and possibly IUST. I want to leave a held-out set for the final run/eval. I
don't want to report "testing on training".
bq. Also I can’t perceive why you didn’t involve Tika and IUST in your
comparison!
Tika can be computed from the results algorithmically (if html is null, use
Universal; if Universal is null, use ICU4j). I didn't use IUST because this
was a preliminary run, and I wasn't sure which version I should use. The one
on github or the proposed modification
[above|https://issues.apache.org/jira/browse/TIKA-2038?focusedCommentId=15830525&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15830525
] or both? Let me know which code you'd like me to run.
bq. It maybe would be nice to do also efficiency/speed test for the algorithms.
I want to focus on accuracy first. We still have to settle on an eval method.
But, yes, I do want to look at this.
bq. If the http header is available for all documents
It is. I didn't have time to join the two tables. I will. Not all headers
included encoding information, of course.
bq. (with or without medias?)
Media is included only if it is inlined in the html. I did not pull references
to images, etc.
bq. just as a subsidiary note about TIKA-2273,
Y. I'm hoping that making it configurable and adding documentation will help.
There are still some required improvements in TIKA-2273.
bq. I suggest doing a peripheral study about the number of html documents that
have charset in http and meta header.
And as another study, I calculated how many times a page has a metaheader for
charset _and_ how far into the page that metaheader reference is. I'll share
that shortly.
> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
> Issue Type: Improvement
> Components: core, detector
> Reporter: Shabanali Faghani
> Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx,
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip,
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv,
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html_plus_H_column.xlsx,
> tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents
> as well as the other naturally text documents. But the accuracy of encoding
> detector tools, including icu4j, in dealing with the HTML documents is
> meaningfully less than from which the other text documents. Hence, in our
> project I developed a library that works pretty well for HTML documents,
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
> the HTML documents, it seems that having such an facility in Tika also will
> help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)