[
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887801#comment-15887801
]
Shabanali Faghani edited comment on TIKA-2038 at 3/4/17 6:31 PM:
-----------------------------------------------------------------
Perfect reply, [[email protected]]. Thank you!
bq. The current version of the stripper leaves in <meta > headers if they also
include "charset". … I included the output of the stripped HTMLMeta detector as
a sanity check … (/)
bq. I figure that we'll be modifying the stripper …
We might need the stripper works like a SAX parser, i.e the input should be
_InputStream_. This is required if we decided to be too conservative about OOM
error or avoiding from resource wasting for big html files. I know writing a
perfect _html stream stripper_ with the minimal faults
(false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup
should be able to to do so but there are two problems including _chicken and
egg_ and _performance_. The former problem can be solved by _ISO-8859-1
encoding-decoding_ trick but there is no solution for the latter.
For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley|
https://jhy.io/], the author of Jsoup or someone else in Jsoup’s mailing list
that if they ever have done a thing like this or could they help us. We may
also suggest/introduce IUST (the standalone version) to them. IIRC, in Jsoup
1.6.1-3 (and most likely now) the charset of a page was supposed/considered as
UTF-8 if the http header didn’t contain any charset or the charset was not
specified in input.
bq. … and possibly IUST.
The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_
for neither JCharDet nor ICU4j! So, we should write a custom version of IUST to
do so. Nevertheless, I think we can ignore this for the first version because I
think that haven’t a meaningful effect on the algorithm. In fact I think
calling the detection methods of JCharDet and ICU4j with InputStream input will
a bit increase the efficiency in charge of a bit decrease in the accuracy.
bq. I didn't use IUST because this was a preliminary run, and I wasn't sure
which version I should use. The one on github or the proposed modification
above or both? Let me know which code you'd like me to run.
The _modified IUST_ isn’t yet complete. To complete it we must prepare a
thorough list of languages for which the stripping shouldn’t be done. These
languages/tlds are determined by comparing the results of the IUST with and
without stripping. So, you should run both _htmlchardet-1.0.1.jar_ (IUST whit
stripping) with _lookInMeta=false_ and the class _IUSTWithoutMarkupElimination_
(IUST without stripping) from the [lang-wise-eval source code|
https://issues.apache.org/jira/secure/attachment/12848364/lang-wise-eval_source_code.zip].
The accuracy of _modified IUST_ (the pseudo code above) can be computed
algorithmically by selecting the best from the two for each language/tld .
bq. I want to focus on accuracy first. We still have to settle on an eval
method. But, yes, I do want to look at this. (/)
was (Author: faghani):
Perfect reply, [[email protected]]. Thank you!
bq. The current version of the stripper leaves in <meta > headers if they also
include "charset". … I included the output of the stripped HTMLMeta detector as
a sanity check … (/)
bq. I figure that we'll be modifying the stripper …
We might need the stripper works like a SAX parser, i.e the input should be
_InputStream_. This is required if we decided to be too conservative about OOM
error or avoiding from resource wasting for big html files. I know writing a
perfect _html stream stripper_ with the minimal faults
(false-negative/positive, exception, …) is very hard. As a SAX parser, TagSoup
should be able to to do so but there are two problems including _chicken and
egg_ and _performance_. The former problem can be solved by _ISO-8859-1
encoding-decoding_ trick but there is no solution for the latter.
For a lightweight SAX-style stripper I think we can ask [Jonathan Hedley|
https://jhy.io/], the author of Jsoup or someone else in Jsoup’s mailing list
that if they ever have done a thing like this or could they help us. We may
also suggest/introduce IUST (the standalone version) to them. This is quite
like a gif entitled “_Adding a citation to a paper possibly written by the
reviewer_” in [phd funnies| http://users.auth.gr/ksiop/phd_funny/index.html],
mutual scratching!! IIRC, in Jsoup 1.6.1-3 (and most likely now) the charset of
a page was supposed/considered as UTF-8 if the http header didn’t contain any
charset or the charset was not specified in input.
bq. … and possibly IUST.
The current version of IUST, i.e htmlchardet-1.0.1, uses _early-termination_
for neither JCharDet nor ICU4j! So, we should write a custom version of IUST to
do so. Oh, still a lot of works to do … :( Nevertheless, I think we can ignore
this for the first version because I think that haven’t a meaningful effect on
the algorithm. In fact I think calling the detection methods of JCharDet and
ICU4j with InputStream input will a bit increase the efficiency in charge of a
bit decrease in the accuracy.
bq. I didn't use IUST because this was a preliminary run, and I wasn't sure
which version I should use. The one on github or the proposed modification
above or both? Let me know which code you'd like me to run.
The _modified IUST_ isn’t yet complete. To complete it we must prepare a
thorough list of languages for which the stripping shouldn’t be done. These
languages/tlds are determined by comparing the results of the IUST with and
without stripping. So, you should run both _htmlchardet-1.0.1.jar_ (IUST whit
stripping) with _lookInMeta=false_ and the class _IUSTWithoutMarkupElimination_
(IUST without stripping) from the [lang-wise-eval source code|
https://issues.apache.org/jira/secure/attachment/12848364/lang-wise-eval_source_code.zip].
The accuracy of _modified IUST_ (the pseudo code above) can be computed
algorithmically by selecting the best from the two for each language/tld .
bq. I want to focus on accuracy first. We still have to settle on an eval
method. But, yes, I do want to look at this. (/)
> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
> Issue Type: Improvement
> Components: core, detector
> Reporter: Shabanali Faghani
> Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx,
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip,
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv,
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html_plus_H_column.xlsx,
> tld_text_html.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents
> as well as the other naturally text documents. But the accuracy of encoding
> detector tools, including icu4j, in dealing with the HTML documents is
> meaningfully less than from which the other text documents. Hence, in our
> project I developed a library that works pretty well for HTML documents,
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
> the HTML documents, it seems that having such an facility in Tika also will
> help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)