[
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541307#comment-16541307
]
Gerard Bouchar edited comment on TIKA-2673 at 7/13/18 9:28 AM:
---------------------------------------------------------------
[[email protected]] : great, thank you very much ! Of course I agree for it
to be merged. I'm sorry for forgetting the license header in the first place.
I have done more work on this in the last days. I am going to make a pull
request to include my last changes.
We have conducted an internal testing on this, and have seen great results. We
selected a random subset of ~100 000 URLs from a nutch segment, fetched it once
in nutch, and parsed it using different strategies. We fetched the same URLs
using puppeteer (a headless chrome), and compared the charset detected. Here
are the results
{{ correct similar wrong}}
{{standard 99.4% 0.0% 0.6%}}
{{standard_noparse 94.7% 4.6% 0.6%}}
{{default 85.9% 11.5% 2.6%}}
{{icu 79.1% 13.9% 7.0%}}
!image-2018-07-13-11-28-16-657.png!
standard_noparse is a composite detector with a version of my detector that
just takes into account the BOM and HTTP headers, chained with the existing
HtmlEncodingDetector, chained with Icu4JEncodingDetector.
standard is a composite detector with the last version of my detector, chained
with Icu4JEncodingDetector.
Labeled as "correct" are the pages that were detected the same in chrome and
tika. "similar" means that although incorrect, the detected charset is close to
the one detected by chrome (ISO-8859-1 instead of WINDOWS-1254, for instance).
"wrong" means that the detected charset was not close to the one detected by
chrome.
was (Author: gbouchar):
[[email protected]] : great, thank you very much ! Of course I agree for it
to be merged. I'm sorry for forgetting the license header in the first place.
I have done more work on this in the last days. I am going to make a pull
request to include my last changes.
We have conducted an internal testing on this, and have seen great results. We
selected a random subset of ~100 000 URLs from a nutch segment, fetched it once
in nutched, and parsed it using different strategies. We fetched the same URLs
using puppeteer (a headless chrome), and compared the charset detected. Here
are the results
!https://confluence.qwant.ninja/confluence/download/attachments/25790597/image2018-7-11_16-50-32.png?version=1&modificationDate=1531320645751&api=v2!
standard_noparse is a composite detector with a version of my detector that
just takes into account the BOM and HTTP headers, chained with the existing
HtmlEncodingDetector, chained with Icu4JEncodingDetector.
standard is a composite detector with the last version of my detector, chained
with Icu4JEncodingDetector.
Labeled as "correct" are the pages that were detected the same in chrome and
tika. "similar" means that although incorrect, the detected charset is close to
the one detected by chrome (ISO-8859-1 instead of WINDOWS-1254, for instance).
"wrong" means that the detected charset was not close to the one detected by
chrome.
> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
> Key: TIKA-2673
> URL: https://issues.apache.org/jira/browse/TIKA-2673
> Project: Tika
> Issue Type: Bug
> Reporter: Gerard Bouchar
> Priority: Major
> Attachments: HtmlEncodingDetectorTest.java,
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where
> HtmlEncodingDetector differs from the specification, and thus fails at
> detecting the right charset.
> I am attaching the test cases to this issue:
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)