[jira] [Comment Edited] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

Gerard Bouchar (JIRA) Fri, 13 Jul 2018 02:29:43 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16541307#comment-16541307
 ]


Gerard Bouchar edited comment on TIKA-2673 at 7/13/18 9:28 AM:
---------------------------------------------------------------

[[email protected]] : great, thank you very much ! Of course I agree for it 
to be merged. I'm sorry for forgetting the license header in the first place.

I have done more work on this in the last days. I am going to make a pull 
request to include my last changes.

We have conducted an internal testing on this, and have seen great results. We 
selected a random subset of ~100 000 URLs from a nutch segment, fetched it once 
in nutch, and parsed it using different strategies. We fetched the same URLs 
using puppeteer (a headless chrome), and compared the charset detected. Here 
are the results

 

{{                 correct similar wrong}}
{{standard           99.4%    0.0%  0.6%}}
{{standard_noparse   94.7%    4.6%  0.6%}}
{{default            85.9%   11.5%  2.6%}}
{{icu                79.1%   13.9%  7.0%}}

 

 

!image-2018-07-13-11-28-16-657.png!

standard_noparse is a composite detector with a version of my detector that 
just takes into account the BOM and HTTP headers, chained with the existing 
HtmlEncodingDetector, chained with Icu4JEncodingDetector.

standard is a composite detector with the last version of my detector, chained 
with Icu4JEncodingDetector.

Labeled as "correct" are the pages that were detected the same in chrome and 
tika. "similar" means that although incorrect, the detected charset is close to 
the one detected by chrome (ISO-8859-1 instead of WINDOWS-1254, for instance). 
"wrong" means that the detected charset was not close to the one detected by 
chrome.


was (Author: gbouchar):
[[email protected]] : great, thank you very much ! Of course I agree for it 
to be merged. I'm sorry for forgetting the license header in the first place.

I have done more work on this in the last days. I am going to make a pull 
request to include my last changes.

We have conducted an internal testing on this, and have seen great results. We 
selected a random subset of ~100 000 URLs from a nutch segment, fetched it once 
in nutched, and parsed it using different strategies. We fetched the same URLs 
using puppeteer (a headless chrome), and compared the charset detected. Here 
are the results

!https://confluence.qwant.ninja/confluence/download/attachments/25790597/image2018-7-11_16-50-32.png?version=1&modificationDate=1531320645751&api=v2!

standard_noparse is a composite detector with a version of my detector that 
just takes into account the BOM and HTTP headers, chained with the existing 
HtmlEncodingDetector, chained with Icu4JEncodingDetector.

standard is a composite detector with the last version of my detector, chained 
with Icu4JEncodingDetector.

Labeled as "correct" are the pages that were detected the same in chrome and 
tika. "similar" means that although incorrect, the detected charset is close to 
the one detected by chrome (ISO-8859-1 instead of WINDOWS-1254, for instance). 
"wrong" means that the detected charset was not close to the one detected by 
chrome.

> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
>                 Key: TIKA-2673
>                 URL: https://issues.apache.org/jira/browse/TIKA-2673
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Priority: Major
>         Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2673) HtmlEncodingDetector doesn't follow the specification

Reply via email to