[ 
https://issues.apache.org/jira/browse/TIKA-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571207#comment-16571207
 ] 

Gerard Bouchar edited comment on TIKA-2673 at 8/7/18 7:26 AM:
--------------------------------------------------------------

Yes, the pages for which fetching failed are not included in the non-chrome 
files. The analysis is based on the pages that were successfully fetched and 
parsed with all the strategies. When an error was thrown while fetching in 
chrome, the charset is marked as "unknown", but the URL is still included.

If you want to redo the experiment yourself, I would advise to take the 200k 
URLs, and then filter only the ones for which fetching and parsing succeeded, 
and the resulting document was actual HTML.


was (Author: gbouchar):
Yes, the pages for which fetching failed are not included in the non-chrome 
files. The analysis is based on the pages that were successfully fetched and 
parsed with all the strategies. (when an error was thrown during fetching in 
chrome, the charset in marked as "unknown", but the URL is still included).

If you want to redo the experiment yourself, I would advise to take the 200k 
URLs, and then filter only the ones for which fetching and parsing succeeded, 
and the resulting document was actual HTML.

> HtmlEncodingDetector doesn't follow the specification
> -----------------------------------------------------
>
>                 Key: TIKA-2673
>                 URL: https://issues.apache.org/jira/browse/TIKA-2673
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gerard Bouchar
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.19, 2.0.0
>
>         Attachments: HtmlEncodingDetectorTest.java, 
> StrictHtmlEncodingDetector.tar.gz, image-2018-07-13-11-28-16-657.png
>
>
> This bug is linked to TIKA-2671, but does not concern metadata, but rather 
> the bytes-based detection itself.
> While reading the specification, I collected a list of sample cases where 
> HtmlEncodingDetector differs from the specification, and thus fails at 
> detecting the right charset.
> I am attaching the test cases to this issue: 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to