[ 
https://issues.apache.org/jira/browse/TIKA-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414216#comment-15414216
 ] 

Shabanali Faghani edited comment on TIKA-2050 at 8/10/16 9:05 AM:
------------------------------------------------------------------

I tested it again. All charset information in these docs are at indices greater 
than 8192 and the regex isn’t at fault. Since most of these indices are greater 
than 15,000 I think increasing buffer size wouldn’t be a good idea, however 
that is a trade-off between accuracy and efficiency.
 
For those GBK docs that have more than one charset in their Metas, I’ve tested 
them with Chrome and Firefox. I found out Chrome has a high level of 
self-confidence because it doesn’t use charset information at all but its 
confidence, at least in these cases, doesn’t help it and it detects these docs 
as Western (Windows-1252). On the other hand Firefox extracts and uses the 
first charset appearing in Meta tags for decoding the pages. Hence, it seems 
that selecting the first charset is a kind of best-practice in this context. 
Nevertheless, we know that this method fails in some cases, like our case for 
the attached GBK docs. Maybe extracting all charsets from Meta tags and then 
selecting the one that has least popularity/usage would be a better solution.

bq. Do you see any other problems besides our buffer length?
I also suspected to those charsets that appear in Script tags to being 
false-positives of this class, but when I checked its regex I found out that 
isn’t the case. I don’t see any other problem; just I would to say that your 
regex approach in this class is ~18x faster than my Dom-Tree navigating 
approach! 

 I’m not sure; but probably our approach in TIKA-2038 is even more accurate 
than Meta detection and also more accurate than algorithms of some Browsers!! 
(Remove charset information in meta tags for some docs, e.g. Windows-1256, GBK, 
…, and then open them using some browsers to test it) So, I think even if 
HTMLEncodingDetector class couldn’t extract existing charsets from Meta tags, 
we shouldn’t be worry!, that isn't that important.


was (Author: faghani):
I tested it again. All charset information in these docs are at indices greater 
than 8192 and the regex isn’t at fault. Since most of these indices are greater 
than 15,000 I think increasing buffer size wouldn’t be a good idea, however 
that is a trade-off between accuracy and efficiency.
 
For those GBK docs that have more than one charset in their Metas, I’ve tested 
them with Chrome and Firefox. I found out Chrome has a high level of 
self-confidence because it doesn’t use charset information at all but its 
confidence, at least in these cases, doesn’t help it and it detects these docs 
as Western (Windows-1252). On the other hand Firefox extracts and uses the 
first charset appearing in Meta tags for decoding the pages. Hence, it seems 
that selecting the first charset is a kind of best-practice in this context. 
Nevertheless, we know that this method fails in some cases, like our case for 
the attached GBK docs. Maybe extracting all charsets from Meta tags and then 
selecting the one that has least popularity/usage would be a better solution.

bq. Do you see any other problems besides our buffer length?
I was also suspected to those charsets that appear in Script tags to being 
false-positives of this class, but when I checked its regex I found out that 
isn’t the case. I don’t see any other problem; just I would to say that your 
regex approach in this class is ~18x faster than my Dom-Tree navigating 
approach! 

 I’m not sure; but probably our approach in TIKA-2038 is even more accurate 
than Meta detection and also more accurate than algorithms of some Browsers!! 
(Remove charset information in meta tags for some docs, e.g. Windows-1256, GBK, 
…, and then open them using some browsers to test it) So, I think even if 
HTMLEncodingDetector class couldn’t extract existing charsets from Meta tags, 
we shouldn’t be worry!, that isn't that important.

> HTMLEncodingDetector class fails on some HTML documents
> -------------------------------------------------------
>
>                 Key: TIKA-2050
>                 URL: https://issues.apache.org/jira/browse/TIKA-2050
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: false-negative-responce-from-HTMLEncodingDetector.zip
>
>
> When [[email protected]] and I were working on 
> [TIKA-2038|https://issues.apache.org/jira/browse/TIKA-2038] I found out that 
> HTMLEncodingDetector class cannot extract charsets from some HTML documents. 
> I’ve attached the HTML documents that HTMLEncodingDetector fails on them. It 
> seems that its regex should be corrected to cover these cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to