[jira] [Resolved] (TIKA-2050) HTMLEncodingDetector class fails on some HTML documents

Tim Allison (JIRA) Thu, 11 Aug 2016 05:34:36 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tim Allison resolved TIKA-2050.
-------------------------------
    Resolution: Won't Fix

If we want to increase the buffer in the future or make it configurable, we can 
reopen this issue.

> HTMLEncodingDetector class fails on some HTML documents
> -------------------------------------------------------
>
>                 Key: TIKA-2050
>                 URL: https://issues.apache.org/jira/browse/TIKA-2050
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: false-negative-responce-from-HTMLEncodingDetector.zip
>
>
> When [[email protected]] and I were working on 
> [TIKA-2038|https://issues.apache.org/jira/browse/TIKA-2038] I found out that 
> HTMLEncodingDetector class cannot extract charsets from some HTML documents. 
> I’ve attached the HTML documents that HTMLEncodingDetector fails on them. It 
> seems that its regex should be corrected to cover these cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (TIKA-2050) HTMLEncodingDetector class fails on some HTML documents

Reply via email to