[jira] [Created] (TIKA-2485) HTMLEncodingDetector content limit to be configurable

Markus Jelsma (JIRA) Fri, 27 Oct 2017 06:37:52 -0700

Markus Jelsma created TIKA-2485:
-----------------------------------

             Summary: HTMLEncodingDetector content limit to be configurable
                 Key: TIKA-2485
                 URL: https://issues.apache.org/jira/browse/TIKA-2485
             Project: Tika
          Issue Type: Improvement
          Components: detector
    Affects Versions: 1.16
            Reporter: Markus Jelsma
            Priority: Minor
             Fix For: 1.17



Tim's response to my question:

-----Original message-----
> From:Allison, Timothy B. <[email protected]>
> Sent: Friday 27th October 2017 14:53
> To: [email protected]
> Subject: RE: Incorrect encoding detected
> 
> Hi Markus,
>   
> My guess is that the ~32,000 characters of mostly ascii-ish <script/> are 
> what is actually being used for encoding detection.  The HTMLEncodingDetector 
> only looks in the first 8,192 characters, and the other encoding detectors 
> have similar (but longer?) restrictions.
>  
> At some point, I had a dev version of a stripper that removed contents of 
> <script/> and <style/> before trying to detect the encoding[0]...perhaps it 
> is time to resurrect that code and integrate it?
> 
> Or, given that HTML has been, um, blossoming, perhaps, more simply, we should 
> expand how far we look into a stream for detection?
> 
> Cheers,
> 
>                Tim
> 
> [0] https://issues.apache.org/jira/browse/TIKA-2038
>    
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:[email protected]] 
> Sent: Friday, October 27, 2017 8:39 AM
> To: [email protected]
> Subject: Incorrect encoding detected
> 
> Hello,
> 
> We have a problem with Tika, encoding and pages on this website: 
> https://www.aarstiderne.com/frugt-groent-og-mere/mixkasser
> 
> Using Nutch and Tika 1.12, but also using Tika 1.16, we found out that the 
> regular HTML parser does a fine job, but our TikaParser has a tough job 
> dealing with this HTML. For some reason Tika thinks 
> Content-Encoding=windows-1252 is what this webpage says it is, instead the 
> page identifies itself properly as UTF-8.
> 
> Of all websites we index, this is so far the only one giving trouble indexing 
> accents, getting fÃ¥ instead of a regular få.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (TIKA-2485) HTMLEncodingDetector content limit to be configurable

Reply via email to