[ 
https://issues.apache.org/jira/browse/TIKA-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445351#comment-13445351
 ] 

Ken Krugler commented on TIKA-539:
----------------------------------

I'm looking at refactoring the detector code in Tika, to avoid having various 
bits of code parse the metadata, and allowing ordering of detectors so (for 
example) if the BOM detector finds a BOM, then we can stop processing the text.

This should allow us to treat the meta tag as more of a hint, which now seems 
like the way to go (unless people can come up with cases where the latest 
charset detection code returns an invalid result).
                
> Encoding detection is too biased by encoding in meta tag
> --------------------------------------------------------
>
>                 Key: TIKA-539
>                 URL: https://issues.apache.org/jira/browse/TIKA-539
>             Project: Tika
>          Issue Type: Bug
>          Components: metadata, parser
>    Affects Versions: 0.8, 0.9, 0.10
>            Reporter: Reinhard Schwab
>            Assignee: Ken Krugler
>             Fix For: 1.3
>
>         Attachments: TIKA-539_2.patch, TIKA-539.patch
>
>
> if the encoding in the meta tag is wrong, this encoding is detected,
> even if there is the right encoding set in metadata before(which can be  from 
> http response header).
> test code to reproduce:
> static String content = "<html><head>\n"
>                       + "<meta http-equiv=\"content-type\" 
> content=\"application/xhtml+xml; charset=iso-8859-1\" />"
>                       + "</head><body>Über den Wolken\n</body></html>";
>       /**
>        * @param args
>        * @throws IOException
>        * @throws TikaException
>        * @throws SAXException
>        */
>       public static void main(String[] args) throws IOException, SAXException,
>                       TikaException {
>               Metadata metadata = new Metadata();
>               metadata.set(Metadata.CONTENT_TYPE, "text/html");
>               metadata.set(Metadata.CONTENT_ENCODING, "UTF-8");
>               System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>               InputStream in = new 
> ByteArrayInputStream(content.getBytes("UTF-8"));
>               AutoDetectParser parser = new AutoDetectParser();
>               BodyContentHandler h = new BodyContentHandler(10000);
>               parser.parse(in, h, metadata, new ParseContext());
>               System.out.print(h.toString());
>               System.out.println(metadata.get(Metadata.CONTENT_ENCODING));
>       }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to