Tim Allison created TIKA-1514:
---------------------------------

             Summary: http-equiv content-type extraction should pick first 
parseable content value 
                 Key: TIKA-1514
                 URL: https://issues.apache.org/jira/browse/TIKA-1514
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.6
            Reporter: Tim Allison
            Priority: Trivial
             Fix For: 1.8


In a handful of files from govdocs1, there are some creative http-equiv 
content-type headers, including: 
{noformat}
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" 
name="keywords" content="DNRC, division of nutrition">
{noformat}

The content type that is going into the metadata for this file is "DNRC, 
division of nutrition".

Let's modify our html metaheader charset detector to pick the first parseable 
charset value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to