Tim Allison created TIKA-1514:
---------------------------------
Summary: http-equiv content-type extraction should pick first
parseable content value
Key: TIKA-1514
URL: https://issues.apache.org/jira/browse/TIKA-1514
Project: Tika
Issue Type: Bug
Affects Versions: 1.6
Reporter: Tim Allison
Priority: Trivial
Fix For: 1.8
In a handful of files from govdocs1, there are some creative http-equiv
content-type headers, including:
{noformat}
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1"
name="keywords" content="DNRC, division of nutrition">
{noformat}
The content type that is going into the metadata for this file is "DNRC,
division of nutrition".
Let's modify our html metaheader charset detector to pick the first parseable
charset value.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)