[
https://issues.apache.org/jira/browse/TIKA-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277503#comment-14277503
]
Tim Allison commented on TIKA-1514:
-----------------------------------
I dug into this a bit. It will take more effort than it is worth expending to
fix this particular problem. The encoding extractor is choosing the correct
value if there is more than one.
However, I found that our HTMLParser is setting the "Content-Type" to whatever
the last value of "content" is an in http-equiv header.
So, in this case:
{noformat}
<meta http-equiv=Content-Type content="text/html; charset=iso-8859-6"
content="application/pdf">
{noformat}
The metadata is:
{noformat}
Content-Encoding:ISO-8859-6
X-Parsed-By:org.apache.tika.parser.DefaultParser
X-Parsed-By:org.apache.tika.parser.html.HtmlParser
Content-Type:application/pdf
{noformat}
Or in this case:
{noformat}
<meta http-equiv=Content-Type content="text/html; charset=iso-8859-6"
content="blah de blah blah">
{noformat}
The metadata is:
{noformat}
Content-Encoding:ISO-8859-6
X-Parsed-By:org.apache.tika.parser.DefaultParser
X-Parsed-By:org.apache.tika.parser.html.HtmlParser
Content-Type:blah de blah blah
{noformat}
Shouldn't we be setting the Content-Type as "text/html; charset=ISO-8859-6" so
that malformed or incorrect html won't yield incorrect Content-Type data? We
can include what the Content-Type metahttp header alleges in a different key
("Content-Type-Meta-HTTP-Equiv" ?), but I'd prefer "Content-Type" to mean the
content type that Tika detected not whatever the html happened to include.
What do others think?
> http-equiv content-type extraction should pick first parseable content value
> -----------------------------------------------------------------------------
>
> Key: TIKA-1514
> URL: https://issues.apache.org/jira/browse/TIKA-1514
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.6
> Reporter: Tim Allison
> Priority: Trivial
> Fix For: 1.8
>
>
> In a handful of files from govdocs1, there are some creative http-equiv
> content-type headers, including:
> {noformat}
> <meta http-equiv="content-type" content="text/html; charset=iso-8859-1"
> name="keywords" content="DNRC, division of nutrition">
> {noformat}
> The content type that is going into the metadata for this file is "DNRC,
> division of nutrition".
> Let's modify our html metaheader charset detector to pick the first parseable
> charset value.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)