[
https://issues.apache.org/jira/browse/TIKA-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283891#comment-14283891
]
Tim Allison commented on TIKA-1519:
-----------------------------------
[~lfcnassif], y, I completely agree. I think we may want three categories of
Content-Type
# Detected content-type: whatever the detector finally decided, this is the
"detected content-type"
# Content type hint: this is a hint that is sent into the detector before the
parse. The detector can do whatever it wants with this information. This could
come from the http header from the server from which a document was retrieved,
an http-equiv metaheader or some other hint that a user wants to pass into the
parser before the parse. This would be multivalued, and we might eventually
consider adding priors (a la TIKA-1517).
# Override content-type: I think we should add an OverrideDetector that is
called before any of the other detectors and if the client specifies this key,
it means: trust this content-type absolutely, do not run any detection on the
file. In the PSTParser, for example, if I understand it correctly, we're
sending UTF-8 encoded bytes into an EmbeddedDocumentParser, and I think that
we're hoping that the content-type will be correctly identified. It would be
better if we could set a key that would prevent potentially incorrect detection.
Eventually (Tika 2.0?), I propose that "Content-Type" should only be used for
the content-type identified via detection. For now, so that we don't break
backward compatibility, let's leave the current ambiguity of "Content-Type" as
it is and add other keys for types 2) and 3).
What do people think of "Content-Type-Hint" and "Content-Type-Override"?
> Don't allow whatever is in http-equiv Content-Type to overwrite actual
> Content-Type in HtmlParser
> -------------------------------------------------------------------------------------------------
>
> Key: TIKA-1519
> URL: https://issues.apache.org/jira/browse/TIKA-1519
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.6
> Reporter: Tim Allison
> Priority: Trivial
> Fix For: 1.8
>
>
> The HtmlParser will overwrite the value of Content-Type in Metadata with any
> value of content in an http-equiv=Content-Type header, e.g.
> {noformat}
> <meta http-equiv=Content-Type content="blah de blah blah">{noformat}.
> or even worse, perhaps:
> <meta http-equiv=Content-Type content="application/pdf">
> Let's capture the content type alleged by the html file in a different key
> from Content-Type; I'd prefer to reserve Content-Type for "text/html;
> charset=X".
> Candidate key/Property: Content-Type-Meta-HTTP-Equiv?
> See TIKA-1514 for example output.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)