[ 
https://issues.apache.org/jira/browse/TIKA-1519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14283891#comment-14283891
 ] 

Tim Allison commented on TIKA-1519:
-----------------------------------

[~lfcnassif], y, I completely agree.  I think we may want three categories of 
Content-Type

# Detected content-type: whatever the detector finally decided, this is the 
"detected content-type"
# Content type hint: this is a hint that is sent into the detector before the 
parse. The detector can do whatever it wants with this information.  This could 
come from the http header from the server from which a document was retrieved, 
an http-equiv metaheader or some other hint that a user wants to pass into the 
parser before the parse.  This would be multivalued, and we might eventually 
consider adding priors (a la  TIKA-1517).
# Override content-type: I think we should add an OverrideDetector that is 
called before any of the other detectors and if the client specifies this key, 
it means: trust this content-type absolutely, do not run any detection on the 
file.  In the PSTParser, for example, if I understand it correctly, we're 
sending UTF-8 encoded bytes into an EmbeddedDocumentParser, and I think that 
we're hoping that the content-type will be correctly identified.  It would be 
better if we could set a key that would prevent potentially incorrect detection.

Eventually (Tika 2.0?), I propose that "Content-Type" should only be used for 
the content-type identified via detection.  For now, so that we don't break 
backward compatibility, let's leave the current ambiguity of "Content-Type" as 
it is and add other keys for types 2) and 3).  

What do people think of "Content-Type-Hint" and "Content-Type-Override"?


> Don't allow whatever is in http-equiv Content-Type to overwrite actual 
> Content-Type in HtmlParser
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1519
>                 URL: https://issues.apache.org/jira/browse/TIKA-1519
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Tim Allison
>            Priority: Trivial
>             Fix For: 1.8
>
>
> The HtmlParser will overwrite the value of Content-Type in Metadata with any 
> value of content in an http-equiv=Content-Type header, e.g.
> {noformat}
> <meta http-equiv=Content-Type content="blah de blah blah">{noformat}.
> or even worse, perhaps:
> <meta http-equiv=Content-Type content="application/pdf">
> Let's capture the content type alleged by the html file in a different key 
> from Content-Type; I'd prefer to reserve Content-Type for "text/html; 
> charset=X".
> Candidate key/Property: Content-Type-Meta-HTTP-Equiv?
> See TIKA-1514 for example output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to