no objections, +1 from me.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: "Tim Allison   (JIRA)" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, April 10, 2015 at 2:30 PM
To: "[email protected]" <[email protected]>
Subject: [jira] [Commented] (TIKA-1519) Don't allow whatever is in
http-equiv Content-Type to overwrite actual Content-Type in HtmlParser

>
>    [ 
>https://issues.apache.org/jira/browse/TIKA-1519?page=com.atlassian.jira.pl
>ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490076#comm
>ent-14490076 ] 
>
>Tim Allison commented on TIKA-1519:
>-----------------------------------
>
>With the initial TIKA-1519 change, we went from 217 unique mime types to
>129 unique mime-types in govdocs1.  Some of this is due to the collapse
>of the various charsets in {{text/html; charset=XXX}} to the charset-less
>{{application/xhtml+xml}}.  However, quite a few of the decreases are
>great because they represent a likely correct normalization.  For
>example, there are 49 different values in 1.7 for the single value in
>Tika 1.8-rc1's {{text/html; charset=ISO-8859-1}}.  The top few include:
>
>||mime||doc count||
>|text/html; charset=ISO-8859-1|49039|
>|text/html; charset=iso-8859-1|36373|
>|text/html|243|
>|text/html; charset=windows-1252|234|
>|text/html; charset=utf-8|71|
>|text/html; charset=Windows-1252|49|
>|text/html; iso-8859-1=|38|
>|text/html; charset=iso8859-1|25|
>|text/html; charset=macintosh|25|
>|application/xml|22|
>|text/html; charset=iso_8859_1|19|
>
>Bottom line last.
>After reading through TIKA-431, I think we might consider adding {{;
>charset=xyz}} to {{application/xhtml+xml}}.  However, as stated above, I
>have very little knowledge of the standards
>
>Any objections?
>
>
>> Don't allow whatever is in http-equiv Content-Type to overwrite actual
>>Content-Type in HtmlParser
>> 
>>-------------------------------------------------------------------------
>>------------------------
>>
>>                 Key: TIKA-1519
>>                 URL: https://issues.apache.org/jira/browse/TIKA-1519
>>             Project: Tika
>>          Issue Type: Bug
>>    Affects Versions: 1.6
>>            Reporter: Tim Allison
>>            Priority: Trivial
>>             Fix For: 1.8
>>
>>         Attachments: TIKA-1519.patch
>>
>>
>> The HtmlParser will overwrite the value of Content-Type in Metadata
>>with any value of content in an http-equiv=Content-Type header, e.g.
>> {noformat}
>> <meta http-equiv=Content-Type content="blah de blah blah">{noformat}.
>> or even worse, perhaps:
>> <meta http-equiv=Content-Type content="application/pdf">
>> Let's capture the content type alleged by the html file in a different
>>key from Content-Type; I'd prefer to reserve Content-Type for
>>"text/html; charset=X".
>> Candidate key/Property: Content-Type-Meta-HTTP-Equiv?
>> See TIKA-1514 for example output.
>
>
>
>--
>This message was sent by Atlassian JIRA
>(v6.3.4#6332)

Reply via email to