no objections, +1 from me. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: "Tim Allison (JIRA)" <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, April 10, 2015 at 2:30 PM To: "[email protected]" <[email protected]> Subject: [jira] [Commented] (TIKA-1519) Don't allow whatever is in http-equiv Content-Type to overwrite actual Content-Type in HtmlParser > > [ >https://issues.apache.org/jira/browse/TIKA-1519?page=com.atlassian.jira.pl >ugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14490076#comm >ent-14490076 ] > >Tim Allison commented on TIKA-1519: >----------------------------------- > >With the initial TIKA-1519 change, we went from 217 unique mime types to >129 unique mime-types in govdocs1. Some of this is due to the collapse >of the various charsets in {{text/html; charset=XXX}} to the charset-less >{{application/xhtml+xml}}. However, quite a few of the decreases are >great because they represent a likely correct normalization. For >example, there are 49 different values in 1.7 for the single value in >Tika 1.8-rc1's {{text/html; charset=ISO-8859-1}}. The top few include: > >||mime||doc count|| >|text/html; charset=ISO-8859-1|49039| >|text/html; charset=iso-8859-1|36373| >|text/html|243| >|text/html; charset=windows-1252|234| >|text/html; charset=utf-8|71| >|text/html; charset=Windows-1252|49| >|text/html; iso-8859-1=|38| >|text/html; charset=iso8859-1|25| >|text/html; charset=macintosh|25| >|application/xml|22| >|text/html; charset=iso_8859_1|19| > >Bottom line last. >After reading through TIKA-431, I think we might consider adding {{; >charset=xyz}} to {{application/xhtml+xml}}. However, as stated above, I >have very little knowledge of the standards > >Any objections? > > >> Don't allow whatever is in http-equiv Content-Type to overwrite actual >>Content-Type in HtmlParser >> >>------------------------------------------------------------------------- >>------------------------ >> >> Key: TIKA-1519 >> URL: https://issues.apache.org/jira/browse/TIKA-1519 >> Project: Tika >> Issue Type: Bug >> Affects Versions: 1.6 >> Reporter: Tim Allison >> Priority: Trivial >> Fix For: 1.8 >> >> Attachments: TIKA-1519.patch >> >> >> The HtmlParser will overwrite the value of Content-Type in Metadata >>with any value of content in an http-equiv=Content-Type header, e.g. >> {noformat} >> <meta http-equiv=Content-Type content="blah de blah blah">{noformat}. >> or even worse, perhaps: >> <meta http-equiv=Content-Type content="application/pdf"> >> Let's capture the content type alleged by the html file in a different >>key from Content-Type; I'd prefer to reserve Content-Type for >>"text/html; charset=X". >> Candidate key/Property: Content-Type-Meta-HTTP-Equiv? >> See TIKA-1514 for example output. > > > >-- >This message was sent by Atlassian JIRA >(v6.3.4#6332)
