Hi Chris,
Thanks for the ref to SpellCheckedMetadata.
Based on the previous decision, I'll go ahead and add some checks in
the HtmlHandler code to fix up capitalization issues, since Tika
itself is the "client" in this case (it consumes the content type
information).
https://issues.apache.org/jira/browse/TIKA-497 tracks this.
-- Ken
On Aug 23, 2010, at 10:13am, Mattmann, Chris A (388J) wrote:
Hey Ken,
RE: #1, see SpellCheckedMetadata [1]. Jerome and Sami and I worked
on it a long time ago, and it handles exactly the case you are
talking about. RE: #2, ehh...not sure! :) Jukka took out [1] in
r780895 [2], because he felt it would best be handled in client code.
Cheers,
Chris
[1] http://s.apache.org/eo
[2] http://svn.apache.org/viewvc/?rev=780895&view=rev
On 8/23/10 10:01 AM, "Ken Krugler" <[email protected]>
wrote:
I ran into an issue recently, where the metadata after a parse had two
versions of the same data.
One was from the HTTP response headers, and was called "Content-Type".
The other had been derived from a <meta http-equiv="content-type">
element in the HTML content.
That brings up two questions:
1. Should Tika's Metadata ensure that keys are case-insensitive
unique?
2. For the above case, who wins? Based on HTML5's approach to charset
detection (see
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html)
, I think it's the response header, but based on experience, I think
it should be what's in the HTML.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g