Hey Ken,

RE: #1, see SpellCheckedMetadata [1]. Jerome and Sami and I worked on it a long 
time ago, and it handles exactly the case you are talking about. RE: #2, 
ehh...not sure! :) Jukka took out [1] in r780895 [2], because he felt it would 
best be handled in client code.

Cheers,
Chris

[1] http://s.apache.org/eo
[2] http://svn.apache.org/viewvc/?rev=780895&view=rev

On 8/23/10 10:01 AM, "Ken Krugler" <[email protected]> wrote:

I ran into an issue recently, where the metadata after a parse had two
versions of the same data.

One was from the HTTP response headers, and was called "Content-Type".

The other had been derived from a <meta http-equiv="content-type">
element in the HTML content.

That brings up two questions:

1. Should Tika's Metadata ensure that keys are case-insensitive unique?

2. For the above case, who wins? Based on HTML5's approach to charset
detection (see 
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html)
, I think it's the response header, but based on experience, I think
it should be what's in the HTML.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to