Hey Ken, RE: #1, see SpellCheckedMetadata [1]. Jerome and Sami and I worked on it a long time ago, and it handles exactly the case you are talking about. RE: #2, ehh...not sure! :) Jukka took out [1] in r780895 [2], because he felt it would best be handled in client code.
Cheers, Chris [1] http://s.apache.org/eo [2] http://svn.apache.org/viewvc/?rev=780895&view=rev On 8/23/10 10:01 AM, "Ken Krugler" <[email protected]> wrote: I ran into an issue recently, where the metadata after a parse had two versions of the same data. One was from the HTTP response headers, and was called "Content-Type". The other had been derived from a <meta http-equiv="content-type"> element in the HTML content. That brings up two questions: 1. Should Tika's Metadata ensure that keys are case-insensitive unique? 2. For the above case, who wins? Based on HTML5's approach to charset detection (see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html) , I think it's the response header, but based on experience, I think it should be what's in the HTML. -- Ken -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
