I ran into an issue recently, where the metadata after a parse had two
versions of the same data.
One was from the HTTP response headers, and was called "Content-Type".
The other had been derived from a <meta http-equiv="content-type">
element in the HTML content.
That brings up two questions:
1. Should Tika's Metadata ensure that keys are case-insensitive unique?
2. For the above case, who wins? Based on HTML5's approach to charset
detection (see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html)
, I think it's the response header, but based on experience, I think
it should be what's in the HTML.
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g