I ran into an issue recently, where the metadata after a parse had two versions of the same data.

One was from the HTTP response headers, and was called "Content-Type".

The other had been derived from a <meta http-equiv="content-type"> element in the HTML content.

That brings up two questions:

1. Should Tika's Metadata ensure that keys are case-insensitive unique?

2. For the above case, who wins? Based on HTML5's approach to charset detection (see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html) , I think it's the response header, but based on experience, I think it should be what's in the HTML.

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to