On Sep 7, 2010, at 5:58am, Staffan wrote:
On Tue, Sep 7, 2010 at 10:43 AM, Nick Burch
<[email protected]> wrote:
On Mon, 6 Sep 2010, Ken Krugler wrote:
I recently updated the Bixo project to use Tika 0.8-SNAPSHOT, and
a number
of documents now fail during parsing that previously passed.
Any chance you could create a new jira issue, and upload one of the
problem
documents?
Did the Tika-0.7 image parsers (JPEG, GIF, PNG) not extract
metadata, and
thus not run into these types of issues?
The image metadata stuff has changed dramatically since 0.7, and
we're now
processing a lot more of the files in search of useful metadata
than we used
to.
The exception is thrown before we start to extract the metadata. It
looks like the file is auto detected as a Jpeg but the EXIF parser
(the same version that Tika has used for a long time) says it is not a
Jpeg. Please attach one of the failing files to the issue.
I'm extracting these from a .arc web archive file (from the Heritrix
project). So I'll have to write some code to save these as individual
files - hopefully next week.
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g