On Tue, 24 Jun 2014, Andrew Skiba wrote:
I started with org.apache.tika.parser.microsoft.WordExtractor and immediately saw that it already makes a recursive call to the org.apache.tika.parser.image.ImageParser. But ImageParser currently only enriches metadata, and does not create <img> element itself. This is done in the WordExtractor and respective handlers for types other, than MS Word.

A would've thought it would only trigger ImageParser if you set the AutoDetectParser on the parse context, did you?

My idea was that you'd have a content handler + recursing parse class / pair, the handler would re-write the img tag when it came through, and the recursing parser would capture the image when that triggers to get the image data suitable for the re-write. (This is largely what the Alfresco class does that I suggested you look at)

You shouldn't be changing anything in the Word Parser itself, you want to be writing something that applies equally to all parsers.

(It might be that you find that one parser is being non-standard about how it reports embedded images, in which case you'll need to fix that to follow the others, but ideally you shouldn't be touching the built in parsers beyond that)

Nick

Reply via email to