Re: Patch: self-contained HTML using Data URI

Nick Burch Tue, 24 Jun 2014 09:47:33 -0700

On Tue, 24 Jun 2014, Andrew Skiba wrote:

I started with org.apache.tika.parser.microsoft.WordExtractor andimmediately saw that it already makes a recursive call to theorg.apache.tika.parser.image.ImageParser. But ImageParser currently onlyenriches metadata, and does not create <img> element itself. This isdone in the WordExtractor and respective handlers for types other, thanMS Word.

A would've thought it would only trigger ImageParser if you set theAutoDetectParser on the parse context, did you?

My idea was that you'd have a content handler + recursing parse class /pair, the handler would re-write the img tag when it came through, and therecursing parser would capture the image when that triggers to get theimage data suitable for the re-write. (This is largely what the Alfrescoclass does that I suggested you look at)

You shouldn't be changing anything in the Word Parser itself, you want tobe writing something that applies equally to all parsers.

(It might be that you find that one parser is being non-standard about howit reports embedded images, in which case you'll need to fix that tofollow the others, but ideally you shouldn't be touching the built inparsers beyond that)


Nick

Re: Patch: self-contained HTML using Data URI

Reply via email to