Hello again. I created an issue https://issues.apache.org/jira/browse/TIKA-1344 for this patch and got an advise to implement this in a content handler. So I learned the idea behind RecursiveMetadata and started to look how to move my change into a handler according to what Nick advised me.
I started with org.apache.tika.parser.microsoft.WordExtractor and immediately saw that it already makes a recursive call to the org.apache.tika.parser.image.ImageParser. But ImageParser currently only enriches metadata, and does not create <img> element itself. This is done in the WordExtractor and respective handlers for types other, than MS Word. So my question is - do I have to move the creation of <img> to ImageParser and remove it from WordExtractor? Thank you. Andrew. On Wed, Jun 18, 2014 at 5:16 PM, Andrew Skiba <[email protected]> wrote: > Hi, > > In the current code, the images from Word documents are referenced by > "embedded:xxx" links in the generated HTML. This causes the browsers > display "x" icon instead of the image. > > The proposed patch encodes the images using Data URI, if there is > -Dtika.parsers.urlimages system property. > > http://en.wikipedia.org/wiki/Data_URI_scheme > > So the default behavior is the same, but users of the library can > optionally generate self-contained HTML with correct images. > > Thank you, > > Andrew. >
