On Tue, 31 May 2011, sgraessle wrote:
1. Can anyone point me in the direction of where I should look within
Tika to modify/create code to not only extract the metadata for an image
but also extract it's relative position in a document. (For example:
between words A and word B) and then save this information.
You'll need to look at the HTML version of the parent file, and watch the
img tags
2. I need to be able to extract the images within the parsed documents
and saved them as well. Would the best place to do this be to create my
own ImageParser and add a few lines in the Parse method?
You'll want your own parser, registered for the image types, and then add
that to the parse context
You may find this class from Alfresco worth a look:
http://svn.alfresco.com/repos/alfresco-open-mirror/alfresco/HEAD/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java
It handles saving embedded images out, and tweaking the <img> tags for
them
Nick