On Tue, 31 May 2011, sgraessle wrote:
1. Can anyone point me in the direction of where I should look within Tika to modify/create code to not only extract the metadata for an image but also extract it's relative position in a document. (For example: between words A and word B) and then save this information.

You'll need to look at the HTML version of the parent file, and watch the img tags

2. I need to be able to extract the images within the parsed documents and saved them as well. Would the best place to do this be to create my own ImageParser and add a few lines in the Parse method?

You'll want your own parser, registered for the image types, and then add that to the parse context

You may find this class from Alfresco worth a look:
   
http://svn.alfresco.com/repos/alfresco-open-mirror/alfresco/HEAD/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java
It handles saving embedded images out, and tweaking the <img> tags for them

Nick

Reply via email to