Re: Image Extraction

Nick Burch Tue, 31 May 2011 09:14:48 -0700

On Tue, 31 May 2011, sgraessle wrote:

1. Can anyone point me in the direction of where I should look withinTika to modify/create code to not only extract the metadata for an imagebut also extract it's relative position in a document. (For example:between words A and word B) and then save this information.

You'll need to look at the HTML version of the parent file, and watch theimg tags

2. I need to be able to extract the images within the parsed documentsand saved them as well. Would the best place to do this be to create myown ImageParser and add a few lines in the Parse method?

You'll want your own parser, registered for the image types, and then addthat to the parse context


You may find this class from Alfresco worth a look:
   
http://svn.alfresco.com/repos/alfresco-open-mirror/alfresco/HEAD/root/projects/repository/source/java/org/alfresco/repo/rendition/executer/HTMLRenderingEngine.java

It handles saving embedded images out, and tweaking the <img> tags forthem


Nick

Re: Image Extraction

Reply via email to