Hello!

01.09.2010 13:54, Nick Burch пишет:
I've been thinking about extracting files from container formats (eg images in a .docx, pdfs in a zip file etc). Given the recent number of queries about embeded files and Tika lately, I was wondering if people thought this might be something worth adding as another part of Tika?

My idea is that you'd pass to this "service" a container file. You'd also say if you wanted recursion, and which mime types interest you. The result would be say an iterator of input stream, which would probably also let you get the filenames and mime types where supported by the container.

I think it is a good idea. I already have POI-based part extractor for office file formats. I can contribute some code when API will be done.

best wishes, Max

Reply via email to