Hello!
01.09.2010 13:54, Nick Burch пишет:
I've been thinking about extracting files from container formats (eg images in a
.docx, pdfs in a zip file etc). Given the recent number of queries about embeded
files and Tika lately, I was wondering if people thought this might be something
worth adding as another part of Tika?
My idea is that you'd pass to this "service" a container file. You'd also say if
you wanted recursion, and which mime types interest you. The result would be say
an iterator of input stream, which would probably also let you get the filenames
and mime types where supported by the container.
I think it is a good idea. I already have POI-based part extractor for office file
formats. I can contribute some code when API will be done.
best wishes, Max