On 2010-09-01 11:54, Nick Burch wrote:
Hi All
I've been thinking about extracting files from container formats (eg
images in a .docx, pdfs in a zip file etc). Given the recent number of
queries about embeded files and Tika lately, I was wondering if people
thought this might be something worth adding as another part of Tika?
This would be very useful. We contemplated implementing something like
this in Nutch, to handle archives (jar/tar/zip/...), but having it in
Tika would be much better.
Example uses would be:
* .doc file, non recursive, request image/png and image/jpg
gives you all the images in the word document
* .ppt file, recursive, request excel
gives you excel files embeded in the powerpoint, and excel files embeded
in the word documents embeded in the powerpoint
* .docx file, non recursive, request image/png
treated as a ooxml file, not a plain zip file, and all png images
from the magic embeded directory are returned.
* .zip file, recursive, request pdf
gives you all pdf files anywhere in the zip
Does recursive here mean that it would look into embedded zip files too?
Or that it would process all paths (since there is really no hierarchy
in zip files)?
* .ogg file, non-recursive, request audio
gives you the 3 different audio streams in your video file
You could pass the resultant input streams into the regular tika parser
if you wanted to process them, or even just save them into a directory
if all you wanted was an extractor.
What do people think? Is this useful? Is this appropriate for Tika? If
yes to these two, does the rough method signature sound sane?
+1.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com