Hi Nick,
Potentially interesting thread from about a year ago:
http://lucene.472066.n3.nabble.com/Multiple-documents-per-input-stream-td647159.html#a647159
I ran into a similar issue when trying to figure out how best to
handle mbox files.
-- Ken
On Sep 1, 2010, at 2:54am, Nick Burch wrote:
Hi All
I've been thinking about extracting files from container formats (eg
images in a .docx, pdfs in a zip file etc). Given the recent number
of queries about embeded files and Tika lately, I was wondering if
people thought this might be something worth adding as another part
of Tika?
My idea is that you'd pass to this "service" a container file. You'd
also say if you wanted recursion, and which mime types interest you.
The result would be say an iterator of input stream, which would
probably also let you get the filenames and mime types where
supported by the container.
Example uses would be:
* .doc file, non recursive, request image/png and image/jpg
gives you all the images in the word document
* .ppt file, recursive, request excel
gives you excel files embeded in the powerpoint, and excel files
embeded
in the word documents embeded in the powerpoint
* .docx file, non recursive, request image/png
treated as a ooxml file, not a plain zip file, and all png images
from the magic embeded directory are returned.
* .zip file, recursive, request pdf
gives you all pdf files anywhere in the zip
* .ogg file, non-recursive, request audio
gives you the 3 different audio streams in your video file
You could pass the resultant input streams into the regular tika
parser if you wanted to process them, or even just save them into a
directory
if all you wanted was an extractor.
What do people think? Is this useful? Is this appropriate for Tika?
If yes to these two, does the rough method signature sound sane?
Nick
PS I'm willing to do most of the coding on this if it's deemed
suitable
for Tika, but not for a few weeks probably, until Alfresco 3.4 is
done
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g