On Wed, 1 Sep 2010, Nick Burch wrote:
I've been thinking about extracting files from container formats (eg
images in a .docx, pdfs in a zip file etc).
I've been pondering the various feedback over the weekend, and hopefully
now have a more detailed idea.
Firstly, the new service needs to work for both people who have the
container file locally, and those streaming it remotely. Some container
parsers may work better with input streams, some with files, so making the
input contract be a TikaInputStream would seem to be the right way around
this?
Next, how to control which child elements are returned. The container will
usually know the embeded file name, but not always, and will often know
the path details of it (eg /foo/bar.txt in a zip file). It may sometimes
know the mime type. This seems to me too difficult to easily represent as
a wish-list filter. So, I now think that probably the only way to work it
is to offer all the details of every file to the consumer, and let them
decide if they're interested or not. Ideally, the amount of work done by
the container parser until the consumer decides they want it + asks for
the contents will be minimised. (A filter wrapper can always be put around
it as required)
Nested embeded files - do we have a boolean flag for descend / don't
descend, or do we pass that choice back to the consumer on a per-embeded
basis similar to above? I worry that the latter would make things too
complicated and heavy-weight, so I'm leaning towards the simple boolean
flag.
Finally, pull vs push for the consumer. The two forms would probably look
something like:
====
Iterator<Embeded> embeded = containerExtractor.extract(inp, false);
for(Embeded details : embeded) {
if("application/pdf".equals(details.getMimeType()) ||
"pdf".equals(details.getSuffix()) {
handlePDF(details.getInputStream());
}
if("/README.txt".equals(details.getFilename()) {
handleREADME(details.getInputStream());
}
}
====
containerExtractor.extract(inp, false, new EmbededHandler() {
public void handle(String filename, String mimetype, InputStreamSource
futureInputStream) {
if("application/pdf".equals(mimetype) ||
(filename != null && filename.endsWith("pdf"))) {
handlePDF(futureInputStream.getInputStream());
}
}
});
====
I think the former would be a little bit more work for us, but is likely
to lead to cleaner and simpler code for consumers. What do people think?
Nick