On Wed, 1 Sep 2010, Jukka Zitting wrote:
The main complexity I see here is what the return values of such a
service would look like, especially if you need to support cases where
the container document is only available as an InputStream (i.e. no
random access). Then you'd either need to use temporary files (or
in-memory buffers) or a callback interface like this one:

   public interface ComponentDocumentHandler {
       void handleComponentDocument(
           InputStream stream, Metadata metadata)
           throws IOException, TikaException;
   }

The issue is that for some file formats, we'll have to process the whole container anyway to do something useful. Even zip is problematic - we'll want to know if it's a plain .zip file, or a .docx file, or a Keynote file. That would potentially mean looking at the whole of the zip file's entries before we'll know if we should expose every entry in the zip, or only ones in certain special places. For the .docx case, we'll also need to look at the content type and rels entries to figure out the mime types, and potentially the real file names.

So, I think that if someone wants to use this service, they'll need to either have the file locally, or put up with buffering the whole thing in memory. Alas I don't see this is being a light-weight call.


In terms of linking it up with the tika parser, I'm happy to go with whatever you suggest :)

Nick

Reply via email to