On Tue, 7 Sep 2010, Jukka Zitting wrote:
On Mon, Sep 6, 2010 at 1:19 PM, Nick Burch <[email protected]> wrote:
Finally, pull vs push for the consumer.
[...]
I think the former would be a little bit more work for us, but is likely to
lead to cleaner and simpler code for consumers. What do people think?

I'd start with a push mechanism as that supports streaming and is
better in line with the current design of Tika.

OK, that seems sensible to me, we'll go for a push option where you specify a callback helper that'll be triggered for each embeded file. It'd then be up to you to decide if you wanted the contents or not, based on the filename and/or mime type.

In terms of fully streaming approach though, I'm not sure how easy it'll be. Reviewing the different container formats, the extent that they'll be streamable vs need buffering is:
* Tar (+compressed) - can be streamed
* Ogg / Avi / etc - different parts of the file are interlaced. If we
   support streaming, the callbacks would need to handle being run
   in parallel, which might add too much complexity for users?
* OLE2 - can't be streamed, we're going to have to buffer the whole file,
   load it into POIFS, and only then start returning things
* Zip - we'll need to do (at least) two passes. The first pass we'll look
   at what files it contains, and use that to figure out if it's
   .docx, keynote, open office etc, or just plain zip. If it's a plain
   zip, 2nd pass will return each file in turn. If it's a zip-based
   document format, filetype specific code will identify the embeded
   media for that format, and return each in turn.

I'd see this as meaning that you pass in a TikaInputStream to the service, and a callback handler. If supported for the container, it will stream through the file, firing the callback handler as it goes. For most cases, the file will be buffered (to disk or memory as appropriate), the appropriate bits identified, and then the callback handler fired for each part.

Nick

Reply via email to