On Tue, 7 Sep 2010, Jukka Zitting wrote:
On Mon, Sep 6, 2010 at 1:19 PM, Nick Burch <[email protected]> wrote:
Finally, pull vs push for the consumer.
[...]
I think the former would be a little bit more work for us, but is likely to
lead to cleaner and simpler code for consumers. What do people think?
I'd start with a push mechanism as that supports streaming and is
better in line with the current design of Tika.
OK, that seems sensible to me, we'll go for a push option where you
specify a callback helper that'll be triggered for each embeded file. It'd
then be up to you to decide if you wanted the contents or not, based on
the filename and/or mime type.
In terms of fully streaming approach though, I'm not sure how easy it'll
be. Reviewing the different container formats, the extent that they'll be
streamable vs need buffering is:
* Tar (+compressed) - can be streamed
* Ogg / Avi / etc - different parts of the file are interlaced. If we
support streaming, the callbacks would need to handle being run
in parallel, which might add too much complexity for users?
* OLE2 - can't be streamed, we're going to have to buffer the whole file,
load it into POIFS, and only then start returning things
* Zip - we'll need to do (at least) two passes. The first pass we'll look
at what files it contains, and use that to figure out if it's
.docx, keynote, open office etc, or just plain zip. If it's a plain
zip, 2nd pass will return each file in turn. If it's a zip-based
document format, filetype specific code will identify the embeded
media for that format, and return each in turn.
I'd see this as meaning that you pass in a TikaInputStream to the service,
and a callback handler. If supported for the container, it will stream
through the file, firing the callback handler as it goes. For most cases,
the file will be buffered (to disk or memory as appropriate), the
appropriate bits identified, and then the callback handler fired for each
part.
Nick