Re: Container Extractor?

Mattmann, Chris A (388J) Tue, 07 Sep 2010 06:56:31 -0700

Hey Guys,

I've been following this discussion and one thing I'd like to add is that 
scientific data formats exhibit most of the properties that the container 
formats do as well. For instance, NetCDF does not support RandomAccess, and 
existing Java APIs to deal with those files require the full file to be 
available on disk in order to be loaded into the class methods for extracting 
information from the file. HDF is similar. So I'm going to follow this 
discussion a bit more closely now as I see it coming closer to a concrete idea! 
;) I've been watching the TikaInputStream stuff that Jukka has been working on 
and I think that's a good starting point for addressing some of these issues.


Cheers,
Chris


On 9/7/10 3:39 AM, "Nick Burch" <[email protected]> wrote:

On Tue, 7 Sep 2010, Jukka Zitting wrote:
> On Mon, Sep 6, 2010 at 1:19 PM, Nick Burch <[email protected]> wrote:
>> Finally, pull vs push for the consumer.
>> [...]
>> I think the former would be a little bit more work for us, but is likely to
>> lead to cleaner and simpler code for consumers. What do people think?
>
> I'd start with a push mechanism as that supports streaming and is
> better in line with the current design of Tika.

OK, that seems sensible to me, we'll go for a push option where you
specify a callback helper that'll be triggered for each embeded file. It'd
then be up to you to decide if you wanted the contents or not, based on
the filename and/or mime type.

In terms of fully streaming approach though, I'm not sure how easy it'll
be. Reviewing the different container formats, the extent that they'll be
streamable vs need buffering is:
* Tar (+compressed) - can be streamed
* Ogg / Avi / etc - different parts of the file are interlaced. If we
    support streaming, the callbacks would need to handle being run
    in parallel, which might add too much complexity for users?
* OLE2 - can't be streamed, we're going to have to buffer the whole file,
    load it into POIFS, and only then start returning things
* Zip - we'll need to do (at least) two passes. The first pass we'll look
    at what files it contains, and use that to figure out if it's
    .docx, keynote, open office etc, or just plain zip. If it's a plain
    zip, 2nd pass will return each file in turn. If it's a zip-based
    document format, filetype specific code will identify the embeded
    media for that format, and return each in turn.

I'd see this as meaning that you pass in a TikaInputStream to the service,
and a callback handler. If supported for the container, it will stream
through the file, firing the callback handler as it goes. For most cases,
the file will be buffered (to disk or memory as appropriate), the
appropriate bits identified, and then the callback handler fired for each
part.

Nick



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
Phone: +1 (818) 354-8810
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Container Extractor?

Reply via email to