Hey Guys, I've been following this discussion and one thing I'd like to add is that scientific data formats exhibit most of the properties that the container formats do as well. For instance, NetCDF does not support RandomAccess, and existing Java APIs to deal with those files require the full file to be available on disk in order to be loaded into the class methods for extracting information from the file. HDF is similar. So I'm going to follow this discussion a bit more closely now as I see it coming closer to a concrete idea! ;) I've been watching the TikaInputStream stuff that Jukka has been working on and I think that's a good starting point for addressing some of these issues.
Cheers, Chris On 9/7/10 3:39 AM, "Nick Burch" <[email protected]> wrote: On Tue, 7 Sep 2010, Jukka Zitting wrote: > On Mon, Sep 6, 2010 at 1:19 PM, Nick Burch <[email protected]> wrote: >> Finally, pull vs push for the consumer. >> [...] >> I think the former would be a little bit more work for us, but is likely to >> lead to cleaner and simpler code for consumers. What do people think? > > I'd start with a push mechanism as that supports streaming and is > better in line with the current design of Tika. OK, that seems sensible to me, we'll go for a push option where you specify a callback helper that'll be triggered for each embeded file. It'd then be up to you to decide if you wanted the contents or not, based on the filename and/or mime type. In terms of fully streaming approach though, I'm not sure how easy it'll be. Reviewing the different container formats, the extent that they'll be streamable vs need buffering is: * Tar (+compressed) - can be streamed * Ogg / Avi / etc - different parts of the file are interlaced. If we support streaming, the callbacks would need to handle being run in parallel, which might add too much complexity for users? * OLE2 - can't be streamed, we're going to have to buffer the whole file, load it into POIFS, and only then start returning things * Zip - we'll need to do (at least) two passes. The first pass we'll look at what files it contains, and use that to figure out if it's .docx, keynote, open office etc, or just plain zip. If it's a plain zip, 2nd pass will return each file in turn. If it's a zip-based document format, filetype specific code will identify the embeded media for that format, and return each in turn. I'd see this as meaning that you pass in a TikaInputStream to the service, and a callback handler. If supported for the container, it will stream through the file, firing the callback handler as it goes. For most cases, the file will be buffered (to disk or memory as appropriate), the appropriate bits identified, and then the callback handler fired for each part. Nick ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ Phone: +1 (818) 354-8810 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
