On 2010-09-01 12:25, Nick Burch wrote:
On Wed, 1 Sep 2010, Andrzej Bialecki wrote:
This would be very useful. We contemplated implementing something like
this in Nutch, to handle archives (jar/tar/zip/...), but having it in
Tika would be much better.

I'd forgotten about tar, that's another one to handle... :)

Does recursive here mean that it would look into embedded zip files
too? Or that it would process all paths (since there is really no
hierarchy in zip files)?

I was thinking recursive could mean different things. For zip files, tar
files etc, it would probably just mean root directory vs descend into
all directories.

There are no directories in these formats - it's just a flat namespace that just happens to use the filesystem conventions. Java APIs for these containers also provide only simple iterators. So I'm not sure if there's any benefit to this distinction here... maybe provide a FilenameFilter to control what path names to process?

On the other hand I see a benefit in having an option to automatically descend into embedded archives.

For OLE2, it would mean checking embeded documents of
embeded documents (normally but not always by means of descending into
child directories). Maybe there's a clearer name for this sort of thing?

OLE2 is nothing special, it's the same with other archive types, you can always have embedded archives within archives. I think the following could be helpful:

* a FilenameFilter to decide what paths to process
* a boolean "recursive" to specify that we want to descend into embedded archives, maybe with a list of interesting archive types?

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to