On Wed, 1 Sep 2010, Ken Krugler wrote:
http://lucene.472066.n3.nabble.com/Multiple-documents-per-input-stream-td647159.html#a647159
I ran into a similar issue when trying to figure out how best to handle mbox
files.
Yeah, I guess we could optionally treat mbox (and .pst + similar)
mailboxes as containers too.
My current thinking is that Tika should do roughly the right thing if
people just throw it a document, but should allow finer grained access to
embeded resources for people who need control. Just FYI, at the moment my
use case is to extract the embeded images out of uploaded .doc and .docx
files, but I can see future requirements for parsing the metadata and text
out of embeded documents via Tika too, so I want it to work for both :)
Nick