On Wed, 1 Sep 2010, Ken Krugler wrote:
http://lucene.472066.n3.nabble.com/Multiple-documents-per-input-stream-td647159.html#a647159

I ran into a similar issue when trying to figure out how best to handle mbox files.

Yeah, I guess we could optionally treat mbox (and .pst + similar) mailboxes as containers too.

My current thinking is that Tika should do roughly the right thing if people just throw it a document, but should allow finer grained access to embeded resources for people who need control. Just FYI, at the moment my use case is to extract the embeded images out of uploaded .doc and .docx files, but I can see future requirements for parsing the metadata and text out of embeded documents via Tika too, so I want it to work for both :)

Nick

Reply via email to