On Wed, 14 Jul 2010, Paul Jakubik wrote:
I created a wiki page for this discussion (
http://wiki.apache.org/tika/MetadataDiscussion). I don't know if that is
what you were thinking of.

Looks good to me!

Having looked through your proposed solutions, I can't see easy ways to implement these use cases:
* enumerate all the Metadata objects at this depth
  eg top level has one Metadata object (for the parent file), 1 level
   down may have 3 Metadata objects, one for each of the 3 child documents
* get the Metadata for a specific embeded document
  eg I know my zip file has "/foo/bar.doc" in it, give me the metadata
  for that

There should probably be some mention of how users can opt in or out of the nested metadata extraction. Some people won't want anything from embeded documents, so they'll set the context appropriately, and the parser won't touch the embeded files. Some may want text content, but not care about the metadata (I think someone on the list raised this use case). Some may want both text and metadata.

Nick

Reply via email to