On Wed, 14 Jul 2010, Paul Jakubik wrote:
I created a wiki page for this discussion (
http://wiki.apache.org/tika/MetadataDiscussion). I don't know if that is
what you were thinking of.
Looks good to me!
Having looked through your proposed solutions, I can't see easy ways to
implement these use cases:
* enumerate all the Metadata objects at this depth
eg top level has one Metadata object (for the parent file), 1 level
down may have 3 Metadata objects, one for each of the 3 child documents
* get the Metadata for a specific embeded document
eg I know my zip file has "/foo/bar.doc" in it, give me the metadata
for that
There should probably be some mention of how users can opt in or out of
the nested metadata extraction. Some people won't want anything from
embeded documents, so they'll set the context appropriately, and the
parser won't touch the embeded files. Some may want text content, but not
care about the metadata (I think someone on the list raised this use
case). Some may want both text and metadata.
Nick