Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "MetadataDiscussion" page has been changed by PaulJakubik. http://wiki.apache.org/tika/MetadataDiscussion?action=diff&rev1=1&rev2=2 -------------------------------------------------- This page has been created to host a discussion on how Tika returns metadata for different kinds of documents. The goal is to make sure that Tika users have a chance to get to all of the metadata created and/or extracted by Tika. == Original Problem == - The original inspiration for this page was a Tika user who wanted to get access to the metadata for every document in an archive (e.g. zip, tar.gz, etc.). While the AutoDetectParser allows users to add a Parser to the ParseContext to use when recursively parsing documents in an archive, there is no equivalent for getting access to the recursive metadata. + The original inspiration for this page was a Tika user who wanted to get access to the metadata for every document in an archive (e.g. zip, tar.gz, etc.). A way to get recursive metadata is described in the RecursiveMetadata article. == Goals for this Page == The goals for this page are bigger than the original problem. This page should hold a discussion about how to better meet different metadata needs for the different kinds of documents supported by Tika, and for the different kinds of users supported by Tika. @@ -142, +142 @@ Hopefully we can find some solutions that actually work, and work for many kinds of users. It doesn't look like there is a way to represent metadata for nested sections or nested documents in XHTML, but there may be other ways to make metadata nested metadata available to some users. == Metadata for ContentHandler Implementors: Metadata Stack in ParseContext == + If you are going to the effort of implementing a ContentHandler, the RecursiveMetadata page describes how you can ret access to recursive metadata. - This whole nested metadata problem mainly comes up when using the AutoDetectParser or CompositeParser to parse a container. If the user is going to recursively parse the contents of a container, the user has to add a parser to the ParseContext that Tika can use for those nested documents. - - Similarly, a user could add a Metadata stack to the ParseContext. Tika could then follow the rule that every time a new {{{<div>}}} section is started, a new Metadata object is pushed onto the stack, and every time a {{{<div>}}} section ends, a Metadata object is popped off the stack. Most ContentHandler implementors could then peek at the top of the stack to see the metadata for their current document. This same solution would work with structured documents with nested subsections. - - Each Metadata object would only contain metadata for a specific {{{<div>}}} section, and any user who wants to see the full metadata context at any point in the parsing could walk through all of the Metadata objects on the stack and examine their contents. == A Solution for Users Who Don't Implement ContentHandler == If XHTML doesn't offer a legal way to associate arbitrary name-value pairs with a {{{<div>}}} section, then there don't seem to be options for providing full metadata in a single XHTML document. There are at least a couple of possibilities for providing a better-than-nothing solution for users who want all of the metadata without having to write their own ContentHandler.
