Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "MetadataDiscussion" page has been changed by PaulJakubik.
http://wiki.apache.org/tika/MetadataDiscussion?action=diff&rev1=1&rev2=2

--------------------------------------------------

  This page has been created to host a discussion on how Tika returns metadata 
for different kinds of documents. The goal is to make sure that Tika users have 
a chance to get to all of the metadata created and/or extracted by Tika.
  
  == Original Problem ==
- The original inspiration for this page was a Tika user who wanted to get 
access to the metadata for every document in an archive (e.g. zip, tar.gz, 
etc.). While the AutoDetectParser allows users to add a Parser to the 
ParseContext to use when recursively parsing documents in an archive, there is 
no equivalent for getting access to the recursive metadata.
+ The original inspiration for this page was a Tika user who wanted to get 
access to the metadata for every document in an archive (e.g. zip, tar.gz, 
etc.). A way to get recursive metadata is described in the RecursiveMetadata 
article.
  
  == Goals for this Page ==
  The goals for this page are bigger than the original problem. This page 
should hold a discussion about how to better meet different metadata needs for 
the different kinds of documents supported by Tika, and for the different kinds 
of users supported by Tika.
@@ -142, +142 @@

  Hopefully we can find some solutions that actually work, and work for many 
kinds of users. It doesn't look like there is a way to represent metadata for 
nested sections or nested documents in XHTML, but there may be other ways to 
make metadata nested metadata available to some users.
  
  == Metadata for ContentHandler Implementors: Metadata Stack in ParseContext ==
+ If you are going to the effort of implementing a ContentHandler, the 
RecursiveMetadata page describes how you can ret access to recursive metadata.
- This whole nested metadata problem mainly comes up when using the 
AutoDetectParser or CompositeParser to parse a container. If the user is going 
to recursively parse the contents of a container, the user has to add a parser 
to the ParseContext that Tika can use for those nested documents.
- 
- Similarly, a user could add a Metadata stack to the ParseContext. Tika could 
then follow the rule that every time a new {{{<div>}}} section is started, a 
new Metadata object is pushed onto the stack, and every time a {{{<div>}}} 
section ends, a Metadata object is popped off the stack. Most ContentHandler 
implementors could then peek at the top of the stack to see the metadata for 
their current document. This same solution would work with structured documents 
with nested subsections.
- 
- Each Metadata object would only contain metadata for a specific {{{<div>}}} 
section, and any user who wants to see the full metadata context at any point 
in the parsing could walk through all of the Metadata objects on the stack and 
examine their contents.
  
  == A Solution for Users Who Don't Implement ContentHandler ==
  If XHTML doesn't offer a legal way to associate arbitrary name-value pairs 
with a {{{<div>}}} section, then there don't seem to be options for providing 
full metadata in a single XHTML document. There are at least a couple of 
possibilities for providing a better-than-nothing solution for users who want 
all of the metadata without having to write their own ContentHandler.

Reply via email to