[Tika Wiki] Update of "RecursiveMetadata" by PaulJakubi k

Apache Wiki Thu, 12 Aug 2010 13:25:10 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "RecursiveMetadata" page has been changed by PaulJakubik.
http://wiki.apache.org/tika/RecursiveMetadata?action=diff&rev1=1&rev2=2

--------------------------------------------------

  <<TableOfContents(2)>>
  
  = Introduction =
- After the MetadataDiscussion page was created, Jukka Zitting offered an 
example of how to get to recursive metadata when parsing with an 
AutoDetectParser. In addition to sharing Jukka's example, this page also offers 
some additional details on how, if you are willing to write your own 
ContentHandler, you can capture both text and metadata for each recursive 
document.
+ After the MetadataDiscussion page was created, Jukka Zitting offered an 
example of how to get to recursive metadata when parsing with an 
AutoDetectParser, and later updated that example with how to get both text and 
metadata for nested documents using the AutoDetectParser.
  
- NOTE - This discussion of recursive metadata is from the point of view of 
what might be an oddball use case. The assumption of this page is NOT that you 
would want to take a container file, maybe a zip file, and extract all of the 
text and metadata into a single mega-representation of all of the text and 
metadata found in that container. Instead, this page assumes that what you 
really want to do is to extract the text for each document in the container, 
and be able to see each of these nested documents as a separate entity with its 
own text and metadata.  
+ If you parse an archive (zip, tar, etc.) the parsed document contains other 
documents, and any of those documents could also be archives containing other 
documents, and so on. The example on this page shows you how to do the 
following:
+ 
+  * Set up the parse context so nested documents will be parsed.
+  * Wrap the AutoDetectParser so you can get the text and metadata for each 
nested document.
  
  = Jukka's Example =
- Here is the full source for Jukka's example for how to get access to nested 
metadata. This example writes the metadata for each nested document to standard 
output. More details about how Jukka's example works are available in 
subsections below.
+ Here is the full source for Jukka's example for how to get access to nested 
metadata and document body text. This example writes the metadata and body text 
for each nested document to standard output. More details about how Jukka's 
example works are in subsections below.
  
  {{{
    public static void main(String[] args) throws Exception {
@@ -41, +44 @@

  
         @Override
         public void parse(
-                InputStream stream, ContentHandler handler,
+                InputStream stream, ContentHandler ignore,
                 Metadata metadata, ParseContext context)
                 throws IOException, SAXException, TikaException {
+            ContentHandler content = new BodyContentHandler();
-            super.parse(stream, handler, metadata, context);
+            super.parse(stream, content, metadata, context);
  
             System.out.println("----");
             System.out.println(metadata);
+            System.out.println("----");
+            System.out.println(content.toString());
         }
- 
     }
  }}}
  
@@ -102, +107 @@

  {{{
         @Override
         public void parse(
-                InputStream stream, ContentHandler handler,
+                InputStream stream, ContentHandler ignore,
                 Metadata metadata, ParseContext context)
                 throws IOException, SAXException, TikaException {
+            ContentHandler content = new BodyContentHandler();
-            super.parse(stream, handler, metadata, context);
+            super.parse(stream, content, metadata, context);
  
             System.out.println("----");
             System.out.println(metadata);
+            System.out.println("----");
+            System.out.println(content.toString());
         }
  
     }
  }}}
  
- The parse method is where you get access to the metadata. When the parser set 
in ParseContext is used to parse a nested document, a new Metadata object is 
created and passed to the parse method. Since the example put a 
RecursiveMetadataParser in the ParseContext, RecursiveMetadataParser's parse 
method is called. Before calling {{{super.parse}}}, the metadata object is 
empty. After {{{super.parse}}} returns, the metadata object contains all of the 
metadata the decorated parser found  and {{{System.out.println(metadata)}}} 
prints all of the metadata to standard output.
+ The parse method is where you get access to the metadata and the body text. 
When the parser set in ParseContext is used to parse a nested document, a new 
Metadata object is created and passed to the parse method. Since the example 
put a RecursiveMetadataParser in the ParseContext, RecursiveMetadataParser's 
parse method is called. Before calling {{{super.parse}}}, the metadata object 
is empty. After {{{super.parse}}} returns, the metadata object contains all of 
the metadata the decorated parser found and {{{System.out.println(metadata)}}} 
prints all of the metadata to standard output.
  
+ By creating a new BodyContentHandler and passing that to {{{super.parse}}}, 
the text for each document is captured without mixing it with text from other 
documents.
- = What's Missing from Jukka's Example? =
- Jukka's example shows how you can get metadata for a nested document, but it 
doesn't show how you can get that metadata along with the text for that nested 
document.
  
- If you only need the metadata, then this example is great. If instead you 
want to extract complete documents from containers including both text and 
metadata, then you need to do more.
+ = Surprise! Zips Have Text Too! =
+ The great thing about AutoDetectParser is that it can parse and extract text 
from almost anything. In particular, it can parse zip, tar, tar.bz2, and other 
archives that contain documents. If you have a zip file with 100 text files in 
it, using Jukka's example code you can get the text and metadata for each file 
nested inside of the zip file. What you might not expect is that you also get 
metadata and body text for the zip file itself.
  
+ Maybe this doesn't surprise you at all. My first reaction when I saw both 
metadata AND text for the zip file itself was "What text could a zip file 
possibly have?" My naive assumption was that a zip file wouldn't contain any 
text, and my assumption was wrong.
- == Extracting Text is an Exorcise for the Reader ==
- A way to match up the metadata for a document with its text requires you to 
write your own ContentHandler that is able to identify text for individual 
nested documents. Since this page is called RecursiveMetadata and not 
HowToGetASeparateTextBodyForEachNestedDocument, no details are offered for how 
to implement that ContentHandler. While I was hoping there would be help for 
this in Tika's library, after quickly scanning all the handlers I could find in 
http://tika.apache.org/0.7/api/ I didn't see any that offered easy ways to get 
to the text for each contained document as a separate set of text.
  
- Until someone writes a page on how to get the text for each separate document 
in a container as a separate body of text, writing this ContentHandler is an 
exercise left to the reader. I have written a ContentHandler that does this for 
the kinds of files and containers I have tested with, and if no one comes 
forward with an easy way to write this kinds of ContentHandler, my experiences 
might become the start of yet another wiki page.
+ I was thinking that a zip, tar, or other archive file was simply a container 
for other files, and so didn't have any text of its own. Tika looks at archives 
differently; Tika sees an archive as being like a directory in a file system, 
and the text for an archive is a list of the contents of the archive.
  
+ If you have a zip file that contains 100 text files, after using the code on 
this page to get the text and metadata for each file, you will get the text and 
metadata for 101 files: 100 text files, and 1 zip file. The text for the zip 
file will list the names for each of the 100 text files it contains.
- == How to get Metadata with Text ==
- Assuming that you have written your own ContentHandler, and that 
ContentHandler can be used to get the text for individual documents in a 
container, how can you get associate the metadata for a document with that 
document's text?
  
- The solution I currently use is to create a RecursiveMetadataParser class 
that is constructed with a RecursiveParserListener. The listener is notified 
just before and just after each parse call, and my ContentHandler can implement 
both the ContentHandler and the RecursiveParserListener interfaces. Here is a 
rough example:
+ If you aren't interested in seeing text and metadata for the zip file itself, 
you'll want to take a look at {{{metadata.get(Metadata.CONTENT_TYPE))}}} for 
each file Tika parses so you can skip the archives themselves. For a zip file, 
the content type is "application/zip".
  
- 
- {{{
- public interface RecursiveParserListener {
-     void startSubDocument(Metadata metadata);
-     void endSubDocument();
- }
- 
- public class RecursiveMetadataParser extends ParserDecorator {
-     private final RecursiveParserListener listener;
- 
-     public RecursiveMetadataParser(Parser parser, RecursiveParserListener 
listener) {
-         super(parser);
-         this.listener = listener;
-     }
- 
-     public void parse(InputStream stream, ContentHandler handler, Metadata 
metadata,
-                       ParseContext context) throws IOException, SAXException, 
TikaException {
-         listener.startSubDocument(metadata);
-         super.parse(stream, handler, metadata, context);
-         listener.endSubDocument();
-     }
- }
- 
- class TikaContentHandler implements ContentHandler, RecursiveParserListener {
-     //...
-     public void startSubDocument(Metadata metadata) {stack.push(metadata);}
-     public void endSubDocument() {stack.pop();}
-     //...
-     public void endElement(String uri, String localName, String qName) throws 
SAXException {
-         //...
-         // if this end element means a document is ending
-         Metadata metadata = stack.peek();
-         // do something with metadata and document text
-     }
- 
- }
- 
- }}}
- 
- The basic idea is that if you have gone to the trouble of implementing a 
ContentHandler capable of identifying text for each individual nested document, 
then if you can also get notifications for when a subdocument with separate 
metadata starts and ends, you can keep track of this metadata and associate it 
with the text you extract.
- 
- Hopefully this example offers an idea of what you would have to do to get 
both the text and metadata for a nested document. 
- 
- = A Possibly Misplaced or Inappropriate Wish for Tika =
- While it is possible to get the text for each nested document in a container 
using Tika, and it is possible to get the metadata for each nested document, it 
would be nice if Tika offered an easy way to get both the text and the metadata 
for a nested document together as a single entity. 
- 
- Tika seems to want to turn any file you give it into a single XHTML document, 
or the stream of ContentHandler events you would get if you were parsing that 
single XHTML document. Containers that aren't logically a single document 
(containers that are logically single documents include OLE2 and .xslx) don't 
live comfortably inside this single document model. Because Tika does a great 
job of identifying and parsing a wide variety of container types, and because 
Tika is being extended to identify when a container is logically a single 
document and when a container is logically many separate documents, it would be 
nice if there was a better way for Tika to return the metadata and text for 
containers that are logically many separate documents.
-

[Tika Wiki] Update of "RecursiveMetadata" by PaulJakubi k

Reply via email to