[Tika Wiki] Update of "RecursiveMetadata" by PaulJakubi k

Apache Wiki Mon, 02 Aug 2010 13:10:04 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "RecursiveMetadata" page has been changed by PaulJakubik.
http://wiki.apache.org/tika/RecursiveMetadata

--------------------------------------------------

New page:
#format wiki
#language en
#pragma section-numbers off

'''Index'''

<<TableOfContents(2)>>

= Introduction =
After the MetadataDiscussion page was created, Jukka Zitting offered an example 
of how to get to recursive metadata when parsing with an AutoDetectParser. In 
addition to sharing Jukka's example, this page also offers some additional 
details on how, if you are willing to write your own ContentHandler, you can 
capture both text and metadata for each recursive document.

NOTE - This discussion of recursive metadata is from the point of view of what 
might be an oddball use case. The assumption of this page is NOT that you would 
want to take a container file, maybe a zip file, and extract all of the text 
and metadata into a single mega-representation of all of the text and metadata 
found in that container. Instead, this page assumes that what you really want 
to do is to extract the text for each document in the container, and be able to 
see each of these nested documents as a separate entity with its own text and 
metadata.  

= Jukka's Example =
Here is the full source for Jukka's example for how to get access to nested 
metadata. This example writes the metadata for each nested document to standard 
output. More details about how Jukka's example works are available in 
subsections below.

{{{
  public static void main(String[] args) throws Exception {
       Parser parser = new RecursiveMetadataParser(new AutoDetectParser());
       ParseContext context = new ParseContext();
       context.set(Parser.class, parser);

       ContentHandler handler = new DefaultHandler();
       Metadata metadata = new Metadata();

       InputStream stream = TikaInputStream.get(new File(args[0]));
       try {
           parser.parse(stream, handler, metadata, context);
       } finally {
           stream.close();
       }
   }

   private static class RecursiveMetadataParser extends ParserDecorator {

       public RecursiveMetadataParser(Parser parser) {
           super(parser);
       }

       @Override
       public void parse(
               InputStream stream, ContentHandler handler,
               Metadata metadata, ParseContext context)
               throws IOException, SAXException, TikaException {
           super.parse(stream, handler, metadata, context);

           System.out.println("----");
           System.out.println(metadata);
       }

   }
}}}

== Main from Jukka's Example ==

=== Setting up Recursive Parsing ===
{{{
  public static void main(String[] args) throws Exception {
       Parser parser = new RecursiveMetadataParser(new AutoDetectParser());
       ParseContext context = new ParseContext();
       context.set(Parser.class, parser);
}}}

The example starts by setting up recursive parsing. If you are parsing text 
files, word documents, etc. then you'll never notice if recursive parsing is 
enable or not. If you are parsing containers like zip files and tar.gz files, 
the only way to get the text for the files contained by the containers is to 
enable recursive parsing.

The way to enable recursive parsing is to create a ParseContext and add a 
parser to it as shown on the line {{{context.set(Parser.class, parser)}}}. This 
is the parser that will be used to parse any nested documents. 

In this case the parser is a RecursiveMetadataParser that is a wrapper around 
an AutoDetectParser. The RecursiveMetadata parser is part of Jukka's example 
and more details are given below.

=== Parsing a File ===
{{{
       ContentHandler handler = new DefaultHandler();
       Metadata metadata = new Metadata();

       InputStream stream = TikaInputStream.get(new File(args[0]));
       try {
           parser.parse(stream, handler, metadata, context);
       } finally {
           stream.close();
       }
}}}

The rest of the main function parses a file. The parser used to parse the root 
document is the same parser that was added to the ParseContext as the parser to 
use for nested documents. 

Looking at the Tika API (http://tika.apache.org/0.7/api/), I don't see a 
DefaultHandler class or a TikaInputStream. In the place of DefaultHandler you 
could use BodyContentHandler, and in the place of TikaInputStream you could use 
FileInputStream. 

== Jukka's RecursiveMetadata Parser ==
=== RecursiveMetadataParser Constructor ===
{{{
   private static class RecursiveMetadataParser extends ParserDecorator {

       public RecursiveMetadataParser(Parser parser) {
           super(parser);
       }
}}}

The RecursiveMetadataParser extends ParserDecorator. All the constructor has to 
do is let the ParserDecorator superclass know which parser object is being 
decorated.

=== RecursiveMetadataParser parse ===
{{{
       @Override
       public void parse(
               InputStream stream, ContentHandler handler,
               Metadata metadata, ParseContext context)
               throws IOException, SAXException, TikaException {
           super.parse(stream, handler, metadata, context);

           System.out.println("----");
           System.out.println(metadata);
       }

   }
}}}

The parse method is where you get access to the metadata. When the parser set 
in ParseContext is used to parse a nested document, a new Metadata object is 
created and passed to the parse method. Since the example put a 
RecursiveMetadataParser in the ParseContext, RecursiveMetadataParser's parse 
method is called. Before calling {{{super.parse}}}, the metadata object is 
empty. After {{{super.parse}}} returns, the metadata object contains all of the 
metadata the decorated parser found  and {{{System.out.println(metadata)}}} 
prints all of the metadata to standard output.

= What's Missing from Jukka's Example? =
Jukka's example shows how you can get metadata for a nested document, but it 
doesn't show how you can get that metadata along with the text for that nested 
document.

If you only need the metadata, then this example is great. If instead you want 
to extract complete documents from containers including both text and metadata, 
then you need to do more.

== Extracting Text is an Exorcise for the Reader ==
A way to match up the metadata for a document with its text requires you to 
write your own ContentHandler that is able to identify text for individual 
nested documents. Since this page is called RecursiveMetadata and not 
HowToGetASeparateTextBodyForEachNestedDocument, no details are offered for how 
to implement that ContentHandler. While I was hoping there would be help for 
this in Tika's library, after quickly scanning all the handlers I could find in 
http://tika.apache.org/0.7/api/ I didn't see any that offered easy ways to get 
to the text for each contained document as a separate set of text.

Until someone writes a page on how to get the text for each separate document 
in a container as a separate body of text, writing this ContentHandler is an 
exercise left to the reader. I have written a ContentHandler that does this for 
the kinds of files and containers I have tested with, and if no one comes 
forward with an easy way to write this kinds of ContentHandler, my experiences 
might become the start of yet another wiki page.

== How to get Metadata with Text ==
Assuming that you have written your own ContentHandler, and that ContentHandler 
can be used to get the text for individual documents in a container, how can 
you get associate the metadata for a document with that document's text?

The solution I currently use is to create a RecursiveMetadataParser class that 
is constructed with a RecursiveParserListener. The listener is notified just 
before and just after each parse call, and my ContentHandler can implement both 
the ContentHandler and the RecursiveParserListener interfaces. Here is a rough 
example:


{{{
public interface RecursiveParserListener {
    void startSubDocument(Metadata metadata);
    void endSubDocument();
}

public class RecursiveMetadataParser extends ParserDecorator {
    private final RecursiveParserListener listener;

    public RecursiveMetadataParser(Parser parser, RecursiveParserListener 
listener) {
        super(parser);
        this.listener = listener;
    }

    public void parse(InputStream stream, ContentHandler handler, Metadata 
metadata,
                      ParseContext context) throws IOException, SAXException, 
TikaException {
        listener.startSubDocument(metadata);
        super.parse(stream, handler, metadata, context);
        listener.endSubDocument();
    }
}

class TikaContentHandler implements ContentHandler, RecursiveParserListener {
    //...
    public void startSubDocument(Metadata metadata) {stack.push(metadata);}
    public void endSubDocument() {stack.pop();}
    //...
    public void endElement(String uri, String localName, String qName) throws 
SAXException {
        //...
        // if this end element means a document is ending
        Metadata metadata = stack.peek();
        // do something with metadata and document text
    }

}

}}}

The basic idea is that if you have gone to the trouble of implementing a 
ContentHandler capable of identifying text for each individual nested document, 
then if you can also get notifications for when a subdocument with separate 
metadata starts and ends, you can keep track of this metadata and associate it 
with the text you extract.

Hopefully this example offers an idea of what you would have to do to get both 
the text and metadata for a nested document. 

= A Possibly Misplaced or Inappropriate Wish for Tika =
While it is possible to get the text for each nested document in a container 
using Tika, and it is possible to get the metadata for each nested document, it 
would be nice if Tika offered an easy way to get both the text and the metadata 
for a nested document together as a single entity. 

Tika seems to want to turn any file you give it into a single XHTML document, 
or the stream of ContentHandler events you would get if you were parsing that 
single XHTML document. Containers that aren't logically a single document 
(containers that are logically single documents include OLE2 and .xslx) don't 
live comfortably inside this single document model. Because Tika does a great 
job of identifying and parsing a wide variety of container types, and because 
Tika is being extended to identify when a container is logically a single 
document and when a container is logically many separate documents, it would be 
nice if there was a better way for Tika to return the metadata and text for 
containers that are logically many separate documents.

[Tika Wiki] Update of "RecursiveMetadata" by PaulJakubi k

Reply via email to