Hi Guys,

There is a great discussion going on around TIKA-1445 right now
that I wanted to bring to the dev list:

http://issues.apache.org/jira/browse/TIKA-1445

What we are seeing from OCR and GDAL lately is that there may be a
use case to have multiple parsers called for the same MediaType.
In this fashion, each parser contributes *more* metadata and content
handling, rather than simply replacing it, or being the only Parser
selected to contribute to it. Tim brought up the following questions
that I wanted to respond to here on list:

{quote} How will we handle: 1) Two parsers both "set" a value in
the Metadata object? Will the second overwrite the value of the
first?  2) Content: How will we know when a document ends?
AutoDetectParser would wrap the handler in an
EndDocumentShieldingContentHandler and then call endDocument when
done?  3) Will the user be able to parse the output from the handler
to figure out which parser is responsible for which content? Let's
say a user wants to pull the electronic text out of a PDF and render
the page as an image and then run it through OCR, would we have
something like <div parser="o.a.t.p.PDFParser"> or similar?  If we
go this route, we'd want to make sure we don't have literally
duplicate parsers (as we do now).  This sounds more complicated
than having parent parsers know which children they control and how
to control them, but, it might make sense.  Aside from OCR
{quote}

Here are my replies:

#1 We will use a default policy of ³append² which allows the Metadata
object to append values to the same key, rather than replace them.
We could also couple this with X-Parsed-By, which is an ordered
list of what Parser parsed what so that we can reconstruct what
Parser contributed what field. If it¹s multi-valued, we can also
add fields for Offsets, etc.  An alternative here would also be to
prefix metadata keys in this CompositeParser by the X-Parsed-By
parser name, to avoid conflicts. Users would be able to switch the
policy from ³append² to ³overwrite² in which this isn¹t a problem,
and we simply allow the last parser to input into a conflicting key
to be the one that takes precedence. One option with overwrite would
be to allow in this policy for providing a precedence order of
Parsers (e.g., the current service list could be a precedence order).

That said, how sure are we that this is a *real* problem? Some
parsers parse the same MediaType but contribute vastly different
and non overlapping keys to the metadata object?

#2 I like your suggestion - or the alternative as I suggested would
be to reset the stream to the beginning after each parser, or
alternatively keep a clone of the original stream as a copy, and
then clone it for each called Parser attempt?

#3 I like your idea about wrapping content provided by handlers
with the parser attribute. Very neat, let¹s try that!

OK, thanks. I will add this to the JIRA issue too, but I think this
is a good thing to have on the dev@ list.

Cheers, 
Chris



++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++




Reply via email to