RE: TIKA-1445 and having multiple Parsers (as many as needed) work on the same MediaType

Allison, Timothy B. Tue, 18 Nov 2014 07:53:49 -0800

Chris,
  Thank you for moving this to the dev list.  This would be a fairly large 
change, and the discussion is valuable.

-----Original Message-----
From: Mattmann, Chris A (3980) [mailto:[email protected]] 
Sent: Monday, November 17, 2014 5:25 PM
To: [email protected]
Subject: TIKA-1445 and having multiple Parsers (as many as needed) work on the 
same MediaType

Hi Guys,

There is a great discussion going on around TIKA-1445 right now
that I wanted to bring to the dev list:

http://issues.apache.org/jira/browse/TIKA-1445

What we are seeing from OCR and GDAL lately is that there may be a
use case to have multiple parsers called for the same MediaType.
In this fashion, each parser contributes *more* metadata and content
handling, rather than simply replacing it, or being the only Parser
selected to contribute to it. Tim brought up the following questions
that I wanted to respond to here on list:

{quote} How will we handle: 1) Two parsers both "set" a value in
the Metadata object? Will the second overwrite the value of the
first?  2) Content: How will we know when a document ends?
AutoDetectParser would wrap the handler in an
EndDocumentShieldingContentHandler and then call endDocument when
done?  3) Will the user be able to parse the output from the handler
to figure out which parser is responsible for which content? Let's
say a user wants to pull the electronic text out of a PDF and render
the page as an image and then run it through OCR, would we have
something like <div parser="o.a.t.p.PDFParser"> or similar?  If we
go this route, we'd want to make sure we don't have literally
duplicate parsers (as we do now).  This sounds more complicated
than having parent parsers know which children they control and how
to control them, but, it might make sense.  Aside from OCR
{quote}

Here are my replies:

#1 We will use a default policy of ³append² which allows the Metadata
object to append values to the same key, rather than replace them.
We could also couple this with X-Parsed-By, which is an ordered
list of what Parser parsed what so that we can reconstruct what
Parser contributed what field. If it¹s multi-valued, we can also
add fields for Offsets, etc.  An alternative here would also be to
prefix metadata keys in this CompositeParser by the X-Parsed-By
parser name, to avoid conflicts. Users would be able to switch the
policy from ³append² to ³overwrite² in which this isn¹t a problem,
and we simply allow the last parser to input into a conflicting key
to be the one that takes precedence. One option with overwrite would
be to allow in this policy for providing a precedence order of
Parsers (e.g., the current service list could be a precedence order).

That said, how sure are we that this is a *real* problem? Some
parsers parse the same MediaType but contribute vastly different
and non overlapping keys to the metadata object?

>>I agree that different parsers contribute vastly different metadata keys, 
>>and, frankly, in the current use case, the tesseract parser should add nearly 
>>zero metadata, so this won't be an issue.  However, if we're going to change 
>>the way we've been doing things generally, I wanted us to think of the 
>>implications.  The root of my initial concern with this is that the child 
>>parsers choose whether or not to add or set.  

>>Oh, but wait, ok, so what we'd actually do is send in a new metadata object 
>>for each parser and then at the CompositeParser level, we'd make the decision 
>>on whether to append or overwrite the data that we got from each Metadata 
>>object.  But wait, aren't there some Properties that only allow one value 
>>(e.g. TikaCoreProperties.TITLE)?  Ok, so, when we merge the Metadata objects, 
>>we just get String(s) as keys, so we lose the Property restrictions.  Will 
>>this wreck XMP or lead to a bad day for people expecting these restrictions?

#2 I like your suggestion - or the alternative as I suggested would
be to reset the stream to the beginning after each parser, or
alternatively keep a clone of the original stream as a copy, and
then clone it for each called Parser attempt?

>>I think we're talking about different things.  Yes, we'll definitely need to 
>>reset or spool the stream depending on its length.  My concern was more with 
>>the handlers.  If the first parser calls endDocument() and we don't shield 
>>that, then if someone uses the BodyContentHandler, then they might not see 
>>contents from the second/third parser because the initial parser "ended" the 
>>document.  I need to test this concern, but I think that this was the root of 
>>TIKA-1124.

#3 I like your idea about wrapping content provided by handlers
with the parser attribute. Very neat, let¹s try that!

RE: TIKA-1445 and having multiple Parsers (as many as needed) work on the same MediaType

Reply via email to