Chris, Thank you for moving this to the dev list. This would be a fairly large change, and the discussion is valuable.
-----Original Message----- From: Mattmann, Chris A (3980) [mailto:[email protected]] Sent: Monday, November 17, 2014 5:25 PM To: [email protected] Subject: TIKA-1445 and having multiple Parsers (as many as needed) work on the same MediaType Hi Guys, There is a great discussion going on around TIKA-1445 right now that I wanted to bring to the dev list: http://issues.apache.org/jira/browse/TIKA-1445 What we are seeing from OCR and GDAL lately is that there may be a use case to have multiple parsers called for the same MediaType. In this fashion, each parser contributes *more* metadata and content handling, rather than simply replacing it, or being the only Parser selected to contribute to it. Tim brought up the following questions that I wanted to respond to here on list: {quote} How will we handle: 1) Two parsers both "set" a value in the Metadata object? Will the second overwrite the value of the first? 2) Content: How will we know when a document ends? AutoDetectParser would wrap the handler in an EndDocumentShieldingContentHandler and then call endDocument when done? 3) Will the user be able to parse the output from the handler to figure out which parser is responsible for which content? Let's say a user wants to pull the electronic text out of a PDF and render the page as an image and then run it through OCR, would we have something like <div parser="o.a.t.p.PDFParser"> or similar? If we go this route, we'd want to make sure we don't have literally duplicate parsers (as we do now). This sounds more complicated than having parent parsers know which children they control and how to control them, but, it might make sense. Aside from OCR {quote} Here are my replies: #1 We will use a default policy of ³append² which allows the Metadata object to append values to the same key, rather than replace them. We could also couple this with X-Parsed-By, which is an ordered list of what Parser parsed what so that we can reconstruct what Parser contributed what field. If it¹s multi-valued, we can also add fields for Offsets, etc. An alternative here would also be to prefix metadata keys in this CompositeParser by the X-Parsed-By parser name, to avoid conflicts. Users would be able to switch the policy from ³append² to ³overwrite² in which this isn¹t a problem, and we simply allow the last parser to input into a conflicting key to be the one that takes precedence. One option with overwrite would be to allow in this policy for providing a precedence order of Parsers (e.g., the current service list could be a precedence order). That said, how sure are we that this is a *real* problem? Some parsers parse the same MediaType but contribute vastly different and non overlapping keys to the metadata object? >>I agree that different parsers contribute vastly different metadata keys, >>and, frankly, in the current use case, the tesseract parser should add nearly >>zero metadata, so this won't be an issue. However, if we're going to change >>the way we've been doing things generally, I wanted us to think of the >>implications. The root of my initial concern with this is that the child >>parsers choose whether or not to add or set. >>Oh, but wait, ok, so what we'd actually do is send in a new metadata object >>for each parser and then at the CompositeParser level, we'd make the decision >>on whether to append or overwrite the data that we got from each Metadata >>object. But wait, aren't there some Properties that only allow one value >>(e.g. TikaCoreProperties.TITLE)? Ok, so, when we merge the Metadata objects, >>we just get String(s) as keys, so we lose the Property restrictions. Will >>this wreck XMP or lead to a bad day for people expecting these restrictions? #2 I like your suggestion - or the alternative as I suggested would be to reset the stream to the beginning after each parser, or alternatively keep a clone of the original stream as a copy, and then clone it for each called Parser attempt? >>I think we're talking about different things. Yes, we'll definitely need to >>reset or spool the stream depending on its length. My concern was more with >>the handlers. If the first parser calls endDocument() and we don't shield >>that, then if someone uses the BodyContentHandler, then they might not see >>contents from the second/third parser because the initial parser "ended" the >>document. I need to test this concern, but I think that this was the root of >>TIKA-1124. #3 I like your idea about wrapping content provided by handlers with the parser attribute. Very neat, let¹s try that!
