Hi Guys, There is a great discussion going on around TIKA-1445 right now that I wanted to bring to the dev list:
http://issues.apache.org/jira/browse/TIKA-1445 What we are seeing from OCR and GDAL lately is that there may be a use case to have multiple parsers called for the same MediaType. In this fashion, each parser contributes *more* metadata and content handling, rather than simply replacing it, or being the only Parser selected to contribute to it. Tim brought up the following questions that I wanted to respond to here on list: {quote} How will we handle: 1) Two parsers both "set" a value in the Metadata object? Will the second overwrite the value of the first? 2) Content: How will we know when a document ends? AutoDetectParser would wrap the handler in an EndDocumentShieldingContentHandler and then call endDocument when done? 3) Will the user be able to parse the output from the handler to figure out which parser is responsible for which content? Let's say a user wants to pull the electronic text out of a PDF and render the page as an image and then run it through OCR, would we have something like <div parser="o.a.t.p.PDFParser"> or similar? If we go this route, we'd want to make sure we don't have literally duplicate parsers (as we do now). This sounds more complicated than having parent parsers know which children they control and how to control them, but, it might make sense. Aside from OCR {quote} Here are my replies: #1 We will use a default policy of ³append² which allows the Metadata object to append values to the same key, rather than replace them. We could also couple this with X-Parsed-By, which is an ordered list of what Parser parsed what so that we can reconstruct what Parser contributed what field. If it¹s multi-valued, we can also add fields for Offsets, etc. An alternative here would also be to prefix metadata keys in this CompositeParser by the X-Parsed-By parser name, to avoid conflicts. Users would be able to switch the policy from ³append² to ³overwrite² in which this isn¹t a problem, and we simply allow the last parser to input into a conflicting key to be the one that takes precedence. One option with overwrite would be to allow in this policy for providing a precedence order of Parsers (e.g., the current service list could be a precedence order). That said, how sure are we that this is a *real* problem? Some parsers parse the same MediaType but contribute vastly different and non overlapping keys to the metadata object? #2 I like your suggestion - or the alternative as I suggested would be to reset the stream to the beginning after each parser, or alternatively keep a clone of the original stream as a copy, and then clone it for each called Parser attempt? #3 I like your idea about wrapping content provided by handlers with the parser attribute. Very neat, let¹s try that! OK, thanks. I will add this to the JIRA issue too, but I think this is a good thing to have on the dev@ list. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
