Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "CompositeParserDiscussion" page has been changed by TimothyAllison: https://wiki.apache.org/tika/CompositeParserDiscussion New page: =Composite Parser Discussion= A given mime type may be supported by several parsers. Work on TIKA-1445 (adding metadata back into OCR'd text) raised the prominence of this issue. Currently, the CompositeParser picks the first parser that supports a given mime type. In discussion on TIKA-1445 other potential use cases were identified. The purpose of this page is to track a unified vision of the strategies that we'll implement in Tika. The JIRA issue for this is [[https://issues.apache.org/jira/browse/TIKA-1509|TIKA-1509]]. '''This page is just a start. Please contribute''' =Strategies= ==Classic== Sort the parsers by non-tika vs tika and then alphabetically by class name. Pick the first parser that will handle a given mime type. ==Supplementary/Additive== Concatenate the results (metadata and content) for several parsers We need a better name for this! ==Back-off== Try one parser and if the output doesn't meet some criterion, apply another. One use case for this might be: if a file is identified as XML, try the XMLParser and if that throws an exception, try the HTMLParser. ==Pick the Best Output== One use case for this: the charset detector identifies two equally likely charsets. Apply both and use the wished-for junk detector (TIKA-1443) to determine which output is more likely to be not junk.
