[Tika Wiki] Update of "CompositeParserDiscussion" by TimothyAllison

Apache Wiki Fri, 09 Jan 2015 13:14:20 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "CompositeParserDiscussion" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/CompositeParserDiscussion

New page:
=Composite Parser Discussion=
A given mime type may be supported by several parsers.  Work on TIKA-1445 
(adding metadata back into OCR'd text) raised the prominence of this issue.  
Currently, the CompositeParser picks the first parser that supports a given 
mime type.  In discussion on TIKA-1445 other potential use cases were 
identified.

The purpose of this page is to track a unified vision of the strategies that 
we'll implement in Tika.

The JIRA issue for this is 
[[https://issues.apache.org/jira/browse/TIKA-1509|TIKA-1509]].

'''This page is just a start.  Please contribute'''
=Strategies=
==Classic==
Sort the parsers by non-tika vs tika and then alphabetically by class name.  
Pick the first parser that will handle a given mime type.

==Supplementary/Additive==
Concatenate the results (metadata and content) for several parsers

We need a better name for this!

==Back-off==
Try one parser and if the output doesn't meet some criterion, apply another.  
One use case for this might be: if a file is identified as XML, try the 
XMLParser and if that throws an exception, try the HTMLParser. 

==Pick the Best Output==
One use case for this: the charset detector identifies two equally likely 
charsets.  Apply both and use the wished-for junk detector (TIKA-1443) to 
determine which output is more likely to be not junk.

[Tika Wiki] Update of "CompositeParserDiscussion" by TimothyAllison

Reply via email to