Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "CompositeParserDiscussion" page has been changed by NickBurch:
https://wiki.apache.org/tika/CompositeParserDiscussion?action=diff&rev1=1&rev2=2

Comment:
Formatting and expand

- =Composite Parser Discussion=
+ = Composite Parser Discussion =
- A given mime type may be supported by several parsers.  Work on TIKA-1445 
(adding metadata back into OCR'd text) raised the prominence of this issue.  
Currently, the CompositeParser picks the first parser that supports a given 
mime type.  In discussion on TIKA-1445 other potential use cases were 
identified.
+ 
+ A given mime type may be supported by several parsers.  Work on 
[[https://issues.apache.org/jira/browse/TIKA-1445|TIKA-1445]] (adding metadata 
back into OCR'd text) raised the prominence of this issue.  Currently, the 
CompositeParser picks the first parser that supports a given mime type.  In 
discussion on TIKA-1445 other potential use cases were identified.
  
  The purpose of this page is to track a unified vision of the strategies that 
we'll implement in Tika.
  
  The JIRA issue for this is 
[[https://issues.apache.org/jira/browse/TIKA-1509|TIKA-1509]].
  
  '''This page is just a start.  Please contribute'''
+ 
- =Strategies=
+ = Strategies =
- ==Classic==
+ == Classic ==
  Sort the parsers by non-tika vs tika and then alphabetically by class name.  
Pick the first parser that will handle a given mime type.
  
- ==Supplementary/Additive==
+ == Supplementary/Additive ==
  Concatenate the results (metadata and content) for several parsers
  
  We need a better name for this!
  
- ==Back-off==
+ == Back-off ==
  Try one parser and if the output doesn't meet some criterion, apply another.  
One use case for this might be: if a file is identified as XML, try the 
XMLParser and if that throws an exception, try the HTMLParser. 
  
- ==Pick the Best Output==
+ == Pick the Best Output ==
  One use case for this: the charset detector identifies two equally likely 
charsets.  Apply both and use the wished-for junk detector (TIKA-1443) to 
determine which output is more likely to be not junk.
  
+ == Fastest ==
+ If there are two parsers, use the faster one even if it might mean lower 
quality (eg avoid OCR)
+ 
+ = Allowing the User to select a strategy =
+ The right strategy for one user may not be the right for another. The right 
strategy for one file may not be the right one for another. We therefore need 
to allow users to pick their strategy, on an overall basis, and on a per-file 
basis
+ 
+ == From TikaConfig ==
+ ''TODO''
+ 
+ == With a Tika Configuration file ==
+ ''TODO''
+ 
+ == In Code ==
+ ''TODO''
+ 

Reply via email to