[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217965#comment-14217965 ]
Tim Allison edited comment on TIKA-1445 at 11/19/14 3:01 PM: ------------------------------------------------------------- How about using the order of parsers as specified in TikaConfig? That should accommodate 6 class files in different jars, no? Via TikaConfig, we could also specify the which subclass of a default composite parser to use. I now see at least three use cases: 1) Tika classic: pick the first parser that applies and hope that it is the one you meant, ignore the others. :) 2) The use case we've been discussing, where each parser is additive. 3) A BackOffOnExceptionParser (TIKA-1483 got me thinking about this) Wait, for Tika 2.0, couldn't we do all the class loading from TikaConfig? We could also get rid of our one-off parser config hacks (like Solr): {noformat} <parser class="org.apache.tika.parser.audio.AudioParser"> <params> <int name="someparam1">2</int> <str name="someOtherParam2">something or other</str> </params> <mime>audio/basic</mime> <mime>audio/x-aiff</mime> <mime>audio/x-wav</mime> </parser> {noformat} We could specify a ChainingParser on the fly via config: {noformat} <parser class="org.apache.tika.parser.ChainingParser" name="MyOCRAndMetadataParser"> <childparser>org.apache.tika.parser.jpeg.JPegParser</childparser> <childparser>...</childparser> <childparser>...</childparser> <childparser>org.apache.tika.parser.ocr.TesseractOCR</childparser> <mime>image/bmp</mime> <mime>image/gif</mime> <mime>image/png</mime> <mime>image/vnd.wap.wbmp</mime> <mime>image/x-icon</mime> <mime>image/x-ms-bmp</mime> <mime>image/x-xcf</mime> </parser> {noformat} was (Author: talli...@mitre.org): How about using the order of parsers as specified in TikaConfig? That should accommodate 6 class files in different jars, no? Via TikaConfig, we could also specify the which subclass of a default composite parser to use. I now see at least three use cases: 1) Tika classic: pick the first parser that applies and hope that it is the one you meant, ignore the others. :) 2) The use case we've been discussing, where each parser is additive. 3) A BackOffOnExceptionParser (TIKA-1483 got me thinking about this) > Figure out how to add Image metadata extraction to Tesseract parser > ------------------------------------------------------------------- > > Key: TIKA-1445 > URL: https://issues.apache.org/jira/browse/TIKA-1445 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Chris A. Mattmann > Assignee: Chris A. Mattmann > Fix For: 1.8 > > Attachments: TIKA-1445.Mattmann.101214.patch.txt, > TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, > TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch > > > Now that Tesseract is the default image parser in Tika for many image types, > consider how to add back in the metadata extraction capabilities by the other > Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)