[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217965#comment-14217965
 ] 

Tim Allison edited comment on TIKA-1445 at 11/19/14 3:01 PM:
-------------------------------------------------------------

How about using the order of parsers as specified in TikaConfig?  That should 
accommodate 6 class files in different jars, no?

Via TikaConfig, we could also specify the which subclass of a default composite 
parser to use.  I now see at least three use cases:
1) Tika classic: pick the first parser that applies and hope that it is the one 
you meant, ignore the others. :)
2) The use case we've been discussing, where each parser is additive.
3) A BackOffOnExceptionParser (TIKA-1483 got me thinking about this)

Wait, for Tika 2.0, couldn't we do all the class loading from TikaConfig?  We 
could also get rid of our one-off parser config hacks (like Solr):

{noformat}
    <parser class="org.apache.tika.parser.audio.AudioParser">
      <params>
        <int name="someparam1">2</int>
        <str name="someOtherParam2">something or other</str>
      </params>
      <mime>audio/basic</mime>
      <mime>audio/x-aiff</mime>
      <mime>audio/x-wav</mime>
    </parser>
{noformat}

We could specify a ChainingParser on the fly via config:
{noformat}
    <parser class="org.apache.tika.parser.ChainingParser" 
name="MyOCRAndMetadataParser">
      <childparser>org.apache.tika.parser.jpeg.JPegParser</childparser>
      <childparser>...</childparser>
      <childparser>...</childparser>
      <childparser>org.apache.tika.parser.ocr.TesseractOCR</childparser>

      <mime>image/bmp</mime>
      <mime>image/gif</mime>
      <mime>image/png</mime>
      <mime>image/vnd.wap.wbmp</mime>
      <mime>image/x-icon</mime>
      <mime>image/x-ms-bmp</mime>
      <mime>image/x-xcf</mime>

    </parser>
{noformat}


was (Author: talli...@mitre.org):
How about using the order of parsers as specified in TikaConfig?  That should 
accommodate 6 class files in different jars, no?

Via TikaConfig, we could also specify the which subclass of a default composite 
parser to use.  I now see at least three use cases:
1) Tika classic: pick the first parser that applies and hope that it is the one 
you meant, ignore the others. :)
2) The use case we've been discussing, where each parser is additive.
3) A BackOffOnExceptionParser (TIKA-1483 got me thinking about this)

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to