[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Hong-Thai Nguyen (JIRA) Mon, 13 Oct 2014 01:58:33 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169090#comment-14169090
 ]


Hong-Thai Nguyen commented on TIKA-1445:
----------------------------------------

Interesting question !
For me, parser's selection and parsers priority decision should be done on 
runtime by configuration, not inside a parser.
Image's parser is an interesting case of concurrent parsers (Tesseract vs 
classical Image Parsers). We have double problem here:
1. When many parsers can work with same mime type, which one is selected ?
2. When we have many parsers, can we apply many parsers and merge results 
(metadata & handler) .

* For case 1, if we use a override config of parsers on runtime, we can declare 
many parsers with matching mimetype and the later one in list will be selected. 
We may extend CLI/WebService to inject this kind of configuration.
* For case 2, we don't have a solution for now. We may extend CompositeParser 
to accept a mode 'many' parsers and call matching parsers in chain. The merging 
result is an other problem.we can accept a same metadata name is override by an 
other parser. The perfect solution is (again) using nested structure on our 
metadata which enable store each parser's result.

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.7
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Reply via email to