[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Nick Burch (JIRA) Tue, 18 Nov 2014 09:14:00 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216444#comment-14216444
 ]


Nick Burch commented on TIKA-1445:
----------------------------------

I think it's fairly common for people to have 4-5 parser services files, and 
whatever we do needs to accept that as a "normal" use case. Pretty much anyone 
depending on tika-parsers is going to have at least 2.

Think of the case of
{code:title="tika-parsers.jar:META-INF/services/org.apache.tika.parser.Parser"}
org.apache.tika.parser.gdal.GDALParser
org.apache.tika.parser.html.HtmlParser
org.apache.tika.parser.image.ImageParser
{code}
{code:title="my-tika-extension.jar:META-INF/services/org.apache.tika.parser.Parser"}
com.example.tika.ocr.customocrparser
org.apache.tika.parser.image.ImageParser
{code}

Under your plan, given that the JVM could return the two service files to you 
in any order, how do you decide which of GDALParser or ImageParser goes second 
after the OCR one? In one parser file, Image comes first, in the other it's 
second. Which wins? How do we make it deterministic, and not just based on 
which jar the JVM spots first?

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Reply via email to