[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Nick Burch (JIRA) Tue, 18 Nov 2014 09:32:40 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216466#comment-14216466
 ]


Nick Burch commented on TIKA-1445:
----------------------------------

Anyone using tika-parser OOTB has two parsers services files - built-in and 
vorbis. Anyone adding a third party parser under a non-ASLv2 license off the 
wiki will get a third. Anyone adding their own custom parsers following the 
instructions on the website will get a few more. 

My hunch is that most users won't care at all about what order the parsers are 
asked "hey, can you handle this file type" in. My second hunch is that users 
who do care will typically only care about it for a handful of formats, eg "for 
jpeg try ocr then image, everything else default is fine". 

We also need to support those users who currently say "I don't care what you 
find on the classpath, I only ever want you to use these 5 parsers and in this 
explicit order I'm passing you now"

I can describe the problem, but I'm not sure on the right solution at this 
point...

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Reply via email to