[
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216466#comment-14216466
]
Nick Burch commented on TIKA-1445:
----------------------------------
Anyone using tika-parser OOTB has two parsers services files - built-in and
vorbis. Anyone adding a third party parser under a non-ASLv2 license off the
wiki will get a third. Anyone adding their own custom parsers following the
instructions on the website will get a few more.
My hunch is that most users won't care at all about what order the parsers are
asked "hey, can you handle this file type" in. My second hunch is that users
who do care will typically only care about it for a handful of formats, eg "for
jpeg try ocr then image, everything else default is fine".
We also need to support those users who currently say "I don't care what you
find on the classpath, I only ever want you to use these 5 parsers and in this
explicit order I'm passing you now"
I can describe the problem, but I'm not sure on the right solution at this
point...
> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt,
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt,
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types,
> consider how to add back in the metadata extraction capabilities by the other
> Image parsers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)