[
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14183873#comment-14183873
]
Tyler Palsulich commented on TIKA-1445:
---------------------------------------
I've been trying my hand at this some time now. An idea I had was to create a
temporary file from the input InputStream, then create new input streams from
that file to run each Parser on.
But, before this OCR Parser, we only ran one Parser on the image, anyway. So,
what if there was a way to get the "second best" default parser for the image?
An option is to hard code the exact working Parsers. But, in my opinion, we
should load them dynamically. So, that would require getting a
{{List<Parser>}}, instead of just the "best" Parser for a given MediaType
({{CompositeParser.getParsers(ParseContext)}}).
If we only chose the second best Parser, we wouldn't have to merge the Metadata
results, since the OCRParser doesn't add Metadata. But, it might call the
ContentHandler.
> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Fix For: 1.7
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt
>
>
> Now that Tesseract is the default image parser in Tika for many image types,
> consider how to add back in the metadata extraction capabilities by the other
> Image parsers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)