[
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216365#comment-14216365
]
Tim Allison commented on TIKA-1445:
-----------------------------------
Copied from dev discussion to record points on this issue. Will not duplicate
in future. Sorry!
On issue 1: The proposal is that we'd send in a fresh Metadata object to each
parser and then combine that information into a new Metadata object either via
add or set. If we go this route, we'll lose the restrictions that Properties
may have originally held (e.g. one value as in TikaCoreProperties.TITLE).
On Issue 2:
I think we're talking about different things. Yes, we'll definitely need to
reset or spool the stream depending on its length. My concern was more with
the handlers. If the first parser calls endDocument() and we don't shield
that, then if someone uses the BodyContentHandler, then they might not see
contents from the second/third parser because the initial parser "ended" the
document. I need to test this concern, but I think that this was the root of
TIKA-1124.
> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt,
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt,
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types,
> consider how to add back in the metadata extraction capabilities by the other
> Image parsers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)