[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Tim Allison (JIRA) Tue, 18 Nov 2014 08:17:56 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216365#comment-14216365
 ]


Tim Allison commented on TIKA-1445:
-----------------------------------

Copied from dev discussion to record points on this issue.  Will not duplicate 
in future.  Sorry!

On issue 1: The proposal is that we'd send in a fresh Metadata object to each 
parser and then combine that information into a new Metadata object either via 
add or set.  If we go this route, we'll lose the restrictions that Properties 
may have originally held (e.g. one value as in TikaCoreProperties.TITLE).

On Issue 2:
I think we're talking about different things.  Yes, we'll definitely need to 
reset or spool the stream depending on its length.  My concern was more with 
the handlers.  If the first parser calls endDocument() and we don't shield 
that, then if someone uses the BodyContentHandler, then they might not see 
contents from the second/third parser because the initial parser "ended" the 
document.  I need to test this concern, but I think that this was the root of 
TIKA-1124.

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Reply via email to