[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Tim Allison (JIRA) Mon, 17 Nov 2014 06:02:47 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214668#comment-14214668
 ]


Tim Allison commented on TIKA-1445:
-----------------------------------

This might muddy results, initially, but users could choose to turn off/not 
load parsers that they didn't want.  It would be a significant change over what 
we're currently doing.

How will we handle:
1) Two parsers both "set" a value in the Metadata object?  Will the second 
overwrite the value of the first?
2) Content:  How will we know when a document ends?  AutoDetectParser would 
wrap the handler in an EndDocumentShieldingContentHandler and then call 
endDocument when done?
3) Will the user be able to parse the output from the handler to figure out 
which parser is responsible for which content?  Let's say a user wants to pull 
the electronic text out of a PDF _and_ render the page as an image and then run 
it through OCR, would we have something like <div parser="o.a.t.p.PDFParser"> 
or similar?

If we go this route, we'd want to make sure we don't have literally duplicate 
parsers (as we do now).

This sounds more complicated than having parent parsers know which children 
they control and how to control them, but, it might make sense.

Aside from OCR, what other use cases do we have where we might want multiple 
parsers operating on the same doc type?

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Reply via email to