[
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214668#comment-14214668
]
Tim Allison commented on TIKA-1445:
-----------------------------------
This might muddy results, initially, but users could choose to turn off/not
load parsers that they didn't want. It would be a significant change over what
we're currently doing.
How will we handle:
1) Two parsers both "set" a value in the Metadata object? Will the second
overwrite the value of the first?
2) Content: How will we know when a document ends? AutoDetectParser would
wrap the handler in an EndDocumentShieldingContentHandler and then call
endDocument when done?
3) Will the user be able to parse the output from the handler to figure out
which parser is responsible for which content? Let's say a user wants to pull
the electronic text out of a PDF _and_ render the page as an image and then run
it through OCR, would we have something like <div parser="o.a.t.p.PDFParser">
or similar?
If we go this route, we'd want to make sure we don't have literally duplicate
parsers (as we do now).
This sounds more complicated than having parent parsers know which children
they control and how to control them, but, it might make sense.
Aside from OCR, what other use cases do we have where we might want multiple
parsers operating on the same doc type?
> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt,
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt,
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types,
> consider how to add back in the metadata extraction capabilities by the other
> Image parsers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)