[
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217407#comment-14217407
]
Lewis John McGibbney commented on TIKA-1445:
--------------------------------------------
OK so in Any23, if we were to take the following example where we are focusing
on a *single document extraction* e.g. (0) then it can be said that for any
given document, when we run (1) the extraction we:
* from all registered extractors, filter the extractors by MimeType (2)
* from all matching extractors for the given MimeType, create the extractor (3)
* loop through the matching extractors and actually run (4) each extractor on
the local document source as an InputStream (5) for instance.
We also have an Extraction Content and Extraction Reporting layers within Any23
which may be of use to Tika. To be honest I find the reports and context
objects extremely useful for obtaining metrics from extraction... maybe we
could do the same for Tika?
There are some improvements which can be made to SingleDocumentExtraction
within Any23 however that conversation is not relevant here. Hopefully the high
level overview of the chaining extraction algorithm within Any23 is of some
value to this conversation.
(0)
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java
(1)
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L205
(2)
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L223
(3)
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L252
(4)
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L440
(5)
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L465
> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt,
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt,
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types,
> consider how to add back in the metadata extraction capabilities by the other
> Image parsers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)