[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Lewis John McGibbney (JIRA) Tue, 18 Nov 2014 20:38:51 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217407#comment-14217407
 ]


Lewis John McGibbney commented on TIKA-1445:
--------------------------------------------

OK so in Any23, if we were to take the following example where we are focusing 
on a *single document extraction* e.g. (0) then it can be said that for any 
given document, when we run (1) the extraction we:
 * from all registered extractors, filter the extractors by MimeType (2) 
 * from all matching extractors for the given MimeType, create the extractor (3)
 * loop through the matching extractors and actually run (4) each extractor on 
the local document source as an InputStream (5) for instance.

We also have an Extraction Content and Extraction Reporting layers within Any23 
which may be of use to Tika. To be honest I find the reports and context 
objects extremely useful for obtaining metrics from extraction... maybe we 
could do the same for Tika?

There are some improvements which can be made to SingleDocumentExtraction 
within Any23 however that conversation is not relevant here. Hopefully the high 
level overview of the chaining extraction algorithm within Any23 is of some 
value to this conversation.

(0) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java
(1) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L205
(2) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L223
(3) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L252
(4) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L440
(5) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L465

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Reply via email to