[jira] [Comment Edited] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Tim Allison (JIRA) Tue, 28 Oct 2014 04:09:07 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186696#comment-14186696
 ]


Tim Allison edited comment on TIKA-1445 at 10/28/14 11:08 AM:
--------------------------------------------------------------

On further thought...I won't have time to sketch this out until tonight or 
tomorrow...

It might make sense to get rid of the AbstractTerminalMetadataParser class, and 
have AbstractOCRParser load the image metadata parsers from a services file; we 
could then remove the image metadata parsers from the Parser services list.  
For those without Tesseract installed, the TesseractOCRParser would be a 
pass-through to the old behavior (no copying of streams, just classic metadata 
parsing); for those with it installed, TesseractOCRParser would copy the stream 
and do a double pass, once for the metadata and once for the OCR (as in Tyler's 
patch).

This solution would get us out of the reliance on reverse alphabetic sort order 
of parser class names to pick the oat.parser.ocr.TesseractOCRParser as "best" 
parser for .gif, .jpeg, etc.  Of course, we're still relying on that order to 
pick TesseractOCRParser over GDAL for .png files...


was (Author: talli...@mitre.org):
On further thought...I won't have time to sketch this out until tonight or 
tomorrow...

It might make sense to get rid of the AbstractTerminalMetadataParser class, and 
have AbstractOCRParser load the image metadata parsers from a services file; we 
could then remove the image metadata parsers from the Parser services list.  
For those without Tesseract installed, the TesseractOCRParser would be a 
pass-through to the old behavior (no copying of streams, just classic metadata 
parsing); for those with it installed, TesseractOCRParser would copy the stream 
and do a double pass, once for the metadata and once for the OCR (as in Tyler's 
patch).

This solution would get us out of the reliance on alphabetic sort order of 
parser class names to pick the oat.ocr.TesseractOCRParser as "best" parser for 
.gif, .jpeg, etc.  Of course, we're still relying on that order to pick 
TesseractOCRParser over GDAL for .png files...

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Reply via email to