[jira] [Comment Edited] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Tim Allison (JIRA) Mon, 27 Oct 2014 18:43:11 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14186200#comment-14186200
 ]


Tim Allison edited comment on TIKA-1445 at 10/28/14 1:42 AM:
-------------------------------------------------------------

This is more invasive than I'd like, and it does not solve all problems, and 
there are some important printlns still in there.

I'm sure this was part of the plan in the integration, but it seems a bit on 
the side of dark magic that the Tesseract parser is selected for image files by 
the AutoDetectParser only because the full class name sorts after 
oat.image.ImageParser, etc.  I'd prefer the solution recommended above (if we 
thoroughly document it), where we pick the last (or first?) in the list, but 
that isn't currently happening, I don't think.

Am I understanding this correctly?  Do we want to take away some of the magic?

I added an AbstractTerminalImageMetadataParser so that we could gather together 
all classes used to parse just the metadata of images.  This allows the 
OCRParsers to go through all the parsers and pick out only those that are not 
composite but do parse image metadata.  Perhaps we should remove these parsers 
that implement this from the AutoDetectParser??? Still a bit dark...

I think our tests should not add the TesseractOCRParser to the ParseContext as 
a parser.  It would be far better to pass in AutoDetectParser so that the 
TesseractOCRParser operates on all embedded images, no matter the depth.

This patch is not a solution, only some thoughts.


was (Author: talli...@mitre.org):
This is more invasive than I'd like, and it does not solve all problems, and 
there are some important printlns still in there.

I'm sure this was part of the plan in the integration, but it seems a bit on 
the side of dark magic that the Tesseract parser is selected for image files by 
the AutoDetectParser only because the full class name sorts after 
oat.image.ImageParser, etc.

Am I understanding this correctly?  Do we want to take away some of the magic?

I added an AbstractTerminalImageMetadataParser so that we could gather together 
all classes used to parse just the metadata of images.  This allows the 
OCRParsers to go through all the parsers and pick out only those that are not 
composite but do parse image metadata.  Perhaps we should remove these parsers 
that implement this from the AutoDetectParser??? Still a bit dark...

I think our tests should not add the TesseractOCRParser to the ParseContext as 
a parser.  It would be far better to pass in AutoDetectParser so that the 
TesseractOCRParser operates on all embedded images, no matter the depth.

This patch is not a solution, only some thoughts.

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

Reply via email to