[ 
https://issues.apache.org/jira/browse/TIKA-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17262931#comment-17262931
 ] 

Tim Allison commented on TIKA-3266:
-----------------------------------

Merged this into main just now.  Rebuilding from a fresh pull before closing 
this.

This was a fairly substantial refactoring.  The good news is that the pdf 
module no longer depends on the image module or the ocr module.  Users will get 
those together in tika-classic-parser.  But people who really, really only want 
PDF text extraction can have that now.

More importantly, we've no longer hard coded tesseract as _the_ OCR parser.  
Users can now add their own OCR parser, and it will be executed by the image 
parsers and the PDFParser.

The trick, for now, is that by default, the OCR parser is called on mime types 
like "image/ocr-jpeg".  The ImageParser now runs metadata extraction and then 
checks to see if the there's a parser available for "image/ocr-x", if there is, 
then it executes that parser.

Rather than have the TesseractOCRParser run an internal ImageParser, the image 
parsers and the PDFParser now use this workaround.

Users who want to run _just_ tesseract without the wrapping of the ImageParsers 
can decorate tesseract with, e.g. "image/jpeg".

 

> Generalize OCRParser so that users can service load custom ocr parsers
> ----------------------------------------------------------------------
>
>                 Key: TIKA-3266
>                 URL: https://issues.apache.org/jira/browse/TIKA-3266
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> For Tika 2.0.0, it would be great if we could allow users to add custom OCR 
> parsers.  We've hardcoded Tesseract a bit much...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to