[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

Hudson (Jira) Wed, 13 Jan 2021 13:01:05 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17264425#comment-17264425
 ]


Hudson commented on TIKA-3258:
------------------------------

SUCCESS: Integrated in Jenkins build Tika » tika-main-jdk8 #125 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/125/])
TIKA-3258 -- in Tika 2.0.0, the default for OCR'ing of PDFs is 'auto' 
(tallison: 
[https://github.com/apache/tika/commit/4bd897a7df772c208fb6918b6b7559e37e5ec3b9])
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/OCR2XHTML.java
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
* (edit) 
tika-parsers/tika-parsers-classic/tika-parsers-classic-package/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* (edit) CHANGES.txt


> Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0
> ---------------------------------------------------------
>
>                 Key: TIKA-3258
>                 URL: https://issues.apache.org/jira/browse/TIKA-3258
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>
> In Tika 1.x we currently have the fiddly mess that users have to configure 
> OCR of PDFs...it doesn't just work out of the box.  We did this initially 
> because of concerns (well, reality) of crazy resource consumption for some 
> PDFs that can have thousands of images per page that are stitched together to 
> make a reasonable composite.
> Since then, we've added option 2, which renders each page and then runs OCR 
> on that composite image rather than running OCR on each inline image...so 
> we'll only call tesseract once per page.  Second, we've added an 'auto' mode 
> that runs OCR only on pages that didn't have much text extracted.  While 
> there is plenty of room for improvement in the 'auto' heuristic, I think we 
> should move to running OCR automatically on PDFs as default in 2.0.0. 
> Under this proposal, users will now have to disable OCR if they have 
> tesseract installed but don't want to run it on PDFs.
> This will be a breaking change, and we'll make sure to document it early and 
> often in the "Breaking Changes" sections of the readme.txt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

Reply via email to