[
https://issues.apache.org/jira/browse/TIKA-1994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15314671#comment-15314671
]
Hudson commented on TIKA-1994:
------------------------------
SUCCESS: Integrated in tika-2.x #107 (See
[https://builds.apache.org/job/tika-2.x/107/])
TIKA-1994 -- Integrate TesseractOCR with full page image rendering for
(tallison: rev ebe70289815776f6ce6c271c7faf8d23cfd31337)
*
tika-parser-bundles/tika-parser-pdf-bundle/src/test/java/org/apache/tika/module/pdf/BundleIT.java
*
tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
*
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/OCR2XHTML.java
*
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
*
tika-parser-modules/tika-parser-pdf-module/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
* tika-parser-modules/tika-parser-multimedia-module/pom.xml
*
tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* CHANGES.txt
*
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java
*
tika-parser-bundles/tika-parser-journal-bundle/src/test/java/org/apache/tika/module/journal/BundleIT.java
*
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* tika-parser-modules/tika-parser-pdf-module/pom.xml
* tika-parser-bundles/tika-parser-pdf-bundle/pom.xml
*
tika-parser-modules/tika-parser-pdf-module/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
*
tika-parser-modules/tika-parser-pdf-module/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
> Integrate OCR with PDFParser
> ----------------------------
>
> Key: TIKA-1994
> URL: https://issues.apache.org/jira/browse/TIKA-1994
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Assignee: Tim Allison
> Fix For: 2.0, 1.14
>
>
> Users can now run OCR on individual images embedded inline in PDFs if they
> get the configuration right.
> There are some drawbacks: 1) the text appears as an attachment if using the
> RecursiveParserWrapper, 2) text may be more cleanly extracted on the fully
> rendered page instead of on the individual images (this is still tbd).
> It might be useful to run OCR against each rendered page (instead of the
> component images).
> Integrating OCR is on the roadmap for PDFBox 2.1 (PDFBOX-1912). This will
> allow us to experiment with strategies until the cleaner integration is
> available with PDFBox 2.1.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)