Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison: https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=13&rev2=14 Start with the instructions on [[https://wiki.apache.org/tika/TikaOCR|TikaOCR]]. In short, you need to have Tesseract installed. + There are two ways of running OCR on PDFs: 1. Extracting the inline images and letting Tesseract run on each inline image. 2. Rendering each PDF page as a single image and running Tesseract on that single image. @@ -58, +59 @@ </params> </parser ... + }}} + + Note, '''if their licenses are compatible with your application''', you may want to include levigo and jai in your classpath to handle jp2, jpeg2000 and tiff files. + + {{{ + <dependency> + <groupId>org.apache.tika</groupId> + <artifactId>tika-parsers</artifactId> + <version>1.13</version> + </dependency> + <dependency> + <groupId>com.levigo.jbig2</groupId> + <artifactId>levigo-jbig2-imageio</artifactId> + <version>1.6.5</version> + </dependency> + <dependency> + <groupId>com.github.jai-imageio</groupId> + <artifactId>jai-imageio-core</artifactId> + <version>1.3.1</version> + </dependency> }}} === Configuring OCR on Rendered Pages ===
