[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Apache Wiki Mon, 27 Mar 2017 09:53:21 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=13&rev2=14

  
  Start with the instructions on 
[[https://wiki.apache.org/tika/TikaOCR|TikaOCR]].  In short, you need to have 
Tesseract installed.
  
+ 
  There are two ways of running OCR on PDFs:
   1. Extracting the inline images and letting Tesseract run on each inline 
image.
   2. Rendering each PDF page as a single image and running Tesseract on that 
single image.
@@ -58, +59 @@

              </params>
          </parser
  ...
+ }}}
+ 
+ Note, '''if their licenses are compatible with your application''', you may 
want to include levigo and jai in your classpath to handle jp2, jpeg2000 and 
tiff files.
+ 
+ {{{
+     <dependency>
+         <groupId>org.apache.tika</groupId>
+         <artifactId>tika-parsers</artifactId>
+         <version>1.13</version>
+     </dependency>
+     <dependency>
+         <groupId>com.levigo.jbig2</groupId>
+         <artifactId>levigo-jbig2-imageio</artifactId>
+         <version>1.6.5</version>
+     </dependency>
+     <dependency>
+         <groupId>com.github.jai-imageio</groupId>
+         <artifactId>jai-imageio-core</artifactId>
+         <version>1.3.1</version>
+     </dependency>
  }}}
  
  === Configuring OCR on Rendered Pages ===

[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Reply via email to