[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Apache Wiki Fri, 02 Nov 2018 08:17:46 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=17&rev2=18

  
  We have not carried out evaluations to determine which strategy is better.  
We suspect that the tried and true ''It Depends(TM)'' is operative here.  We 
added OCR'ing of the single image option because some PDFs can contain hundreds 
of images per page where each image is a tiny part of the overall page, and OCR 
would be useless.  However, we recognize, that if the page is logically broken 
into sections, running OCR on the individual inline images might yield better 
results.
  
- === Configuring OCR on Inline Images ===
+ === Option 1: Configuring OCR on Inline Images ===
  
- This will extract inline images as if they were attachments, and then, if 
Tesseract is correctly configured, it should run against the images.  Note: by 
default, extracting inline images is turned off because some PDFs contain 
thousands of inline images, and it has a big hit on performance, both memory 
usage and time.
+ This will extract inline images as if they were attachments, and then, if 
Tesseract is correctly configured, it should run against the images.  Note: by 
default, extracting inline images is turned off because some rare PDFs contain 
thousands of inline images per page, and it has a big hit on performance, both 
memory usage and time.
  
  {{{
  ...
@@ -61, +61 @@

  ...
  }}}
  
- Note, '''if their licenses are compatible with your application''', you may 
want to include levigo and jai in your classpath to handle jp2, jpeg2000 and 
tiff files.  '''The licenses are not Apache 2.0 compatible!'''
  
- {{{
-     <dependency>
-         <groupId>com.levigo.jbig2</groupId>
-         <artifactId>levigo-jbig2-imageio</artifactId>
-         <version>1.6.5</version>
-     </dependency>
-     <dependency>
-         <groupId>com.github.jai-imageio</groupId>
-         <artifactId>jai-imageio-core</artifactId>
-         <version>1.3.1</version>
-     </dependency>
- }}}
- 
- === Configuring OCR on Rendered Pages ===
+ === Option 2: Configuring OCR on Rendered Pages ===
  This will render each PDF page and then run OCR on that image.  This method 
of OCR is triggered by the {{{ocrStrategy}}} parameter, but users can 
manipulate other parameters, including the {{{image type}}} (see 
{{{org.apache.pdfbox.rendering.ImageType}}} for options) and the dots per inch 
{{{dpi}}}.  The defaults are: {{{gray}}} and {{{300}}} respectively.  For 
{{{ocrStrategy}}}, we currently have: {{{no_ocr}}} (rely on regular text 
extraction only), {{{ocr_only}}} (don't bother extracting text, just run OCR on 
each page), {{{ocr_and_text}}} (both extract text and run OCR). We should add 
more advanced strategies, e.g. if you only get 10 words out of a page, run OCR, 
but we haven't implemented those yet.
  
  {{{
@@ -90, +76 @@

          </parser>
  ...
  }}}
+ 
+ === Optional Dependencies ===
+ 
+ Note, you should include the following dependency to process JBIG2 images:
+ 
+ {{{
+     <dependency>
+         <groupId>org.apache.pdfbox</groupId>
+         <artifactId>jbig2-imageio</artifactId>
+         <version>3.0.2</version>
+     </dependency>
+ }}}
+ 
+ Note, '''if their licenses are compatible with your application''', you may 
want to include the following jai libraries in your classpath to handle jp2, 
jpeg2000 and tiff files.  '''The licenses are not Apache 2.0 compatible!'''
+ 
+ {{{
+     <dependency>
+         <groupId>com.github.jai-imageio</groupId>
+         <artifactId>jai-imageio-core</artifactId>
+         <version>1.4.0</version>
+     </dependency>
+     <dependency>
+         <groupId>com.github.jai-imageio</groupId>
+         <artifactId>jai-imageio-jpeg2000</artifactId>
+         <version>1.3.0</version>
+         <scope>test</scope>
+     </dependency>
+ }}}
+ 
  
  If you are using Java 8, make sure to see 
[[https://pdfbox.apache.org/2.0/migration.html#pdf-rendering|pdf-rendering]] 
for JVM settings that may improve the speed of processing.

[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Reply via email to