[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Apache Wiki Wed, 09 Nov 2016 07:46:40 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=2&rev2=3

  == Configuration options ==
  There are three ways of configuring the PDFParser.  
   1. Programmatically via setter methods on the PDFParser.
-  2. Programmatically via the PDFParserConfig object submitted through the 
ParseContext.
+  2. Programmatically via the PDFParserConfig object submitted through the 
`ParseContext`.
   3. Via the tika-config.xml file (many thanks to Thamme Gowda and Chris 
Mattmann's work on TIKA-1508).
  
  The first two are fairly self-explanatory through the javadocs.
  
- Here follows an example tika-config.xml file for setting 
catchIntermediateExceptions to {{{false}}} and for checking for whether the PDF 
allows for extraction for accessibility.
+ Here follows an example tika-config.xml file for setting 
{{{catchIntermediateExceptions}}} to {{{false}}} and for checking for whether 
the PDF allows for extraction for accessibility. 
  {{{
  <?xml version="1.0" encoding="UTF-8"?>
  <properties>
@@ -20, +20 @@

              <params>
                  <param name="allowExtractionForAccessibility" 
type="bool">true</param>
                  <param name="catchIntermediateExceptions" 
type="bool">false</param>
-                 <!-- we really should throw an exception for this!! 
+                 <!-- we really should throw an exception for this.
-                      We are currently swallowing it -->
+                      We are currently not checking -->
                  <param name="someRandomThingOrOther" type="bool">true</param>
              </params>
          </parser>
@@ -29, +29 @@

  </properties>
  }}}
  
+ 
  == OCR ==
+ Start with the instructions on 
[[TikaOCR|https://wiki.apache.org/tika/TikaOCR]].  In short, you need to have 
Tesseract installed.
+ 
- There are two ways of running OCR on PDFs
+ There are two ways of running OCR on PDFs:
-  1. Extracting the inline images and letting Tesseract run on those
+  1. Extracting the inline images and letting Tesseract run on those.
   2. Rendering each PDF page as a single image and running Tesseract on that 
single image
  
  We have not carried out evaluations to determine which strategy is better.  
We suspect that the tried and true ''It Depends(TM)'' is operative here.  We 
added OCR'ing of the single image option because some PDFs can contain hundreds 
of images per page where each image is a tiny part of the overall page, and OCR 
would be useless.  However, we recognize, that if the page is logically broken 
into sections, running OCR on the individual inline images might yield better 
results.
  
  === Configuring OCR on Inline Images ===
  
+ This will extract inline images as if they were attachments, and then, if 
Tesseract is correctly configured, it should run against the images.  Note: by 
default, extracting inline images is turned off because some PDFs contain 
thousands of inline images, and it has a big hit on performance, both memory 
usage and time.
+ 
+ {{{
+ ...
+         <parser class="org.apache.tika.parser.pdf.PDFParser">
+             <params>
+                 <param name="extractInlineImages" type="bool">true</param>
+             </params>
+         </parser
+ ...
+ }}}
  
  === Configuring OCR on Rendered Pages ===
+ This will render each PDF page and then run OCR on that image.  Users can 
select the {{image type}} (see {{org.apache.pdfbox.rendering.ImageType}} for 
options) and the dots per inch {{dpi}}.  For {{ocrStrategy}}, we currently 
have: no_ocr (rely on regular text extraction only), ocr_only (don't bother 
extracting text, just run OCR on each page), ocr_and_text (both extract text 
and run OCR). We should add more advanced strategies, e.g. if you only get 10 
words out of a page, run OCR, but we haven't implemented those yet.
+ 
+ {{{
+ ...
+         <parser class="org.apache.tika.parser.pdf.PDFParser">
+             <params>
+                 <param name="ocrStrategy" type="string">ocr_only</param>
+                 <param name="ocrImageType" type="string">rgb</param>
+                 <param name="ocrDPI" type="int">100</param>
+             </params>
+         </parser>
+ ...
+ }}}
+ 
  
  == See also ==

[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Reply via email to