[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Apache Wiki Wed, 09 Nov 2016 07:23:23 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=1&rev2=2

  = PDFParser =
  
+ 
  == Configuration options ==
+ There are three ways of configuring the PDFParser.  
+  1. Programmatically via setter methods on the PDFParser.
+  2. Programmatically via the PDFParserConfig object submitted through the 
ParseContext.
+  3. Via the tika-config.xml file (many thanks to Thamme Gowda and Chris 
Mattmann's work on TIKA-1508).
  
- ...to be filled in...
+ The first two are fairly self-explanatory through the javadocs.
+ 
+ Here follows an example tika-config.xml file for setting 
catchIntermediateExceptions to {{{false}}} and for checking for whether the PDF 
allows for extraction for accessibility.
+ {{{
+ <?xml version="1.0" encoding="UTF-8"?>
+ <properties>
+     <parsers>
+         <parser class="org.apache.tika.parser.pdf.PDFParser">
+             <params>
+                 <param name="allowExtractionForAccessibility" 
type="bool">true</param>
+                 <param name="catchIntermediateExceptions" 
type="bool">false</param>
+                 <!-- we really should throw an exception for this!! 
+                      We are currently swallowing it -->
+                 <param name="someRandomThingOrOther" type="bool">true</param>
+             </params>
+         </parser>
+     </parsers>
+ </properties>
+ }}}
+ 
+ == OCR ==
+ There are two ways of running OCR on PDFs
+  1. Extracting the inline images and letting Tesseract run on those
+  2. Rendering each PDF page as a single image and running Tesseract on that 
single image
+ 
+ We have not carried out evaluations to determine which strategy is better.  
We suspect that the tried and true ''It Depends(TM)'' is operative here.  We 
added OCR'ing of the single image option because some PDFs can contain hundreds 
of images per page where each image is a tiny part of the overall page, and OCR 
would be useless.  However, we recognize, that if the page is logically broken 
into sections, running OCR on the individual inline images might yield better 
results.
+ 
+ === Configuring OCR on Inline Images ===
+ 
+ 
+ === Configuring OCR on Rendered Pages ===
  
  == See also ==

[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Reply via email to