[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Apache Wiki Wed, 09 Nov 2016 07:50:24 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=4&rev2=5

   2. Programmatically via the PDFParserConfig object submitted through the 
`ParseContext`.
   3. Via the tika-config.xml file (many thanks to Thamme Gowda and Chris 
Mattmann's work on TIKA-1508).
  
- The first two are fairly self-explanatory through the javadocs.
+ The first two are fairly self-explanatory through the javadocs. 
  
  Here follows an example tika-config.xml file for setting 
{{{catchIntermediateExceptions}}} to {{{false}}} and for checking for whether 
the PDF allows for extraction for accessibility. 
  {{{
@@ -54, +54 @@

  }}}
  
  === Configuring OCR on Rendered Pages ===
- This will render each PDF page and then run OCR on that image.  Users can 
select the {{image type}} (see {{org.apache.pdfbox.rendering.ImageType}} for 
options) and the dots per inch {{dpi}}.  For {{ocrStrategy}}, we currently 
have: no_ocr (rely on regular text extraction only), ocr_only (don't bother 
extracting text, just run OCR on each page), ocr_and_text (both extract text 
and run OCR). We should add more advanced strategies, e.g. if you only get 10 
words out of a page, run OCR, but we haven't implemented those yet.
+ This will render each PDF page and then run OCR on that image.  Users can 
select the {{{image type}}} (see {{{org.apache.pdfbox.rendering.ImageType}}} 
for options) and the dots per inch {{{dpi}}}.  For {{{ocrStrategy}}}, we 
currently have: {{{no_ocr}}} (rely on regular text extraction only), 
{{{ocr_only}}} (don't bother extracting text, just run OCR on each page), 
{{{ocr_and_text}}} (both extract text and run OCR). We should add more advanced 
strategies, e.g. if you only get 10 words out of a page, run OCR, but we 
haven't implemented those yet.
  
  {{{
  ...

[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Reply via email to