[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Apache Wiki Thu, 10 Nov 2016 12:08:51 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=11&rev2=12

  }}}
  
  === Configuring OCR on Rendered Pages ===
- This will render each PDF page and then run OCR on that image.  Users can 
select the {{{image type}}} (see {{{org.apache.pdfbox.rendering.ImageType}}} 
for options) and the dots per inch {{{dpi}}}.  For {{{ocrStrategy}}}, we 
currently have: {{{no_ocr}}} (rely on regular text extraction only), 
{{{ocr_only}}} (don't bother extracting text, just run OCR on each page), 
{{{ocr_and_text}}} (both extract text and run OCR). We should add more advanced 
strategies, e.g. if you only get 10 words out of a page, run OCR, but we 
haven't implemented those yet.
+ This will render each PDF page and then run OCR on that image.  This method 
of OCR is triggered by the {{{ocrStrategy}}} parameter, but users can 
manipulate other parameters, including the {{{image type}}} (see 
{{{org.apache.pdfbox.rendering.ImageType}}} for options) and the dots per inch 
{{{dpi}}}.  The defaults are: {{{gray}}} and {{{200}}} respectively.  For 
{{{ocrStrategy}}}, we currently have: {{{no_ocr}}} (rely on regular text 
extraction only), {{{ocr_only}}} (don't bother extracting text, just run OCR on 
each page), {{{ocr_and_text}}} (both extract text and run OCR). We should add 
more advanced strategies, e.g. if you only get 10 words out of a page, run OCR, 
but we haven't implemented those yet.
  
  {{{
  ...

[Tika Wiki] Update of "PDFParser (Apache PDFBox)" by TimothyAllison

Reply via email to