Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison: https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=11&rev2=12 }}} === Configuring OCR on Rendered Pages === - This will render each PDF page and then run OCR on that image. Users can select the {{{image type}}} (see {{{org.apache.pdfbox.rendering.ImageType}}} for options) and the dots per inch {{{dpi}}}. For {{{ocrStrategy}}}, we currently have: {{{no_ocr}}} (rely on regular text extraction only), {{{ocr_only}}} (don't bother extracting text, just run OCR on each page), {{{ocr_and_text}}} (both extract text and run OCR). We should add more advanced strategies, e.g. if you only get 10 words out of a page, run OCR, but we haven't implemented those yet. + This will render each PDF page and then run OCR on that image. This method of OCR is triggered by the {{{ocrStrategy}}} parameter, but users can manipulate other parameters, including the {{{image type}}} (see {{{org.apache.pdfbox.rendering.ImageType}}} for options) and the dots per inch {{{dpi}}}. The defaults are: {{{gray}}} and {{{200}}} respectively. For {{{ocrStrategy}}}, we currently have: {{{no_ocr}}} (rely on regular text extraction only), {{{ocr_only}}} (don't bother extracting text, just run OCR on each page), {{{ocr_and_text}}} (both extract text and run OCR). We should add more advanced strategies, e.g. if you only get 10 words out of a page, run OCR, but we haven't implemented those yet. {{{ ...
