Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison: https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=4&rev2=5 2. Programmatically via the PDFParserConfig object submitted through the `ParseContext`. 3. Via the tika-config.xml file (many thanks to Thamme Gowda and Chris Mattmann's work on TIKA-1508). - The first two are fairly self-explanatory through the javadocs. + The first two are fairly self-explanatory through the javadocs. Here follows an example tika-config.xml file for setting {{{catchIntermediateExceptions}}} to {{{false}}} and for checking for whether the PDF allows for extraction for accessibility. {{{ @@ -54, +54 @@ }}} === Configuring OCR on Rendered Pages === - This will render each PDF page and then run OCR on that image. Users can select the {{image type}} (see {{org.apache.pdfbox.rendering.ImageType}} for options) and the dots per inch {{dpi}}. For {{ocrStrategy}}, we currently have: no_ocr (rely on regular text extraction only), ocr_only (don't bother extracting text, just run OCR on each page), ocr_and_text (both extract text and run OCR). We should add more advanced strategies, e.g. if you only get 10 words out of a page, run OCR, but we haven't implemented those yet. + This will render each PDF page and then run OCR on that image. Users can select the {{{image type}}} (see {{{org.apache.pdfbox.rendering.ImageType}}} for options) and the dots per inch {{{dpi}}}. For {{{ocrStrategy}}}, we currently have: {{{no_ocr}}} (rely on regular text extraction only), {{{ocr_only}}} (don't bother extracting text, just run OCR on each page), {{{ocr_and_text}}} (both extract text and run OCR). We should add more advanced strategies, e.g. if you only get 10 words out of a page, run OCR, but we haven't implemented those yet. {{{ ...
