Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "PDFParser (Apache PDFBox)" page has been changed by TimothyAllison: https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29?action=diff&rev1=1&rev2=2 = PDFParser = + == Configuration options == + There are three ways of configuring the PDFParser. + 1. Programmatically via setter methods on the PDFParser. + 2. Programmatically via the PDFParserConfig object submitted through the ParseContext. + 3. Via the tika-config.xml file (many thanks to Thamme Gowda and Chris Mattmann's work on TIKA-1508). - ...to be filled in... + The first two are fairly self-explanatory through the javadocs. + + Here follows an example tika-config.xml file for setting catchIntermediateExceptions to {{{false}}} and for checking for whether the PDF allows for extraction for accessibility. + {{{ + <?xml version="1.0" encoding="UTF-8"?> + <properties> + <parsers> + <parser class="org.apache.tika.parser.pdf.PDFParser"> + <params> + <param name="allowExtractionForAccessibility" type="bool">true</param> + <param name="catchIntermediateExceptions" type="bool">false</param> + <!-- we really should throw an exception for this!! + We are currently swallowing it --> + <param name="someRandomThingOrOther" type="bool">true</param> + </params> + </parser> + </parsers> + </properties> + }}} + + == OCR == + There are two ways of running OCR on PDFs + 1. Extracting the inline images and letting Tesseract run on those + 2. Rendering each PDF page as a single image and running Tesseract on that single image + + We have not carried out evaluations to determine which strategy is better. We suspect that the tried and true ''It Depends(TM)'' is operative here. We added OCR'ing of the single image option because some PDFs can contain hundreds of images per page where each image is a tiny part of the overall page, and OCR would be useless. However, we recognize, that if the page is logically broken into sections, running OCR on the individual inline images might yield better results. + + === Configuring OCR on Inline Images === + + + === Configuring OCR on Rendered Pages === == See also ==
