Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TikaOCR" page has been changed by TimothyAllison: https://wiki.apache.org/tika/TikaOCR?action=diff&rev1=10&rev2=11 = OCR and PDFs = + See also [[https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29|PDFParser notes]] for more details on options for performing OCR on PDFs. + - With Tika server, the PDFConfig is generated for each document, so any configurations that you may do in the tika-config.xml file are overwritten. + Note: With Tika server, the PDFConfig is generated for each document, so any configurations that you may specify in the tika-config.xml file that you pass to the tika-server on startup are overwritten. + You need to specify configurations for the PDFParser like so: `curl -T testOCR.pdf http://localhost:9998/rmeta/text --header "X-Tika-PDFextractInlineImages: true"` - - See [[https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29|PDFParser notes]]. = Disable Tika OCR = Tika's OCR will trigger on images embedded within, say, office documents in addition to images you upload directly. Because OCR slows down Tika, you might want to disable it if you don't need the results. You can disable OCR by simply uninstalling tesseract, but if that's not an option, here is a tika.xml config file that disables OCR:
