[Tika Wiki] Update of "TikaOCR" by TimothyAllison

Apache Wiki Wed, 15 Feb 2017 11:12:06 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.


The "TikaOCR" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaOCR?action=diff&rev1=10&rev2=11

  
  = OCR and PDFs =
  
+ See also 
[[https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29|PDFParser 
notes]] for more details on options for performing OCR on PDFs.
+ 
- With Tika server, the PDFConfig is generated for each document, so any 
configurations that you may do in the tika-config.xml file are overwritten.
+ Note: With Tika server, the PDFConfig is generated for each document, so any 
configurations that you may specify in the tika-config.xml file that you pass 
to the tika-server on startup are overwritten.
+ 
  You need to specify configurations for the PDFParser like so:
  
  `curl -T testOCR.pdf http://localhost:9998/rmeta/text --header 
"X-Tika-PDFextractInlineImages: true"`
- 
- See 
[[https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29|PDFParser 
notes]].
  
  = Disable Tika OCR =
  Tika's OCR will trigger on images embedded within, say, office documents in 
addition to images you upload directly. Because OCR slows down Tika, you might 
want to disable it if you don't need the results. You can disable OCR by simply 
uninstalling tesseract, but if that's not an option, here is a tika.xml config 
file that disables OCR:

[Tika Wiki] Update of "TikaOCR" by TimothyAllison

Reply via email to