Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "Troubleshooting Tika" page has been changed by NickBurch:
https://wiki.apache.org/tika/Troubleshooting%20Tika?action=diff&rev1=8&rev2=9

Comment:
PDF text issues

   * Make sure Tika is able to correctly detect your file's type, see 
'''Content Incorrectly Detected'''
   * Make sure Tika used the parser you meant it to, see '''Wrong Parser Used'''
   * Make sure you're actually using the version of Tika you meant to use! See 
'''Identifying your Tika Version'''
+  * Problems with a PDF? See '''PDF Text Problems'''
  
  == No Content Extracted ==
   * Make sure Tika is able to correctly detect your file's type, see 
'''Content Incorrectly Detected'''
@@ -239, +240 @@

  
  ''TODO describe how to use a ServiceLoader.LoadErrorHandler.ERROR to trigger 
an exception''
  
+ == PDF Text Problems ==
+ If Tika isn't extracting the right text from a PDF, and/or is giving errors, 
the first thing to do is identify if this is a Tika issue, or an issue with the 
underlying Apache PDFBox library used.
+ 
+ To check, grab the latest [[http://pdfbox.apache.org/download.cgi|Apache 
PDFBox pdfbox-app jar]] and use the 
[[http://pdfbox.apache.org/2.0/commandline.html#extracttext|ExtractText command 
line tool]] on your problematic PDF. 
+ 
+ If that shows the same problem, it's a PDFBox bug. Please 
[[http://pdfbox.apache.org/support.html|file an Apache PDFBox bug report]] and 
attach at least one failing file to the bug. When that gets fixed, Tika will 
pick up the new release and will get the fix
+ 
+ If the PDFBox ExtractText works fine, it's likely a Tika bug. Please 
[[http://tika.apache.org/contribute.html|report an Apache Tika bug]], attach at 
least one failing file, and mention that PDFBox ExtractText works fine
+ 

Reply via email to