Dear Wiki user, You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "Troubleshooting Tika" page has been changed by NickBurch: https://wiki.apache.org/tika/Troubleshooting%20Tika?action=diff&rev1=8&rev2=9 Comment: PDF text issues * Make sure Tika is able to correctly detect your file's type, see '''Content Incorrectly Detected''' * Make sure Tika used the parser you meant it to, see '''Wrong Parser Used''' * Make sure you're actually using the version of Tika you meant to use! See '''Identifying your Tika Version''' + * Problems with a PDF? See '''PDF Text Problems''' == No Content Extracted == * Make sure Tika is able to correctly detect your file's type, see '''Content Incorrectly Detected''' @@ -239, +240 @@ ''TODO describe how to use a ServiceLoader.LoadErrorHandler.ERROR to trigger an exception'' + == PDF Text Problems == + If Tika isn't extracting the right text from a PDF, and/or is giving errors, the first thing to do is identify if this is a Tika issue, or an issue with the underlying Apache PDFBox library used. + + To check, grab the latest [[http://pdfbox.apache.org/download.cgi|Apache PDFBox pdfbox-app jar]] and use the [[http://pdfbox.apache.org/2.0/commandline.html#extracttext|ExtractText command line tool]] on your problematic PDF. + + If that shows the same problem, it's a PDFBox bug. Please [[http://pdfbox.apache.org/support.html|file an Apache PDFBox bug report]] and attach at least one failing file to the bug. When that gets fixed, Tika will pick up the new release and will get the fix + + If the PDFBox ExtractText works fine, it's likely a Tika bug. Please [[http://tika.apache.org/contribute.html|report an Apache Tika bug]], attach at least one failing file, and mention that PDFBox ExtractText works fine +
