[ https://issues.apache.org/jira/browse/TIKA-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728962#comment-14728962 ]
Tim Allison commented on TIKA-1729: ----------------------------------- Right, we haven't integrated OCR with PDF files yet. You can do this now by extracting the image files (see ExtractEmbeddedFiles in tika-example) and then running OCR on each, but I agree that we should eventually add this capability. > OCR in PDF files > ---------------- > > Key: TIKA-1729 > URL: https://issues.apache.org/jira/browse/TIKA-1729 > Project: Tika > Issue Type: Bug > Components: config, parser > Affects Versions: 1.9, 1.10 > Environment: Windows 7, 64-bit, JDK 1.8.0_51 64-bit > Windows 10, 64-bit, JDK 1.8.0_51 32-bit > Reporter: Loris Bachert > Labels: java, ocr, parser, pdf > > As described in this > [stackoverflow-post|http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files] > i'm having troubles extracting text out of scanned PDF files. By scanned PDF > files i mean PDF files that consist only of images. Because each page is an > image i can't extract them using a custom ParsingEmbeddedDocumentExtractor. I > also tried using the setExtractInlineImages method of the PDFParserConfig but > this didn't work aswell. > There was already a [ticket|https://issues.apache.org/jira/browse/TIKA-93] > regarding the OCR support and including the [PDF > file|https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf] > i'm using for my tests. > Here is a JUnit-test about my issue: > {code:title=PDFOCRTest.java|borderStyle=solid} > @Test > public void testPDFOCRExtraction() throws IOException, SAXException, > TikaException { > File file = new File(filePath); > InputStream stream = new FileInputStream(file); > > BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); > Metadata metadata = new Metadata(); > PDFParserConfig config = new PDFParserConfig(); > config.setExtractInlineImages(true); > ParseContext context = new ParseContext(); > context.set(PDFParserConfig.class, config); > > PDFParser pdfParser = new PDFParser(); > pdfParser.setPDFParserConfig(config); > pdfParser.parse(stream, handler, metadata, context); > String text = handler.toString().trim(); > assertFalse(text.isEmpty()); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)