[ 
https://issues.apache.org/jira/browse/TIKA-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729036#comment-14729036
 ] 

Tim Allison commented on TIKA-1729:
-----------------------------------

Where is it failing?

Are you able to extract the inline images into separate files?

Are you able to run OCR against those files?

IIRC, Tesseract doesn't handle all of the image formats that PDF uses and you 
might need to convert... Ugh.  Probably part of the reason we haven't 
implemented this yet.

> OCR in PDF files
> ----------------
>
>                 Key: TIKA-1729
>                 URL: https://issues.apache.org/jira/browse/TIKA-1729
>             Project: Tika
>          Issue Type: Improvement
>          Components: config, parser
>    Affects Versions: 1.9, 1.10
>         Environment: Windows 7, 64-bit, JDK 1.8.0_51 64-bit
> Windows 10, 64-bit, JDK 1.8.0_51 32-bit
>            Reporter: Loris Bachert
>              Labels: java, ocr, parser, pdf
>
> As described in this 
> [stackoverflow-post|http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files]
>  i'm having troubles extracting text out of scanned PDF files. By scanned PDF 
> files i mean PDF files that consist only of images. Because each page is an 
> image i can't extract them using a custom ParsingEmbeddedDocumentExtractor. I 
> also tried using the setExtractInlineImages method of the PDFParserConfig but 
> this didn't work aswell.
> There was already a [ticket|https://issues.apache.org/jira/browse/TIKA-93] 
> regarding the OCR support and including the [PDF 
> file|https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf] 
> i'm using for my tests.
> Here is a JUnit-test about my issue:
> {code:title=PDFOCRTest.java|borderStyle=solid}
> @Test
> public void testPDFOCRExtraction() throws IOException, SAXException, 
> TikaException {
>       File file = new File(filePath);
>       InputStream stream = new FileInputStream(file);
>       
>       BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
>       Metadata metadata = new Metadata();
>       PDFParserConfig config = new PDFParserConfig();
>       config.setExtractInlineImages(true);
>       ParseContext context = new ParseContext();
>       context.set(PDFParserConfig.class, config);
>       
>       PDFParser pdfParser = new PDFParser();
>       pdfParser.setPDFParserConfig(config);
>       pdfParser.parse(stream, handler, metadata, context);
>       String text = handler.toString().trim();
>       assertFalse(text.isEmpty());
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to