[jira] [Commented] (TIKA-1729) OCR in PDF files

Tim Allison (JIRA) Thu, 03 Sep 2015 07:16:30 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729117#comment-14729117
 ]


Tim Allison commented on TIKA-1729:
-----------------------------------

Sorry for my misinformation above.  

I think the issue is that you need to use the AutoDetectParser.  If you only 
call the PDFParser, it doesn't know what to do (by itself) with embedded 
documents.

Something like this:
{code}
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

        TesseractOCRConfig config = new TesseractOCRConfig();
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);

        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        parseContext.set(Parser.class, parser); //need to add this to make sure 
recursive parsing happens!

        parser.parse(stream, handler, new Metadata(), parseContext);
{code}



> OCR in PDF files
> ----------------
>
>                 Key: TIKA-1729
>                 URL: https://issues.apache.org/jira/browse/TIKA-1729
>             Project: Tika
>          Issue Type: Improvement
>          Components: config, parser
>    Affects Versions: 1.9, 1.10
>         Environment: Windows 7, 64-bit, JDK 1.8.0_51 64-bit
> Windows 10, 64-bit, JDK 1.8.0_51 32-bit
>            Reporter: Loris Bachert
>              Labels: java, ocr, parser, pdf
>
> As described in this 
> [stackoverflow-post|http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files]
>  i'm having troubles extracting text out of scanned PDF files. By scanned PDF 
> files i mean PDF files that consist only of images. Because each page is an 
> image i can't extract them using a custom ParsingEmbeddedDocumentExtractor. I 
> also tried using the setExtractInlineImages method of the PDFParserConfig but 
> this didn't work aswell.
> There was already a [ticket|https://issues.apache.org/jira/browse/TIKA-93] 
> regarding the OCR support and including the [PDF 
> file|https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf] 
> i'm using for my tests.
> Here is a JUnit-test about my issue:
> {code:title=PDFOCRTest.java|borderStyle=solid}
> @Test
> public void testPDFOCRExtraction() throws IOException, SAXException, 
> TikaException {
>       File file = new File(filePath);
>       InputStream stream = new FileInputStream(file);
>       
>       BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
>       Metadata metadata = new Metadata();
>       PDFParserConfig config = new PDFParserConfig();
>       config.setExtractInlineImages(true);
>       ParseContext context = new ParseContext();
>       context.set(PDFParserConfig.class, config);
>       
>       PDFParser pdfParser = new PDFParser();
>       pdfParser.setPDFParserConfig(config);
>       pdfParser.parse(stream, handler, metadata, context);
>       String text = handler.toString().trim();
>       assertFalse(text.isEmpty());
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1729) OCR in PDF files

Reply via email to