[jira] [Updated] (TIKA-1729) OCR in PDF files

Loris Bachert (JIRA) Thu, 03 Sep 2015 02:52:59 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Loris Bachert updated TIKA-1729:
--------------------------------
    Description: 
As described in this 
[stackoverflow-post|http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files]
 i'm having troubles extracting text out of scanned PDF files. By scanned PDF 
files i mean PDF files that consist only of images. Because each page is an 
image i can't extract them using a custom ParsingEmbeddedDocumentExtractor. I 
also tried using the setExtractInlineImages method of the PDFParserConfig but 
this didn't work aswell.
There was already a [ticket|https://issues.apache.org/jira/browse/TIKA-93] 
regarding the OCR support and including the [PDF 
file|https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf] i'm 
using for my tests.
Here is a JUnit-test about my issue:
{code:title=PDFOCRTest.java|borderStyle=solid}
@Test
public void testPDFOCRExtraction() throws IOException, SAXException, 
TikaException {
        File file = new File(filePath);
        InputStream stream = new FileInputStream(file);
        
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
        Metadata metadata = new Metadata();
        PDFParserConfig config = new PDFParserConfig();
        config.setExtractInlineImages(true);
        ParseContext context = new ParseContext();
        context.set(PDFParserConfig.class, config);
        
        PDFParser pdfParser = new PDFParser();
        pdfParser.setPDFParserConfig(config);
        pdfParser.parse(stream, handler, metadata, context);
        String text = handler.toString().trim();
        assertFalse(text.isEmpty());
}
{code}

  was:
As described in this 
[stackoverflow-post|http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files]
 i'm having troubles extracting text out of scanned PDF files. By scanned PDF 
files i mean PDF files that consist only of images. Because each page is an 
image i can't extract them using a custom ParsingEmbeddedDocumentExtractor. I 
also tried using the setExtractInlineImages method of the PDFParserConfig but 
this didn't work aswell.
There was already a [ticket|https://issues.apache.org/jira/browse/TIKA-93] 
regarding the OCR support and including the [PDF 
file|https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf] i'm 
using for my tests.
Here is a JUnit-test about my issue:
{code:title=PDFOCRTest.java|borderStyle=solid}
@Test
public void testPDFOCRExtraction() throws IOException, SAXException, 
TikaException {
        File file = new File(filePath);
        InputStream stream = new FileInputStream(file);
        
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
        Metadata metadata = new Metadata();
        PDFParserConfig config = new PDFParserConfig();
        config.setExtractInlineImages(true);
        ParseContext context = new ParseContext();
        context.set(PDFParserConfig.class, config);
        
        PDFParser pdfParser = new PDFParser();
        pdfParser.setPDFParserConfig(config);
        pdfParser.parse(stream, handler, metadata, context);
        String text = handler.toString();
        assertFalse(text.isEmpty());
}
{code}


> OCR in PDF files
> ----------------
>
>                 Key: TIKA-1729
>                 URL: https://issues.apache.org/jira/browse/TIKA-1729
>             Project: Tika
>          Issue Type: Bug
>          Components: config, parser
>    Affects Versions: 1.9, 1.10
>         Environment: Windows 7, 64-bit, JDK 1.8.0_51 64 bit
>            Reporter: Loris Bachert
>              Labels: java, ocr, parser, pdf
>
> As described in this 
> [stackoverflow-post|http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files]
>  i'm having troubles extracting text out of scanned PDF files. By scanned PDF 
> files i mean PDF files that consist only of images. Because each page is an 
> image i can't extract them using a custom ParsingEmbeddedDocumentExtractor. I 
> also tried using the setExtractInlineImages method of the PDFParserConfig but 
> this didn't work aswell.
> There was already a [ticket|https://issues.apache.org/jira/browse/TIKA-93] 
> regarding the OCR support and including the [PDF 
> file|https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf] 
> i'm using for my tests.
> Here is a JUnit-test about my issue:
> {code:title=PDFOCRTest.java|borderStyle=solid}
> @Test
> public void testPDFOCRExtraction() throws IOException, SAXException, 
> TikaException {
>       File file = new File(filePath);
>       InputStream stream = new FileInputStream(file);
>       
>       BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
>       Metadata metadata = new Metadata();
>       PDFParserConfig config = new PDFParserConfig();
>       config.setExtractInlineImages(true);
>       ParseContext context = new ParseContext();
>       context.set(PDFParserConfig.class, config);
>       
>       PDFParser pdfParser = new PDFParser();
>       pdfParser.setPDFParserConfig(config);
>       pdfParser.parse(stream, handler, metadata, context);
>       String text = handler.toString().trim();
>       assertFalse(text.isEmpty());
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1729) OCR in PDF files

Reply via email to