[
https://issues.apache.org/jira/browse/TIKA-1729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Loris Bachert updated TIKA-1729:
--------------------------------
Description:
As described in this
[stackoverflow-post|http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files]
i'm having troubles extracting text out of scanned PDF files. By scanned PDF
files i mean PDF files that consist only of images. Because each page is an
image i can't extract them using a custom ParsingEmbeddedDocumentExtractor. I
also tried using the setExtractInlineImages method of the PDFParserConfig but
this didn't work aswell.
There was already a [ticket|https://issues.apache.org/jira/browse/TIKA-93]
regarding the OCR support and including the [PDF
file|https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf] i'm
using for my tests.
Here is a JUnit-test about my issue:
{code:title=PDFOCRTest.java|borderStyle=solid}
@Test
public void testPDFOCRExtraction() throws IOException, SAXException,
TikaException {
File file = new File(filePath);
InputStream stream = new FileInputStream(file);
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
ParseContext context = new ParseContext();
context.set(PDFParserConfig.class, config);
PDFParser pdfParser = new PDFParser();
pdfParser.setPDFParserConfig(config);
pdfParser.parse(stream, handler, metadata, context);
String text = handler.toString().trim();
assertFalse(text.isEmpty());
}
{code}
was:
As described in this
[stackoverflow-post|http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files]
i'm having troubles extracting text out of scanned PDF files. By scanned PDF
files i mean PDF files that consist only of images. Because each page is an
image i can't extract them using a custom ParsingEmbeddedDocumentExtractor. I
also tried using the setExtractInlineImages method of the PDFParserConfig but
this didn't work aswell.
There was already a [ticket|https://issues.apache.org/jira/browse/TIKA-93]
regarding the OCR support and including the [PDF
file|https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf] i'm
using for my tests.
Here is a JUnit-test about my issue:
{code:title=PDFOCRTest.java|borderStyle=solid}
@Test
public void testPDFOCRExtraction() throws IOException, SAXException,
TikaException {
File file = new File(filePath);
InputStream stream = new FileInputStream(file);
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
ParseContext context = new ParseContext();
context.set(PDFParserConfig.class, config);
PDFParser pdfParser = new PDFParser();
pdfParser.setPDFParserConfig(config);
pdfParser.parse(stream, handler, metadata, context);
String text = handler.toString();
assertFalse(text.isEmpty());
}
{code}
> OCR in PDF files
> ----------------
>
> Key: TIKA-1729
> URL: https://issues.apache.org/jira/browse/TIKA-1729
> Project: Tika
> Issue Type: Bug
> Components: config, parser
> Affects Versions: 1.9, 1.10
> Environment: Windows 7, 64-bit, JDK 1.8.0_51 64 bit
> Reporter: Loris Bachert
> Labels: java, ocr, parser, pdf
>
> As described in this
> [stackoverflow-post|http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files]
> i'm having troubles extracting text out of scanned PDF files. By scanned PDF
> files i mean PDF files that consist only of images. Because each page is an
> image i can't extract them using a custom ParsingEmbeddedDocumentExtractor. I
> also tried using the setExtractInlineImages method of the PDFParserConfig but
> this didn't work aswell.
> There was already a [ticket|https://issues.apache.org/jira/browse/TIKA-93]
> regarding the OCR support and including the [PDF
> file|https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf]
> i'm using for my tests.
> Here is a JUnit-test about my issue:
> {code:title=PDFOCRTest.java|borderStyle=solid}
> @Test
> public void testPDFOCRExtraction() throws IOException, SAXException,
> TikaException {
> File file = new File(filePath);
> InputStream stream = new FileInputStream(file);
>
> BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
> Metadata metadata = new Metadata();
> PDFParserConfig config = new PDFParserConfig();
> config.setExtractInlineImages(true);
> ParseContext context = new ParseContext();
> context.set(PDFParserConfig.class, config);
>
> PDFParser pdfParser = new PDFParser();
> pdfParser.setPDFParserConfig(config);
> pdfParser.parse(stream, handler, metadata, context);
> String text = handler.toString().trim();
> assertFalse(text.isEmpty());
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)