[jira] [Updated] (TIKA-2844) OCR_STRATEGY.OCR_ONLY does not extract any text

Horst Krause (JIRA) Mon, 25 Mar 2019 00:11:46 -0700


     [ 
https://issues.apache.org/jira/browse/TIKA-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Horst Krause updated TIKA-2844:
-------------------------------
    Description: 
I have some PDF which were scanned including OCR with some other software. But 
the recognized text quality is quite poor. So I would like to ignore the text 
in the pdf and just do a new OCR with tesseract.

So I use OCR_STRATEGY.OCR_ONLY. Unfortunately this does not extract any text 
from the PDF. When I use OCR_AND_TEXT_EXTRACTION I get the poor text from the 
original PDF.

After trying several tutorials and examples, this is my code:
{code:java}
final InputStream pdf = Files.newInputStream(Paths.get("e:/path/to/my.pdf"));
final ByteArrayOutputStream out = new ByteArrayOutputStream();

final TikaConfig config = TikaConfig.getDefaultConfig();
final String version = (new Tika(config)).toString();
LOG.info("Tika version " + version + " / " + 
config.getParser().getClass().getName());

final BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

final PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setOcrStrategy(OCR_STRATEGY.OCR_ONLY);

final TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
tesserConfig.setTesseractPath("c:/Progra~1/TESSER~1");
tesserConfig.setImageMagickPath("C:/Progra~1/IMAGEM~1.8-Q");
tesserConfig.setEnableImageProcessing(1);

final Parser parser = new AutoDetectParser();
final Metadata meta = new Metadata();
final ParseContext parsecontext = new ParseContext();

parsecontext.set(Parser.class, parser);
parsecontext.set(PDFParserConfig.class, pdfConfig);
parsecontext.set(TesseractOCRConfig.class, tesserConfig);

parser.parse(pdf, handler, meta, parsecontext);
System.out.println("OCR Result: " + handler.toString());

{code}
My maven dependencies:
{code:java}
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.20</version> <!-- 1.20 -->
</dependency>

<dependency>
<groupId>com.levigo.jbig2</groupId>
<artifactId>levigo-jbig2-imageio</artifactId>
<version>1.6.5</version>
</dependency>

<dependency>
<groupId>com.github.jai-imageio</groupId>
<artifactId>jai-imageio-core</artifactId>
<version>1.3.1</version> <!-- 1.4.0 -->
</dependency>

<dependency>
<groupId>com.github.jai-imageio</groupId>
<artifactId>jai-imageio-jpeg2000</artifactId>
<version>1.3.0</version>
</dependency>

<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>jbig2-imageio</artifactId>
<version>3.0.0</version>
</dependency>

{code}
 

As there is no error message or stack trace at all, I don't understand why I 
don't get any result. If it is not a bug, it should at least output some hint 
what's going wrong.

  was:
I have some PDF which were scanned including OCR with some other software. But 
the recognized text quality is quite poor. So I would like to ignore the text 
in the pdf and just do a new OCR with tesseract.

So I use OCR_STRATEGY.OCR_ONLY. Unfortunately this does not extract any text 
from the PDF. When I use OCR_AND_TEXT_EXTRACTION I get the poor text from the 
original PDF.

After trying several tutorials and examples, this is my code:
{code:java}
final InputStream pdf = Files.newInputStream(Paths.get("e:/path/to/my.pdf"));
final ByteArrayOutputStream out = new ByteArrayOutputStream();

final TikaConfig config = TikaConfig.getDefaultConfig();
final String version = (new Tika(config)).toString();
LOG.info("Tika version " + version + " / " + 
config.getParser().getClass().getName());

final BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

final PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);
pdfConfig.setOcrStrategy(OCR_STRATEGY.OCR_ONLY);

final TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
tesserConfig.setTesseractPath("c:/Progra~1/TESSER~1");
tesserConfig.setImageMagickPath("C:/Progra~1/IMAGEM~1.8-Q");
tesserConfig.setEnableImageProcessing(1);

final Parser parser = new AutoDetectParser();
final Metadata meta = new Metadata();
final ParseContext parsecontext = new ParseContext();

parsecontext.set(Parser.class, parser);
parsecontext.set(PDFParserConfig.class, pdfConfig);
parsecontext.set(TesseractOCRConfig.class, tesserConfig);

parser.parse(pdf, handler, meta, parsecontext);
System.out.println("OCR Result: " + handler.toString());

{code}
As there is no error message or stack trace at all, I don't understand why I 
don't get any result. If it is not a bug, it should at least output some hint 
what's going wrong.


> OCR_STRATEGY.OCR_ONLY does not extract any text
> -----------------------------------------------
>
>                 Key: TIKA-2844
>                 URL: https://issues.apache.org/jira/browse/TIKA-2844
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr
>    Affects Versions: 1.20
>         Environment: Win7, 64-bit, Tesseract 4.1.0 and Image Magiick 7.0.8 
> installed
>            Reporter: Horst Krause
>            Priority: Major
>
> I have some PDF which were scanned including OCR with some other software. 
> But the recognized text quality is quite poor. So I would like to ignore the 
> text in the pdf and just do a new OCR with tesseract.
> So I use OCR_STRATEGY.OCR_ONLY. Unfortunately this does not extract any text 
> from the PDF. When I use OCR_AND_TEXT_EXTRACTION I get the poor text from the 
> original PDF.
> After trying several tutorials and examples, this is my code:
> {code:java}
> final InputStream pdf = Files.newInputStream(Paths.get("e:/path/to/my.pdf"));
> final ByteArrayOutputStream out = new ByteArrayOutputStream();
> final TikaConfig config = TikaConfig.getDefaultConfig();
> final String version = (new Tika(config)).toString();
> LOG.info("Tika version " + version + " / " + 
> config.getParser().getClass().getName());
> final BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
> final PDFParserConfig pdfConfig = new PDFParserConfig();
> pdfConfig.setExtractInlineImages(true);
> pdfConfig.setOcrStrategy(OCR_STRATEGY.OCR_ONLY);
> final TesseractOCRConfig tesserConfig = new TesseractOCRConfig();
> tesserConfig.setTesseractPath("c:/Progra~1/TESSER~1");
> tesserConfig.setImageMagickPath("C:/Progra~1/IMAGEM~1.8-Q");
> tesserConfig.setEnableImageProcessing(1);
> final Parser parser = new AutoDetectParser();
> final Metadata meta = new Metadata();
> final ParseContext parsecontext = new ParseContext();
> parsecontext.set(Parser.class, parser);
> parsecontext.set(PDFParserConfig.class, pdfConfig);
> parsecontext.set(TesseractOCRConfig.class, tesserConfig);
> parser.parse(pdf, handler, meta, parsecontext);
> System.out.println("OCR Result: " + handler.toString());
> {code}
> My maven dependencies:
> {code:java}
> <dependency>
> <groupId>org.apache.tika</groupId>
> <artifactId>tika-parsers</artifactId>
> <version>1.20</version> <!-- 1.20 -->
> </dependency>
> <dependency>
> <groupId>com.levigo.jbig2</groupId>
> <artifactId>levigo-jbig2-imageio</artifactId>
> <version>1.6.5</version>
> </dependency>
> <dependency>
> <groupId>com.github.jai-imageio</groupId>
> <artifactId>jai-imageio-core</artifactId>
> <version>1.3.1</version> <!-- 1.4.0 -->
> </dependency>
> <dependency>
> <groupId>com.github.jai-imageio</groupId>
> <artifactId>jai-imageio-jpeg2000</artifactId>
> <version>1.3.0</version>
> </dependency>
> <dependency>
> <groupId>org.apache.pdfbox</groupId>
> <artifactId>jbig2-imageio</artifactId>
> <version>3.0.0</version>
> </dependency>
> {code}
>  
> As there is no error message or stack trace at all, I don't understand why I 
> don't get any result. If it is not a bug, it should at least output some hint 
> what's going wrong.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (TIKA-2844) OCR_STRATEGY.OCR_ONLY does not extract any text

Reply via email to