RE: apache tikka is not working for scanned image documents

Allison, Timothy B. Wed, 05 Apr 2017 04:17:27 -0700

Y, and please also see the "Optional Dependencies" section here:

https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29




-----Original Message-----
From: Thejan Wijesinghe [mailto:[email protected]] 
Sent: Wednesday, April 5, 2017 2:24 AM
To: [email protected]
Subject: Re: apache tikka is not working for scanned image documents

Hi Vadivelhan,

As Chris mentioned, please visit https://wiki.apache.org/tika/TikaOCR and 
install Tesseract in your machine. To check the availability of Tesseract in 
your machine, type this command without quotes "tesseract test.jpg out " in the 
terminal and check whether you can OCR an image and output it to a file.

This is a code snippet to OCR a pdf, give it a run.

public void doOCR() throws Exception {

    String resource = "yourPDF.pdf";

    TesseractOCRConfig config = new TesseractOCRConfig();

    Parser parser = new RecursiveParserWrapper(new AutoDetectParser(),
            new BasicContentHandlerFactory(
                    BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1));

    PDFParserConfig pdfConfig = new PDFParserConfig();
    pdfConfig.setExtractInlineImages(true);

    ParseContext parseContext = new ParseContext();
    parseContext.set(TesseractOCRConfig.class, config);
    parseContext.set(Parser.class, parser);
    parseContext.set(PDFParserConfig.class, pdfConfig);

    try (InputStream stream =
TesseractOCRParserTest.class.getResourceAsStream(resource)) {
        parser.parse(stream, new DefaultHandler(), new Metadata(), 
parseContext);
    }
    List<Metadata> metadataList = ((RecursiveParserWrapper) 
parser).getMetadata();

    StringBuilder contents = new StringBuilder();
    for (Metadata m : metadataList) {
        contents.append(m.get(RecursiveParserWrapper.TIKA_CONTENT));
    }

    System.out.println(contents.toString());
}


On Wed, Apr 5, 2017 at 9:07 AM, Vadivelhan < 
[email protected]> wrote:

> Hi ,
>
> I tested with Apache Tikka with OCR configuration. It is not able to 
> provide extracted text from the pdf document. I attached the same 
> document.please check and update me with Result. This is very urgent. 
> It would be really appreciated.
>
>
> Best Regards,
> M.Vadivelhan
> Cell No:+91 7708435395 <+91%2077084%2035395>
>
> On Tue, 04 Apr 2017 23:09:15 +0530 Chris Mattmann wrote
> > Hi,Have you checked out:http://wiki.apache.org/tika/TikaOCRWhat
> specifically isn’t working?Moving this to [email protected]:Cheers,ChrisFrom:
> on behalf of Vadivelhan
> Date: Tuesday, April 4, 2017 at 8:25 AM
> To: "[email protected]"
> Subject: apache tikka is not working for scanned image documentsHI
>
> apache tikka is not working for scanned image documents. please 
> suggest your help
>
> Regards,
> M.Vadivelhan
>
>
>

RE: apache tikka is not working for scanned image documents

Reply via email to