Y, and please also see the "Optional Dependencies" section here: https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29
-----Original Message----- From: Thejan Wijesinghe [mailto:[email protected]] Sent: Wednesday, April 5, 2017 2:24 AM To: [email protected] Subject: Re: apache tikka is not working for scanned image documents Hi Vadivelhan, As Chris mentioned, please visit https://wiki.apache.org/tika/TikaOCR and install Tesseract in your machine. To check the availability of Tesseract in your machine, type this command without quotes "tesseract test.jpg out " in the terminal and check whether you can OCR an image and output it to a file. This is a code snippet to OCR a pdf, give it a run. public void doOCR() throws Exception { String resource = "yourPDF.pdf"; TesseractOCRConfig config = new TesseractOCRConfig(); Parser parser = new RecursiveParserWrapper(new AutoDetectParser(), new BasicContentHandlerFactory( BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1)); PDFParserConfig pdfConfig = new PDFParserConfig(); pdfConfig.setExtractInlineImages(true); ParseContext parseContext = new ParseContext(); parseContext.set(TesseractOCRConfig.class, config); parseContext.set(Parser.class, parser); parseContext.set(PDFParserConfig.class, pdfConfig); try (InputStream stream = TesseractOCRParserTest.class.getResourceAsStream(resource)) { parser.parse(stream, new DefaultHandler(), new Metadata(), parseContext); } List<Metadata> metadataList = ((RecursiveParserWrapper) parser).getMetadata(); StringBuilder contents = new StringBuilder(); for (Metadata m : metadataList) { contents.append(m.get(RecursiveParserWrapper.TIKA_CONTENT)); } System.out.println(contents.toString()); } On Wed, Apr 5, 2017 at 9:07 AM, Vadivelhan < [email protected]> wrote: > Hi , > > I tested with Apache Tikka with OCR configuration. It is not able to > provide extracted text from the pdf document. I attached the same > document.please check and update me with Result. This is very urgent. > It would be really appreciated. > > > Best Regards, > M.Vadivelhan > Cell No:+91 7708435395 <+91%2077084%2035395> > > On Tue, 04 Apr 2017 23:09:15 +0530 Chris Mattmann wrote > > Hi,Have you checked out:http://wiki.apache.org/tika/TikaOCRWhat > specifically isn’t working?Moving this to [email protected]:Cheers,ChrisFrom: > on behalf of Vadivelhan > Date: Tuesday, April 4, 2017 at 8:25 AM > To: "[email protected]" > Subject: apache tikka is not working for scanned image documentsHI > > apache tikka is not working for scanned image documents. please > suggest your help > > Regards, > M.Vadivelhan > > >
