btw it's not tikka. It's Tika :) On Wed, Apr 5, 2017 at 11:53 AM, Thejan Wijesinghe < [email protected]> wrote:
> Hi Vadivelhan, > > As Chris mentioned, please visit https://wiki.apache.org/tika/TikaOCR and > install Tesseract in your machine. To check the availability of Tesseract > in your machine, type this command without quotes "tesseract test.jpg out > " in the terminal and check whether you can OCR an image and output it > to a file. > > This is a code snippet to OCR a pdf, give it a run. > > public void doOCR() throws Exception { > > String resource = "yourPDF.pdf"; > > TesseractOCRConfig config = new TesseractOCRConfig(); > > Parser parser = new RecursiveParserWrapper(new AutoDetectParser(), > new BasicContentHandlerFactory( > BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1)); > > PDFParserConfig pdfConfig = new PDFParserConfig(); > pdfConfig.setExtractInlineImages(true); > > ParseContext parseContext = new ParseContext(); > parseContext.set(TesseractOCRConfig.class, config); > parseContext.set(Parser.class, parser); > parseContext.set(PDFParserConfig.class, pdfConfig); > > try (InputStream stream = > TesseractOCRParserTest.class.getResourceAsStream(resource)) { > parser.parse(stream, new DefaultHandler(), new Metadata(), > parseContext); > } > List<Metadata> metadataList = ((RecursiveParserWrapper) > parser).getMetadata(); > > StringBuilder contents = new StringBuilder(); > for (Metadata m : metadataList) { > contents.append(m.get(RecursiveParserWrapper.TIKA_CONTENT)); > } > > System.out.println(contents.toString()); > } > > > On Wed, Apr 5, 2017 at 9:07 AM, Vadivelhan <vadivelcommunicationid@ > rediffmail.com> wrote: > >> Hi , >> >> I tested with Apache Tikka with OCR configuration. It is not able to >> provide extracted text from the pdf document. I attached the same >> document.please check and update me with Result. This is very urgent. It >> would be really appreciated. >> >> >> Best Regards, >> M.Vadivelhan >> Cell No:+91 7708435395 <+91%2077084%2035395> >> >> On Tue, 04 Apr 2017 23:09:15 +0530 Chris Mattmann wrote >> > Hi,Have you checked out:http://wiki.apache.org/tika/TikaOCRWhat >> specifically isn’t working?Moving this to [email protected]:Cheers,ChrisFrom: >> on behalf of Vadivelhan >> Date: Tuesday, April 4, 2017 at 8:25 AM >> To: "[email protected]" >> Subject: apache tikka is not working for scanned image documentsHI >> >> apache tikka is not working for scanned image documents. please suggest >> your help >> >> Regards, >> M.Vadivelhan >> >> >> >
