Hi Dimuthu, I cloned your code and did some experiments with it - it’s working nicely. I’m glad that subclassing PDFTextStripper has been a success, it’s a nice clean implementation.
> Tesseract API [1] > > 1. Currently all necessary functions were implemented and test cases were > written in order to check proper functionality > > 2. Support for Mac and linux operating systems. In future I'll try to add > support for Windows also That’s fine for now. > 3. All static libs for Tesseract and Leptonica were pre built and added to > resources folder. Perfect. > 4. At build phase it dynamically identify correct libs that support to > particular Operating system > > 5. If some one needs to build above static libs manually, instructions were > given in read me. > 6. In future, I'll work on adding those static libs creation when project is > built. Currently they must be manually built. That would be handy. > OCR plugin [2] > > 1. Almost finished implementing. > > 2. Working fine with sample PDF files I have given. Is there any set of PDF > files that can be used to test accuracy and performance? Currently, no, but I’ll take a look in my collection of test files… > In addition to that, there are some code formatting and commenting stuff to > be done. It might be nice to add a command line utility to your OCR-Plugin, you could copy ExtractText.java from org.apache.pdfbox.tools and rename it to OCRText and have it use your PDFOCRTextStripper class instead of PDFTextStripper. That way your plugin is immediately usable by end-users. -- John
