Re: OCR for PDFBox : Progress

John Hewson Tue, 10 Jun 2014 18:47:24 -0700

Hi Dimuthu,

I cloned your code and did some experiments with it  - it’s working nicely. I’m 
glad that subclassing
PDFTextStripper has been a success, it’s a nice clean implementation.


> Tesseract API [1]
> 
> 1. Currently all necessary functions were implemented and test cases were 
> written in order to check proper functionality
> 
> 2. Support for Mac and linux operating systems. In future I'll try to add 
> support for Windows also

That’s fine for now.

> 3. All static libs for Tesseract and Leptonica were pre built and added to 
> resources folder. 

Perfect.

> 4. At build phase it dynamically identify correct libs that support to 
> particular Operating system
> 
> 5. If some one needs to build above static libs manually, instructions were 
> given in read me.

> 6. In future, I'll work on adding those static libs creation when project  is 
> built. Currently they must be manually built.

That would be handy.

> OCR plugin [2]
> 
> 1. Almost finished implementing. 
> 
> 2. Working fine with sample PDF files I have given. Is there any set of PDF 
> files that can be used to test accuracy and performance?

Currently, no, but I’ll take a look in my collection of test files…

> In addition to that, there are some code formatting and commenting stuff to 
> be done.

It might be nice to add a command line utility to your OCR-Plugin, you could 
copy ExtractText.java from org.apache.pdfbox.tools and rename it to OCRText and 
have it use your PDFOCRTextStripper class instead of PDFTextStripper. That way 
your plugin is immediately usable by end-users.

-- John

Re: OCR for PDFBox : Progress

Reply via email to