Re: OCR for PDFBox : Progress

DImuthu Upeksha Sat, 14 Jun 2014 12:37:19 -0700

Hi John,

Thank you for your valuable feedback.

As you have mentioned I copied ExtractText.java and Created OCRText.java
with changes you have mentioned.
https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/tools/OCRText.java

Now it's working properly.

I removed arguments like -html which will not make sense for OCR stuff.

Small comment about PDFTextStripper. Why is variable currentPageNo is
private? Is there a special reason? In some case I needed to access
currentPageNo variable in PDFOCRTextStripper.java. Because it is private I
had to make my own local page number variable which is manually incremented
 in processStream method. I think this is not a good practice but I had no
other way to do it. Can we make currentPageNo variable protected which will
be able to make accessible to subclasses of PDFTextStripper in future?

Thanks
Dimuthu

On Wed, Jun 11, 2014 at 7:16 AM, John Hewson <[email protected]> wrote:

> Hi Dimuthu,
>
> I cloned your code and did some experiments with it  - it’s working
> nicely. I’m glad that subclassing
> PDFTextStripper has been a success, it’s a nice clean implementation.
>
> *Tesseract API [1]*
>
> 1. Currently all necessary functions were implemented and test cases were
> written in order to check proper functionality
>
> 2. Support for Mac and linux operating systems. In future I'll try to add
> support for Windows also
>
>
> That’s fine for now.
>
> 3. All static libs for Tesseract and Leptonica were pre built and added to
> resources folder.
>
>
> Perfect.
>
> 4. At build phase it dynamically identify correct libs that support to
> particular Operating system
>
> 5. If some one needs to build above static libs manually, instructions
> were given in read me.
>
>
> 6. In future, I'll work on adding those static libs creation when project
>  is built. Currently they must be manually built.
>
>
> That would be handy.
>
>
> *OCR plugin [2]*
>
> 1. Almost finished implementing.
>
> 2. Working fine with sample PDF files I have given. Is there any set of
> PDF files that can be used to test accuracy and performance?
>
>
> Currently, no, but I’ll take a look in my collection of test files…
>
> In addition to that, there are some code formatting and commenting stuff
> to be done.
>
>
> It might be nice to add a command line utility to your OCR-Plugin, you
> could copy ExtractText.java from org.apache.pdfbox.tools and rename it to
> OCRText and have it use your PDFOCRTextStripper class instead of
> PDFTextStripper. That way your plugin is immediately usable by end-users.
>
> -- John
>
>

-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: OCR for PDFBox : Progress

Reply via email to