Re: Improving OCR plugin for PDFBox

DImuthu Upeksha Sun, 06 Jul 2014 09:35:37 -0700

Hi John,

I added Apache header to all java files and pom files in Tesseract API and
OCR plugin. In ICLA there are two fields for preferred Apache id and notify
projects. What should I put in those fields?


For new functionality you have suggested, I implemented a command line
tool[1] that writes OCR'd text to original pdf as visible text. However it
currently writes text to the PDF in constant font size (12). It should be
dynamically adjusted. In addition to that, I need to know how to make those
text invisible inside the PDF. How can I make them invisible?

[1]
https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/tools/OCRToPDF.java

Thank You
Dimuthu


On Fri, Jun 27, 2014 at 12:28 PM, John Hewson <[email protected]> wrote:

> Hi Dimuthu
>
> That’s great. We should wait until closer to the end of the GSoC period to
> integrate your work with PDFBox, as ideally we only want to have to do it
> once. We’ve not included C++ dependencies before so no, there won’t be a
> standard way, we’ll have to think something up. We’ll either make it an
> optional sub-project and the Tesseract JNI bindings might be better of
> having their own branch so that they are more like an external dependency -
> I’ll ask the dev mailing list.
>
> To prepare your code for contribution you’ll need to add the Apache header
> to each.java file (see any PDFBox .java file for an example) and submit a
> signed ICLA http://www.apache.org/licenses/icla.pdf to Apache.
>
> Regarding additional functionality, the most useful would be for a new
> command line tool which could write the OCR’d text back into the original
> PDF file as “invisible text”, which would allow for copy and paste and text
> search to then work for that PDF file. A starting point for this would be
> to try and write the OCR’d text into the original PDF as “visible” text -
> we can make it invisible later!
>
> -- John
>
> On 19 Jun 2014, at 13:57, DImuthu Upeksha <[email protected]>
> wrote:
>
> Hi John,
> Except providing compatibility for platforms like windows, I think most of
> the functionalities of OCR plugin are finished (Please correct me if I'm
> wrong). But I would like to contribute to project further. Do  you have
> anything to add as a new functionality? And If you plan to add this to
> PDFBox code, how should prepare my code? Is there any standard way?
>
> Thanks
> Dimuthu
> --
> Regards
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> University of Moratuwa, Sri Lanka
>
>
>


-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: Improving OCR plugin for PDFBox

Reply via email to