Re: Improving OCR plugin for PDFBox

DImuthu Upeksha Sun, 20 Jul 2014 00:45:08 -0700

Hi John,

I made the font size dynamically adjustable and text is written to the PDF
file as invisible text [1]. You can find sample PDF file [2] I used for
testing and resultant PDF file after adding invisible text. I'll be testing
more files in future.


I added a new argument to tool called 'Separation Mode' (-s). Separation
mode is used to extract data from the PDF file in character by
character(mode =0) or word by word (mode=1). When quality of images in the
PDF file is low or text alignments are not perfect, use mode 0. But this
will take more time than mode 1 because it processes data character by
character.

I did some improvements in Tesseract-API[3] recently. If you are going to
test this code, you may need to pull and build the latest version of
Tesseract-API also.

[1]
https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/tools/OCRToPDF.java
[2]
https://github.com/DImuthuUpe/PDFBox-OCR-Plugin-Samples/tree/master/OCRToPDF
[3] https://github.com/DImuthuUpe/Tesseract-API

Thank You
Dimuthu



On Wed, Jul 9, 2014 at 7:13 AM, John Hewson <[email protected]> wrote:

> Hi Dimuthu
>
> In ICLA there are two fields for preferred Apache id and notify projects.
> What should I put in those fields?
>
>
> You can leave the preferred id blank because you’re not applying to be a
> contributor, just a patch submitter.
> For notify projects put “PDFBox”.
>
> For new functionality you have suggested, I implemented a command line
> tool[1] that writes OCR'd text to original pdf as visible text. However it
> currently writes text to the PDF in constant font size (12). It should be
> dynamically adjusted.
>
>
> Yes, you should be able to set the font size in the graphics state.
>
> In addition to that, I need to know how to make those text invisible
> inside the PDF. How can I make them invisible?
>
>
> This can be done by setting the text rendering mode to 3 (neither fill nor
> stroke) in the text state, you can call:
>
>
> PDGraphicsState#getTextState().setRenderingMode(RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT)
>
> You might need to save/restore the state before/after your text rendering
> too.
>
> -- John
>
> On 6 Jul 2014, at 09:34, DImuthu Upeksha <[email protected]>
> wrote:
>
> Hi John,
>
> I added Apache header to all java files and pom files in Tesseract API and
> OCR plugin. In ICLA there are two fields for preferred Apache id and notify
> projects. What should I put in those fields?
>
> For new functionality you have suggested, I implemented a command line
> tool[1] that writes OCR'd text to original pdf as visible text. However it
> currently writes text to the PDF in constant font size (12). It should be
> dynamically adjusted. In addition to that, I need to know how to make those
> text invisible inside the PDF. How can I make them invisible?
>
> [1]
> https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/tools/OCRToPDF.java
>
> Thank You
> Dimuthu
>
>
> On Fri, Jun 27, 2014 at 12:28 PM, John Hewson <[email protected]> wrote:
>
>> Hi Dimuthu
>>
>> That’s great. We should wait until closer to the end of the GSoC period
>> to integrate your work with PDFBox, as ideally we only want to have to do
>> it once. We’ve not included C++ dependencies before so no, there won’t be a
>> standard way, we’ll have to think something up. We’ll either make it an
>> optional sub-project and the Tesseract JNI bindings might be better of
>> having their own branch so that they are more like an external dependency -
>> I’ll ask the dev mailing list.
>>
>> To prepare your code for contribution you’ll need to add the Apache
>> header to each.java file (see any PDFBox .java file for an example) and
>> submit a signed ICLA http://www.apache.org/licenses/icla.pdf to Apache.
>>
>> Regarding additional functionality, the most useful would be for a new
>> command line tool which could write the OCR’d text back into the original
>> PDF file as “invisible text”, which would allow for copy and paste and text
>> search to then work for that PDF file. A starting point for this would be
>> to try and write the OCR’d text into the original PDF as “visible” text -
>> we can make it invisible later!
>>
>> -- John
>>
>> On 19 Jun 2014, at 13:57, DImuthu Upeksha <[email protected]>
>> wrote:
>>
>> Hi John,
>> Except providing compatibility for platforms like windows, I think most
>> of the functionalities of OCR plugin are finished (Please correct me if I'm
>> wrong). But I would like to contribute to project further. Do  you have
>> anything to add as a new functionality? And If you plan to add this to
>> PDFBox code, how should prepare my code? Is there any standard way?
>>
>> Thanks
>> Dimuthu
>> --
>> Regards
>> W.Dimuthu Upeksha
>> Undergraduate
>> Department of Computer Science And Engineering
>>  University of Moratuwa, Sri Lanka
>>
>>
>>
>
>
> --
> Regards
> W.Dimuthu Upeksha
> Undergraduate
> Department of Computer Science And Engineering
> University of Moratuwa, Sri Lanka
>
>
>


-- 
Regards

W.Dimuthu Upeksha
Undergraduate
Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: Improving OCR plugin for PDFBox

Reply via email to