Hi John, I made the font size dynamically adjustable and text is written to the PDF file as invisible text [1]. You can find sample PDF file [2] I used for testing and resultant PDF file after adding invisible text. I'll be testing more files in future.
I added a new argument to tool called 'Separation Mode' (-s). Separation mode is used to extract data from the PDF file in character by character(mode =0) or word by word (mode=1). When quality of images in the PDF file is low or text alignments are not perfect, use mode 0. But this will take more time than mode 1 because it processes data character by character. I did some improvements in Tesseract-API[3] recently. If you are going to test this code, you may need to pull and build the latest version of Tesseract-API also. [1] https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/tools/OCRToPDF.java [2] https://github.com/DImuthuUpe/PDFBox-OCR-Plugin-Samples/tree/master/OCRToPDF [3] https://github.com/DImuthuUpe/Tesseract-API Thank You Dimuthu On Wed, Jul 9, 2014 at 7:13 AM, John Hewson <[email protected]> wrote: > Hi Dimuthu > > In ICLA there are two fields for preferred Apache id and notify projects. > What should I put in those fields? > > > You can leave the preferred id blank because you’re not applying to be a > contributor, just a patch submitter. > For notify projects put “PDFBox”. > > For new functionality you have suggested, I implemented a command line > tool[1] that writes OCR'd text to original pdf as visible text. However it > currently writes text to the PDF in constant font size (12). It should be > dynamically adjusted. > > > Yes, you should be able to set the font size in the graphics state. > > In addition to that, I need to know how to make those text invisible > inside the PDF. How can I make them invisible? > > > This can be done by setting the text rendering mode to 3 (neither fill nor > stroke) in the text state, you can call: > > > PDGraphicsState#getTextState().setRenderingMode(RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT) > > You might need to save/restore the state before/after your text rendering > too. > > -- John > > On 6 Jul 2014, at 09:34, DImuthu Upeksha <[email protected]> > wrote: > > Hi John, > > I added Apache header to all java files and pom files in Tesseract API and > OCR plugin. In ICLA there are two fields for preferred Apache id and notify > projects. What should I put in those fields? > > For new functionality you have suggested, I implemented a command line > tool[1] that writes OCR'd text to original pdf as visible text. However it > currently writes text to the PDF in constant font size (12). It should be > dynamically adjusted. In addition to that, I need to know how to make those > text invisible inside the PDF. How can I make them invisible? > > [1] > https://github.com/DImuthuUpe/OCR-Plugin/blob/master/src/main/java/org/apache/pdfbox/tools/OCRToPDF.java > > Thank You > Dimuthu > > > On Fri, Jun 27, 2014 at 12:28 PM, John Hewson <[email protected]> wrote: > >> Hi Dimuthu >> >> That’s great. We should wait until closer to the end of the GSoC period >> to integrate your work with PDFBox, as ideally we only want to have to do >> it once. We’ve not included C++ dependencies before so no, there won’t be a >> standard way, we’ll have to think something up. We’ll either make it an >> optional sub-project and the Tesseract JNI bindings might be better of >> having their own branch so that they are more like an external dependency - >> I’ll ask the dev mailing list. >> >> To prepare your code for contribution you’ll need to add the Apache >> header to each.java file (see any PDFBox .java file for an example) and >> submit a signed ICLA http://www.apache.org/licenses/icla.pdf to Apache. >> >> Regarding additional functionality, the most useful would be for a new >> command line tool which could write the OCR’d text back into the original >> PDF file as “invisible text”, which would allow for copy and paste and text >> search to then work for that PDF file. A starting point for this would be >> to try and write the OCR’d text into the original PDF as “visible” text - >> we can make it invisible later! >> >> -- John >> >> On 19 Jun 2014, at 13:57, DImuthu Upeksha <[email protected]> >> wrote: >> >> Hi John, >> Except providing compatibility for platforms like windows, I think most >> of the functionalities of OCR plugin are finished (Please correct me if I'm >> wrong). But I would like to contribute to project further. Do you have >> anything to add as a new functionality? And If you plan to add this to >> PDFBox code, how should prepare my code? Is there any standard way? >> >> Thanks >> Dimuthu >> -- >> Regards >> W.Dimuthu Upeksha >> Undergraduate >> Department of Computer Science And Engineering >> University of Moratuwa, Sri Lanka >> >> >> > > > -- > Regards > W.Dimuthu Upeksha > Undergraduate > Department of Computer Science And Engineering > University of Moratuwa, Sri Lanka > > > -- Regards W.Dimuthu Upeksha Undergraduate Department of Computer Science And Engineering University of Moratuwa, Sri Lanka
