Hi Alin, Thank you. It helped me a lot. I'll look into that further. About OCR. I use Tesseract C library to do OCR and I have written some native calls to communicate with Tesseract API. [2]
[2] https://github.com/DImuthuUpe/Tesseract-API On Sat, May 17, 2014 at 10:43 PM, Alin Mazilu <impet...@gmail.com> wrote: > Hello, > > I commented on the gist. You have to use setSortByPosition(true) in the > constructor right after super(). Be careful with your coordinate system. > When you do textPosition1.getY() you get 792 not 0. I don't remember > exactly where, but there is a class that uses the lower left corner of the > page as the origin (0,0), not the upper left corner as it is natural. > > I hope that helps. > > Alin > > PS Is the OCR going to be pure Java or will you be writing it in other > language and use native calls? > > > On Sat, May 17, 2014 at 8:13 AM, DImuthu Upeksha <dimuthu.upeks...@gmail.com >> wrote: > >> Hi Alin, >> >> You can find my source code from here >> https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649 >> As you can see I set >> X-offset : 0 and Y-offset : 0 for "H" >> X-offset : 32 and Y-offset : 0 for "W" >> in Text Matrices. Is that enough? Is there other way to set X,Y >> co-ordinates? >> >> >> On Sat, May 17, 2014 at 12:18 PM, Alin Mazilu <impet...@gmail.com> wrote: >> > What are the x and y coordinates of H and W? >> > >> > Alin Mazilu >> > SKE GlobalTech, LLC >> > 3250 West Market St. Suite 307D >> > Fairlawn, OH 44333 >> > >> > Sent from my Galaxy S3 >> > On May 17, 2014 2:42 AM, "DImuthu Upeksha" <dimuthu.upeks...@gmail.com> >> > wrote: >> > >> >> Hi all, >> >> >> >> I was tying to manually feed text position objects to >> >> processTextPosition method in PDFTextStripper class. I created a sub >> >> class of PDFTextStripper and override processStream method. In >> >> processStream method I manually created two text position objects for >> >> words "W" and "H". At the end I passed them to processTextPosition >> >> >> >> processTextPosition(textPosition1); >> >> processTextPosition(textPosition2); >> >> >> >> Then I tested it using >> >> >> >> PDFTextStripper ocrStripper = new PDFOCRTextStripper(); >> >> PDDocument document = PDDocument.load("some pdf file"); >> >> String data = ocrStripper.getText(document); >> >> System.out.println(data); >> >> >> >> Output was : H W >> >> >> >> Then I changed the sequence of passing TextPosition objects in [1] >> >> >> >> processTextPosition(textPosition2); >> >> processTextPosition(textPosition1); >> >> >> >> Output was : WH >> >> >> >> ------------------------------ >> >> >> >> As far as I understood processTextPosition works with the text >> >> position metadata like x and y co-ordinates of the input text. It >> >> should not depend on the order of the input sequence. But in case It >> >> seems like processTextPosition method works according to order of >> >> input. >> >> Ex. If I input W first, it prints W first without considering it's >> >> actual position. >> >> >> >> Is this the normal behaviour? Or am I missing something here? >> >> >> >> [1] https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649 >> >> -- >> >> Regards >> >> >> >> W.Dimuthu Upeksha >> >> Undergraduate >> >> >> >> Department of Computer Science And Engineering >> >> >> >> University of Moratuwa, Sri Lanka >> >> >> >> >> >> -- >> Regards >> >> W.Dimuthu Upeksha >> Undergraduate >> >> Department of Computer Science And Engineering >> >> University of Moratuwa, Sri Lanka >> -- Regards W.Dimuthu Upeksha Undergraduate Department of Computer Science And Engineering University of Moratuwa, Sri Lanka