That’s great! -- John
On 22 May 2014, at 10:12, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: > Yes I double checked it by debugging processTextPosition method in > normal operation. Thanks for the information. Now text position > details from OCR plugin are successfully fed into processTextPosition. > Output text also pretty good for first sample PDFs. > > On Thu, May 22, 2014 at 10:31 PM, John Hewson <j...@jahewson.com> wrote: >> Yes, as Alin says, the y-axis in PDF uses y=0 as the bottom of the page, >> instead of >> the top as is usually the case in Java. PDFBox uses both styles of >> coordinates internally >> at various points. >> >> -- John >> >> On 17 May 2014, at 11:45, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: >> >>> Hi Alin, >>> Thank you. It helped me a lot. I'll look into that further. >>> >>> About OCR. >>> I use Tesseract C library to do OCR and I have written some native >>> calls to communicate with Tesseract API. [2] >>> >>> [2] https://github.com/DImuthuUpe/Tesseract-API >>> >>> On Sat, May 17, 2014 at 10:43 PM, Alin Mazilu <impet...@gmail.com> wrote: >>>> Hello, >>>> >>>> I commented on the gist. You have to use setSortByPosition(true) in the >>>> constructor right after super(). Be careful with your coordinate system. >>>> When you do textPosition1.getY() you get 792 not 0. I don't remember >>>> exactly where, but there is a class that uses the lower left corner of the >>>> page as the origin (0,0), not the upper left corner as it is natural. >>>> >>>> I hope that helps. >>>> >>>> Alin >>>> >>>> PS Is the OCR going to be pure Java or will you be writing it in other >>>> language and use native calls? >>>> >>>> >>>> On Sat, May 17, 2014 at 8:13 AM, DImuthu Upeksha >>>> <dimuthu.upeks...@gmail.com >>>>> wrote: >>>> >>>>> Hi Alin, >>>>> >>>>> You can find my source code from here >>>>> https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649 >>>>> As you can see I set >>>>> X-offset : 0 and Y-offset : 0 for "H" >>>>> X-offset : 32 and Y-offset : 0 for "W" >>>>> in Text Matrices. Is that enough? Is there other way to set X,Y >>>>> co-ordinates? >>>>> >>>>> >>>>> On Sat, May 17, 2014 at 12:18 PM, Alin Mazilu <impet...@gmail.com> wrote: >>>>>> What are the x and y coordinates of H and W? >>>>>> >>>>>> Alin Mazilu >>>>>> SKE GlobalTech, LLC >>>>>> 3250 West Market St. Suite 307D >>>>>> Fairlawn, OH 44333 >>>>>> >>>>>> Sent from my Galaxy S3 >>>>>> On May 17, 2014 2:42 AM, "DImuthu Upeksha" <dimuthu.upeks...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I was tying to manually feed text position objects to >>>>>>> processTextPosition method in PDFTextStripper class. I created a sub >>>>>>> class of PDFTextStripper and override processStream method. In >>>>>>> processStream method I manually created two text position objects for >>>>>>> words "W" and "H". At the end I passed them to processTextPosition >>>>>>> >>>>>>> processTextPosition(textPosition1); >>>>>>> processTextPosition(textPosition2); >>>>>>> >>>>>>> Then I tested it using >>>>>>> >>>>>>> PDFTextStripper ocrStripper = new PDFOCRTextStripper(); >>>>>>> PDDocument document = PDDocument.load("some pdf file"); >>>>>>> String data = ocrStripper.getText(document); >>>>>>> System.out.println(data); >>>>>>> >>>>>>> Output was : H W >>>>>>> >>>>>>> Then I changed the sequence of passing TextPosition objects in [1] >>>>>>> >>>>>>> processTextPosition(textPosition2); >>>>>>> processTextPosition(textPosition1); >>>>>>> >>>>>>> Output was : WH >>>>>>> >>>>>>> ------------------------------ >>>>>>> >>>>>>> As far as I understood processTextPosition works with the text >>>>>>> position metadata like x and y co-ordinates of the input text. It >>>>>>> should not depend on the order of the input sequence. But in case It >>>>>>> seems like processTextPosition method works according to order of >>>>>>> input. >>>>>>> Ex. If I input W first, it prints W first without considering it's >>>>>>> actual position. >>>>>>> >>>>>>> Is this the normal behaviour? Or am I missing something here? >>>>>>> >>>>>>> [1] https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649 >>>>>>> -- >>>>>>> Regards >>>>>>> >>>>>>> W.Dimuthu Upeksha >>>>>>> Undergraduate >>>>>>> >>>>>>> Department of Computer Science And Engineering >>>>>>> >>>>>>> University of Moratuwa, Sri Lanka >>>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards >>>>> >>>>> W.Dimuthu Upeksha >>>>> Undergraduate >>>>> >>>>> Department of Computer Science And Engineering >>>>> >>>>> University of Moratuwa, Sri Lanka >>>>> >>> >>> >>> >>> -- >>> Regards >>> >>> W.Dimuthu Upeksha >>> Undergraduate >>> >>> Department of Computer Science And Engineering >>> >>> University of Moratuwa, Sri Lanka >> > > > > -- > Regards > > W.Dimuthu Upeksha > Undergraduate > > Department of Computer Science And Engineering > > University of Moratuwa, Sri Lanka