Yes I double checked it by debugging processTextPosition method in normal operation. Thanks for the information. Now text position details from OCR plugin are successfully fed into processTextPosition. Output text also pretty good for first sample PDFs.
On Thu, May 22, 2014 at 10:31 PM, John Hewson <j...@jahewson.com> wrote: > Yes, as Alin says, the y-axis in PDF uses y=0 as the bottom of the page, > instead of > the top as is usually the case in Java. PDFBox uses both styles of > coordinates internally > at various points. > > -- John > > On 17 May 2014, at 11:45, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote: > >> Hi Alin, >> Thank you. It helped me a lot. I'll look into that further. >> >> About OCR. >> I use Tesseract C library to do OCR and I have written some native >> calls to communicate with Tesseract API. [2] >> >> [2] https://github.com/DImuthuUpe/Tesseract-API >> >> On Sat, May 17, 2014 at 10:43 PM, Alin Mazilu <impet...@gmail.com> wrote: >>> Hello, >>> >>> I commented on the gist. You have to use setSortByPosition(true) in the >>> constructor right after super(). Be careful with your coordinate system. >>> When you do textPosition1.getY() you get 792 not 0. I don't remember >>> exactly where, but there is a class that uses the lower left corner of the >>> page as the origin (0,0), not the upper left corner as it is natural. >>> >>> I hope that helps. >>> >>> Alin >>> >>> PS Is the OCR going to be pure Java or will you be writing it in other >>> language and use native calls? >>> >>> >>> On Sat, May 17, 2014 at 8:13 AM, DImuthu Upeksha <dimuthu.upeks...@gmail.com >>>> wrote: >>> >>>> Hi Alin, >>>> >>>> You can find my source code from here >>>> https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649 >>>> As you can see I set >>>> X-offset : 0 and Y-offset : 0 for "H" >>>> X-offset : 32 and Y-offset : 0 for "W" >>>> in Text Matrices. Is that enough? Is there other way to set X,Y >>>> co-ordinates? >>>> >>>> >>>> On Sat, May 17, 2014 at 12:18 PM, Alin Mazilu <impet...@gmail.com> wrote: >>>>> What are the x and y coordinates of H and W? >>>>> >>>>> Alin Mazilu >>>>> SKE GlobalTech, LLC >>>>> 3250 West Market St. Suite 307D >>>>> Fairlawn, OH 44333 >>>>> >>>>> Sent from my Galaxy S3 >>>>> On May 17, 2014 2:42 AM, "DImuthu Upeksha" <dimuthu.upeks...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I was tying to manually feed text position objects to >>>>>> processTextPosition method in PDFTextStripper class. I created a sub >>>>>> class of PDFTextStripper and override processStream method. In >>>>>> processStream method I manually created two text position objects for >>>>>> words "W" and "H". At the end I passed them to processTextPosition >>>>>> >>>>>> processTextPosition(textPosition1); >>>>>> processTextPosition(textPosition2); >>>>>> >>>>>> Then I tested it using >>>>>> >>>>>> PDFTextStripper ocrStripper = new PDFOCRTextStripper(); >>>>>> PDDocument document = PDDocument.load("some pdf file"); >>>>>> String data = ocrStripper.getText(document); >>>>>> System.out.println(data); >>>>>> >>>>>> Output was : H W >>>>>> >>>>>> Then I changed the sequence of passing TextPosition objects in [1] >>>>>> >>>>>> processTextPosition(textPosition2); >>>>>> processTextPosition(textPosition1); >>>>>> >>>>>> Output was : WH >>>>>> >>>>>> ------------------------------ >>>>>> >>>>>> As far as I understood processTextPosition works with the text >>>>>> position metadata like x and y co-ordinates of the input text. It >>>>>> should not depend on the order of the input sequence. But in case It >>>>>> seems like processTextPosition method works according to order of >>>>>> input. >>>>>> Ex. If I input W first, it prints W first without considering it's >>>>>> actual position. >>>>>> >>>>>> Is this the normal behaviour? Or am I missing something here? >>>>>> >>>>>> [1] https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649 >>>>>> -- >>>>>> Regards >>>>>> >>>>>> W.Dimuthu Upeksha >>>>>> Undergraduate >>>>>> >>>>>> Department of Computer Science And Engineering >>>>>> >>>>>> University of Moratuwa, Sri Lanka >>>>>> >>>> >>>> >>>> >>>> -- >>>> Regards >>>> >>>> W.Dimuthu Upeksha >>>> Undergraduate >>>> >>>> Department of Computer Science And Engineering >>>> >>>> University of Moratuwa, Sri Lanka >>>> >> >> >> >> -- >> Regards >> >> W.Dimuthu Upeksha >> Undergraduate >> >> Department of Computer Science And Engineering >> >> University of Moratuwa, Sri Lanka > -- Regards W.Dimuthu Upeksha Undergraduate Department of Computer Science And Engineering University of Moratuwa, Sri Lanka