That’s great!

-- John

On 22 May 2014, at 10:12, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote:

> Yes I double checked it by debugging processTextPosition method in
> normal operation. Thanks for the information. Now text position
> details from OCR plugin are successfully fed into processTextPosition.
> Output text also pretty good for first sample PDFs.
> 
> On Thu, May 22, 2014 at 10:31 PM, John Hewson <j...@jahewson.com> wrote:
>> Yes, as Alin says, the y-axis in PDF uses y=0 as the bottom of the page, 
>> instead of
>> the top as is usually the case in Java. PDFBox uses both styles of 
>> coordinates internally
>> at various points.
>> 
>> -- John
>> 
>> On 17 May 2014, at 11:45, DImuthu Upeksha <dimuthu.upeks...@gmail.com> wrote:
>> 
>>> Hi Alin,
>>> Thank you. It helped me a lot. I'll look into that further.
>>> 
>>> About OCR.
>>> I use Tesseract C library to do OCR and I have written some native
>>> calls to communicate with Tesseract API. [2]
>>> 
>>> [2] https://github.com/DImuthuUpe/Tesseract-API
>>> 
>>> On Sat, May 17, 2014 at 10:43 PM, Alin Mazilu <impet...@gmail.com> wrote:
>>>> Hello,
>>>> 
>>>> I commented on the gist. You have to use setSortByPosition(true) in the
>>>> constructor right after super(). Be careful with your coordinate system.
>>>> When you do textPosition1.getY() you get 792 not 0. I don't remember
>>>> exactly where, but there is a class that uses the lower left corner of the
>>>> page as the origin (0,0), not the upper left corner as it is natural.
>>>> 
>>>> I hope that helps.
>>>> 
>>>> Alin
>>>> 
>>>> PS Is the OCR going to be pure Java or will you be writing it in other
>>>> language and use native calls?
>>>> 
>>>> 
>>>> On Sat, May 17, 2014 at 8:13 AM, DImuthu Upeksha 
>>>> <dimuthu.upeks...@gmail.com
>>>>> wrote:
>>>> 
>>>>> Hi Alin,
>>>>> 
>>>>> You can find my source code from here
>>>>> https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649
>>>>> As you can see I set
>>>>> X-offset : 0 and Y-offset : 0 for "H"
>>>>> X-offset : 32 and Y-offset : 0 for "W"
>>>>> in Text Matrices. Is that enough? Is there other way to set X,Y
>>>>> co-ordinates?
>>>>> 
>>>>> 
>>>>> On Sat, May 17, 2014 at 12:18 PM, Alin Mazilu <impet...@gmail.com> wrote:
>>>>>> What are the x and y coordinates of H and W?
>>>>>> 
>>>>>> Alin Mazilu
>>>>>> SKE GlobalTech, LLC
>>>>>> 3250 West Market St. Suite 307D
>>>>>> Fairlawn, OH 44333
>>>>>> 
>>>>>> Sent from my Galaxy S3
>>>>>> On May 17, 2014 2:42 AM, "DImuthu Upeksha" <dimuthu.upeks...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> I was tying to manually feed text position objects to
>>>>>>> processTextPosition method in PDFTextStripper class. I created a sub
>>>>>>> class of PDFTextStripper and override processStream method. In
>>>>>>> processStream method I manually created two text position objects for
>>>>>>> words "W" and "H". At the end I passed them to processTextPosition
>>>>>>> 
>>>>>>> processTextPosition(textPosition1);
>>>>>>> processTextPosition(textPosition2);
>>>>>>> 
>>>>>>> Then I tested it using
>>>>>>> 
>>>>>>> PDFTextStripper ocrStripper = new PDFOCRTextStripper();
>>>>>>> PDDocument document = PDDocument.load("some pdf file");
>>>>>>> String data = ocrStripper.getText(document);
>>>>>>> System.out.println(data);
>>>>>>> 
>>>>>>> Output was : H W
>>>>>>> 
>>>>>>> Then I changed the sequence of passing TextPosition objects in [1]
>>>>>>> 
>>>>>>> processTextPosition(textPosition2);
>>>>>>> processTextPosition(textPosition1);
>>>>>>> 
>>>>>>> Output was : WH
>>>>>>> 
>>>>>>> ------------------------------
>>>>>>> 
>>>>>>> As far as I understood processTextPosition works with the text
>>>>>>> position metadata like x and y co-ordinates of the input text. It
>>>>>>> should not depend on the order of the input sequence. But in case It
>>>>>>> seems like processTextPosition method works according to order of
>>>>>>> input.
>>>>>>> Ex. If I input W first, it prints W first without considering it's
>>>>>>> actual position.
>>>>>>> 
>>>>>>> Is this the normal behaviour? Or am I missing something here?
>>>>>>> 
>>>>>>> [1] https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649
>>>>>>> --
>>>>>>> Regards
>>>>>>> 
>>>>>>> W.Dimuthu Upeksha
>>>>>>> Undergraduate
>>>>>>> 
>>>>>>> Department of Computer Science And Engineering
>>>>>>> 
>>>>>>> University of Moratuwa, Sri Lanka
>>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Regards
>>>>> 
>>>>> W.Dimuthu Upeksha
>>>>> Undergraduate
>>>>> 
>>>>> Department of Computer Science And Engineering
>>>>> 
>>>>> University of Moratuwa, Sri Lanka
>>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Regards
>>> 
>>> W.Dimuthu Upeksha
>>> Undergraduate
>>> 
>>> Department of Computer Science And Engineering
>>> 
>>> University of Moratuwa, Sri Lanka
>> 
> 
> 
> 
> -- 
> Regards
> 
> W.Dimuthu Upeksha
> Undergraduate
> 
> Department of Computer Science And Engineering
> 
> University of Moratuwa, Sri Lanka

Reply via email to