Re: Problem with processTextPosition

DImuthu Upeksha Sat, 17 May 2014 12:25:28 -0700

Hi Alin,
Thank you. It helped me a lot. I'll look into that further.

About OCR.
I use Tesseract C library to do OCR and I have written some native
calls to communicate with Tesseract API. [2]


[2] https://github.com/DImuthuUpe/Tesseract-API

On Sat, May 17, 2014 at 10:43 PM, Alin Mazilu <[email protected]> wrote:
> Hello,
>
> I commented on the gist. You have to use setSortByPosition(true) in the
> constructor right after super(). Be careful with your coordinate system.
> When you do textPosition1.getY() you get 792 not 0. I don't remember
> exactly where, but there is a class that uses the lower left corner of the
> page as the origin (0,0), not the upper left corner as it is natural.
>
> I hope that helps.
>
> Alin
>
> PS Is the OCR going to be pure Java or will you be writing it in other
> language and use native calls?
>
>
> On Sat, May 17, 2014 at 8:13 AM, DImuthu Upeksha <[email protected]
>> wrote:
>
>> Hi Alin,
>>
>> You can find my source code from here
>> https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649
>> As you can see I set
>> X-offset : 0 and Y-offset : 0 for "H"
>> X-offset : 32 and Y-offset : 0 for "W"
>> in Text Matrices. Is that enough? Is there other way to set X,Y
>> co-ordinates?
>>
>>
>> On Sat, May 17, 2014 at 12:18 PM, Alin Mazilu <[email protected]> wrote:
>> > What are the x and y coordinates of H and W?
>> >
>> > Alin Mazilu
>> > SKE GlobalTech, LLC
>> > 3250 West Market St. Suite 307D
>> > Fairlawn, OH 44333
>> >
>> > Sent from my Galaxy S3
>> > On May 17, 2014 2:42 AM, "DImuthu Upeksha" <[email protected]>
>> > wrote:
>> >
>> >> Hi all,
>> >>
>> >> I was tying to manually feed text position objects to
>> >> processTextPosition method in PDFTextStripper class. I created a sub
>> >> class of PDFTextStripper and override processStream method. In
>> >> processStream method I manually created two text position objects for
>> >> words "W" and "H". At the end I passed them to processTextPosition
>> >>
>> >> processTextPosition(textPosition1);
>> >> processTextPosition(textPosition2);
>> >>
>> >> Then I tested it using
>> >>
>> >> PDFTextStripper ocrStripper = new PDFOCRTextStripper();
>> >> PDDocument document = PDDocument.load("some pdf file");
>> >> String data = ocrStripper.getText(document);
>> >> System.out.println(data);
>> >>
>> >> Output was : H W
>> >>
>> >> Then I changed the sequence of passing TextPosition objects in [1]
>> >>
>> >> processTextPosition(textPosition2);
>> >> processTextPosition(textPosition1);
>> >>
>> >> Output was : WH
>> >>
>> >> ------------------------------
>> >>
>> >> As far as I understood processTextPosition works with the text
>> >> position metadata like x and y co-ordinates of the input text. It
>> >> should not depend on the order of the input sequence. But in case It
>> >> seems like processTextPosition method works according to order of
>> >> input.
>> >> Ex. If I input W first, it prints W first without considering it's
>> >> actual position.
>> >>
>> >> Is this the normal behaviour? Or am I missing something here?
>> >>
>> >> [1] https://gist.github.com/DImuthuUpe/5dcfa9758f017794c649
>> >> --
>> >> Regards
>> >>
>> >> W.Dimuthu Upeksha
>> >> Undergraduate
>> >>
>> >> Department of Computer Science And Engineering
>> >>
>> >> University of Moratuwa, Sri Lanka
>> >>
>>
>>
>>
>> --
>> Regards
>>
>> W.Dimuthu Upeksha
>> Undergraduate
>>
>> Department of Computer Science And Engineering
>>
>> University of Moratuwa, Sri Lanka
>>



-- 
Regards

W.Dimuthu Upeksha
Undergraduate

Department of Computer Science And Engineering

University of Moratuwa, Sri Lanka

Re: Problem with processTextPosition

Reply via email to