Debasis Mandal,

Debasis Mandal wrote
> I am working on extracting text from pdf and want to get exact position of
> all words (in the form of co-ordinates) from pdf by using itextsharp dll.
> I am using .Net Framework. But I am facing some problem - when i am
> extracting words from pdf, I can not get the right words. It's split
> multiple part of a word. For example, If word="PAGE", first time its
> render word="PAG" then next render word="E". Also facing same problem for
> finding co-ordinate of a word.

Unfortunately you have not told us how you try to extract words from your
PDF. Thus, I have to guess what you are doing.

I assume you have implemented your own RenderListener and process each
TextRenderInfo immediately when you receive it, starting from the premise
that each word completely is contained in one TextRenderInfo.

This premise is wrong. PDF page content contains numerous groups of glyphs
each of which is to be displayed starting from some respective starting
position. These groups of glyphs may be anything; whole text lines, multiple
words, single words, word parts, individual characters; they even may
contain the end of one word and the start of the next, but none of them
completely. Furthermore the groups don't even have to appear in some reading
order.

And each TextRenderInfo represents one such glyph group.

Thus, to find the coordinates of the words, you have to collect all glyph
groups / TextRenderInfos which may build your word and then determine the
coordinates.

The source of the LocationTextExtractionStrategy shows you how to collect
and sort the text render information objects.

Some more hints can e.g. be found in this item on stackoverflow:
http://stackoverflow.com/questions/13714605/retrieve-the-respective-coordinates-of-all-words-on-the-page-with-itextsharp/13719947

Regards,   Michael



--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Extracting-word-and-finding-co-ordinates-from-pdf-Net-Framework-tp4657283p4657286.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Master Java SE, Java EE, Eclipse, Spring, Hibernate, JavaScript, jQuery
and much more. Keep your Java skills current with LearnJavaNow -
200+ hours of step-by-step video tutorials by Java experts.
SALE $49.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122612 
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to