Debasis Mandal, Debasis Mandal wrote > I am working on extracting text from pdf and want to get exact position of > all words (in the form of co-ordinates) from pdf by using itextsharp dll. > I am using .Net Framework. But I am facing some problem - when i am > extracting words from pdf, I can not get the right words. It's split > multiple part of a word. For example, If word="PAGE", first time its > render word="PAG" then next render word="E". Also facing same problem for > finding co-ordinate of a word.
Unfortunately you have not told us how you try to extract words from your PDF. Thus, I have to guess what you are doing. I assume you have implemented your own RenderListener and process each TextRenderInfo immediately when you receive it, starting from the premise that each word completely is contained in one TextRenderInfo. This premise is wrong. PDF page content contains numerous groups of glyphs each of which is to be displayed starting from some respective starting position. These groups of glyphs may be anything; whole text lines, multiple words, single words, word parts, individual characters; they even may contain the end of one word and the start of the next, but none of them completely. Furthermore the groups don't even have to appear in some reading order. And each TextRenderInfo represents one such glyph group. Thus, to find the coordinates of the words, you have to collect all glyph groups / TextRenderInfos which may build your word and then determine the coordinates. The source of the LocationTextExtractionStrategy shows you how to collect and sort the text render information objects. Some more hints can e.g. be found in this item on stackoverflow: http://stackoverflow.com/questions/13714605/retrieve-the-respective-coordinates-of-all-words-on-the-page-with-itextsharp/13719947 Regards, Michael -- View this message in context: http://itext-general.2136553.n4.nabble.com/Extracting-word-and-finding-co-ordinates-from-pdf-Net-Framework-tp4657283p4657286.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Master Java SE, Java EE, Eclipse, Spring, Hibernate, JavaScript, jQuery and much more. Keep your Java skills current with LearnJavaNow - 200+ hours of step-by-step video tutorials by Java experts. SALE $49.99 this month only -- learn more at: http://p.sf.net/sfu/learnmore_122612 _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php