Kalani, Kalani Bright wrote > Thank you for pointing me there. It does seem that's the best place to > start and it seems I would need to rewrite/extend PRTokenizer or something > similar.
IMO the PRTokenizer is the wrong place to look as there already is the pdf parser API. All you have to do is write a more intelligent RenderListener (or TextExtractionListener if you prefer to additionally implement a getResultantText method) than LocationTextExtractionStrategy. RenderListeners get the smallest text fragments directly available from the PDF content streams, the string arguments of the commands showing text, and the relevant transformation matrix information. The TextRenderInfo object wrapping these information offers you some methods to analyze that fragment; you might need some more functionality inspired by those methods, though. Your RenderListener implementation merely has to process this information to collect the words, their locations and their widths. Be aware, though, that the text fragments presented to the RenderListener may contain multiple words, or a part of a word, or even multiple parts of multiple words. E.g. you might receive "w i", "rd stuff", and "e", the former with location information positioned one after the other and the latter positioned to fit in the double space gap in "w i", and you would have to build "weird" and "stuff" from that. This fragmentation might be done in the PDF to position the 'i' and 'r' nearer to each other than proposed by their font and to display the 'e' in a different font. Regards, Michael -- View this message in context: http://itext-general.2136553.n4.nabble.com/Calculating-text-regions-of-individual-words-from-an-existing-PDF-tp4655616p4655624.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php