Kalani,

Kalani Bright wrote
> Thank you for pointing me there. It does seem that's the best place to
> start and it seems I would need to rewrite/extend PRTokenizer or something
> similar.

IMO the PRTokenizer is the wrong place to look as there already is the pdf
parser API.

All you have to do is write a more intelligent RenderListener (or
TextExtractionListener if you prefer to additionally implement a
getResultantText method) than LocationTextExtractionStrategy.
RenderListeners get the smallest text fragments directly available from the
PDF content streams, the string arguments of the commands showing text, and
the relevant transformation matrix information. The TextRenderInfo object
wrapping these information offers you some methods to analyze that fragment;
you might need some more functionality inspired by those methods, though.

Your RenderListener implementation merely has to process this information to
collect the words, their locations and their widths. Be aware, though, that
the text fragments presented to the RenderListener may contain multiple
words, or a part of a word, or even multiple parts of multiple words.

E.g. you might receive "w  i", "rd stuff", and "e", the former with location
information positioned one after the other and the latter positioned to fit
in the double space gap in "w  i", and you would have to build "weird" and
"stuff" from that. This fragmentation might be done in the PDF to position
the 'i' and 'r' nearer to each other than proposed by their font and to
display the 'e' in a different font.

Regards,   Michael

--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Calculating-text-regions-of-individual-words-from-an-existing-PDF-tp4655616p4655624.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to