There are currently two text extraction strategies. One is a very simple extraction of text directly from the content stream. The other is a much more advanced, location based extraction (this is the default).
Extending that to add additional formatting capabilities is possible, and was the intent of the architecture. The advanced formatting will probably require some heuristic development because PDF doesn't really have the sort of layout information that would make this sort of thing straight forward - but the framework will provide all the information you need as inputs to the heuristic. So, if you are up for contributing a PrettyTextExtractionStrategy to the parser, the vast majority of the infrastructure for doing this has been built, and I welcome the contribution. One thing that I can think of to make this easier for others to do is to add a protected method to LocationTextExtractionStrategy that exposes the locationalResults member collection. If this is something that you'd like to take a crack at, I'll re-factor that. Let me know. - K -- View this message in context: http://itext-general.2136553.n4.nabble.com/Save-PDF-as-plain-text-tp4041246p4042929.html Sent from the iText - General mailing list archive at Nabble.com. ------------------------------------------------------------------------------ RSA(R) Conference 2012 Save $700 by Nov 18 Register now http://p.sf.net/sfu/rsa-sfdev2dev1 _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
