There are currently two text extraction strategies.  One is a very simple
extraction of text directly from the content stream.  The other is a much
more advanced, location based extraction (this is the default).

Extending that to add additional formatting capabilities is possible, and
was the intent of the architecture.  The advanced formatting will probably
require some heuristic development because PDF doesn't really have the sort
of layout information that would make this sort of thing straight forward -
but the framework will provide all the information you need as inputs to the
heuristic.

So, if you are up for contributing a PrettyTextExtractionStrategy to the
parser, the vast majority of the infrastructure for doing this has been
built, and I welcome the contribution.

One thing that I can think of to make this easier for others to do is to add
a protected method to LocationTextExtractionStrategy that exposes the
locationalResults member collection.  If this is something that you'd like
to take a crack at, I'll re-factor that.  Let me know.

- K

--
View this message in context: 
http://itext-general.2136553.n4.nabble.com/Save-PDF-as-plain-text-tp4041246p4042929.html
Sent from the iText - General mailing list archive at Nabble.com.

------------------------------------------------------------------------------
RSA(R) Conference 2012
Save $700 by Nov 18
Register now
http://p.sf.net/sfu/rsa-sfdev2dev1
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

iText(R) is a registered trademark of 1T3XT BVBA.
Many questions posted to this list can (and will) be answered with a reference 
to the iText book: http://www.itextpdf.com/book/
Please check the keywords list before you ask for examples: 
http://itextpdf.com/themes/keywords.php

Reply via email to