Re: Extracting formatted text from PDF files

Daniel Noll Thu, 22 Mar 2007 14:16:36 -0800

Mike O'Leary wrote:

Please forgive the laziness inherent in this question, as I haven't looked
through the PDFBox code yet. I am wondering if that code supports extracting
text from PDF files while preserving such things as sequences of whitespace
between characters and other layout and formatting information. I am working
with a project that extracts and operates on certain table-like blocks of
text from PDF files, and a lot of freeware and shareware PDF to text
converters seem to either ignore formatting or try to preserve formatting
and not get it quite right.

Even pdftohtml? The sample outputs I've seen from that applicationdon't look too bad to me.


Daniel


--
Daniel Noll

Nuix Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia    Ph: +61 2 9280 0699
Web: http://nuix.com/                               Fax: +61 2 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Extracting formatted text from PDF files

Reply via email to