First, if you can get the data some other way from the source applicatino 
(besides parsing PDF), you will be much happier for it.  I strongly recommend 
that you try that first.

If that is not possible, then read on...

First, the request to convert whitepace into appropriate spaces/tabs/newlines 
is not, in the general case, going to happen.  After all, how do you define 
what a tab represents?  That said, it may be possible for you to define 
physical zones on your pages, and perform text extraction based on those zones. 
 From the data set you provided, you could probably look for the location of 
the headings of each column (JAN, FEB, etc...) to determine the right hand 
extent of each column.  That would allow you to determine the area of the page 
devoted to a particular column.  The exception here is the JAN colum, but you 
may be able to fudge this.

If this can be done, then you could construct a custom text extractor that 
would identify which column each text string belongs to.  The fact that your 
source uses a mono-space font will make your job a bit easier here - I wouldn't 
be surprised if you could actually assign a character X,Y location (as opposed 
to physical page location X,Y) to each string that you extract.  That would 
make it pretty easy to determine which values belong to a given column.

You can take a look at the 
com.lowagie.text.pdf.parser.SimpleTextExtractingPdfContentStreamProcessor class 
as a starting point for a solution (actually start with PdfTextExtractor to see 
how to just get a basic text extraction, then dive into 
SimpleTextExtractingPdfContentStreamProcessor for how to tweak it to your 
needs).  The start of the string is defined by the current text matrix in the 
graphics state.  The end of the string is defined by the endingTextMatrix that 
is passed in to the displayText() method.  If you look at how we compute the 
currentX position, you can probably just use that value as the 'X' location of 
the string in question.  The Y location comes from 
textLineMatrix.get(Matrix.I32).  


The following approach will only work because you are using a mono-space font, 
but take any simplifications that you can get!!

If you take a couple of strings that gets passed into displayText with varying 
xPosition, you should be able to come up with a common scaling factor such that:

(X-X0)*Sx ==>  Xcharpos == [0, 1, 2, ... m]

where m is the total number of character output columns.  You won't know m to 
begin with - the key is that you solve X0 and Sx such that Xcharpos is an 
integer (or very close to it - maybe within 0.0001 )

A similar calculation can be done for the Y direction (given two strings with 
varying yPosition), such that:

(Y-Y0)*Sy ==>  Ycharpos == [0, 1, 2, ... n]

where n is the total number of character output rows.


Once you solve for X0, Sx, Y0, Sy (which will all be floating point numbers), 
each character can then be placed into an array of char[m][n] using Xcharpos 
and Ycharpos as the array indexes, and you've basically got the equivalent of 
an old-school dumb terminal text display that you can parse to your heart's 
content.


Note that b/c you are dealing with fixed spaced fonts, there is no need to do 
the really nasty determination of whether two adjacent strings actually have a 
space between them or not.


This would be a useful PdfContentStreamProcessor for parsing mon-spaced fonts - 
please let us know what you come up with!

- K


----------------------- Original Message -----------------------
 
From: "Eoin Hinchy" <[EMAIL PROTECTED]>
To: [email protected]
Cc: 
Date: Sat, 8 Nov 2008 12:36:21 +0000
Subject: [iText-questions] Read PDF replacing whitespace with spaces
 
Hi guys,

I was wondering if it's possible to use iText to read in a PDF and
replace all the whitespace in it with spaces/tabs/newlines.
For example:
Read in the file http://www.plainsight.info/dev/example.pdf
and output something along the lines of:
http://www.plainsight.info/dev/desired.txt

I've been looking through the itext forums/mail lists for the answer
to my question but I couldn't find it.
Is it even possible?

Thanks a mill, any help massively appreciated,
Eoin

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Reply via email to