First, if you can get the data some other way from the source applicatino (besides parsing PDF), you will be much happier for it. I strongly recommend that you try that first.
If that is not possible, then read on... First, the request to convert whitepace into appropriate spaces/tabs/newlines is not, in the general case, going to happen. After all, how do you define what a tab represents? That said, it may be possible for you to define physical zones on your pages, and perform text extraction based on those zones. From the data set you provided, you could probably look for the location of the headings of each column (JAN, FEB, etc...) to determine the right hand extent of each column. That would allow you to determine the area of the page devoted to a particular column. The exception here is the JAN colum, but you may be able to fudge this. If this can be done, then you could construct a custom text extractor that would identify which column each text string belongs to. The fact that your source uses a mono-space font will make your job a bit easier here - I wouldn't be surprised if you could actually assign a character X,Y location (as opposed to physical page location X,Y) to each string that you extract. That would make it pretty easy to determine which values belong to a given column. You can take a look at the com.lowagie.text.pdf.parser.SimpleTextExtractingPdfContentStreamProcessor class as a starting point for a solution (actually start with PdfTextExtractor to see how to just get a basic text extraction, then dive into SimpleTextExtractingPdfContentStreamProcessor for how to tweak it to your needs). The start of the string is defined by the current text matrix in the graphics state. The end of the string is defined by the endingTextMatrix that is passed in to the displayText() method. If you look at how we compute the currentX position, you can probably just use that value as the 'X' location of the string in question. The Y location comes from textLineMatrix.get(Matrix.I32). The following approach will only work because you are using a mono-space font, but take any simplifications that you can get!! If you take a couple of strings that gets passed into displayText with varying xPosition, you should be able to come up with a common scaling factor such that: (X-X0)*Sx ==> Xcharpos == [0, 1, 2, ... m] where m is the total number of character output columns. You won't know m to begin with - the key is that you solve X0 and Sx such that Xcharpos is an integer (or very close to it - maybe within 0.0001 ) A similar calculation can be done for the Y direction (given two strings with varying yPosition), such that: (Y-Y0)*Sy ==> Ycharpos == [0, 1, 2, ... n] where n is the total number of character output rows. Once you solve for X0, Sx, Y0, Sy (which will all be floating point numbers), each character can then be placed into an array of char[m][n] using Xcharpos and Ycharpos as the array indexes, and you've basically got the equivalent of an old-school dumb terminal text display that you can parse to your heart's content. Note that b/c you are dealing with fixed spaced fonts, there is no need to do the really nasty determination of whether two adjacent strings actually have a space between them or not. This would be a useful PdfContentStreamProcessor for parsing mon-spaced fonts - please let us know what you come up with! - K ----------------------- Original Message ----------------------- From: "Eoin Hinchy" <[EMAIL PROTECTED]> To: [email protected] Cc: Date: Sat, 8 Nov 2008 12:36:21 +0000 Subject: [iText-questions] Read PDF replacing whitespace with spaces Hi guys, I was wondering if it's possible to use iText to read in a PDF and replace all the whitespace in it with spaces/tabs/newlines. For example: Read in the file http://www.plainsight.info/dev/example.pdf and output something along the lines of: http://www.plainsight.info/dev/desired.txt I've been looking through the itext forums/mail lists for the answer to my question but I couldn't find it. Is it even possible? Thanks a mill, any help massively appreciated, Eoin ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php
