|
Good suggestion. done and comitted to HEAD.
I wound up making the member variables of the PdfContentStreamProcessor private (leaving them package exposed was just laziness on my part during inital development) - hopefully that doesn't break anything you have written.
I may decide that a better approach here would be to expose a text positioning state (that would itself contain the matrices) - not sure if that will be necessary or not.
- K
----------------------- Original Message -----------------------
From: "Neil Aggarwal" <[EMAIL PROTECTED]>
To: "'Post all your questions about iText here'" <[email protected]>
Cc:
Date: Wed, 19 Nov 2008 12:32:40 -0600
Subject: Re: [iText-questions] Access tables from PDF file?
Kevin:
I wrote a class that extends PdfContentStreamProcessor and implemented my own displayText method. I am able to get the text and x and y coordinates. That gives me what I need. The only problem was that since my class is not in the com.lowagie.text.pdf.parser package, I had to add a protected accessor method for the textMatrix member in PdfContentStreamProcessor. You may want to consider adding that to the mainline code so people can write their own subclasses without having to rebuild iText. Thank you for the help. Neil -- Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com Eliminate junk email and reclaim your inbox. Visit http://www.spammilter.com <http://www.spammilter.com/> for details. ________________________________ From: Kevin Day [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 18, 2008 7:10 PM To: IText Questions; 'Post all your questions about iText here' Subject: Re: [iText-questions] Access tables from PDF file? Well, SimpleTextExtractingPdfContentStreamProcessor is just a listener on the parser. It determines whether a carriage return is appropriate given the vertical position of the current text relative to the previous text (this, by the way, is *not* a robust solution - it's quite possible for text to be placed in the content stream in a different orderthan it appears on screen - but it seems to work for many PDF files - at least the ones I need to process :-) ). I'm not sure that you'd want to turn around and put a listener on the listener (there may be reasons to do this, but I haven't hit one just yet). You could, of course, just take the string that SimpleTextExtractingPdfContentStreamProcessor generates and split it by carriage returns - that would be the easiest thing. If you need to do any fancier sort of handling, you'd be looking at your own implementation of displayText(). The code that sets hardReturn = true is the check you are asking about for determining the presence of a hard return. The X position that a given string is going to be placed at is computed in the currentX variable. &n bsp;Be very careful here, though - the strings passed in to displayText() are *not* guaranteed to be (and in most cases are not) words. It is quite possible to have 3 displayText() calls containing text from 5 words. Computing where we are supposed to put the spaces in between those words is the hardest part of this parsing stuff - a very basic implementation is in the 'else if (lastEndingTextMatrix != null)' portion of the if-block. It works for many cases, but has not been exhaustively tested (for example, I am *certain* that there are fonts out there that don't specify a width for a space character). Note that it's also possible for a single displayText() call to contain multiple words, with a space in between them... Welcome to PDF :-) I think your best bet is to create your own stream processor, copy the existing displayText() code and insert your special handling as needed. At some point this whole thing will stabilize and you'll be able to sub-class, etc... but not yet. - K ----------------------- Original Message ----------------------- From: "Neil Aggarwal" <[EMAIL PROTECTED]> <mailto:[EMAIL PROTECTED]> To: "'Post all your questions about iText here'" <[email protected]> FONT> <mailto:[email protected]> Cc: Date: Tue, 18 Nov 2008 18:25:00 -0600 Subject: Re: [iText-questions] Access tables from PDF file? Kevin: Actually, I have an idea. It seems to me that if I can get the operations that are going on, such as a hard return and the X position when it writes text, I can do what I need. I am thinking a callback that SimpleTextExtractingPdfContentStreamProcessor invokes to let it know what is happening is the best approach. & nbsp; What do you think? Neil -- Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com Eliminate junk email and reclaim your inbox. Visit http://www.spammilter.com <http://www.spammilter.com/> for details. ________________________________ From: Kevin Day [mailto:[EMAIL PROTECTED] Sent: Tuesday, Novembe r 18, 2008 3:32 PM To: IText Questions Subject: Re: [iText-questions] Access tables from PDF file? Let me know if you need to bounce ideas... - K ----------------------- Original Message ----------------------- From: "Neil Aggarwal" <[EMAIL PROTECTED]> <mailto:[EMAIL PROTECTED]> To: "'Post all your questions about iText here'" <[email protected]> <mailto:[email protected]> Cc: Date: Tue, 18 Nov 2008 15:26:41 -0600 Subject: Re: [iText-questions] Access tables from PDF file? Kevin: &n bsp; Thank you for the tip. I think you are right in that the most straightfoward solution is to look at the column heading locations. I will try that approach first. Thanks, Neil -- Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com Eliminate junk email and reclaim your inbox. Visit http://www.spammilter.com <http://www.spammilter.com/> for details. ________________________________ From: Kevin Day [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 18, 2008 3:07 PM To: IText Questions Subject: Re: [iText-questions] Access tables from PDF file?   ; Code executing in SimpleTextExtractingPdfContentStreamProcessor.displayText() has all the information you need to determine where on the page a given text string is going to be placed. You should be able to use that information to create multiple text outputs instead of a single text output (one output per column). You'll probably wind up writing your own custom PdfContentStreamProcessor implementation. The challenge will be determining where the columns start and stop. If you are processing a ton of files that all have the same format, you may be able to hard code these.   ; Otherwise, I suppose that you could add a first pass PdfContentStreamProcessor that looks for the draw operations that create the vertical lines of the graph, and construct your columns using those. That might be tricker... If you are going to go down this path, you'll need to add some additional ContentOperators to PdfContentStreamProcessor. If you need to do this, let me know and I'll add some methods to PdfContentStreamProcessor to allow you to add additional operators from sub-classes (not allowed right now). You might also be able to look for the column header text, and figure out the column margins from that - a bit less accurate, but might work (and probably would be a lot eas ier than trying to figure out *which* vertical line draw operations are for column borders). - K ----------------------- Original Message ----------------------- From: "Neil Aggarwal" <[EMAIL PROTECTED]> <mailto:[EMAIL PROTECTED]> &nbs p; To: <[email protected]> <mailto:[email protected]> Cc: Date: Tue, 18 Nov 2008 14:06:46 -0600 Subject: [iText-questions] Access tables from PDF file? Hello: I am trying to use iText to read the values from this file: http://www.bcad.org/PDFs/TAX-RATE-CHARTS%202008.pdf <http://www.bcad.org/PDFs/TAX-RATE-CHARTS%202008.pdf> I used the text extractor on the file and it gave me all the text which I expected. I am trying to figure out i f it is possible for me to step through the tables and pull the text myself since I need to know which colums the values come from. I am investigating and hoping someone can tell me if I am barking up the wrong tree. Thanks, Neil -- Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com <http://www.JAMMConsulting.com> Eliminate junk email and reclaim your inbox. Visit http://www.spammilter.com <http://www.spammilter.com> for details. & nbsp; -------------------------------- ----------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=""> <http://moblin-contest.org/redirect.php?banner_id=100&url=""> < BR> _______________________________________________ iText-questions mailing list [email protected] <mailto:[email protected]> https://lists.sourceforge.net/lists/listinfo/itext-questions <https://lists.sourceforge.net/lists/listinfo/itext-questions> Buy the iText book: http://www.1t3xt.com/docs/book.php <http://www.1t3xt.com/docs/book.php> ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=""> _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo /itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world &nb sp; http://moblin-contest.org/redirect.php?banner_id=100&url=""> _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php &nbs p; ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=""> _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t 3xt.com/docs/book.php |
------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.1t3xt.com/docs/book.php
