Kevin:
I wrote a class that extends PdfContentStreamProcessor
and implemented my own displayText method. I am able
to get the text and x and y coordinates. That gives me
what I need.
The only problem was that since my class is not
in the com.lowagie.text.pdf.parser package, I had
to add a protected accessor method for the textMatrix
member in PdfContentStreamProcessor.
You may want to consider adding that to the mainline code
so people can write their own subclasses without having to
rebuild iText.
Thank you for the help.
Neil
--
Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com
Eliminate junk email and reclaim your inbox.
Visit http://www.spammilter.com <http://www.spammilter.com/> for details.
________________________________
From: Kevin Day [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 18, 2008 7:10 PM
To: IText Questions; 'Post all your questions about iText here'
Subject: Re: [iText-questions] Access tables from PDF file?
Well, SimpleTextExtractingPdfContentStreamProcessor is just
a listener on the parser. It determines whether a carriage return is
appropriate given the vertical position of the current text relative to the
previous text (this, by the way, is *not* a robust solution - it's quite
possible for text to be placed in the content stream in a different order
than it appears on screen - but it seems to work for many PDF files - at
least the ones I need to process :-) ). I'm not sure that you'd want to
turn around and put a listener on the listener (there may be reasons to do
this, but I haven't hit one just yet).
You could, of course, just take the string that
SimpleTextExtractingPdfContentStreamProcessor generates and split it by
carriage returns - that would be the easiest thing.
If you need to do any fancier sort of handling, you'd be looking at
your own implementation of displayText(). The code that sets hardReturn =
true is the check you are asking about for determining the presence of a
hard return. The X position that a given string is going to be placed at is
computed in the currentX variable.
Be very careful here, though - the strings passed in to
displayText() are *not* guaranteed to be (and in most cases are not) words.
It is quite possible to have 3 displayText() calls containing text from 5
words. Computing where we are supposed to put the spaces in between those
words is the hardest part of this parsing stuff - a very basic
implementation is in the 'else if (lastEndingTextMatrix != null)' portion of
the if-block. It works for many cases, but has not been exhaustively tested
(for example, I am *certain* that there are fonts out there that don't
specify a width for a space character).
Note that it's also possible for a single displayText() call to
contain multiple words, with a space in between them...
Welcome to PDF :-)
I think your best bet is to create your own stream processor, copy
the existing displayText() code and insert your special handling as needed.
At some point this whole thing will stabilize and you'll be able to
sub-class, etc... but not yet.
- K
----------------------- Original Message
-----------------------
From: "Neil Aggarwal" <[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]>
To: "'Post all your questions about iText here'"
<[email protected]>
<mailto:[email protected]>
Cc:
Date: Tue, 18 Nov 2008 18:25:00 -0600
Subject: Re: [iText-questions] Access tables from PDF file?
Kevin:
Actually, I have an idea. It seems to me that if I can get the
operations that are going on, such as a hard return and
the X position when it writes text, I can do what I need.
I am thinking a callback that
SimpleTextExtractingPdfContentStreamProcessor
invokes to let it know what is happening is the best approach.
What do you think?
Neil
--
Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com
Eliminate junk email and reclaim your inbox.
Visit http://www.spammilter.com <http://www.spammilter.com/> for
details.
________________________________
From: Kevin Day [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 18, 2008 3:32 PM
To: IText Questions
Subject: Re: [iText-questions] Access tables from PDF file?
Let me know if you need to bounce ideas...
- K
----------------------- Original Message
-----------------------
From: "Neil Aggarwal" <[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]>
To: "'Post all your questions about iText here'"
<[email protected]>
<mailto:[email protected]>
Cc:
Date: Tue, 18 Nov 2008 15:26:41 -0600
Subject: Re: [iText-questions] Access tables from PDF file?
Kevin:
Thank you for the tip. I think you are right in that the
most straightfoward solution
is to look at the column heading locations. I will try that
approach first.
Thanks,
Neil
--
Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com
Eliminate junk email and reclaim your inbox.
Visit http://www.spammilter.com <http://www.spammilter.com/>
for details.
________________________________
From: Kevin Day [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 18, 2008 3:07 PM
To: IText Questions
Subject: Re: [iText-questions] Access tables from
PDF file?
Code executing in
SimpleTextExtractingPdfContentStreamProcessor.displayText() has all the
information you need to determine where on the page a given text string is
going to be placed. You should be able to use that information to create
multiple text outputs instead of a single text output (one output per
column). You'll probably wind up writing your own custom
PdfContentStreamProcessor implementation.
The challenge will be determining where the columns
start and stop. If you are processing a ton of files that all have the same
format, you may be able to hard code these.
Otherwise, I suppose that you could add a first pass
PdfContentStreamProcessor that looks for the draw operations that create the
vertical lines of the graph, and construct your columns using those. That
might be tricker... If you are going to go down this path, you'll need to
add some additional ContentOperators to PdfContentStreamProcessor. If you
need to do this, let me know and I'll add some methods to
PdfContentStreamProcessor to allow you to add additional operators from
sub-classes (not allowed right now).
You might also be able to look for the column header
text, and figure out the column margins from that - a bit less accurate, but
might work (and probably would be a lot easier than trying to figure out
*which* vertical line draw operations are for column borders).
- K
-----------------------
Original Message -----------------------
From: "Neil Aggarwal" <[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]>
To: <[email protected]>
<mailto:[email protected]>
Cc:
Date: Tue, 18 Nov 2008 14:06:46 -0600
Subject: [iText-questions] Access tables from PDF
file?
Hello:
I am trying to use iText to read the values from
this file:
http://www.bcad.org/PDFs/TAX-RATE-CHARTS%202008.pdf
<http://www.bcad.org/PDFs/TAX-RATE-CHARTS%202008.pdf>
I used the text extractor on the file and it gave
me all the text which I expected.
I am trying to figure out if it is possible for me
to step through the tables and pull the text myself
since I need to know which colums the values
come from.
I am investigating and hoping someone can tell me if
I am barking up the wrong tree.
Thanks,
Neil
--
Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com
<http://www.JAMMConsulting.com>
Eliminate junk email and reclaim your inbox.
Visit http://www.spammilter.com
<http://www.spammilter.com> for details.
--------------------------------
-----------------------------------------
This SF.Net email is sponsored by the Moblin Your
Move Developer's challenge
Build the coolest Linux based applications with
Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source
event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
<http://moblin-contest.org/redirect.php?banner_id=100&url=/>
_______________________________________________
iText-questions mailing list
[email protected]
<mailto:[email protected]>
https://lists.sourceforge.net/lists/listinfo/itext-questions
<https://lists.sourceforge.net/lists/listinfo/itext-questions>
Buy the iText book:
http://www.1t3xt.com/docs/book.php <http://www.1t3xt.com/docs/book.php>
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your
Move Developer's challenge
Build the coolest Linux based applications with
Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source
event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book:
http://www.1t3xt.com/docs/book.php
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's
challenge
Build the coolest Linux based applications with Moblin SDK & win
great prizes
Grand prize is a trip for two to an Open Source event anywhere in
the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.1t3xt.com/docs/book.php
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://www.1t3xt.com/docs/book.php