Good suggestion.  done and comitted to HEAD.
 
I wound up making the member variables of the PdfContentStreamProcessor private (leaving them package exposed was just laziness on my part during inital development) - hopefully that doesn't break anything you have written.
 
I may decide that a better approach here would be to expose a text positioning state (that would itself contain the matrices) - not sure if that will be necessary or not.
 
- K
 
----------------------- Original Message -----------------------
  
From: "Neil Aggarwal" <[EMAIL PROTECTED]>
To: "'Post all your questions about iText here'" <[email protected]>
Cc: 
Date: Wed, 19 Nov 2008 12:32:40 -0600
Subject: Re: [iText-questions] Access tables from PDF file?
  
Kevin:

I wrote a class that extends PdfContentStreamProcessor
and implemented my own displayText method.  I am able
to get the text and x and y coordinates.  That gives me
what I need.

The only problem was that since my class is not
in the com.lowagie.text.pdf.parser package, I had
to add a protected accessor method for the textMatrix
member in PdfContentStreamProcessor.  

You may want to consider adding that to the mainline code
so people can write their own subclasses without having to
rebuild iText.

Thank you for the help.

    Neil


--
Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com
Eliminate junk email and reclaim your inbox.
Visit http://www.spammilter.com <http://www.spammilter.com/>  for details.




________________________________

    From: Kevin Day [mailto:[EMAIL PROTECTED]
    Sent: Tuesday, November 18, 2008 7:10 PM
    To: IText Questions; 'Post all your questions about iText here'
    Subject: Re: [iText-questions] Access tables from PDF file?
    
    
        Well, SimpleTextExtractingPdfContentStreamProcessor is just
a listener on the parser.  It determines whether a carriage return is
appropriate given the vertical position of the current text relative to the
previous text (this, by the way, is *not* a robust solution - it's quite
possible for text to be placed in the content stream in a different orderthan it appears on screen - but it seems to work for many PDF files - at
least the ones I need to process :-)  ).  I'm not sure that you'd want to
turn around and put a listener on the listener (there may be reasons to do
this, but I haven't hit one just yet).
     
    You could, of course, just take the string that
SimpleTextExtractingPdfContentStreamProcessor generates and split it by
carriage returns - that would be the easiest thing.
     
    If you need to do any fancier sort of handling, you'd be looking at
your own implementation of displayText().  The code that sets hardReturn =
true is the check you are asking about for determining the presence of a
hard return.  The X position that a given string is going to be placed at is
computed in the currentX variable.
     
   &n bsp;Be very careful here, though - the strings passed in to
displayText() are *not* guaranteed to be (and in most cases are not) words.
It is quite possible to have 3 displayText() calls containing text from 5
words.  Computing where we are supposed to put the spaces in between those
words is the hardest part of this parsing stuff - a very basic
implementation is in the 'else if (lastEndingTextMatrix != null)' portion of
the if-block.  It works for many cases, but has not been exhaustively tested
(for example, I am *certain* that there are fonts out there that don't
specify a width for a space character).
     
    Note that it's also possible for a single displayText() call to
contain multiple words, with a space in between them...
     
    Welcome to PDF :-)
     
    I think your best bet is to create your own stream processor, copy
the existing displayText() code and insert your special handling as needed.
At some point this whole thing will stabilize and you'll be able to
sub-class, etc... but not yet.
     
    - K
    
     
        ----------------------- Original Message
-----------------------
      
    From: "Neil Aggarwal" <[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]>
    To: "'Post all your questions about iText here'"
<[email protected]>
<mailto:[email protected]>
    Cc:
    Date: Tue, 18 Nov 2008 18:25:00 -0600
    Subject: Re: [iText-questions] Access tables from PDF file?
      
    Kevin:
     
    Actually, I have an idea.  It seems to me that if I can get the
    operations that are going on, such as a hard return and
    the X position when it writes text, I can do what I need.
     
    I am thinking a callback that
SimpleTextExtractingPdfContentStreamProcessor
    invokes to let it know what is happening is the best approach.
     
  & nbsp; What do you think?
     
        Neil
     

    --
    Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com
    Eliminate junk email and reclaim your inbox.
    Visit http://www.spammilter.com <http://www.spammilter.com/>  for
details.

     


________________________________

        From: Kevin Day [mailto:[EMAIL PROTECTED]
        Sent: Tuesday, Novembe r 18, 2008 3:32 PM
        To: IText Questions
        Subject: Re: [iText-questions] Access tables from PDF file?
        
        
                Let me know if you need to bounce ideas...
         
        - K
        
         
                ----------------------- Original Message
-----------------------
          
        From: "Neil Aggarwal" <[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]>
        To: "'Post all your questions about iText here'"
<[email protected]>
<mailto:[email protected]>
        Cc:
        Date: Tue, 18 Nov 2008 15:26:41 -0600
        Subject: Re: [iText-questions] Access tables from PDF file?
          
        Kevin:
    &n bsp;    
        Thank you for the tip.  I think you are right in that the
most straightfoward solution
        is to look at the column heading locations. I will try that
approach first.
         
        Thanks,    
            Neil
         

        --
        Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com
        Eliminate junk email and reclaim your inbox.
        Visit http://www.spammilter.com <http://www.spammilter.com/>
for details.

         


________________________________

            From: Kevin Day [mailto:[EMAIL PROTECTED]
            Sent: Tuesday, November 18, 2008 3:07 PM
            To: IText Questions
            Subject: Re: [iText-questions] Access tables from
PDF file?
            
             ;
                        Code executing in
SimpleTextExtractingPdfContentStreamProcessor.displayText() has all the
information you need to determine where on the page a given text string is
going to be placed.  You should be able to use that information to create
multiple text outputs instead of a single text output (one output per
column).  You'll probably wind up writing your own custom
PdfContentStreamProcessor implementation.
             
            The challenge will be determining where the columns
start and stop.  If you are processing a ton of files that all have the same
format, you may be able to hard code these.
           ;   
            Otherwise, I suppose that you could add a first pass
PdfContentStreamProcessor that looks for the draw operations that create the
vertical lines of the graph, and construct your columns using those.  That
might be tricker...  If you are going to go down this path, you'll need to
add some additional ContentOperators to PdfContentStreamProcessor.  If you
need to do this, let me know and I'll add some methods to
PdfContentStreamProcessor to allow you to add additional operators from
sub-classes (not allowed right now).
             
            You might also be able to look for the column header
text, and figure out the column margins from that - a bit less accurate, but
might work (and probably would be a lot eas ier than trying to figure out
*which* vertical line draw operations are for column borders).
             
            - K
             
                        -----------------------
Original Message -----------------------
              
            From: "Neil Aggarwal" <[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]>
    &nbs p;       To: <[email protected]>
<mailto:[email protected]>
            Cc:
            Date: Tue, 18 Nov 2008 14:06:46 -0600
            Subject: [iText-questions] Access tables from PDF
file?
              
                        Hello:
            
             I am trying to use iText to read the values from
this file:
            http://www.bcad.org/PDFs/TAX-RATE-CHARTS%202008.pdf
<http://www.bcad.org/PDFs/TAX-RATE-CHARTS%202008.pdf>
            
            I used the text extractor on the file and it gave
            me all the text which I expected.  
            
            I am trying to figure out i f it is possible for me
            to step through the tables and pull the text myself
            since I need to know which colums the values
            come from.
            
            I am investigating and hoping someone can tell me if
            I am barking up the wrong tree.
            
            Thanks,
                Neil
             
            --
            Neil Aggarwal, (832)245-7314, www.JAMMConsulting.com
<http://www.JAMMConsulting.com>
            Eliminate junk email and reclaim your inbox.
            Visit http://www.spammilter.com
<http://www.spammilter.com>  for details.
            
            
 & nbsp;          --------------------------------
-----------------------------------------
            This SF.Net email is sponsored by the Moblin Your
Move Developer's challenge
            Build the coolest Linux based applications with
Moblin SDK & win great prizes
            Grand prize is a trip for two to an Open Source
event anywhere in the world
    
http://moblin-contest.org/redirect.php?banner_id=100&url="">
<http://moblin-contest.org/redirect.php?banner_id=100&url=""> < BR>            _______________________________________________
            iText-questions mailing list
            [email protected]
<mailto:[email protected]>
    
https://lists.sourceforge.net/lists/listinfo/itext-questions
<https://lists.sourceforge.net/lists/listinfo/itext-questions>
             
            Buy the iText book:
http://www.1t3xt.com/docs/book.php <http://www.1t3xt.com/docs/book.php>
            
            
    
-------------------------------------------------------------------------
            This SF.Net email is sponsored by the Moblin Your
Move Developer's challenge
            Build the coolest Linux based applications with
Moblin SDK & win great prizes
             Grand prize is a trip for two to an Open Source
event anywhere in the world
    
http://moblin-contest.org/redirect.php?banner_id=100&url="">
            
            
    
_______________________________________________
            iText-questions mailing list
            [email protected]
    
https://lists.sourceforge.net/lists/listinfo /itext-questions
            
            Buy the iText book:
http://www.1t3xt.com/docs/book.php
            
            
            
    
-------------------------------------------------------------------------
    This SF.Net email is sponsored by the Moblin Your Move Developer's
challenge
    Build the coolest Linux based applications with Moblin SDK & win
great prizes
    Grand prize is a trip for two to an Open Source event anywhere in
the world
  &nb sp; http://moblin-contest.org/redirect.php?banner_id=100&url="">
    
    
        _______________________________________________
    iText-questions mailing list
    [email protected]
    https://lists.sourceforge.net/lists/listinfo/itext-questions
    
    Buy the iText book: http://www.1t3xt.com/docs/book.php
    
    
  &nbs p; 


-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url="">
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t 3xt.com/docs/book.php

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Reply via email to