When I ran the code through reader.getCOntent(398)
And saved the results to a file it produced the output below.
Obviosuly, this is a start, but I need to get text on a line by line
basis.
So I loaded reader.getPageN(398) into a PDFDictionary, but know I am
stuck as my lmited knowledge of PDF shines through.  My guess is that
there is a way to interate thorugh the dictionary and get what I want
(like Bruno showed me how to do with the AcroForm.Fields), any code to
do that would be great.

0 g
1 i 
/GS0 gs
BT
/T1_0 1 Tf
0.0023 Tc 0 Tw 0  Ts 100  Tz 0 Tr 9.96001 0 0 9.96001 50.40001 50.28011
Tm
[( )-2108(Publication 1346 )-2108(             August 30, 2005 )-7229(
)-2259(Part 2 Page 12 )]TJ
/TT0 1 Tf
0 Tc 0 -1.2048 TD
( )Tj
/T1_0 1 Tf
0.0023 Tc 0 67.6506 TD
(           FORM 1040 PAGE 1             U.S. Individual Income Tax
Retur\
n )Tj
0 Tc 0 -1.0663 TD
( )Tj
0.0023 Tc T*
(           Field Identification         Form       Length  Field
Descrip\
tion )Tj
T*
(           No.                          Ref. )Tj
T*
(           ----- --------------         ----       ------
-------------\
---- )Tj
0 Tc T*
( )Tj
0.0023 Tc T*
(                 Byte Count                           4    "1450" for
Fi\
xed;         | )Tj
T*
(                                                           "nnnn" for
va\
riable )Tj
T*
(                                                           format )Tj
0 Tc T*
( )Tj
0.0023 Tc T*
(                 Start of Record Sentinel             4    Value "****"
\
)Tj
0 Tc T*
( )Tj
0.0023 Tc T*
(           0000  Record ID                            6    "RETbbb" )Tj
T*
(                                                            )Tj
T*
(           0001  Type                                 6    "1040bb" )Tj
T*
(                                                            )Tj
T*
(           0002  Page Number                          5    "PG01b" )Tj
T*
(                                                            )Tj
T*
(           0003  Taxpayer                             9    N \(Primary
S\
SN\) )Tj
T*
(                 Identification                             )Tj
T*
(                 Number                                     )Tj
T*
(                                                            )Tj
T*
(           0004  Filler                               1    blank )Tj
T*
(                                                            )Tj
T*
(           0005  Tax Period                           6    Value
"200512\
", YYYYMM | )Tj
T*
(                                                            )Tj
T*
(           0006  Filler                               1    blank )Tj
T*
(                                                            )Tj
T*
(           0007  Return Sequence                     16    N )Tj
T*
(                 Number                                     )Tj
T*
(                                                            )Tj
T*
(           0008  Declaration Control                 14    N )Tj
T*
(                 Number                                     )Tj
T*
(                                                            )Tj
T*
(           0010  Primary SSN                          9    N \(Your
Soci\
al )Tj
T*
(                                                           Security
Numb\
er\) )Tj
T*
(                                                            )Tj
T*
(           0020  Primary Date of                      8    YYYYMMDD or
b\
lank )Tj
T*
(                 Death                                      )Tj
T*
(                                                            )Tj
T*
(           0030  Secondary SSN                        9    N or blank
)Tj
T*
(                                                            )Tj
T*
(           0040  Secondary Date of                    8    YYYYMMDD or
b\
lank )Tj
T*
(                 Death                                      )Tj
T*
(                                                            )Tj
T*
(           0050  Primary Name Control                 4    First 4
signi\
ficant )Tj
T*
(                                                           characters
of\
 taxpayer's )Tj
T*
(                                                           last name,
no\
 leading or )Tj
T*
(                                                           embedded
spac\
es; )Tj
T*
(                                                           allowable
cha\
racters are )Tj
T*
(                                                           alpha,
hyphen\
 or space )Tj
T*
(                                                           \(see
special\
 )Tj
T*
(
instructions\)\
 )Tj
T*
(                                                            )Tj
T*
(                                                                    )Tj
ET

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
Richard Braman
Sent: Tuesday, February 14, 2006 1:35 PM
To: itext-questions@lists.sourceforge.net
Subject: [iText-questions] Reading and Extracting Text from PDF


I have a open source project that is attempting to structure IRS
produced documents such as publications and instructions and parse out
data that is critical to building tax software. An example of such a
file is http://www.irs.gov/pub/irs-pdf/p1346.pdf.
This file contains e-file record layouts, which start on page 398.  They
used to publish this as text which made parsing relatively easy, but now
it comes in PDF only, and the project needs to be able to have good open
source parsing technology.   Is Itext the right tool for this job?  I
have seen it do good work on parsing the metadata contained in IRS
fill-in forms.
 
 
Richard Braman
mailto:[EMAIL PROTECTED]
561.748.4002 (voice) 

http://www.taxcodesoftware.org
Free Open Source Tax Software



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log
files for problems?  Stop!  Download the new AJAX search engine that
makes searching your log files as easy as surfing the  web.  DOWNLOAD
SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
iText-questions mailing list iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Reply via email to