When I ran the code through reader.getCOntent(398) And saved the results to a file it produced the output below. Obviosuly, this is a start, but I need to get text on a line by line basis. So I loaded reader.getPageN(398) into a PDFDictionary, but know I am stuck as my lmited knowledge of PDF shines through. My guess is that there is a way to interate thorugh the dictionary and get what I want (like Bruno showed me how to do with the AcroForm.Fields), any code to do that would be great.
0 g 1 i /GS0 gs BT /T1_0 1 Tf 0.0023 Tc 0 Tw 0 Ts 100 Tz 0 Tr 9.96001 0 0 9.96001 50.40001 50.28011 Tm [( )-2108(Publication 1346 )-2108( August 30, 2005 )-7229( )-2259(Part 2 Page 12 )]TJ /TT0 1 Tf 0 Tc 0 -1.2048 TD ( )Tj /T1_0 1 Tf 0.0023 Tc 0 67.6506 TD ( FORM 1040 PAGE 1 U.S. Individual Income Tax Retur\ n )Tj 0 Tc 0 -1.0663 TD ( )Tj 0.0023 Tc T* ( Field Identification Form Length Field Descrip\ tion )Tj T* ( No. Ref. )Tj T* ( ----- -------------- ---- ------ -------------\ ---- )Tj 0 Tc T* ( )Tj 0.0023 Tc T* ( Byte Count 4 "1450" for Fi\ xed; | )Tj T* ( "nnnn" for va\ riable )Tj T* ( format )Tj 0 Tc T* ( )Tj 0.0023 Tc T* ( Start of Record Sentinel 4 Value "****" \ )Tj 0 Tc T* ( )Tj 0.0023 Tc T* ( 0000 Record ID 6 "RETbbb" )Tj T* ( )Tj T* ( 0001 Type 6 "1040bb" )Tj T* ( )Tj T* ( 0002 Page Number 5 "PG01b" )Tj T* ( )Tj T* ( 0003 Taxpayer 9 N \(Primary S\ SN\) )Tj T* ( Identification )Tj T* ( Number )Tj T* ( )Tj T* ( 0004 Filler 1 blank )Tj T* ( )Tj T* ( 0005 Tax Period 6 Value "200512\ ", YYYYMM | )Tj T* ( )Tj T* ( 0006 Filler 1 blank )Tj T* ( )Tj T* ( 0007 Return Sequence 16 N )Tj T* ( Number )Tj T* ( )Tj T* ( 0008 Declaration Control 14 N )Tj T* ( Number )Tj T* ( )Tj T* ( 0010 Primary SSN 9 N \(Your Soci\ al )Tj T* ( Security Numb\ er\) )Tj T* ( )Tj T* ( 0020 Primary Date of 8 YYYYMMDD or b\ lank )Tj T* ( Death )Tj T* ( )Tj T* ( 0030 Secondary SSN 9 N or blank )Tj T* ( )Tj T* ( 0040 Secondary Date of 8 YYYYMMDD or b\ lank )Tj T* ( Death )Tj T* ( )Tj T* ( 0050 Primary Name Control 4 First 4 signi\ ficant )Tj T* ( characters of\ taxpayer's )Tj T* ( last name, no\ leading or )Tj T* ( embedded spac\ es; )Tj T* ( allowable cha\ racters are )Tj T* ( alpha, hyphen\ or space )Tj T* ( \(see special\ )Tj T* ( instructions\)\ )Tj T* ( )Tj T* ( )Tj ET -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Richard Braman Sent: Tuesday, February 14, 2006 1:35 PM To: itext-questions@lists.sourceforge.net Subject: [iText-questions] Reading and Extracting Text from PDF I have a open source project that is attempting to structure IRS produced documents such as publications and instructions and parse out data that is critical to building tax software. An example of such a file is http://www.irs.gov/pub/irs-pdf/p1346.pdf. This file contains e-file record layouts, which start on page 398. They used to publish this as text which made parsing relatively easy, but now it comes in PDF only, and the project needs to be able to have good open source parsing technology. Is Itext the right tool for this job? I have seen it do good work on parsing the metadata contained in IRS fill-in forms. Richard Braman mailto:[EMAIL PROTECTED] 561.748.4002 (voice) http://www.taxcodesoftware.org Free Open Source Tax Software ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions