Going by that particular output sample, you should be able to parse out the text relatively easily.
Anything between '(' and ')' is text (note that there are several escapes that can be used, see the PDF Reference on the String object for more details). "T*" and "TD" are the PDF equivalent of an EOL... TD uses an offset from the previous line, T* does not. If the PDFs stick to this format, your job is pretty simple, at least by comparison. To be a bit more cautious, you might specifically parse out the various other operators, looking for the (seen here only at the beginning) 'Tm' operator, which moves the text insertion point to an arbitrary location. You should be able to ignore anything outside BT and ET (begin text, end text). The following operators are of no consequence to plain-text extraction: parm1 parm2 Tf parm1 Tc parm1 Tw parm1 Tr parm1 Ts parm1 Tz You'll note that PDF's operators are preceeded by their parameters. "Reverse Polish Notation"... it's easy to parse and process in a stack, though it takes some getting used to. As for the ones you'll care about: parm1 Tj That one parameter can be either a String, wrapped in () or <>, for hex values, or an array of strings, wrapped in []. There can also be numbers in that array for spacing information (usually kerning). T* No parameters parm1 parm2 TD The parameters provide a horizontal and vertical offset for the placement of the next line. parm1 ... parm6 Tm This modifies the current Text Matrix. The six parameters define a 3x3 transformation matrix like so (see section 5.3.1 of the PDF reference): p1 p2 0 p3 p4 0 p5 p6 1 That won't count for much if you're not up on matrix math... which isn't really necessary in your case. The Reference goes into some general transformation details in section 4.2.2 if you're curious. The web is also peppered with matrix math tutorials, primarily in regards to 3D graphics (where understanding such things is nigh-unavoidable). PS: I mentioned PDF Structure earlier. This content stream doesn't have any, no luck there. PPS: iText has a good content tokenizer already. 'PRTokenizer', which is used by AcroFields.splitDAelements() to parse the sort of content stream you'll be working with. It's also used extensively by PdfReader. --Mark Storer Senior Software Engineer Cardiff Software #include <disclaimer> typedef std::Disclaimer<Cardiff> DisCard; > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf > Of Richard > Braman > Sent: Tuesday, February 14, 2006 1:32 PM > To: itext-questions@lists.sourceforge.net > Subject: RE: [iText-questions] Reading and Extracting Text from PDF > > > When I ran the code through reader.getCOntent(398) > And saved the results to a file it produced the output below. > Obviosuly, this is a start, but I need to get text on a line by line > basis. > So I loaded reader.getPageN(398) into a PDFDictionary, but know I am > stuck as my lmited knowledge of PDF shines through. My guess is that > there is a way to interate thorugh the dictionary and get what I want > (like Bruno showed me how to do with the AcroForm.Fields), any code to > do that would be great. > > 0 g > 1 i > /GS0 gs > BT > /T1_0 1 Tf > 0.0023 Tc 0 Tw 0 Ts 100 Tz 0 Tr 9.96001 0 0 9.96001 > 50.40001 50.28011 > Tm > [( )-2108(Publication 1346 )-2108( August 30, 2005 )-7229( > )-2259(Part 2 Page 12 )]TJ > /TT0 1 Tf > 0 Tc 0 -1.2048 TD > ( )Tj > /T1_0 1 Tf > 0.0023 Tc 0 67.6506 TD > ( FORM 1040 PAGE 1 U.S. Individual Income Tax > Retur\ > n )Tj > 0 Tc 0 -1.0663 TD > ( )Tj > 0.0023 Tc T* > ( Field Identification Form Length Field > Descrip\ > tion )Tj > T* > ( No. Ref. )Tj > T* > ( ----- -------------- ---- ------ > -------------\ > ---- )Tj > 0 Tc T* > ( )Tj > 0.0023 Tc T* > ( Byte Count 4 "1450" for > Fi\ > xed; | )Tj > T* > ( "nnnn" for > va\ > riable )Tj > T* > ( format )Tj > 0 Tc T* > ( )Tj > 0.0023 Tc T* > ( Start of Record Sentinel 4 > Value "****" > \ > )Tj > 0 Tc T* > ( )Tj > 0.0023 Tc T* > ( 0000 Record ID 6 > "RETbbb" )Tj > T* > ( )Tj > T* > ( 0001 Type 6 > "1040bb" )Tj > T* > ( )Tj > T* > ( 0002 Page Number 5 > "PG01b" )Tj > T* > ( )Tj > T* > ( 0003 Taxpayer 9 N > \(Primary > S\ > SN\) )Tj > T* > ( Identification )Tj > T* > ( Number )Tj > T* > ( )Tj > T* > ( 0004 Filler 1 blank )Tj > T* > ( )Tj > T* > ( 0005 Tax Period 6 Value > "200512\ > ", YYYYMM | )Tj > T* > ( )Tj > T* > ( 0006 Filler 1 blank )Tj > T* > ( )Tj > T* > ( 0007 Return Sequence 16 N )Tj > T* > ( Number )Tj > T* > ( )Tj > T* > ( 0008 Declaration Control 14 N )Tj > T* > ( Number )Tj > T* > ( )Tj > T* > ( 0010 Primary SSN 9 N \(Your > Soci\ > al )Tj > T* > ( Security > Numb\ > er\) )Tj > T* > ( )Tj > T* > ( 0020 Primary Date of 8 > YYYYMMDD or > b\ > lank )Tj > T* > ( Death )Tj > T* > ( )Tj > T* > ( 0030 Secondary SSN 9 N or blank > )Tj > T* > ( )Tj > T* > ( 0040 Secondary Date of 8 > YYYYMMDD or > b\ > lank )Tj > T* > ( Death )Tj > T* > ( )Tj > T* > ( 0050 Primary Name Control 4 First 4 > signi\ > ficant )Tj > T* > ( characters > of\ > taxpayer's )Tj > T* > ( last name, > no\ > leading or )Tj > T* > ( embedded > spac\ > es; )Tj > T* > ( allowable > cha\ > racters are )Tj > T* > ( alpha, > hyphen\ > or space )Tj > T* > ( \(see > special\ > )Tj > T* > ( > instructions\)\ > )Tj > T* > ( )Tj > T* > ( > )Tj > ET ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions