RE: [iText-questions] Reading and Extracting Text from PDF

Mark Storer Wed, 15 Feb 2006 10:18:10 -0800

Going by that particular output sample, you should be able to parse out the 
text relatively easily.


Anything between '(' and ')' is text (note that there are several escapes that 
can be used, see the PDF Reference on the String object for more details).  
"T*"  and "TD" are the PDF equivalent of an EOL... TD uses an offset from the 
previous line, T* does not.

If the PDFs stick to this format, your job is pretty simple, at least by 
comparison.  To be a bit more cautious, you might specifically parse out the 
various other operators, looking for the (seen here only at the beginning) 'Tm' 
operator, which moves the text insertion point to an arbitrary location.

You should be able to ignore anything outside BT and ET (begin text, end text).

The following operators are of no consequence to plain-text extraction:
parm1 parm2 Tf
parm1 Tc
parm1 Tw
parm1 Tr
parm1 Ts
parm1 Tz

You'll note that PDF's operators are preceeded by their parameters.  "Reverse 
Polish Notation"... it's easy to parse and process in a stack, though it takes 
some getting used to.

As for the ones you'll care about:

parm1 Tj

That one parameter can be either a String, wrapped in () or <>, for hex values, 
or an array of strings, wrapped in [].  There can also be numbers in that array 
for spacing information (usually kerning).

T*

No parameters

parm1 parm2 TD

The parameters provide a horizontal and vertical offset for the placement of 
the next line.

parm1 ... parm6 Tm

This modifies the current Text Matrix.  The six parameters define a 3x3 
transformation matrix like so (see section 5.3.1 of the PDF reference):

p1 p2 0
p3 p4 0
p5 p6 1

That won't count for much if you're not up on matrix math... which isn't really 
necessary in your case.  The Reference goes into some general transformation 
details in section 4.2.2 if you're curious.  The web is also peppered with 
matrix math tutorials, primarily in regards to 3D graphics (where understanding 
such things is nigh-unavoidable).

PS: I mentioned PDF Structure earlier.  This content stream doesn't have any, 
no luck there.

PPS: iText has a good content tokenizer already.  'PRTokenizer', which is used 
by AcroFields.splitDAelements() to parse the sort of content stream you'll be 
working with.  It's also used extensively by PdfReader.

--Mark Storer
  Senior Software Engineer
  Cardiff Software

#include <disclaimer>
typedef std::Disclaimer<Cardiff> DisCard;



> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf 
> Of Richard
> Braman
> Sent: Tuesday, February 14, 2006 1:32 PM
> To: itext-questions@lists.sourceforge.net
> Subject: RE: [iText-questions] Reading and Extracting Text from PDF
> 
> 
> When I ran the code through reader.getCOntent(398)
> And saved the results to a file it produced the output below.
> Obviosuly, this is a start, but I need to get text on a line by line
> basis.
> So I loaded reader.getPageN(398) into a PDFDictionary, but know I am
> stuck as my lmited knowledge of PDF shines through.  My guess is that
> there is a way to interate thorugh the dictionary and get what I want
> (like Bruno showed me how to do with the AcroForm.Fields), any code to
> do that would be great.
> 
> 0 g
> 1 i 
> /GS0 gs
> BT
> /T1_0 1 Tf
> 0.0023 Tc 0 Tw 0  Ts 100  Tz 0 Tr 9.96001 0 0 9.96001 
> 50.40001 50.28011
> Tm
> [( )-2108(Publication 1346 )-2108(             August 30, 2005 )-7229(
> )-2259(Part 2 Page 12 )]TJ
> /TT0 1 Tf
> 0 Tc 0 -1.2048 TD
> ( )Tj
> /T1_0 1 Tf
> 0.0023 Tc 0 67.6506 TD
> (           FORM 1040 PAGE 1             U.S. Individual Income Tax
> Retur\
> n )Tj
> 0 Tc 0 -1.0663 TD
> ( )Tj
> 0.0023 Tc T*
> (           Field Identification         Form       Length  Field
> Descrip\
> tion )Tj
> T*
> (           No.                          Ref. )Tj
> T*
> (           ----- --------------         ----       ------
> -------------\
> ---- )Tj
> 0 Tc T*
> ( )Tj
> 0.0023 Tc T*
> (                 Byte Count                           4    "1450" for
> Fi\
> xed;         | )Tj
> T*
> (                                                           "nnnn" for
> va\
> riable )Tj
> T*
> (                                                           format )Tj
> 0 Tc T*
> ( )Tj
> 0.0023 Tc T*
> (                 Start of Record Sentinel             4    
> Value "****"
> \
> )Tj
> 0 Tc T*
> ( )Tj
> 0.0023 Tc T*
> (           0000  Record ID                            6    
> "RETbbb" )Tj
> T*
> (                                                            )Tj
> T*
> (           0001  Type                                 6    
> "1040bb" )Tj
> T*
> (                                                            )Tj
> T*
> (           0002  Page Number                          5    
> "PG01b" )Tj
> T*
> (                                                            )Tj
> T*
> (           0003  Taxpayer                             9    N 
> \(Primary
> S\
> SN\) )Tj
> T*
> (                 Identification                             )Tj
> T*
> (                 Number                                     )Tj
> T*
> (                                                            )Tj
> T*
> (           0004  Filler                               1    blank )Tj
> T*
> (                                                            )Tj
> T*
> (           0005  Tax Period                           6    Value
> "200512\
> ", YYYYMM | )Tj
> T*
> (                                                            )Tj
> T*
> (           0006  Filler                               1    blank )Tj
> T*
> (                                                            )Tj
> T*
> (           0007  Return Sequence                     16    N )Tj
> T*
> (                 Number                                     )Tj
> T*
> (                                                            )Tj
> T*
> (           0008  Declaration Control                 14    N )Tj
> T*
> (                 Number                                     )Tj
> T*
> (                                                            )Tj
> T*
> (           0010  Primary SSN                          9    N \(Your
> Soci\
> al )Tj
> T*
> (                                                           Security
> Numb\
> er\) )Tj
> T*
> (                                                            )Tj
> T*
> (           0020  Primary Date of                      8    
> YYYYMMDD or
> b\
> lank )Tj
> T*
> (                 Death                                      )Tj
> T*
> (                                                            )Tj
> T*
> (           0030  Secondary SSN                        9    N or blank
> )Tj
> T*
> (                                                            )Tj
> T*
> (           0040  Secondary Date of                    8    
> YYYYMMDD or
> b\
> lank )Tj
> T*
> (                 Death                                      )Tj
> T*
> (                                                            )Tj
> T*
> (           0050  Primary Name Control                 4    First 4
> signi\
> ficant )Tj
> T*
> (                                                           characters
> of\
>  taxpayer's )Tj
> T*
> (                                                           last name,
> no\
>  leading or )Tj
> T*
> (                                                           embedded
> spac\
> es; )Tj
> T*
> (                                                           allowable
> cha\
> racters are )Tj
> T*
> (                                                           alpha,
> hyphen\
>  or space )Tj
> T*
> (                                                           \(see
> special\
>  )Tj
> T*
> (
> instructions\)\
>  )Tj
> T*
> (                                                            )Tj
> T*
> (                                                             
>        )Tj
> ET


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

RE: [iText-questions] Reading and Extracting Text from PDF

Reply via email to