RE: [iText-questions] Reading and Extracting Text from PDF

Mark Storer Tue, 14 Feb 2006 23:48:18 -0800

iText (to my limited knowledge) doesn't have a "Word Finder" like Acrobat does.


The problems with extracting text from some PDF found in the wild are as 
follows:
        1) The text could be in just about any encoding, including one created 
on the spot for that particular piece of text.  Determining this encoding is 
possible, but requires some effort.
        2) The text might be raw glyph indexes ("the Nth character in this 
font" with no ordering guarantees of any kind) ... you'd have to crack open the 
font file and hunt through it's character mapping tables to determine the right 
character.  I have yet to see a word finder handle this case.
        3) They just might be raw-est drawing commands.  Curve-to's and 
line-to's.  OCR is your only recourse at that point.  I have yet to see anyone 
handle this case.
        4) Text doesn't have to appear in a contiguous block.  There can be 
kerning information between letters (information to adjust the spacing between 
letters so things like 'ij' or 'll' look better).  Each letter can be drawn 
individually... heck, it's perfectly legal to draw all the characters in 
alphabetical order rather than by location.  Inefficient (lots of moving the 
current drawing point around), but valid.  The end of a run of characters can 
appear at any time... cutting words in half.

"Word Finders" like the one found in Acrobat/Reader have to figure out where 
all the letters are on a page, what those letters are, and then build words out 
of them based on their position (letters sharing a base line with only X 
distance between them are part of the same word... that sort of thing).

But that's the worst case scenario... Some random PDF build by some random 
application... it has to work with anything that's legal PDF.

You're not in that scenario.


The PDFs produced by the IRS will be from a limited number of applications... 
possibly even "1".  Examining the raw output will show you short cuts that, 
while handy for your particular case, would be Really Bad in the general case.

Somthing in the GhostScript family may have a word finder.  Poking around 
revealed that GSview claims to be able to search for text (which requires 
knowing how to find words).  http://www.cs.wisc.edu/~ghost/gsview/.  GSview is 
released under the GPL.

You may even be so lucky as to have PDF Structure in your PDFs that 
specifically calls out the text of each paragraph... for things like 
text-to-speach software.  The gub'ment is big on accessibility-enabled PDFs.  
At that point, you don't really need to worry about what's drawn on the page at 
all, you can just poke around in the Structure tree (still work, but not so 
daunting).

--Mark Storer
  Senior Software Engineer
  Cardiff Software

#include <disclaimer>
typedef std::Disclaimer<Cardiff> DisCard;



> -----Original Message-----
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf 
> Of Richard
> Braman
> Sent: Tuesday, February 14, 2006 10:35 AM
> To: itext-questions@lists.sourceforge.net
> Subject: [iText-questions] Reading and Extracting Text from PDF
> 
> 
> I have a open source project that is attempting to structure IRS
> produced documents such as publications and instructions and parse out
> data that is critical to building tax software.
> An example of such a file is http://www.irs.gov/pub/irs-pdf/p1346.pdf.
> This file contains e-file record layouts, which start on page 
> 398.  They
> used to publish this as text which made parsing relatively 
> easy, but now
> it comes in PDF only, and the project needs to be able to 
> have good open
> source parsing technology.   Is Itext the right tool for this job?  I
> have seen it do good work on parsing the metadata contained in IRS
> fill-in forms.
>  
>  
> Richard Braman
> mailto:[EMAIL PROTECTED]
> 561.748.4002 (voice) 
> 
> http://www.taxcodesoftware.org
> Free Open Source Tax Software
> 
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep 
> through log files
> for problems?  Stop!  Download the new AJAX search engine that makes
> searching your log files as easy as surfing the  web.  
> DOWNLOAD SPLUNK!
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&;
dat=121642
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

RE: [iText-questions] Reading and Extracting Text from PDF

Reply via email to