iText (to my limited knowledge) doesn't have a "Word Finder" like Acrobat does.
The problems with extracting text from some PDF found in the wild are as follows: 1) The text could be in just about any encoding, including one created on the spot for that particular piece of text. Determining this encoding is possible, but requires some effort. 2) The text might be raw glyph indexes ("the Nth character in this font" with no ordering guarantees of any kind) ... you'd have to crack open the font file and hunt through it's character mapping tables to determine the right character. I have yet to see a word finder handle this case. 3) They just might be raw-est drawing commands. Curve-to's and line-to's. OCR is your only recourse at that point. I have yet to see anyone handle this case. 4) Text doesn't have to appear in a contiguous block. There can be kerning information between letters (information to adjust the spacing between letters so things like 'ij' or 'll' look better). Each letter can be drawn individually... heck, it's perfectly legal to draw all the characters in alphabetical order rather than by location. Inefficient (lots of moving the current drawing point around), but valid. The end of a run of characters can appear at any time... cutting words in half. "Word Finders" like the one found in Acrobat/Reader have to figure out where all the letters are on a page, what those letters are, and then build words out of them based on their position (letters sharing a base line with only X distance between them are part of the same word... that sort of thing). But that's the worst case scenario... Some random PDF build by some random application... it has to work with anything that's legal PDF. You're not in that scenario. The PDFs produced by the IRS will be from a limited number of applications... possibly even "1". Examining the raw output will show you short cuts that, while handy for your particular case, would be Really Bad in the general case. Somthing in the GhostScript family may have a word finder. Poking around revealed that GSview claims to be able to search for text (which requires knowing how to find words). http://www.cs.wisc.edu/~ghost/gsview/. GSview is released under the GPL. You may even be so lucky as to have PDF Structure in your PDFs that specifically calls out the text of each paragraph... for things like text-to-speach software. The gub'ment is big on accessibility-enabled PDFs. At that point, you don't really need to worry about what's drawn on the page at all, you can just poke around in the Structure tree (still work, but not so daunting). --Mark Storer Senior Software Engineer Cardiff Software #include <disclaimer> typedef std::Disclaimer<Cardiff> DisCard; > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Behalf > Of Richard > Braman > Sent: Tuesday, February 14, 2006 10:35 AM > To: itext-questions@lists.sourceforge.net > Subject: [iText-questions] Reading and Extracting Text from PDF > > > I have a open source project that is attempting to structure IRS > produced documents such as publications and instructions and parse out > data that is critical to building tax software. > An example of such a file is http://www.irs.gov/pub/irs-pdf/p1346.pdf. > This file contains e-file record layouts, which start on page > 398. They > used to publish this as text which made parsing relatively > easy, but now > it comes in PDF only, and the project needs to be able to > have good open > source parsing technology. Is Itext the right tool for this job? I > have seen it do good work on parsing the metadata contained in IRS > fill-in forms. > > > Richard Braman > mailto:[EMAIL PROTECTED] > 561.748.4002 (voice) > > http://www.taxcodesoftware.org > Free Open Source Tax Software > > > > ------------------------------------------------------- > This SF.net email is sponsored by: Splunk Inc. Do you grep > through log files > for problems? Stop! Download the new AJAX search engine that makes > searching your log files as easy as surfing the web. > DOWNLOAD SPLUNK! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486& dat=121642 _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions