Re: [iText-questions] How to extract data

Bruno Lowagie Mon, 27 Nov 2006 07:31:21 -0800

Bernhard Wellhöfer wrote:
> Hello,
>  
> I have a PDF document with ~900 pages. The document lists in a long 
> table the personal data of people. My job is it now to extract the first 
> and last name into a structured document (e.g. XML, Excel, csv ...).


I hope you didn't accept that assignment yet.
Is there still time to turn it down?

> I already managed to open the document via iText. After studying the 
> Documentation I do not understand how to search in the document for the 
> table and then process each table cell. Can somebody send me a hint or a 
> link to the documenation how to find and process the table?

Is the PDF a Tagged PDF file?
If not: congratulations, you have accepted a mission impossible!
PDF is a Page Description Language, not a Word Processing format.
If you add a table to a PDF file, it is painted on a canvas and
all structure is lost (unless the PDF is tagged).
It's similar to creating an image (GIF, JPG) from a table, and
then ask somebody to convert the image to XLS or Word.

If you want to know what Tagged PDF is, please download the
free chapters from my book:
http://www.manning.com/affiliate/idevaffiliate.php?id=223_53
and read them carefully.

Your best chances lie with a product that does OCR (but I
don't know any free ones).
best regards,
Bruno

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions
Buy the iText book: http://itext.ugent.be/itext-in-action/

Re: [iText-questions] How to extract data

Reply via email to