First, if you are going to be doing text extraction with PDF - you REALLY need to read the relevant sections of the PDF Reference/ISO 32000-1 as it will explain everything you need to know.
Now, as Bruno points out, at it's "core" the PDF page is just a series of drawing instructions (eg. moveto, drawstring, moveto, drawline, etc.) and so any determination of how these elements go together must be done by various heuristic models. It's complex, but many developers have written solutions. HOWEVER, PDF DOES support a concept called 'structured PDF' where the various drawing operations are grouped into logical concepts such as paragraphs, tables, etc. In such documents, you now have the information you need to make higher level logical extraction possible w/o the need to "guess". Leonard -----Original Message----- From: 1T3XT info [mailto:i...@1t3xt.info] Sent: Tuesday, March 10, 2009 3:34 AM To: Post all your questions about iText here Subject: Re: [iText-questions] modifed sample, question on PDF contents Mike Marchywka wrote: > Is there any information in the > PDF that tells me how this stuff is supposed to be organized > to extract the INFORMATION or is this just a bunch of hopelessly jumbled > text that can only be read by a human, not a computer? It's just a bunch of glyphs and lines drawn on a canvas; there is no structure in the content UNLESS your PDF is tagged. -- This answer is provided by 1T3XT BVBA http://www.1t3xt.com/ - http://www.1t3xt.info ------------------------------------------------------------------------------ _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php ------------------------------------------------------------------------------ _______________________________________________ iText-questions mailing list iText-questions@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php