First, if you are going to be doing text extraction with PDF - you REALLY need 
to read the relevant sections of the PDF Reference/ISO 32000-1 as it will 
explain everything you need to know.

Now, as Bruno points out, at it's "core" the PDF page is just a series of 
drawing instructions (eg. moveto, drawstring, moveto, drawline, etc.) and so 
any determination of how these elements go together must be done by various 
heuristic models.  It's complex, but many developers have written solutions.

HOWEVER, PDF DOES support a concept called 'structured PDF' where the various 
drawing operations are grouped into logical concepts such as paragraphs, 
tables, etc.  In such documents, you now have the information you need to make 
higher level logical extraction possible w/o the need to "guess".

Leonard

-----Original Message-----
From: 1T3XT info [mailto:i...@1t3xt.info] 
Sent: Tuesday, March 10, 2009 3:34 AM
To: Post all your questions about iText here
Subject: Re: [iText-questions] modifed sample, question on PDF contents

Mike Marchywka wrote:
> Is there any information in the
> PDF that tells me how this stuff is supposed to be organized
> to extract the INFORMATION or is this just a bunch of hopelessly jumbled
> text that can only be read by a human, not a computer?

It's just a bunch of glyphs and lines drawn on a canvas;
there is no structure in the content UNLESS your PDF is tagged.
-- 
This answer is provided by 1T3XT BVBA
http://www.1t3xt.com/ - http://www.1t3xt.info

------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
iText-questions@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Reply via email to