Kevin Day wrote:
> It may be possible to do this doing spatial analysis - as long as your input 
> files are fairly uniform.
> I posted a zip file a couple of days ago with the beginnings of text 
> extraction functionality.

It is now in SVN and there's some promising stuff in it.
By the way: as I was going over the code, I added comments.

>  You could extend or re-implement the 
> SimpleTextExtractingPdfContentStreamProcessor to capture the
> X position of your column headers, determine an 'average' bounding rectangle 
> for each column, then
> extract text that only falls into that bounding rectangle.

That's indeed possible, but a lot of work.
Another interesting implementation of the abstract class
PdfContentStreamProcessor would be a processor that retrieves
the positions of images in a PDF page.

> It would also be possible to analze the graphic operations that draw the cell 
> borders to construct
> the target rectangles.  This kind of thing is *hard*, and impossible in the 
> general case - but
> for specific case it would probably be doable.

Correct.

> No matter what, this wouldn't be 100% perfect (as Paulo says, PDF files do 
> not have any sort of
> meta data that captures the concept of a 'table'), but it might be an option 
> for you.

Actually it's Bruno, but I've been away from the list for a very
long time.
-- 
This answer is provided by 1T3XT BVBA
http://www.1t3xt.com/ - http://www.1t3xt.info

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Reply via email to