Kevin Day wrote: > It may be possible to do this doing spatial analysis - as long as your input > files are fairly uniform. > I posted a zip file a couple of days ago with the beginnings of text > extraction functionality.
It is now in SVN and there's some promising stuff in it. By the way: as I was going over the code, I added comments. > You could extend or re-implement the > SimpleTextExtractingPdfContentStreamProcessor to capture the > X position of your column headers, determine an 'average' bounding rectangle > for each column, then > extract text that only falls into that bounding rectangle. That's indeed possible, but a lot of work. Another interesting implementation of the abstract class PdfContentStreamProcessor would be a processor that retrieves the positions of images in a PDF page. > It would also be possible to analze the graphic operations that draw the cell > borders to construct > the target rectangles. This kind of thing is *hard*, and impossible in the > general case - but > for specific case it would probably be doable. Correct. > No matter what, this wouldn't be 100% perfect (as Paulo says, PDF files do > not have any sort of > meta data that captures the concept of a 'table'), but it might be an option > for you. Actually it's Bruno, but I've been away from the list for a very long time. -- This answer is provided by 1T3XT BVBA http://www.1t3xt.com/ - http://www.1t3xt.info ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php
