Hi,
Back in 2006, two PDFBox developers (Richard Braman, Ben Lichfield) asked
me if I was willing to collaborate in the development of text
segmentation/grouping algorithms. At that time, I was working on an
industrial project and this was not possible because of copyright issues.
Since 2008, I have been working on another university project, and have
got approval to publish the work documented in the following research
paper under an open-source licence:
Hassan, T.: Object-Level Document Analysis of PDF Files
2009 ACM Symposium on Document Engineering
http://www.dbai.tuwien.ac.at/staff/hassan/files/p47-hassan.pdf
This paper describes algorithms for text segmentation as well as grouping
of vector graphics into objects.
My current code makes use of a class named PDFObjectExtractor, which
extends PDFStreamEngine, and obtains the text segments, bitmap and vector
graphics as a list of objects.
I don't know if PDFBox has any such functionality yet, but I would be more
than happy to work on integrating these algorithms into PDFBox.
Please would you let me know what would be the best way to go about this.
Best regards,
Tamir Hassan
[email protected]
Database and Artificial Intelligence Group
Technische UniversitätWien