Contributing text grouping/segmentation algorithms to PDFBox?

Tamir Hassan Mon, 16 Nov 2009 23:57:47 -0800

Hi,

Back in 2006, two PDFBox developers (Richard Braman, Ben Lichfield) askedme if I was willing to collaborate in the development of textsegmentation/grouping algorithms. At that time, I was working on anindustrial project and this was not possible because of copyright issues.

Since 2008, I have been working on another university project, and havegot approval to publish the work documented in the following researchpaper under an open-source licence:


Hassan, T.: Object-Level Document Analysis of PDF Files
2009 ACM Symposium on Document Engineering
http://www.dbai.tuwien.ac.at/staff/hassan/files/p47-hassan.pdf

This paper describes algorithms for text segmentation as well as groupingof vector graphics into objects.

My current code makes use of a class named PDFObjectExtractor, whichextends PDFStreamEngine, and obtains the text segments, bitmap and vectorgraphics as a list of objects.

I don't know if PDFBox has any such functionality yet, but I would be morethan happy to work on integrating these algorithms into PDFBox.


Please would you let me know what would be the best way to go about this.

Best regards,

Tamir Hassan
[email protected]

Database and Artificial Intelligence Group
Technische UniversitÃ¤tWien

Contributing text grouping/segmentation algorithms to PDFBox?

Reply via email to