As was previously mentioned, I am adding the semantic and logical structuring into poppler core.
My plan is to figure out what fits into which category by post processing the XML. Any suggestions on how to reverse [or post?!] engineer this XML back into the PDF would be appreciated. In a few days I will have a very accurate XML genereated with <header></header>, <footer></footer> and table of contents tags. This will involve the "pushing" of the actual "printed" page numbers, and adding hyperlink to each ToC entry, and partitioning the page structure as far as the 1.3 standard allows. My code is extremely modular, neat & efficient, and included the writing of an OO API. So it should be easily extendable with author, title, publisher, year and section title extraction capabilities.
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
