On 6/12/13, Albert Astals Cid <[email protected]> wrote: > El Dimecres, 12 de juny de 2013, a les 00:34:26, Ihar `Philips` Filipau va > escriure: >> On 6/11/13, Jeff Lerman <[email protected]> wrote: >> > Yes, indicating words is an advantage - but failing to indicate that a >> > given character in a word is in a given font is a bug. >> >> This is about right time to tell you the thing: focus of poppler is >> the on-screen representation of the PDF, not helping extracting >> information from the PDFs. Otherwise, year ago I would have flooded >> the place with patches. :D > > Well, that's a self fulfilling profecy, you think that area is not important > so you never send the patches and that area never gets love and the circle > never ends. >
It's not like that. Was probably a bad choice of words on my part. The crux of the problem is that PDFs which are easy to convert, do not require any special attention and even pdftohtml does a very decent job for them. But if PDF has a conversion problem, then there is no generic way to work it around. Otherwise, yes, I had some ideas (and even a sketch) about a generic API to represent various bits of raw information from PDF into a DOM-like structure. But there are several problems with the approach: 1. DOM doesn't fit well the paginated nature of PDF. Generic nature of such API would also sacrifice quite a lot of efficiency, both CPU-wise and RAM-wise. 2. It is an API, and as such is useless to end-users. (And even to the most developers, raw bits of PDF would be way too low-level and not immediately useful.) 3. For many use-cases, it is redundant thanks to the plethora of pdf2xml tools laying around on the web. Reading programmatically XML is easy and already provides a form of in-memory DOM of PDF for an application. 4. Needless to mention, I simply lack sufficient PDF and poppler knowledge to actually implement it to some level of usefulness. (That's why I did try asking on the list some technical questions. Got no responses.) I have tried to go with the sketch as far as I could, expecting to hit the #4. But instead it was the combination of #2 and #3 which persuaded me to abandon it: all people asking related questions on the list are users, not developers; DOM in some fashion is already available using the assortment of pdf2xml tools. wbr. _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
