On 4/19/12, Albert Astals Cid <[email protected]> wrote: > El Dijous, 19 d'abril de 2012, a les 14:43:56, William Bader va escriure: >> I don't understand why converting pdf to html requires gs to rasterize to >> png, especially when pdftoppm can generate png. > > Me either, we inherited pdftohtml from some other dead project via krh > forcing > it on us when he was the maintainer, if it had been my decision pdftohtml > would not be part of poppler, it's code quality is worse than the rest of > poppler. > > I'm all for removing it, but that might bring some unwanted dead threats >
I will not send death threats, I promise ;) Look at from another side: this is the only somehow maintained version of pdftohtml. AND pdf2html/pdf2xml is the only way at the moment to get (most of) the text formatting out of the PDFs. I personally would love to have an alternative, but there are none. N.B. That's by the way the reasons for the question earlier: can I get somehow formatted text from Okular via Copy/Paste or not? I'd love to be able to open Okular/etc, press "Select All", "Copy", switch to OO Writer and press "Paste". But that simply doesn't work. But I guess 99% of the crowd here is interested solely in how PDF looks on the screen - not on how to reverse engineer the information back from it. So lack of interest isn't surprising to me. If there are any particular ideas on how to improve the code quality, at this moment of time I'm open for suggestions. But frankly, the code has to be "forked" first: integration of Splash added another use case and untangling all the dependencies would be a hell of a work: there are 2 HTML modes, Splash mode and XML mode. If I were starting to clean-up the code, I would first copy-paste the pdftohtml/HtmlOutputDev into three different applications: pdftohtml, pdftosplash, pdftoxml - and start by removing redundant stuff and looking for what can be reused. Better half of the code is copy-paste from (old version of) TextOutputDev anyway. But the problem here is the way code is organized, and I mean also code of the poppler itself. I can't for example reuse the code from TextOutputDev in HtmlOutputDev, because there is no stable in-memory presentation of the PDF, all OutputDevs have to invent their own. Thus all the logic and algorithms they already implement are not reusable. All said before is result of a rather cursory code review and I'd love to be shown wrong, e.g. by giving me at least shallow instructions of how can one implement pdftoxml using poppler's cpp interface. Last time I was looking, I have stopped at poppler::page::text(), the only method I have found which provides access to page's text - but it returns it without any formatting or font information or text coordinates and thus is useless. I have found no DOM-like object giving me access to PDF's innards. So one is back to PDFDoc, back to reinventing another OutputDev... _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
