Hi Todd,

Some of us who are working on pdftohtml utility have had similar thoughts.  
It's on my wish list to completely remove the need for a poppler output device 
by utilizing the SVG toolset available in modern browsers.  In any case, we are 
achieving high accuracy on Gecko and Webkit browsers with the current version 
(not merged into the Poppler main repo yet, but I can send you an invite for a 
git repo that Alec Taylor made, which has all those latest changes.)  I think 
it might meet your needs as-is, or with some tweaks to make it work better on 
other browsers.

We are currently extracting the text and fonts for the browser to render 
directly, but still must rely on Splash, Cairo, etc. to rasterize other graphic 
operations.  With the way we've done it, we have an easy path to change over to 
SVG, one graphic operation at a time, if you'd be interested in doing that.

The idea of a separate "data" device is interesting, but I don't think it's the 
right way to go.  In effect, you are talking about changing the PDF data to 
XML, and from there to other formats.  I can appreciate the sentiment, since 
PDF is such a difficult format to work with, but adding a layer of abstraction 
is just going to make things more complex, error-prone, and slow.  To note, the 
current version of pdftohtml creates a valid XML-compliant HTML format — 
actually there's a small bug, but you probably get the point.  You can always 
use the XML-compliant HTML as your easier-to-digest "data" format, which also 
allows us to represent more semantics than are available in the original PDF 
document, and you can always extend it with whatever XML tags you need.  For 
example, I extended it with an attribute describing bounding boxes for all of 
the text spans.  Let me know if you want the repo invite.

Best, --josh

From: Todd Hubers 
<[email protected]<mailto:[email protected]>>
Date: Thu, 3 Nov 2011 18:13:52 -0700
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: [poppler] Poppler - SVG Device

I'm currently using Poppler for Text extraction and using GhostScript for PDF 
to Image functionality, all for viewing PDFs online without requiring a PDF 
plugin in the browser.

I noticed Mozilla was working on an interesting project, PDF.js 
[https://wiki.mozilla.org/PDF.js]. It loads PDF files with pure Javascript (on 
a HTML5 compatible browser - probably needs canvas).

This is an opportunity for poppler to steam ahead and get some headline 
grabbing exposure. The SVG format is well supported by browsers. PDFs are 
portable across systems, however SVGs are very portable (and fast) across the 
web.

I propose the building of an SVG Device - PDF to SVG. I am currently 
considering using PDF to XML, to then perform XML to SVG. Given the status quo, 
I believe it's time for PDF to SVG.

I see SVG as a very efficient and therefore powerful web format, I hope others 
in the poppler community will see the potential as I do.

Thanks,

Todd Hubers (BBIT Hons)
Alivate

PS. Perhaps we could then have PDF>Cairo, PDF>SVG, and then tools for SVG>XML, 
SVG>HTML, SVG>Text. In any case it would be good to have simply one direct 
rendering device and one "data" device.
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to