Currently pdftohtml is creating one large image for each HTML page rendered.  
In order to reduce the size of the HTML file bundles, as well as to improve the 
semantic value of the HTML, Stephen and I would like to extract and use only 
the portions of that background image that are not background white.

In order to accomplish this, our idea is to add hooks into the 
SplashOutputDevNoText to catch painting operations, and record coordinates of 
the bounding box for any painting operations.  After recording each bounding 
box, we'll draw a new bounding box to combine any contiguous regions.  Once we 
have a list of non-contiguous bounding boxes representing all graphics 
operations that have occurred on the page, we'll use those bounding boxes to 
extract only the relevant regions from the large background image, save each 
region as a separate file, and reference the files from the HTML.

Since we're extending the output device, we'll rename it from 
SplashOutputDevNoText to better capture the new role:  
SplashOutputDevHtmlImages.  If you think we should retain the old behavior with 
a switch, please let me know — I don't see a significant benefit to it.

As always, any comments appreciated.

--josh
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to