[poppler] images in pdftohtml -xml mode

Igor Slepchin Mon, 14 Nov 2011 16:19:56 -0800

I know that dumping images when running pdftohtml with -xml flag hasbeen brought up before and it seems that the devs said they would accepta patch; however, it looks like nothing has made it into the source treeso far. I figured I could give this a try too so please take a look atmy proposed changes if there is still some interest in thisfunctionality: https://github.com/igors/poppler/tree/xml_images

The first commit in the above branch fixes up pdf2xml.dtd to match whatpdftohtml generates; the second patch adds support for images in -xmlmode. With this patch applied, pdftohtml -xml will dump all image filesjust like it does in html mode and will add image elements at thebeginning of each page that has images, i.e., you'll see something likethe following in the generated xml:


<page number="51" position="absolute" top="0" left="0"
      height="896" width="572">
<image top="45" left="26" width="523" height="373" src="filename.jpg"/>
<text top="534" left="81" width="17" height="15" font="18">In </text>

The default behavior with -xml switch is to process images now; adding-i option restores the old behavior.

The change is small enough that I hope it won't be very controversialbut comments are certainly appreciated.


Thanks,
Igor
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] images in pdftohtml -xml mode

Reply via email to