I know that dumping images when running pdftohtml with -xml flag has been brought up before and it seems that the devs said they would accept a patch; however, it looks like nothing has made it into the source tree so far. I figured I could give this a try too so please take a look at my proposed changes if there is still some interest in this functionality: https://github.com/igors/poppler/tree/xml_images

The first commit in the above branch fixes up pdf2xml.dtd to match what pdftohtml generates; the second patch adds support for images in -xml mode. With this patch applied, pdftohtml -xml will dump all image files just like it does in html mode and will add image elements at the beginning of each page that has images, i.e., you'll see something like the following in the generated xml:

<page number="51" position="absolute" top="0" left="0"
      height="896" width="572">
<image top="45" left="26" width="523" height="373" src="filename.jpg"/>
<text top="534" left="81" width="17" height="15" font="18">In </text>

The default behavior with -xml switch is to process images now; adding -i option restores the old behavior.

The change is small enough that I hope it won't be very controversial but comments are certainly appreciated.

Thanks,
Igor
_______________________________________________
poppler mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to