[poppler] pdftohtml -xml custom built (image tags but no pngs extracted)

Kai Fritsch Tue, 08 May 2018 11:04:28 -0700

Hey,

I'm using the xml output of pdftohtml to classify pdfs. I wondered if it
would be easy to create an option in a custom built to have the image tags
in the xml without extracting the images themselves. I have to classify a
lot of pdfs and some of those are powerpoint presentations with lots of
small images (e.g. 26000 per page) which take several hours to extract. I
need the image tags for some of my features for classification.


If someone could point me to the place in the code where I could make that
change that would be very much appreciated. Otherwise I have to check the
code myself.

Many Thanks,
Kai

_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

[poppler] pdftohtml -xml custom built (image tags but no pngs extracted)

Reply via email to