Hey, I'm using the xml output of pdftohtml to classify pdfs. I wondered if it would be easy to create an option in a custom built to have the image tags in the xml without extracting the images themselves. I have to classify a lot of pdfs and some of those are powerpoint presentations with lots of small images (e.g. 26000 per page) which take several hours to extract. I need the image tags for some of my features for classification.
If someone could point me to the place in the code where I could make that change that would be very much appreciated. Otherwise I have to check the code myself. Many Thanks, Kai
_______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
