Once my download of the pdf finished (slow link), I was able to grab out all of the images using poppler’s¹ version of pdfimages(1).
I used the -j option to outout DCS images as JFIF (ie, .jpg) files. It output three images for each page; a JFIF for the background, a white-on-black pbm for the text layer and a ppm blend. All of the pbm images looked just right for OCR-ing. Looking at poppler’s src, it looks like the jbig2 support traces back to xpdf, so any recent version of pdfimages should be able to output the pbm files. So, with those LuraTech PDFs, if you run pdfimages and then drop everything except the .pbm files, you should have usable images for doing OCR. -JimC 1] http://poppler.freedesktop.org/ http://cgit.freedesktop.org/poppler/poppler git://anongit.freedesktop.org/poppler/poppler I use the master branch of the git repo. -- James Cloos <[email protected]> OpenPGP: 1024D/ED7DAEA6 --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
