Re: re ocropus wrapper question (pdfimages)

James Cloos Thu, 01 Jan 2009 13:21:18 -0800

Once my download of the pdf finished (slow link), I was able to grab out
all of the images using poppler’s¹ version of pdfimages(1).


I used the -j option to outout DCS images as JFIF (ie, .jpg) files.

It output three images for each page; a JFIF for the background, a
white-on-black pbm for the text layer and a ppm blend.  

All of the pbm images looked just right for OCR-ing.

Looking at poppler’s src, it looks like the jbig2 support traces back to
xpdf, so any recent version of pdfimages should be able to output the
pbm files.

So, with those LuraTech PDFs, if you run pdfimages and then drop
everything except the .pbm files, you should have usable images for
doing OCR.

-JimC

1] http://poppler.freedesktop.org/
   http://cgit.freedesktop.org/poppler/poppler
   git://anongit.freedesktop.org/poppler/poppler

   I use the master branch of the git repo.
-- 
James Cloos <[email protected]>         OpenPGP: 1024D/ED7DAEA6

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: re ocropus wrapper question (pdfimages)

Reply via email to