Hi,
> It looks like the djvu algorithm was used on the scans.  The site
> also has a djvu version; the pdf may have been created from that.
>
> If the djvu-ing was done ideally, the foreground image should be
> all one needs for ocr-ing.
>   
That's right. Archive.org uses a PDF compression product made by 
LuraTech [1]. It woks basicly the following way:
- Extract the background, sample it down to a small resolution (like 72 dpi)
- Extract the text layer, convert it to 1 bit at compress is lossless 
with JBIG2 [2] in its original resolution (probably 300 or 600 dpi)

Both layers (and another one with some sort of blend mask) are 
incorporated per page in the PDF file and rendered in the viewer as one 
page.
Many opensource tools like pdfimages, iText or PDFBox can't deal with 
this type of PDF files. As far as I know the problem is mostly a missing 
JBIG2 implementation.

Best,
Christian

[1] 
http://www.luratech.com/products/document-conversion-solutions/luradocument-pdf-compressor.html
[2] http://en.wikipedia.org/wiki/JBIG2

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"ocropus" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/ocropus?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to