Hi, > It looks like the djvu algorithm was used on the scans. The site > also has a djvu version; the pdf may have been created from that. > > If the djvu-ing was done ideally, the foreground image should be > all one needs for ocr-ing. > That's right. Archive.org uses a PDF compression product made by LuraTech [1]. It woks basicly the following way: - Extract the background, sample it down to a small resolution (like 72 dpi) - Extract the text layer, convert it to 1 bit at compress is lossless with JBIG2 [2] in its original resolution (probably 300 or 600 dpi)
Both layers (and another one with some sort of blend mask) are incorporated per page in the PDF file and rendered in the viewer as one page. Many opensource tools like pdfimages, iText or PDFBox can't deal with this type of PDF files. As far as I know the problem is mostly a missing JBIG2 implementation. Best, Christian [1] http://www.luratech.com/products/document-conversion-solutions/luradocument-pdf-compressor.html [2] http://en.wikipedia.org/wiki/JBIG2 --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "ocropus" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/ocropus?hl=en -~----------~----~----~----~------~----~------~--~---
