Hi! On Tue, 2019-12-03 11:34:06 +1100, Guy Dunphy via cctalk <firstname.lastname@example.org> wrote: > At 01:57 PM 2/12/2019 -0700, you wrote: > >On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk <email@example.com> > >wrote: > > > > > When I corresponded with Al Kossow about format several years ago, he > > > indicated that CCITT Group 4 lossless compression was their standard. > As for G4 bilevel encoding, the only reasons it isn't treated with the same > disdain as JBIG2, are: > 1. Bandwaggon effect - "It must be OK because so many people use it." > 2. People with little or zero awareness of typography, the visual quality of > text, and anything to do with preservation of historical character of > printed works. For them "I can read it OK" is the sole requirement. > > G4 compression was invented for fax machines. No one cared much about visual > quality of faxes, they just had to be readable. Also the technology of fax > machines was only capable of two-tone B&W reproduction, so that's what G4 > encoding provided.
So it boils down to two distinct tasks: * Scan old paper documentation with a proven file format (ie. no compression artifacts, b/w or 16 gray-level for black-and-white text, tables and the like. * Make these images accessible as useable documentation. The first step is that's work-intensive, the second step can probably be easily redone every time we "learn" something about how to make the documents more useful. For accessibility, PDF seems to be quite a nice choice, as long as we see that as a representation only (and not as the information source.) Convert the images to TIFF for example, possibly downsample, possibly OCR and overlay it. > But PDF literally cannot be used as a wrapper for the results, since > it doesn't incorporate the required image compression formats. > This is why I use things like html structuring, wrapped as either a zip > file or RARbook format. Because there is no other option at present. > There will be eventually. Just not yet. PDF has to be either greatly > extended, or replaced. I think that PDF actually is a quite well-working output format, but we'd see it as a compilation product of our actual source (images), not as the final (and only) product. > And that's why I get upset when people physically destroy rare old documents > during or after scanning them currently. It happens so frequently, that by > the time we have a technically adequate document coding scheme, a lot of old > documents won't have any surviving paper copies. > They'll be gone forever, with only really crap quality scans surviving. :-( Too bad, but that happens all the time. Thanks, Jan-Benedict --