Re: P112
On 12/3/19 8:15 PM, Fred Cisin via cctech wrote: > On Wed, 4 Dec 2019, Bill Gunshannon via cctech wrote: >> Along this line I have solved one problem. I mentioned INIT in >> RSX180 printing gibberish on the screen when trying to init a >> hard disk partition where it had worked on a floppy. Problem >> was the size of the partitions. I had tried just making one >> partition for the test I learned that FDISK will make partitions >> too big for any of the P112 OSes. I now have a hard disk with >> 5 partitions to play with. On to the next problem. > > Is it a specific size limit? > (something on the order of number of bits for block number?) Don't know, but I suspect it's around 32M. I seem to remember seeing something mentioned somewhere. I just divided a 42M Seagate into 5 partitions to play with. I may test the limits eventually but for right now I would just like to get some of the OSes loaded on the hard disk so I can work with them. Especially RSX180 as I have some other plans for that one. bill
Re: P112
On Wed, 4 Dec 2019, Bill Gunshannon via cctech wrote: Along this line I have solved one problem. I mentioned INIT in RSX180 printing gibberish on the screen when trying to init a hard disk partition where it had worked on a floppy. Problem was the size of the partitions. I had tried just making one partition for the test I learned that FDISK will make partitions too big for any of the P112 OSes. I now have a hard disk with 5 partitions to play with. On to the next problem. Is it a specific size limit? (something on the order of number of bits for block number?)
Re: P112
On 12/3/19 7:51 PM, Craig Ruff via cctech wrote: > Just in case someone else hasn't already responded, the P112 does not use DOS > style fdisk partitioning for a hard disk. It is done in the BIOS image, and > then the logical disks have to be initialized. This is described in the "P112 > GIDE Construction.pdf" document. > > I've only used 3.5" floppies, which work fine. You can also attach a PATA > CD-ROM drive and access disks with a program that escapes my memory at the > moment. > Along this line I have solved one problem. I mentioned INIT in RSX180 printing gibberish on the screen when trying to init a hard disk partition where it had worked on a floppy. Problem was the size of the partitions. I had tried just making one partition for the test I learned that FDISK will make partitions too big for any of the P112 OSes. I now have a hard disk with 5 partitions to play with. On to the next problem. bill
Re: P112
Just in case someone else hasn't already responded, the P112 does not use DOS style fdisk partitioning for a hard disk. It is done in the BIOS image, and then the logical disks have to be initialized. This is described in the "P112 GIDE Construction.pdf" document. I've only used 3.5" floppies, which work fine. You can also attach a PATA CD-ROM drive and access disks with a program that escapes my memory at the moment.
Re: PBX (or something) for modem testing
On Tue, Dec 3, 2019 at 8:35 PM Jim Brain via cctalk wrote: > > That said, I went out to eBay to see if I could source a 2-8 line > something to help, and got smacked around with my lack of telephone > system knowledge. > > So, any ideas (or links to eBay auctions) of brands/models/etc. I should > focus on? > When I worked in a group that was doing modem related software over 20 years ago two of the telephone line simulators we used were from Teltone and from TAS. You can find Teltone TLS-2, 3, 4, and 5 boxes on eBay. The TLS-2 and TLS-3 are two line boxes, the TLS-4 and TLS-5 are four line boxes. If I remember correctly we had a TLS-5 model that could do more advanced things like Call Waiting and Caller ID signalling. If you're patient you might find a reasonable deal on one of these. Search eBay for "Teltone TLS" and look at completed item sale prices. The TAS Telephone Network Emulators were much more serious pieces of complex programmable test equipment. We had a big TAS Series II box, which was looped through an additional TAS 240 voiceband subscriber loop emulator which could introduce various impairments on the line. If you search eBay for "TAS telphone emulator" you'll see some of those boxes. Doesn't look like those are typically cheap unless they are likely broken, and they are bigger and heavier than you would want to deal with anyway. I might have a Teltone TLS-3 that I don't really need. The only problem would be finding it hidden under other stuff...
Re: PBX (or something) for modem testing
On 12/3/19 9:35 PM, Jim Brain via cctalk wrote: So, any ideas (or links to eBay auctions) of brands/models/etc. I should focus on? I would purchase a Partner system from AT / Lucent / Avaya. I think they are both analog and digital. The analog will work for modems. You will likely need a digital set with display to program things. I should clarify, /each/ line is both digital and analog. This is nicer than older Nortel Norstar systems that I've worked with which needed additional equipment equipment, Analog Terminal Adapter, as they were (by default) purely digital lines. Note: You might not be able to get anything faster than 33.6 without some more effort and / or more specialized phone equipment. There's always the VoIP route, but that's an even deeper, darker, steeper, and more slippery rabbit hole. Partner systems will likely be 8 analog lines that can be used in short order. My biggest concern with a Partner system would be the dial plan. As in, will the modems, or other communications applications be okay with a two digit phone number? Or do you need to make the phone system use longer phone numbers. (As I type this, I seem to remember Nortel systems having an option. I bet Partner systems do too.) I'd cross this particular bridge when you get there, if you get there. -- Grant. . . . unix || die
PBX (or something) for modem testing
To continue validating modem functionality, I think it makes sense to set up a closed loop phone system in my lab that will function well enough to allow modems to connect to each other (dial tone, ringing, busy signal, etc.). I know I can probably whip something up with a 9 v battery and a piece of cable with rj11s, but I think that will fall short. That said, I went out to eBay to see if I could source a 2-8 line something to help, and got smacked around with my lack of telephone system knowledge. So, any ideas (or links to eBay auctions) of brands/models/etc. I should focus on? Also, if anyone has any modems lying around gathering dust, I probably should source a few more models. tcpser handles Hayes "+++" spec correctly, but I should probably support TIES as well, to cite one example. Jim -- Jim Brain br...@jbrain.com www.jbrain.com
Re: Scanning docs for bitsavers
On 03/12/2019 20:22, Fred Cisin via cctalk wrote: Watch out. PDF with OCR can show you a clear and crisp [possibly wrong] interpretation of the scan, not what the actual scan looked like. The OCR may well say "0" where the printing says "8" but what your eyes will see will be the representation of the printing. So if you rely only on OCR you may well miss something, but if you fall back to the way you'd have towork without OCR (or even the way you'd have to work if you had the original paper copy) then you have to rely on your eyesight to fail to find what you are looking for ... Unless, that is, you discard the graphical representation and keep only the OCR result. In which case all bets are off. Antonio -- Antonio Carlini anto...@acarlini.com
Re: Scanning docs for bitsavers
On Tue, 3 Dec 2019, Paul Koning via cctalk wrote: The trouble (for both of these) is that many of the users don't know the limitations and blindly use the wrong tools. "To the man who has a hammer, the whole world looks like a thumb." (which is an idictment about misuse, not an indictment of hammers)
Re: Scanning docs for bitsavers
> JBIG2 .. introduces so many actual factual errors (typically > substituted letters and numbers) On Tue, 3 Dec 2019, Noel Chiappa via cctalk wrote: It's probably worth noting that there are often errors _in the original documents_, too - so even a perfect image doesn't guarantee no errors. . . . and how often will the randomizing corruption actually result in changing an error to what it SHOULD HAVE BEEN? :-) Although looking again at the PDF, the two digits in question are quite clear and crisp, and don't seem like they could be scanning errors. Watch out. PDF with OCR can show you a clear and crisp [possibly wrong] interpretation of the scan, not what the actual scan looked like.
Re: Scanning docs for bitsavers
On Tue, Dec 3, 2019 at 10:59 AM Paul Berger via cctalk < cctalk@classiccmp.org> wrote: > Is there any way to know what compression was used in a pdf file? > There's not necessarily only one. Every object in a PDF file can have its own selection of compression algorithm. I don't know of any user-friendly way to tell. Years ago I used some really awful programs I hacked up to inspect PDF file contents. I'm not sure I even can find them any more.
Re: Scanning docs for bitsavers
> On Dec 3, 2019, at 12:59 PM, Paul Berger via cctalk > wrote: > > ... > Would TIFF G4 still be preferable to JPEG2000? It would seem I can control > the compression used by selecting the pdf compatibility level. JPEG2000 apparently has a lossless mode (says Wikipedia). If so, it would be acceptable as an alternative to other lossless compressions. If used in lossy mode, it's not suitable for scanned documents, just as regular JPEG isn't. paul
Re: Scanning docs for bitsavers
On 2019-12-02 4:57 p.m., Eric Smith via cctalk wrote: On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk wrote: When I corresponded with Al Kossow about format several years ago, he indicated that CCITT Group 4 lossless compression was their standard. There are newer bilevel encodings that are somewhat more efficient than G4 (ITU-T T.6), such as JBIG (T.82) and JBIG2 (T.88), but they are not as widely supported, and AFAIK JBIG2 is still patent encumbered. As a result, G4 is still arguably the best bilevel encoding for general-purpose use. PDF has natively supported G4 for ages, though it gained JBIG and JBIG2 support in more recent versions. Back in 2001, support for G4 encoding in open source software was really awful; where it existed at all, it was horribly slow. There was no good reason for G4 encoding to be slow, which was part of my motivation in writing my own G4 encoder for tumble (an image-to-PDF utility). However, G4 support is generally much better now. Is there any way to know what compression was used in a pdf file? Do you know anything about a compression format JPEG2000? Would TIFF G4 still be preferable to JPEG2000? It would seem I can control the compression used by selecting the pdf compatibility level. Paul.
Re: One old Sol, Two old names...
On Mon, Nov 25, 2019 at 7:43 PM William Sudbrink via cctalk < cctalk@classiccmp.org> wrote: > Other interesting things about the Sol include that it has an 80/64 video > modification > (with patches all over): > http://wsudbrink.dyndns.org:8080/images/fixed_sol/20191125_202606.jpg > Cool! Here's one with both the 80/64 daughterboard and a Z80 daughterboard: https://www.flickr.com/photos/_brouhaha_/6693966999/in/album-72157628862392743/ Unfortunately I have not been able to track down documentation on either.
Re: Scanning docs for bitsavers
On 12/3/19 10:30 AM, Eric Smith via cctalk wrote: PDF was never _intended_ for documents that should undergo any further processing. Okay. Fair rebuttal. The few things that have been hacked onto it for interactive use are actually the worst thing about PDF. My opinion Okay. I don't have any more difficulty extracting and processing scanned page images out of a PDF file than any other container (e.g., TIFF). -- Grant. . . . unix || die
Re: Scanning docs for bitsavers
> On Dec 2, 2019, at 11:12 PM, Grant Taylor via cctalk > wrote: > > On 12/2/19 9:06 PM, Grant Taylor via cctalk wrote: >> In my opinion, PDFs are the last place that computer usable data goes. >> Because getting anything out of a PDF as a data source is next to impossible. >> Sure, you, a human, can read it and consume the data. >> Try importing a simple table from a PDF and working with the data in >> something like a spreadsheet. You can't do it. The raw data is there. But >> you can't readily use it. >> This is why I say that a PDF is the end of the line for data. >> I view it as effectively impossible to take data out of a PDF and do >> anything with it without first needing to reconstitute it before I can use >> it. > > I'll add this: > > PDF is a decent page layout format. But trying to view the contents in any > different layout is problematic (at best). > > Trying to use the result of a page layout as a data source is ... problematic. That's hardly surprising. These properties are precisely the intent of PDF. It's basically a portable variant of PostScript, with some cleanups (relatively sane Unicode support, transparency, hyperlinks, a few other things). Its specific purpose is to encode page images, just as they appear on actual paper. Indeed, PDF is often used as a "camera ready copy" format for material going to a print shop. It works quite well for that. For scanned documents, where each page is just an image, PDF is a decent container format. For documents with actual text, it's far more problematic. Using PDF as an intermediate form is every bit as inappropriate as using JPEG for line art or any other application where artefacts are impermissible. The trouble (for both of these) is that many of the users don't know the limitations and blindly use the wrong tools. paul
Re: Scanning docs for bitsavers
On Mon, Dec 2, 2019 at 9:06 PM Grant Taylor via cctalk < cctalk@classiccmp.org> wrote: > My problem with PDFs starts where most people stop using them. > > Take the average PDF of text, try to copy and paste the text into a text > file. (That may work.) > Sure. Now try thing same thing with a TIFF file.
Re: Scanning docs for bitsavers
On Tue, Dec 3, 2019 at 1:50 AM Christian Corti via cctalk < cctalk@classiccmp.org> wrote: > *NEVER* use JBIG2! I hope you know about the Xerox JBIG2 bug (e.g. making > That's _LOSSY_ JBIG2. YOU DON"T HAVE TO USE LOSSY MODE!
Re: Scanning docs for bitsavers
On Mon, Dec 2, 2019 at 7:08 PM Grant Taylor via cctalk < cctalk@classiccmp.org> wrote: > I *HATE* doing anything with PDFs other than reading them. PDF was never _intended_ for documents that should undergo any further processing. The few things that have been hacked onto it for interactive use are actually the worst thing about PDF. My opinion > is that PDF is where information goes to die. Creating the PDF was the > last time that anything other than a human could use the information as > a unit. I don't have any more difficulty extracting and processing scanned page images out of a PDF file than any other container (e.g., TIFF).
Re: Scanning docs for bitsavers
On Mon, Dec 2, 2019 at 5:34 PM Guy Dunphy via cctalk wrote: > Mentioning JBIG2 (or any of its predecessors) without noting that it is > completely unacceptable as a scanned document compression scheme, > demonstrates > a lack of awareness of the defects it introduces in encoded documents. > Perhaps you are not aware that the JBIG2 standard has a lossless mode. Certainly JBIG2 lossy mode is _extremely_ lossy, but lossless mode doesn't have those problems. It's entirely possible that the common JBIG2 encoders either don't offer lossless mode, or don't make it easy to configure. G4 compression was invented for fax machines. No one cared much about visual > quality of faxes, they just had to be readable. Also the technology of fax > machines was only capable of two-tone B reproduction, so that's what G4 > encoding provided. > Thinking these kinds of visual degradation of quality are acceptable when > scanning documents for long term preservation, is both short sighted and > ignorant of what can already be achieved with better technique. > When used at an appropriate resolution (e.g., not 100 DPI), G4 encoding is perfectly fine for bilevel documents (text and line art) that are in good condition. If the documents were originally bilevel but have suffered from significant degradation in reproduction, then they are effectively no longer bilevel, and G4 (at any resolution) is inappropriate. And therefore why PDF isn't acceptable as a > container for long term archiving of _scanned_ documents for historical > purposes. > You state that as if it was a fact universally agreed upon, which it clearly is not. If you despise PDF as an archival format, by all means please feel free to NOT avail yourself of the hundreds of thousands of pages of archives in PDF format e.g. on Bitsavers. I'm of course not claiming that PDF is perfect, nor is G4 encoding.
Re: Scanning docs for bitsavers
> From: Guy Dunphy > JBIG2 .. introduces so many actual factual errors (typically > substituted letters and numbers) It's probably worth noting that there are often errors _in the original documents_, too - so even a perfect image doesn't guarantee no errors. The most recent one (of many) which I found (although I only had a PDF to work from, so maybe it's a 'scanning induced error') is described at the bottom here: https://gunkies.org/wiki/KS10 Although looking again at the PDF, the two digits in question are quite clear and crisp, and don't seem like they could be scanning errors. Noel
Re: Scanning docs for bitsavers
At 01:20 AM 3/12/2019 -0200, you wrote: >I cannot understand your problems with PDF files. >I've created lots and lots of PDFs, with treated and untreated scanned >material. All of them are very readable and in use for years. Of course, >garbage in, garbage out. I take the utmost care in my scans to have good >enough source files, so I can create great PDFs. > >Of course, Guy's commens are very informative and I'll learn more from it. >But I still believe in good preservation using PDF files. FOR ME it is the >best we have in encapsulating info. Forget HTMLs. I don't propose html as a viable alternative. It has massive inadequacies for representing physical documents. I just use it for experimenting and and as a temporary wrapper, because it's entirely transparent and maleable. ie I have total control over the result (within the bounds of what html can do.) >Please, take a look at this PDF, and tell me: Isn't that good enough for >preservation/use? >https://drive.google.com/file/d/0B7yahi4JC3juSVVkOEhwRWdUR1E/view OK, not too bad in comparison to many others. But a few comments: * The images are fax-mode, and although the resolution is high enough for there to be no ambiguities, it still looks bad and stylistically greatly differs from the original. Pity I don't have a copy of the original, to make demonstration scans of a few illustrations to show what it could be like, for similar file size. * The text is OCR, with a font I expect likely approximates the original fairly well. Though I'd like to see the original. I suspect the PDF font is a bit 'thic' due to incorrect gray threshold. Also it's searchable, except that the OCR process included paper blemishes as 'characters' so if you copy-paste the text elsewhere you have to carefully vet it. And not all searches will work. This is an illustration of the point that till we achieve human-leval AI, it's never going to be possible to go from images to abstracted OCR text automatically without considerable human oversight and proof-reading. And... human-level AI won't _want_ to do drudgery like that. * Your automated PDF generation process did a lot of silly things, like chaotic attempts to OCR 'elements' of diagrams. Just try moving a text selection box over the diagrams, you'll see what I mean. Try several diagrams, it's very random. * The PCB layouts, for eg PDF page #s 28, 29 - I bet the original used light shading to represent copper, and details over the copper were clearly visible. But when you scanned it in bi-level all that is lost. These _have_ to be in gray scale, and preferably post-processed to posterize the flat shading areas (for better compression as well as visual accuracy.) * Why are all the diagram pages variously different widths? I expect the original pages (foldouts?) had common sizes. This variation is because either you didn't use a fixed recipee for scanning and processing, or your PDF generation utility 'handled' that automatically (and messed up.) * You don't have control of what was OCR'd and what wasn't. For instance, why OCR table contents, if the text selection results are garbage? For eg, select the entire block at the bottom of PDF page 48. Does the highlighting create a sense of confidence this is going to work? Now copy and paste into a text editor. Is the result useful? (No.) OCR can be over-used. * 'ownership' As well as your introduction page, you put your tag on every single page. Pretty much everyone does something like this. As if by transcribing the source material you acquired some kind of ownership or bragging rights. But no, others put a very great deal of effort into creating that work, and you just made a digital copy. That the originators probably would consider an aesthetic insult to their efforts. So, why the proud tags everywhere? Summary: It's fine as a working copy for practical use. Better to have made it than not, so long as you didn't destroy the paper original in the process. But if you're talking about an archival historical record, that someone can look at in 500 years (or 5000) and know what the original actually looked like, how much effort went into making that ink crisp and accurate, then no. It's not good enough. To be fair, I've never yet seen any PDF scan of any document that I'd consider good enough. Works created originally in PDF as line art are a different class, and typically OK. Though some other flaws of PDF do come into play. Difficulty of content export, problems with global page parameters, font failures, sequential vs content page numbers, etc. With scanning there are multiple points of failure right through the whole process at present, ranging from misunderstandings of the technology among people doing scanning, problems with scanners (why are edge scanners so rare!?), lack of critical capabilities in post-processing utilities (line art on top of ink screening, it's a nightmare,
Re: Scanning docs for bitsavers
actually we scan to pdf with back ocr also text also tiff also jpegwith the slooowww hp 11x17 scan fax print thing i can scan entite document then save 1 save2 save3 save 4 without rescanning each time ed at smecc In a message dated 12/3/2019 2:16:01 AM US Mountain Standard Time, cctalk@classiccmp.org writes: Hi! On Tue, 2019-12-03 11:34:06 +1100, Guy Dunphy via cctalk wrote: > At 01:57 PM 2/12/2019 -0700, you wrote: > >On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk > >wrote: > > > > > When I corresponded with Al Kossow about format several years ago, he > > > indicated that CCITT Group 4 lossless compression was their standard. > As for G4 bilevel encoding, the only reasons it isn't treated with the same > disdain as JBIG2, are: > 1. Bandwaggon effect - "It must be OK because so many people use it." > 2. People with little or zero awareness of typography, the visual quality of > text, and anything to do with preservation of historical character of > printed works. For them "I can read it OK" is the sole requirement. > > G4 compression was invented for fax machines. No one cared much about visual > quality of faxes, they just had to be readable. Also the technology of fax > machines was only capable of two-tone B reproduction, so that's what G4 > encoding provided. So it boils down to two distinct tasks: * Scan old paper documentation with a proven file format (ie. no compression artifacts, b/w or 16 gray-level for black-and-white text, tables and the like. * Make these images accessible as useable documentation. The first step is that's work-intensive, the second step can probably be easily redone every time we "learn" something about how to make the documents more useful. For accessibility, PDF seems to be quite a nice choice, as long as we see that as a representation only (and not as the information source.) Convert the images to TIFF for example, possibly downsample, possibly OCR and overlay it. > But PDF literally cannot be used as a wrapper for the results, since > it doesn't incorporate the required image compression formats. > This is why I use things like html structuring, wrapped as either a zip > file or RARbook format. Because there is no other option at present. > There will be eventually. Just not yet. PDF has to be either greatly > extended, or replaced. I think that PDF actually is a quite well-working output format, but we'd see it as a compilation product of our actual source (images), not as the final (and only) product. > And that's why I get upset when people physically destroy rare old documents > during or after scanning them currently. It happens so frequently, that by > the time we have a technically adequate document coding scheme, a lot of old > documents won't have any surviving paper copies. > They'll be gone forever, with only really crap quality scans surviving. :-( Too bad, but that happens all the time. Thanks, Jan-Benedict --
Re: Scanning docs for bitsavers
very nice file yep, we prefer pdf with ocr back stuff ed smecc,orgIn a message dated 12/2/2019 8:20:36 PM US Mountain Standard Time, cctalk@classiccmp.org writes: I cannot understand your problems with PDF files. I've created lots and lots of PDFs, with treated and untreated scanned material. All of them are very readable and in use for years. Of course, garbage in, garbage out. I take the utmost care in my scans to have good enough source files, so I can create great PDFs. Of course, Guy's commens are very informative and I'll learn more from it. But I still believe in good preservation using PDF files. FOR ME it is the best we have in encapsulating info. Forget HTMLs. Please, take a look at this PDF, and tell me: Isn't that good enough for preservation/use? https://drive.google.com/file/d/0B7yahi4JC3juSVVkOEhwRWdUR1E/view Thanks Alexandre ---8<---Corte aqui---8<--- http://www.tabajara-labs.blogspot.com http://www.tabalabs.com.br ---8<---Corte aqui---8<--- Em ter., 3 de dez. de 2019 às 00:08, Grant Taylor via cctalk < cctalk@classiccmp.org> escreveu: > On 12/2/19 5:34 PM, Guy Dunphy via cctalk wrote: > > Interesting comments Guy. > > I'm completely naive when it comes to scanning things for preservation. > Your comments do pass my naive understanding. > > > But PDF literally cannot be used as a wrapper for the results, > > since it doesn't incorporate the required image compression formats. > > This is why I use things like html structuring, wrapped as either a zip > > file or RARbook format. Because there is no other option at present. > > There will be eventually. Just not yet. PDF has to be either greatly > > extended, or replaced. > > I *HATE* doing anything with PDFs other than reading them. My opinion > is that PDF is where information goes to die. Creating the PDF was the > last time that anything other than a human could use the information as > a unit. Now, in the future, it's all chopped up lines of text that may > be in a nonsensical order. I believe it will take humans (or something > yet to be created with human like ability) to make sense of the content > and recreate it in a new form for further consumption. > > Have you done any looking at ePub? My understanding is that they are a > zip of a directory structure of HTML and associated files. That sounds > quite similar to what you're describing. > > > And that's why I get upset when people physically destroy rare old > > documents during or after scanning them currently. It happens so > > frequently, that by the time we have a technically adequate document > > coding scheme, a lot of old documents won't have any surviving > > paper copies. They'll be gone forever, with only really crap quality > > scans surviving. > > Fair enough. > > > > -- > Grant. . . . > unix || die >
Re: Scanning docs for bitsavers
Hi! On Tue, 2019-12-03 11:34:06 +1100, Guy Dunphy via cctalk wrote: > At 01:57 PM 2/12/2019 -0700, you wrote: > >On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk > >wrote: > > > > > When I corresponded with Al Kossow about format several years ago, he > > > indicated that CCITT Group 4 lossless compression was their standard. > As for G4 bilevel encoding, the only reasons it isn't treated with the same > disdain as JBIG2, are: > 1. Bandwaggon effect - "It must be OK because so many people use it." > 2. People with little or zero awareness of typography, the visual quality of >text, and anything to do with preservation of historical character of >printed works. For them "I can read it OK" is the sole requirement. > > G4 compression was invented for fax machines. No one cared much about visual > quality of faxes, they just had to be readable. Also the technology of fax > machines was only capable of two-tone B reproduction, so that's what G4 > encoding provided. So it boils down to two distinct tasks: * Scan old paper documentation with a proven file format (ie. no compression artifacts, b/w or 16 gray-level for black-and-white text, tables and the like. * Make these images accessible as useable documentation. The first step is that's work-intensive, the second step can probably be easily redone every time we "learn" something about how to make the documents more useful. For accessibility, PDF seems to be quite a nice choice, as long as we see that as a representation only (and not as the information source.) Convert the images to TIFF for example, possibly downsample, possibly OCR and overlay it. > But PDF literally cannot be used as a wrapper for the results, since > it doesn't incorporate the required image compression formats. > This is why I use things like html structuring, wrapped as either a zip > file or RARbook format. Because there is no other option at present. > There will be eventually. Just not yet. PDF has to be either greatly > extended, or replaced. I think that PDF actually is a quite well-working output format, but we'd see it as a compilation product of our actual source (images), not as the final (and only) product. > And that's why I get upset when people physically destroy rare old documents > during or after scanning them currently. It happens so frequently, that by > the time we have a technically adequate document coding scheme, a lot of old > documents won't have any surviving paper copies. > They'll be gone forever, with only really crap quality scans surviving. :-( Too bad, but that happens all the time. Thanks, Jan-Benedict --
Re: Scanning docs for bitsavers
On Mon, 2 Dec 2019, Eric Smith wrote: There are newer bilevel encodings that are somewhat more efficient than G4 (ITU-T T.6), such as JBIG (T.82) and JBIG2 (T.88), but they are not as widely supported, and AFAIK JBIG2 is still patent encumbered. As a result, *NEVER* use JBIG2! I hope you know about the Xerox JBIG2 bug (e.g. making an 8 where there is a 6 in the original). Alone the idea of multiplying parts from a scan in other areas is a no go. That's not archiving, that's dumb. Therefore using compression algorithms like the one used in JBIG2 is discouraged or even forbidden e.g. for legal matters. Christian