Re: P112

2019-12-03 Thread Bill Gunshannon via cctalk
On 12/3/19 8:15 PM, Fred Cisin via cctech wrote:
> On Wed, 4 Dec 2019, Bill Gunshannon via cctech wrote:
>> Along this line I have solved one problem.  I mentioned INIT in
>> RSX180 printing gibberish on the screen when trying to init a
>> hard disk partition where it had worked on a floppy.  Problem
>> was the size of the partitions.  I had tried just making one
>> partition for the test I learned that FDISK will make partitions
>> too big for any of the P112 OSes.  I now have a hard disk with
>> 5 partitions to play with.  On to the  next problem.
> 
> Is it a specific size limit?
> (something on the order of number of bits for block number?)

Don't know, but I suspect it's around 32M.  I seem to remember
seeing something mentioned somewhere.  I just divided a 42M
Seagate into 5 partitions to play with.  I may test the limits
eventually but for right now I would just like to get some of
the OSes loaded on the hard disk so I can work with them.
Especially RSX180 as I have some other plans for that one.

bill




Re: P112

2019-12-03 Thread Fred Cisin via cctalk

On Wed, 4 Dec 2019, Bill Gunshannon via cctech wrote:

Along this line I have solved one problem.  I mentioned INIT in
RSX180 printing gibberish on the screen when trying to init a
hard disk partition where it had worked on a floppy.  Problem
was the size of the partitions.  I had tried just making one
partition for the test I learned that FDISK will make partitions
too big for any of the P112 OSes.  I now have a hard disk with
5 partitions to play with.  On to the  next problem.


Is it a specific size limit?
(something on the order of number of bits for block number?)




Re: P112

2019-12-03 Thread Bill Gunshannon via cctalk
On 12/3/19 7:51 PM, Craig Ruff via cctech wrote:
> Just in case someone else hasn't already responded, the P112 does not use DOS 
> style fdisk partitioning for a hard disk. It is done in the BIOS image, and 
> then the logical disks have to be initialized. This is described in the "P112 
> GIDE Construction.pdf" document.
> 
> I've only used 3.5" floppies, which work fine. You can also attach a PATA 
> CD-ROM drive and access disks with a program that escapes my memory at the 
> moment.
> 

Along this line I have solved one problem.  I mentioned INIT in
RSX180 printing gibberish on the screen when trying to init a
hard disk partition where it had worked on a floppy.  Problem
was the size of the partitions.  I had tried just making one
partition for the test I learned that FDISK will make partitions
too big for any of the P112 OSes.  I now have a hard disk with
5 partitions to play with.  On to the  next problem.

bill



Re: P112

2019-12-03 Thread Craig Ruff via cctalk
Just in case someone else hasn't already responded, the P112 does not use DOS 
style fdisk partitioning for a hard disk. It is done in the BIOS image, and 
then the logical disks have to be initialized. This is described in the "P112 
GIDE Construction.pdf" document.

I've only used 3.5" floppies, which work fine. You can also attach a PATA 
CD-ROM drive and access disks with a program that escapes my memory at the 
moment.

Re: PBX (or something) for modem testing

2019-12-03 Thread Glen Slick via cctalk
On Tue, Dec 3, 2019 at 8:35 PM Jim Brain via cctalk
 wrote:
>
> That said, I went out to eBay to see if I could source a 2-8 line
> something to help, and got smacked around with my lack of telephone
> system knowledge.
>
> So, any ideas (or links to eBay auctions) of brands/models/etc. I should
> focus on?
>

When I worked in a group that was doing modem related software over 20
years ago two of the telephone line simulators we used were from
Teltone and from TAS.

You can find Teltone TLS-2, 3, 4, and 5 boxes on eBay. The TLS-2 and
TLS-3 are two line boxes, the TLS-4 and TLS-5 are four line boxes. If
I remember correctly we had a TLS-5 model that could do more advanced
things like Call Waiting and Caller ID signalling. If you're patient
you might find a reasonable deal on one of these. Search eBay for
"Teltone TLS" and look at completed item sale prices.

The TAS Telephone Network Emulators were much more serious pieces of
complex programmable test equipment. We had a big TAS Series II box,
which was looped through an additional TAS 240 voiceband subscriber
loop emulator which could introduce various impairments on the line.
If you search eBay for "TAS telphone emulator" you'll see some of
those boxes. Doesn't look like those are typically cheap unless they
are likely broken, and they are bigger and heavier than you would want
to deal with anyway.

I might have a Teltone TLS-3 that I don't really need. The only
problem would be finding it hidden under other stuff...


Re: PBX (or something) for modem testing

2019-12-03 Thread Grant Taylor via cctalk

On 12/3/19 9:35 PM, Jim Brain via cctalk wrote:
So, any ideas (or links to eBay auctions) of brands/models/etc. I should 
focus on?


I would purchase a Partner system from AT / Lucent / Avaya.  I think 
they are both analog and digital.  The analog will work for modems.  You 
will likely need a digital set with display to program things.


I should clarify, /each/ line is both digital and analog.  This is nicer 
than older Nortel Norstar systems that I've worked with which needed 
additional equipment equipment, Analog Terminal Adapter, as they were 
(by default) purely digital lines.


Note:  You might not be able to get anything faster than 33.6 without 
some more effort and / or more specialized phone equipment.


There's always the VoIP route, but that's an even deeper, darker, 
steeper, and more slippery rabbit hole.


Partner systems will likely be 8 analog lines that can be used in short 
order.


My biggest concern with a Partner system would be the dial plan.  As in, 
will the modems, or other communications applications be okay with a two 
digit phone number?  Or do you need to make the phone system use longer 
phone numbers.  (As I type this, I seem to remember Nortel systems 
having an option.  I bet Partner systems do too.)  I'd cross this 
particular bridge when you get there, if you get there.




--
Grant. . . .
unix || die


PBX (or something) for modem testing

2019-12-03 Thread Jim Brain via cctalk
To continue validating modem functionality, I think it makes sense to 
set up a closed loop phone system in my lab that will function well 
enough to allow modems to connect to each other (dial tone, ringing, 
busy signal, etc.).


I know I can probably whip something up with a 9 v battery and a piece 
of cable with rj11s, but I think that will fall short.


That said, I went out to eBay to see if I could source a 2-8 line 
something to help, and got smacked around with my lack of telephone 
system knowledge.


So, any ideas (or links to eBay auctions) of brands/models/etc. I should 
focus on?


Also, if anyone has any modems lying around gathering dust, I probably 
should source a few more models. tcpser handles Hayes "+++" spec 
correctly, but I should probably support TIES as well, to cite one example.


Jim

--
Jim Brain
br...@jbrain.com
www.jbrain.com



Re: Scanning docs for bitsavers

2019-12-03 Thread Antonio Carlini via cctalk

On 03/12/2019 20:22, Fred Cisin via cctalk wrote:


Watch out.  PDF with OCR can show you a clear and crisp  [possibly 
wrong] interpretation of the scan, not what the actual scan looked like.




The OCR may well say "0" where the printing says "8" but what your eyes 
will see will be the representation of the printing. So if you rely only 
on OCR you may well miss something, but if you fall back to the way 
you'd have towork without OCR (or even the way you'd have to work if you 
had the original paper copy) then you have to rely on your eyesight to 
fail to find what you are looking for ...



Unless, that is, you discard the graphical representation and keep only 
the OCR result. In which case all bets are off.



Antonio


--
Antonio Carlini
anto...@acarlini.com



Re: Scanning docs for bitsavers

2019-12-03 Thread Fred Cisin via cctalk

On Tue, 3 Dec 2019, Paul Koning via cctalk wrote:
The trouble (for both of these) is that many of the 
users don't know the limitations and blindly use the wrong tools.


"To the man who has a hammer, the whole world looks like a thumb."

(which is an idictment about misuse, not an indictment of hammers)


Re: Scanning docs for bitsavers

2019-12-03 Thread Fred Cisin via cctalk

   > JBIG2 .. introduces so many actual factual errors (typically
   > substituted letters and numbers)

On Tue, 3 Dec 2019, Noel Chiappa via cctalk wrote:

It's probably worth noting that there are often errors _in the original
documents_, too - so even a perfect image doesn't guarantee no errors.


. . . and how often will the randomizing corruption actually result in 
changing an error to what it SHOULD HAVE BEEN?  :-)



Although looking again at the PDF, the two digits in question are quite 
clear and crisp, and don't seem like they could be scanning errors.


Watch out.  PDF with OCR can show you a clear and crisp  [possibly wrong] 
interpretation of the scan, not what the actual scan looked like.


Re: Scanning docs for bitsavers

2019-12-03 Thread Eric Smith via cctalk
On Tue, Dec 3, 2019 at 10:59 AM Paul Berger via cctalk <
cctalk@classiccmp.org> wrote:

> Is there any way to know what compression was used in a pdf file?
>

There's not necessarily only one. Every object in a PDF file can have its
own selection of compression algorithm.

I don't know of any user-friendly way to tell. Years ago I used some really
awful programs I hacked up to inspect PDF file contents. I'm not sure I
even can find them any more.


Re: Scanning docs for bitsavers

2019-12-03 Thread Paul Koning via cctalk



> On Dec 3, 2019, at 12:59 PM, Paul Berger via cctalk  
> wrote:
> 
> ...
> Would TIFF G4 still be preferable to JPEG2000? It would seem I can control 
> the compression used by selecting the pdf compatibility level.

JPEG2000 apparently has a lossless mode (says Wikipedia).  If so, it would be 
acceptable as an alternative to other lossless compressions.  If used in lossy 
mode, it's not suitable for scanned documents, just as regular JPEG isn't.

paul



Re: Scanning docs for bitsavers

2019-12-03 Thread Paul Berger via cctalk



On 2019-12-02 4:57 p.m., Eric Smith via cctalk wrote:

On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk 
wrote:


When I corresponded with Al Kossow about format several years ago, he
indicated that CCITT Group 4 lossless compression was their standard.


There are newer bilevel encodings that are somewhat more efficient than G4
(ITU-T T.6), such as JBIG (T.82) and JBIG2 (T.88), but they are not as
widely supported, and AFAIK JBIG2 is still patent encumbered. As a result,
G4 is still arguably the best bilevel encoding for general-purpose use. PDF
has natively supported G4 for ages, though it gained JBIG and JBIG2 support
in more recent versions.

Back in 2001, support for G4 encoding in open source software was really
awful; where it existed at all, it was horribly slow. There was no good
reason for G4 encoding to be slow, which was part of my motivation in
writing my own G4 encoder for tumble (an image-to-PDF utility). However, G4
support is generally much better now.


Is there any way to know what compression was used in a pdf file?

Do you know anything about a compression format JPEG2000?

Would TIFF G4 still be preferable to JPEG2000? It would seem I can 
control the compression used by selecting the pdf compatibility level.


Paul.



Re: One old Sol, Two old names...

2019-12-03 Thread Eric Smith via cctalk
On Mon, Nov 25, 2019 at 7:43 PM William Sudbrink via cctalk <
cctalk@classiccmp.org> wrote:

> Other interesting things about the Sol include that it has an 80/64 video
> modification
> (with patches all over):
> http://wsudbrink.dyndns.org:8080/images/fixed_sol/20191125_202606.jpg
>

Cool!

Here's one with both the 80/64 daughterboard and a Z80 daughterboard:
https://www.flickr.com/photos/_brouhaha_/6693966999/in/album-72157628862392743/

Unfortunately I have not been able to track down documentation on either.


Re: Scanning docs for bitsavers

2019-12-03 Thread Grant Taylor via cctalk

On 12/3/19 10:30 AM, Eric Smith via cctalk wrote:
PDF was never _intended_ for documents that should undergo any further 
processing.


Okay.

Fair rebuttal.

The few things that have been hacked onto it for interactive use are 
actually the worst thing about PDF.


My opinion


Okay.

I don't have any more difficulty extracting and processing scanned page 
images out of a PDF file than any other container (e.g., TIFF).





--
Grant. . . .
unix || die


Re: Scanning docs for bitsavers

2019-12-03 Thread Paul Koning via cctalk



> On Dec 2, 2019, at 11:12 PM, Grant Taylor via cctalk  
> wrote:
> 
> On 12/2/19 9:06 PM, Grant Taylor via cctalk wrote:
>> In my opinion, PDFs are the last place that computer usable data goes. 
>> Because getting anything out of a PDF as a data source is next to impossible.
>> Sure, you, a human, can read it and consume the data.
>> Try importing a simple table from a PDF and working with the data in 
>> something like a spreadsheet.  You can't do it.  The raw data is there.  But 
>> you can't readily use it.
>> This is why I say that a PDF is the end of the line for data.
>> I view it as effectively impossible to take data out of a PDF and do 
>> anything with it without first needing to reconstitute it before I can use 
>> it.
> 
> I'll add this:
> 
> PDF is a decent page layout format.  But trying to view the contents in any 
> different layout is problematic (at best).
> 
> Trying to use the result of a page layout as a data source is ... problematic.

That's hardly surprising.  These properties are precisely the intent of PDF.  
It's basically a portable variant of PostScript, with some cleanups (relatively 
sane Unicode support, transparency, hyperlinks, a few other things).  Its 
specific purpose is to encode page images, just as they appear on actual paper. 
 Indeed, PDF is often used as a "camera ready copy" format for material going 
to a print shop.  It works quite well for that.

For scanned documents, where each page is just an image, PDF is a decent 
container format.  For documents with actual text, it's far more problematic.

Using PDF as an intermediate form is every bit as inappropriate as using JPEG 
for line art or any other application where artefacts are impermissible.  The 
trouble (for both of these) is that many of the users don't know the 
limitations and blindly use the wrong tools.

paul



Re: Scanning docs for bitsavers

2019-12-03 Thread Eric Smith via cctalk
On Mon, Dec 2, 2019 at 9:06 PM Grant Taylor via cctalk <
cctalk@classiccmp.org> wrote:

> My problem with PDFs starts where most people stop using them.
>
> Take the average PDF of text, try to copy and paste the text into a text
> file.  (That may work.)
>

Sure. Now try thing same thing with a TIFF file.


Re: Scanning docs for bitsavers

2019-12-03 Thread Eric Smith via cctalk
On Tue, Dec 3, 2019 at 1:50 AM Christian Corti via cctalk <
cctalk@classiccmp.org> wrote:

> *NEVER* use JBIG2! I hope you know about the Xerox JBIG2 bug (e.g. making
>

That's _LOSSY_ JBIG2.

YOU DON"T HAVE TO USE LOSSY MODE!


Re: Scanning docs for bitsavers

2019-12-03 Thread Eric Smith via cctalk
On Mon, Dec 2, 2019 at 7:08 PM Grant Taylor via cctalk <
cctalk@classiccmp.org> wrote:

> I *HATE* doing anything with PDFs other than reading them.


PDF was never _intended_ for documents that should undergo any further
processing. The few things that have been hacked onto it for interactive
use are actually the worst thing about PDF.

  My opinion
> is that PDF is where information goes to die.  Creating the PDF was the
> last time that anything other than a human could use the information as
> a unit.


I don't have any more difficulty extracting and processing scanned page
images out of a PDF file than any other container (e.g., TIFF).


Re: Scanning docs for bitsavers

2019-12-03 Thread Eric Smith via cctalk
On Mon, Dec 2, 2019 at 5:34 PM Guy Dunphy via cctalk 
wrote:

> Mentioning JBIG2 (or any of its predecessors) without noting that it is
> completely unacceptable as a scanned document compression scheme,
> demonstrates
> a lack of awareness of the defects it introduces in encoded documents.
>

Perhaps you are not aware that the JBIG2 standard has a lossless mode.
Certainly JBIG2 lossy mode is _extremely_ lossy, but lossless mode doesn't
have those problems.

It's entirely possible that the common JBIG2 encoders either don't offer
lossless mode, or don't make it easy to configure.

G4 compression was invented for fax machines. No one cared much about visual
> quality of faxes, they just had to be readable. Also the technology of fax
> machines was only capable of two-tone B reproduction, so that's what G4
> encoding provided.
>
Thinking these kinds of visual degradation of quality are acceptable when
> scanning documents for long term preservation, is both short sighted and
> ignorant of what can already be achieved with better technique.
>

When used at an appropriate resolution (e.g., not 100 DPI), G4 encoding is
perfectly fine for bilevel documents (text and line art) that are in good
condition. If the documents were originally bilevel but have suffered from
significant degradation in reproduction, then they are effectively no
longer bilevel, and G4 (at any resolution) is inappropriate.

And therefore why PDF isn't acceptable as a
> container for long term archiving of _scanned_ documents for historical
> purposes.
>

You state that as if it was a fact universally agreed upon, which it
clearly is not.

If you despise PDF as an archival format, by all means please feel free to
NOT avail yourself of the hundreds of thousands of pages of archives in PDF
format e.g. on Bitsavers.

I'm of course not claiming that PDF is perfect, nor is G4 encoding.


Re: Scanning docs for bitsavers

2019-12-03 Thread Noel Chiappa via cctalk
> From: Guy Dunphy

> JBIG2 .. introduces so many actual factual errors (typically
> substituted letters and numbers)

It's probably worth noting that there are often errors _in the original
documents_, too - so even a perfect image doesn't guarantee no errors.

The most recent one (of many) which I found (although I only had a PDF to
work from, so maybe it's a 'scanning induced error') is described at the
bottom here:

  https://gunkies.org/wiki/KS10

Although looking again at the PDF, the two digits in question are quite clear
and crisp, and don't seem like they could be scanning errors.

Noel


Re: Scanning docs for bitsavers

2019-12-03 Thread Guy Dunphy via cctalk
At 01:20 AM 3/12/2019 -0200, you wrote:
>I cannot understand your problems with PDF files.
>I've created lots and lots of PDFs, with treated and untreated scanned
>material. All of them are very readable and in use for years. Of course,
>garbage in, garbage out. I take the utmost care in my scans to have good
>enough source files, so I can create great PDFs.
>
>Of course, Guy's commens are very informative and I'll learn more from it.
>But I still believe in good preservation using PDF files. FOR ME it is the
>best we have in encapsulating info. Forget HTMLs.

I don't propose html as a viable alternative. It has massive inadequacies
for representing physical documents. I just use it for experimenting and
and as a temporary wrapper, because it's entirely transparent and maleable.
ie I have total control over the result (within the bounds of what html
can do.)

>Please, take a look at this PDF, and tell me: Isn't that good enough for
>preservation/use?
>https://drive.google.com/file/d/0B7yahi4JC3juSVVkOEhwRWdUR1E/view

OK, not too bad in comparison to many others. But a few comments:
* The images are fax-mode, and although the resolution is high enough for there 
to be
  no ambiguities, it still looks bad and stylistically greatly differs from the 
original.
  Pity I don't have a copy of the original, to make demonstration scans of a few
  illustrations to show what it could be like, for similar file size.

* The text is OCR, with a font I expect likely approximates the original fairly 
well.
  Though I'd like to see the original. I suspect the PDF font is a bit 'thic' 
due to
  incorrect gray threshold.
  Also it's searchable, except that the OCR process included paper blemishes as 
'characters'
  so if you copy-paste the text elsewhere you have to carefully vet it. And not 
all searches
  will work.

  This is an illustration of the point that till we achieve human-leval AI, 
it's never
  going to be possible to go from images to abstracted OCR text automatically 
without considerable
  human oversight and proof-reading. And... human-level AI won't _want_ to do 
drudgery like that.

* Your automated PDF generation process did a lot of silly things, like chaotic 
attempts to
  OCR 'elements' of diagrams. Just try moving a text selection box over the 
diagrams, you'll
  see what I mean. Try several diagrams, it's very random.

* The PCB layouts, for eg PDF page #s 28, 29 - I bet the original used light 
shading to represent
  copper, and details over the copper were clearly visible. But when you 
scanned it in bi-level
  all that is lost. These _have_ to be in gray scale, and preferably 
post-processed to posterize
  the flat shading areas (for better compression as well as visual accuracy.)

* Why are all the diagram pages variously different widths? I expect the 
original pages (foldouts?)
  had common sizes. This variation is because either you didn't use a fixed 
recipee for scanning
  and processing, or your PDF generation utility 'handled' that automatically 
(and messed up.)

* You don't have control of what was OCR'd and what wasn't. For instance, why 
OCR table contents,
  if the text selection results are garbage? For eg, select the entire block at 
the bottom of
  PDF page 48. Does the highlighting create a sense of confidence this is going 
to work?
  Now copy and paste into a text editor. Is the result useful? (No.)
  OCR can be over-used.

* 'ownership' As well as your introduction page, you put your tag on every 
single page.
  Pretty much everyone does something like this. As if by transcribing the 
source material you
  acquired some kind of ownership or bragging rights. But no, others put a very 
great deal of 
  effort into creating that work, and you just made a digital copy. That the 
originators probably
  would consider an aesthetic insult to their efforts. So, why the proud tags 
everywhere?

Summary: It's fine as a working copy for practical use. Better to have made it 
than not, so long
as you didn't destroy the paper original in the process. But if you're talking 
about an archival
historical record, that someone can look at in 500 years (or 5000) and know 
what the original 
actually looked like, how much effort went into making that ink crisp and 
accurate, then no. 
It's not good enough. 

To be fair, I've never yet seen any PDF scan of any document that I'd consider 
good enough.
Works created originally in PDF as line art are a different class, and 
typically OK. Though
some other flaws of PDF do come into play. Difficulty of content export, 
problems with global
page parameters, font failures, sequential vs content page numbers, etc.

With scanning there are multiple points of failure right through the whole 
process at present, 
ranging from misunderstandings of the technology among people doing scanning, 
problems with
scanners (why are edge scanners so rare!?), lack of critical capabilities in 
post-processing
utilities (line art on top of ink screening, it's a nightmare, 

Re: Scanning docs for bitsavers

2019-12-03 Thread ED SHARPE via cctalk
actually   we scan to pdf  with back ocr  also text  also tiff also jpegwith 
the slooowww   hp 11x17 scan fax print thing i can scan entite document then 
save 1 save2 save3  save 4 without rescanning each time   ed  at smecc
In a message dated 12/3/2019 2:16:01 AM US Mountain Standard Time, 
cctalk@classiccmp.org writes:

Hi!
On Tue, 2019-12-03 11:34:06 +1100, Guy Dunphy via cctalk 
 wrote:
> At 01:57 PM 2/12/2019 -0700, you wrote:
> >On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk 
> >wrote:
> >
> > > When I corresponded with Al Kossow about format several years ago, he
> > > indicated that CCITT Group 4 lossless compression was their standard.
> As for G4 bilevel encoding, the only reasons it isn't treated with the same
> disdain as JBIG2, are:
> 1. Bandwaggon effect - "It must be OK because so many people use it."
> 2. People with little or zero awareness of typography, the visual quality of
>    text, and anything to do with preservation of historical character of
>    printed works. For them "I can read it OK" is the sole requirement.
> 
> G4 compression was invented for fax machines. No one cared much about visual
> quality of faxes, they just had to be readable. Also the technology of fax
> machines was only capable of two-tone B reproduction, so that's what G4
> encoding provided.

So it boils down to two distinct tasks:

  * Scan old paper documentation with a proven file format (ie. no
    compression artifacts, b/w or 16 gray-level for black-and-white
    text, tables and the like.

  * Make these images accessible as useable documentation.


The first step is that's work-intensive, the second step can probably
be easily redone every time we "learn" something about how to make the
documents more useful.

  For accessibility, PDF seems to be quite a nice choice, as long as
we see that as a representation only (and not as the information
source.) Convert the images to TIFF for example, possibly downsample,
possibly OCR and overlay it.

> But PDF literally cannot be used as a wrapper for the results, since
> it doesn't incorporate the required image compression formats. 
> This is why I use things like html structuring, wrapped as either a zip
> file or RARbook format. Because there is no other option at present.
> There will be eventually. Just not yet. PDF has to be either greatly
> extended, or replaced.

I think that PDF actually is a quite well-working output format, but
we'd see it as a compilation product of our actual source (images),
not as the final (and only) product.

> And that's why I get upset when people physically destroy rare old documents
> during or after scanning them currently. It happens so frequently, that by
> the time we have a technically adequate document coding scheme, a lot of old
> documents won't have any surviving paper copies.
> They'll be gone forever, with only really crap quality scans surviving.

:-(  Too bad, but that happens all the time.

Thanks,
  Jan-Benedict

-- 


Re: Scanning docs for bitsavers

2019-12-03 Thread ED SHARPE via cctalk
very nice  file
yep, we prefer pdf   with  ocr   back  stuff   ed smecc,orgIn a message dated 
12/2/2019 8:20:36 PM US Mountain Standard Time, cctalk@classiccmp.org writes:

I cannot understand your problems with PDF files.
I've created lots and lots of PDFs, with treated and untreated scanned
material. All of them are very readable and in use for years. Of course,
garbage in, garbage out. I take the utmost care in my scans to have good
enough source files, so I can create great PDFs.

Of course, Guy's commens are very informative and I'll learn more from it.
But I still believe in good preservation using PDF files. FOR ME it is the
best we have in encapsulating info. Forget HTMLs.

Please, take a look at this PDF, and tell me: Isn't that good enough for
preservation/use?
https://drive.google.com/file/d/0B7yahi4JC3juSVVkOEhwRWdUR1E/view

Thanks
Alexandre

---8<---Corte aqui---8<---
http://www.tabajara-labs.blogspot.com
http://www.tabalabs.com.br
---8<---Corte aqui---8<---


Em ter., 3 de dez. de 2019 às 00:08, Grant Taylor via cctalk <
cctalk@classiccmp.org> escreveu:

> On 12/2/19 5:34 PM, Guy Dunphy via cctalk wrote:
>
> Interesting comments Guy.
>
> I'm completely naive when it comes to scanning things for preservation.
>  Your comments do pass my naive understanding.
>
> > But PDF literally cannot be used as a wrapper for the results,
> > since it doesn't incorporate the required image compression formats.
> > This is why I use things like html structuring, wrapped as either a zip
> > file or RARbook format. Because there is no other option at present.
> > There will be eventually. Just not yet. PDF has to be either greatly
> > extended, or replaced.
>
> I *HATE* doing anything with PDFs other than reading them.  My opinion
> is that PDF is where information goes to die.  Creating the PDF was the
> last time that anything other than a human could use the information as
> a unit.  Now, in the future, it's all chopped up lines of text that may
> be in a nonsensical order.  I believe it will take humans (or something
> yet to be created with human like ability) to make sense of the content
> and recreate it in a new form for further consumption.
>
> Have you done any looking at ePub?  My understanding is that they are a
> zip of a directory structure of HTML and associated files.  That sounds
> quite similar to what you're describing.
>
> > And that's why I get upset when people physically destroy rare old
> > documents during or after scanning them currently. It happens so
> > frequently, that by the time we have a technically adequate document
> > coding scheme, a lot of old documents won't have any surviving
> > paper copies.  They'll be gone forever, with only really crap quality
> > scans surviving.
>
> Fair enough.
>
>
>
> --
> Grant. . . .
> unix || die
>


Re: Scanning docs for bitsavers

2019-12-03 Thread Jan-Benedict Glaw via cctalk
Hi!
On Tue, 2019-12-03 11:34:06 +1100, Guy Dunphy via cctalk 
 wrote:
> At 01:57 PM 2/12/2019 -0700, you wrote:
> >On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk 
> >wrote:
> >
> > > When I corresponded with Al Kossow about format several years ago, he
> > > indicated that CCITT Group 4 lossless compression was their standard.
> As for G4 bilevel encoding, the only reasons it isn't treated with the same
> disdain as JBIG2, are:
> 1. Bandwaggon effect - "It must be OK because so many people use it."
> 2. People with little or zero awareness of typography, the visual quality of
>text, and anything to do with preservation of historical character of
>printed works. For them "I can read it OK" is the sole requirement.
> 
> G4 compression was invented for fax machines. No one cared much about visual
> quality of faxes, they just had to be readable. Also the technology of fax
> machines was only capable of two-tone B reproduction, so that's what G4
> encoding provided.

So it boils down to two distinct tasks:

  * Scan old paper documentation with a proven file format (ie. no
compression artifacts, b/w or 16 gray-level for black-and-white
text, tables and the like.

  * Make these images accessible as useable documentation.


The first step is that's work-intensive, the second step can probably
be easily redone every time we "learn" something about how to make the
documents more useful.

  For accessibility, PDF seems to be quite a nice choice, as long as
we see that as a representation only (and not as the information
source.) Convert the images to TIFF for example, possibly downsample,
possibly OCR and overlay it.

> But PDF literally cannot be used as a wrapper for the results, since
> it doesn't incorporate the required image compression formats. 
> This is why I use things like html structuring, wrapped as either a zip
> file or RARbook format. Because there is no other option at present.
> There will be eventually. Just not yet. PDF has to be either greatly
> extended, or replaced.

I think that PDF actually is a quite well-working output format, but
we'd see it as a compilation product of our actual source (images),
not as the final (and only) product.

> And that's why I get upset when people physically destroy rare old documents
> during or after scanning them currently. It happens so frequently, that by
> the time we have a technically adequate document coding scheme, a lot of old
> documents won't have any surviving paper copies.
> They'll be gone forever, with only really crap quality scans surviving.

:-(  Too bad, but that happens all the time.

Thanks,
  Jan-Benedict

-- 


Re: Scanning docs for bitsavers

2019-12-03 Thread Christian Corti via cctalk

On Mon, 2 Dec 2019, Eric Smith wrote:

There are newer bilevel encodings that are somewhat more efficient than G4
(ITU-T T.6), such as JBIG (T.82) and JBIG2 (T.88), but they are not as
widely supported, and AFAIK JBIG2 is still patent encumbered. As a result,


*NEVER* use JBIG2! I hope you know about the Xerox JBIG2 bug (e.g. making 
an 8 where there is a 6 in the original). Alone the idea of multiplying 
parts from a scan in other areas is a no go. That's not archiving, that's 
dumb. Therefore using compression algorithms like the one used in JBIG2 
is discouraged or even forbidden e.g. for legal matters.


Christian