Re: Scanning docs for bitsavers

Guy Dunphy via cctalk Tue, 03 Dec 2019 03:10:19 -0800

At 01:20 AM 3/12/2019 -0200, you wrote:
>I cannot understand your problems with PDF files.
>I've created lots and lots of PDFs, with treated and untreated scanned
>material. All of them are very readable and in use for years. Of course,
>garbage in, garbage out. I take the utmost care in my scans to have good
>enough source files, so I can create great PDFs.
>
>Of course, Guy's commens are very informative and I'll learn more from it.
>But I still believe in good preservation using PDF files. FOR ME it is the
>best we have in encapsulating info. Forget HTMLs.


I don't propose html as a viable alternative. It has massive inadequacies
for representing physical documents. I just use it for experimenting and
and as a temporary wrapper, because it's entirely transparent and maleable.
ie I have total control over the result (within the bounds of what html
can do.)

>Please, take a look at this PDF, and tell me: Isn't that good enough for
>preservation/use?
>https://drive.google.com/file/d/0B7yahi4JC3juSVVkOEhwRWdUR1E/view

OK, not too bad in comparison to many others. But a few comments:
* The images are fax-mode, and although the resolution is high enough for there 
to be
  no ambiguities, it still looks bad and stylistically greatly differs from the 
original.
  Pity I don't have a copy of the original, to make demonstration scans of a few
  illustrations to show what it could be like, for similar file size.

* The text is OCR, with a font I expect likely approximates the original fairly 
well.
  Though I'd like to see the original. I suspect the PDF font is a bit 'thic' 
due to
  incorrect gray threshold.
  Also it's searchable, except that the OCR process included paper blemishes as 
'characters'
  so if you copy-paste the text elsewhere you have to carefully vet it. And not 
all searches
  will work.

  This is an illustration of the point that till we achieve human-leval AI, 
it's never
  going to be possible to go from images to abstracted OCR text automatically 
without considerable
  human oversight and proof-reading. And... human-level AI won't _want_ to do 
drudgery like that.

* Your automated PDF generation process did a lot of silly things, like chaotic 
attempts to
  OCR 'elements' of diagrams. Just try moving a text selection box over the 
diagrams, you'll
  see what I mean. Try several diagrams, it's very random.

* The PCB layouts, for eg PDF page #s 28, 29 - I bet the original used light 
shading to represent
  copper, and details over the copper were clearly visible. But when you 
scanned it in bi-level
  all that is lost. These _have_ to be in gray scale, and preferably 
post-processed to posterize
  the flat shading areas (for better compression as well as visual accuracy.)

* Why are all the diagram pages variously different widths? I expect the 
original pages (foldouts?)
  had common sizes. This variation is because either you didn't use a fixed 
recipee for scanning
  and processing, or your PDF generation utility 'handled' that automatically 
(and messed up.)

* You don't have control of what was OCR'd and what wasn't. For instance, why 
OCR table contents,
  if the text selection results are garbage? For eg, select the entire block at 
the bottom of
  PDF page 48. Does the highlighting create a sense of confidence this is going 
to work?
  Now copy and paste into a text editor. Is the result useful? (No.)
  OCR can be over-used.

* 'ownership' As well as your introduction page, you put your tag on every 
single page.
  Pretty much everyone does something like this. As if by transcribing the 
source material you
  acquired some kind of ownership or bragging rights. But no, others put a very 
great deal of 
  effort into creating that work, and you just made a digital copy. That the 
originators probably
  would consider an aesthetic insult to their efforts. So, why the proud tags 
everywhere?

Summary: It's fine as a working copy for practical use. Better to have made it 
than not, so long
as you didn't destroy the paper original in the process. But if you're talking 
about an archival
historical record, that someone can look at in 500 years (or 5000) and know 
what the original 
actually looked like, how much effort went into making that ink crisp and 
accurate, then no. 
It's not good enough. 

To be fair, I've never yet seen any PDF scan of any document that I'd consider 
good enough.
Works created originally in PDF as line art are a different class, and 
typically OK. Though
some other flaws of PDF do come into play. Difficulty of content export, 
problems with global
page parameters, font failures, sequential vs content page numbers, etc.

With scanning there are multiple points of failure right through the whole 
process at present, 
ranging from misunderstandings of the technology among people doing scanning, 
problems with
scanners (why are edge scanners so rare!?), lack of critical capabilities in 
post-processing
utilities (line art on top of ink screening, it's a nightmare, also most people 
can't use
Photoshop well, and it's necessary), failings built unavoidably into PDF, and 
not so great
PDF viewer utilities. Apart from the intrinsic issues (aside from a few  
advantages) with
on-screen display and controls compared to paper.

I hope I have not offended you. Btw my pickiness comes from growing up in a 
family with
commercial art, typography, printing and technical art involvement. And having 
in later years
assisted a little with such things. So at least I know how much effort goes 
into such things.

Keep the original. Methods and utilities will improve, and in 10 or 20 years it 
may be possible
to make a visually perfect digital copy (with minimal effort), worthy of 
becoming a sole record
of that thing (if history goes that way.)

Guy

Re: Scanning docs for bitsavers

Reply via email to