Re: Scanning docs for bitsavers

2019-12-04 Thread Christian Corti via cctalk

Dear Mister Noname,

On Tue, 3 Dec 2019, it was written

That's _LOSSY_ JBIG2.

YOU DON"T HAVE TO USE LOSSY MODE!


Don't shout!!
And for the topic: you don't have to use JBIG2. Space isn't really an 
issue today for scanned bilevel documents, so you can just stick with TIFF 
G4 or PNG.


Christian

PS:
You should have a look at the netiquette that says "Only post with a 
name" and "Be polite, don't shout".


Re: Scanning docs for bitsavers

2019-12-03 Thread Antonio Carlini via cctalk

On 03/12/2019 20:22, Fred Cisin via cctalk wrote:


Watch out.  PDF with OCR can show you a clear and crisp  [possibly 
wrong] interpretation of the scan, not what the actual scan looked like.




The OCR may well say "0" where the printing says "8" but what your eyes 
will see will be the representation of the printing. So if you rely only 
on OCR you may well miss something, but if you fall back to the way 
you'd have towork without OCR (or even the way you'd have to work if you 
had the original paper copy) then you have to rely on your eyesight to 
fail to find what you are looking for ...



Unless, that is, you discard the graphical representation and keep only 
the OCR result. In which case all bets are off.



Antonio


--
Antonio Carlini
anto...@acarlini.com



Re: Scanning docs for bitsavers

2019-12-03 Thread Fred Cisin via cctalk

On Tue, 3 Dec 2019, Paul Koning via cctalk wrote:
The trouble (for both of these) is that many of the 
users don't know the limitations and blindly use the wrong tools.


"To the man who has a hammer, the whole world looks like a thumb."

(which is an idictment about misuse, not an indictment of hammers)


Re: Scanning docs for bitsavers

2019-12-03 Thread Fred Cisin via cctalk

   > JBIG2 .. introduces so many actual factual errors (typically
   > substituted letters and numbers)

On Tue, 3 Dec 2019, Noel Chiappa via cctalk wrote:

It's probably worth noting that there are often errors _in the original
documents_, too - so even a perfect image doesn't guarantee no errors.


. . . and how often will the randomizing corruption actually result in 
changing an error to what it SHOULD HAVE BEEN?  :-)



Although looking again at the PDF, the two digits in question are quite 
clear and crisp, and don't seem like they could be scanning errors.


Watch out.  PDF with OCR can show you a clear and crisp  [possibly wrong] 
interpretation of the scan, not what the actual scan looked like.


Re: Scanning docs for bitsavers

2019-12-03 Thread Eric Smith via cctalk
On Tue, Dec 3, 2019 at 10:59 AM Paul Berger via cctalk <
cctalk@classiccmp.org> wrote:

> Is there any way to know what compression was used in a pdf file?
>

There's not necessarily only one. Every object in a PDF file can have its
own selection of compression algorithm.

I don't know of any user-friendly way to tell. Years ago I used some really
awful programs I hacked up to inspect PDF file contents. I'm not sure I
even can find them any more.


Re: Scanning docs for bitsavers

2019-12-03 Thread Paul Koning via cctalk



> On Dec 3, 2019, at 12:59 PM, Paul Berger via cctalk  
> wrote:
> 
> ...
> Would TIFF G4 still be preferable to JPEG2000? It would seem I can control 
> the compression used by selecting the pdf compatibility level.

JPEG2000 apparently has a lossless mode (says Wikipedia).  If so, it would be 
acceptable as an alternative to other lossless compressions.  If used in lossy 
mode, it's not suitable for scanned documents, just as regular JPEG isn't.

paul



Re: Scanning docs for bitsavers

2019-12-03 Thread Paul Berger via cctalk



On 2019-12-02 4:57 p.m., Eric Smith via cctalk wrote:

On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk 
wrote:


When I corresponded with Al Kossow about format several years ago, he
indicated that CCITT Group 4 lossless compression was their standard.


There are newer bilevel encodings that are somewhat more efficient than G4
(ITU-T T.6), such as JBIG (T.82) and JBIG2 (T.88), but they are not as
widely supported, and AFAIK JBIG2 is still patent encumbered. As a result,
G4 is still arguably the best bilevel encoding for general-purpose use. PDF
has natively supported G4 for ages, though it gained JBIG and JBIG2 support
in more recent versions.

Back in 2001, support for G4 encoding in open source software was really
awful; where it existed at all, it was horribly slow. There was no good
reason for G4 encoding to be slow, which was part of my motivation in
writing my own G4 encoder for tumble (an image-to-PDF utility). However, G4
support is generally much better now.


Is there any way to know what compression was used in a pdf file?

Do you know anything about a compression format JPEG2000?

Would TIFF G4 still be preferable to JPEG2000? It would seem I can 
control the compression used by selecting the pdf compatibility level.


Paul.



Re: Scanning docs for bitsavers

2019-12-03 Thread Grant Taylor via cctalk

On 12/3/19 10:30 AM, Eric Smith via cctalk wrote:
PDF was never _intended_ for documents that should undergo any further 
processing.


Okay.

Fair rebuttal.

The few things that have been hacked onto it for interactive use are 
actually the worst thing about PDF.


My opinion


Okay.

I don't have any more difficulty extracting and processing scanned page 
images out of a PDF file than any other container (e.g., TIFF).





--
Grant. . . .
unix || die


Re: Scanning docs for bitsavers

2019-12-03 Thread Paul Koning via cctalk



> On Dec 2, 2019, at 11:12 PM, Grant Taylor via cctalk  
> wrote:
> 
> On 12/2/19 9:06 PM, Grant Taylor via cctalk wrote:
>> In my opinion, PDFs are the last place that computer usable data goes. 
>> Because getting anything out of a PDF as a data source is next to impossible.
>> Sure, you, a human, can read it and consume the data.
>> Try importing a simple table from a PDF and working with the data in 
>> something like a spreadsheet.  You can't do it.  The raw data is there.  But 
>> you can't readily use it.
>> This is why I say that a PDF is the end of the line for data.
>> I view it as effectively impossible to take data out of a PDF and do 
>> anything with it without first needing to reconstitute it before I can use 
>> it.
> 
> I'll add this:
> 
> PDF is a decent page layout format.  But trying to view the contents in any 
> different layout is problematic (at best).
> 
> Trying to use the result of a page layout as a data source is ... problematic.

That's hardly surprising.  These properties are precisely the intent of PDF.  
It's basically a portable variant of PostScript, with some cleanups (relatively 
sane Unicode support, transparency, hyperlinks, a few other things).  Its 
specific purpose is to encode page images, just as they appear on actual paper. 
 Indeed, PDF is often used as a "camera ready copy" format for material going 
to a print shop.  It works quite well for that.

For scanned documents, where each page is just an image, PDF is a decent 
container format.  For documents with actual text, it's far more problematic.

Using PDF as an intermediate form is every bit as inappropriate as using JPEG 
for line art or any other application where artefacts are impermissible.  The 
trouble (for both of these) is that many of the users don't know the 
limitations and blindly use the wrong tools.

paul



Re: Scanning docs for bitsavers

2019-12-03 Thread Eric Smith via cctalk
On Mon, Dec 2, 2019 at 9:06 PM Grant Taylor via cctalk <
cctalk@classiccmp.org> wrote:

> My problem with PDFs starts where most people stop using them.
>
> Take the average PDF of text, try to copy and paste the text into a text
> file.  (That may work.)
>

Sure. Now try thing same thing with a TIFF file.


Re: Scanning docs for bitsavers

2019-12-03 Thread Eric Smith via cctalk
On Tue, Dec 3, 2019 at 1:50 AM Christian Corti via cctalk <
cctalk@classiccmp.org> wrote:

> *NEVER* use JBIG2! I hope you know about the Xerox JBIG2 bug (e.g. making
>

That's _LOSSY_ JBIG2.

YOU DON"T HAVE TO USE LOSSY MODE!


Re: Scanning docs for bitsavers

2019-12-03 Thread Eric Smith via cctalk
On Mon, Dec 2, 2019 at 7:08 PM Grant Taylor via cctalk <
cctalk@classiccmp.org> wrote:

> I *HATE* doing anything with PDFs other than reading them.


PDF was never _intended_ for documents that should undergo any further
processing. The few things that have been hacked onto it for interactive
use are actually the worst thing about PDF.

  My opinion
> is that PDF is where information goes to die.  Creating the PDF was the
> last time that anything other than a human could use the information as
> a unit.


I don't have any more difficulty extracting and processing scanned page
images out of a PDF file than any other container (e.g., TIFF).


Re: Scanning docs for bitsavers

2019-12-03 Thread Eric Smith via cctalk
On Mon, Dec 2, 2019 at 5:34 PM Guy Dunphy via cctalk 
wrote:

> Mentioning JBIG2 (or any of its predecessors) without noting that it is
> completely unacceptable as a scanned document compression scheme,
> demonstrates
> a lack of awareness of the defects it introduces in encoded documents.
>

Perhaps you are not aware that the JBIG2 standard has a lossless mode.
Certainly JBIG2 lossy mode is _extremely_ lossy, but lossless mode doesn't
have those problems.

It's entirely possible that the common JBIG2 encoders either don't offer
lossless mode, or don't make it easy to configure.

G4 compression was invented for fax machines. No one cared much about visual
> quality of faxes, they just had to be readable. Also the technology of fax
> machines was only capable of two-tone B reproduction, so that's what G4
> encoding provided.
>
Thinking these kinds of visual degradation of quality are acceptable when
> scanning documents for long term preservation, is both short sighted and
> ignorant of what can already be achieved with better technique.
>

When used at an appropriate resolution (e.g., not 100 DPI), G4 encoding is
perfectly fine for bilevel documents (text and line art) that are in good
condition. If the documents were originally bilevel but have suffered from
significant degradation in reproduction, then they are effectively no
longer bilevel, and G4 (at any resolution) is inappropriate.

And therefore why PDF isn't acceptable as a
> container for long term archiving of _scanned_ documents for historical
> purposes.
>

You state that as if it was a fact universally agreed upon, which it
clearly is not.

If you despise PDF as an archival format, by all means please feel free to
NOT avail yourself of the hundreds of thousands of pages of archives in PDF
format e.g. on Bitsavers.

I'm of course not claiming that PDF is perfect, nor is G4 encoding.


Re: Scanning docs for bitsavers

2019-12-03 Thread Noel Chiappa via cctalk
> From: Guy Dunphy

> JBIG2 .. introduces so many actual factual errors (typically
> substituted letters and numbers)

It's probably worth noting that there are often errors _in the original
documents_, too - so even a perfect image doesn't guarantee no errors.

The most recent one (of many) which I found (although I only had a PDF to
work from, so maybe it's a 'scanning induced error') is described at the
bottom here:

  https://gunkies.org/wiki/KS10

Although looking again at the PDF, the two digits in question are quite clear
and crisp, and don't seem like they could be scanning errors.

Noel


Re: Scanning docs for bitsavers

2019-12-03 Thread Guy Dunphy via cctalk
At 01:20 AM 3/12/2019 -0200, you wrote:
>I cannot understand your problems with PDF files.
>I've created lots and lots of PDFs, with treated and untreated scanned
>material. All of them are very readable and in use for years. Of course,
>garbage in, garbage out. I take the utmost care in my scans to have good
>enough source files, so I can create great PDFs.
>
>Of course, Guy's commens are very informative and I'll learn more from it.
>But I still believe in good preservation using PDF files. FOR ME it is the
>best we have in encapsulating info. Forget HTMLs.

I don't propose html as a viable alternative. It has massive inadequacies
for representing physical documents. I just use it for experimenting and
and as a temporary wrapper, because it's entirely transparent and maleable.
ie I have total control over the result (within the bounds of what html
can do.)

>Please, take a look at this PDF, and tell me: Isn't that good enough for
>preservation/use?
>https://drive.google.com/file/d/0B7yahi4JC3juSVVkOEhwRWdUR1E/view

OK, not too bad in comparison to many others. But a few comments:
* The images are fax-mode, and although the resolution is high enough for there 
to be
  no ambiguities, it still looks bad and stylistically greatly differs from the 
original.
  Pity I don't have a copy of the original, to make demonstration scans of a few
  illustrations to show what it could be like, for similar file size.

* The text is OCR, with a font I expect likely approximates the original fairly 
well.
  Though I'd like to see the original. I suspect the PDF font is a bit 'thic' 
due to
  incorrect gray threshold.
  Also it's searchable, except that the OCR process included paper blemishes as 
'characters'
  so if you copy-paste the text elsewhere you have to carefully vet it. And not 
all searches
  will work.

  This is an illustration of the point that till we achieve human-leval AI, 
it's never
  going to be possible to go from images to abstracted OCR text automatically 
without considerable
  human oversight and proof-reading. And... human-level AI won't _want_ to do 
drudgery like that.

* Your automated PDF generation process did a lot of silly things, like chaotic 
attempts to
  OCR 'elements' of diagrams. Just try moving a text selection box over the 
diagrams, you'll
  see what I mean. Try several diagrams, it's very random.

* The PCB layouts, for eg PDF page #s 28, 29 - I bet the original used light 
shading to represent
  copper, and details over the copper were clearly visible. But when you 
scanned it in bi-level
  all that is lost. These _have_ to be in gray scale, and preferably 
post-processed to posterize
  the flat shading areas (for better compression as well as visual accuracy.)

* Why are all the diagram pages variously different widths? I expect the 
original pages (foldouts?)
  had common sizes. This variation is because either you didn't use a fixed 
recipee for scanning
  and processing, or your PDF generation utility 'handled' that automatically 
(and messed up.)

* You don't have control of what was OCR'd and what wasn't. For instance, why 
OCR table contents,
  if the text selection results are garbage? For eg, select the entire block at 
the bottom of
  PDF page 48. Does the highlighting create a sense of confidence this is going 
to work?
  Now copy and paste into a text editor. Is the result useful? (No.)
  OCR can be over-used.

* 'ownership' As well as your introduction page, you put your tag on every 
single page.
  Pretty much everyone does something like this. As if by transcribing the 
source material you
  acquired some kind of ownership or bragging rights. But no, others put a very 
great deal of 
  effort into creating that work, and you just made a digital copy. That the 
originators probably
  would consider an aesthetic insult to their efforts. So, why the proud tags 
everywhere?

Summary: It's fine as a working copy for practical use. Better to have made it 
than not, so long
as you didn't destroy the paper original in the process. But if you're talking 
about an archival
historical record, that someone can look at in 500 years (or 5000) and know 
what the original 
actually looked like, how much effort went into making that ink crisp and 
accurate, then no. 
It's not good enough. 

To be fair, I've never yet seen any PDF scan of any document that I'd consider 
good enough.
Works created originally in PDF as line art are a different class, and 
typically OK. Though
some other flaws of PDF do come into play. Difficulty of content export, 
problems with global
page parameters, font failures, sequential vs content page numbers, etc.

With scanning there are multiple points of failure right through the whole 
process at present, 
ranging from misunderstandings of the technology among people doing scanning, 
problems with
scanners (why are edge scanners so rare!?), lack of critical capabilities in 
post-processing
utilities (line art on top of ink screening, it's a nightmare, 

Re: Scanning docs for bitsavers

2019-12-03 Thread ED SHARPE via cctalk
actually   we scan to pdf  with back ocr  also text  also tiff also jpegwith 
the slooowww   hp 11x17 scan fax print thing i can scan entite document then 
save 1 save2 save3  save 4 without rescanning each time   ed  at smecc
In a message dated 12/3/2019 2:16:01 AM US Mountain Standard Time, 
cctalk@classiccmp.org writes:

Hi!
On Tue, 2019-12-03 11:34:06 +1100, Guy Dunphy via cctalk 
 wrote:
> At 01:57 PM 2/12/2019 -0700, you wrote:
> >On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk 
> >wrote:
> >
> > > When I corresponded with Al Kossow about format several years ago, he
> > > indicated that CCITT Group 4 lossless compression was their standard.
> As for G4 bilevel encoding, the only reasons it isn't treated with the same
> disdain as JBIG2, are:
> 1. Bandwaggon effect - "It must be OK because so many people use it."
> 2. People with little or zero awareness of typography, the visual quality of
>    text, and anything to do with preservation of historical character of
>    printed works. For them "I can read it OK" is the sole requirement.
> 
> G4 compression was invented for fax machines. No one cared much about visual
> quality of faxes, they just had to be readable. Also the technology of fax
> machines was only capable of two-tone B reproduction, so that's what G4
> encoding provided.

So it boils down to two distinct tasks:

  * Scan old paper documentation with a proven file format (ie. no
    compression artifacts, b/w or 16 gray-level for black-and-white
    text, tables and the like.

  * Make these images accessible as useable documentation.


The first step is that's work-intensive, the second step can probably
be easily redone every time we "learn" something about how to make the
documents more useful.

  For accessibility, PDF seems to be quite a nice choice, as long as
we see that as a representation only (and not as the information
source.) Convert the images to TIFF for example, possibly downsample,
possibly OCR and overlay it.

> But PDF literally cannot be used as a wrapper for the results, since
> it doesn't incorporate the required image compression formats. 
> This is why I use things like html structuring, wrapped as either a zip
> file or RARbook format. Because there is no other option at present.
> There will be eventually. Just not yet. PDF has to be either greatly
> extended, or replaced.

I think that PDF actually is a quite well-working output format, but
we'd see it as a compilation product of our actual source (images),
not as the final (and only) product.

> And that's why I get upset when people physically destroy rare old documents
> during or after scanning them currently. It happens so frequently, that by
> the time we have a technically adequate document coding scheme, a lot of old
> documents won't have any surviving paper copies.
> They'll be gone forever, with only really crap quality scans surviving.

:-(  Too bad, but that happens all the time.

Thanks,
  Jan-Benedict

-- 


Re: Scanning docs for bitsavers

2019-12-03 Thread ED SHARPE via cctalk
very nice  file
yep, we prefer pdf   with  ocr   back  stuff   ed smecc,orgIn a message dated 
12/2/2019 8:20:36 PM US Mountain Standard Time, cctalk@classiccmp.org writes:

I cannot understand your problems with PDF files.
I've created lots and lots of PDFs, with treated and untreated scanned
material. All of them are very readable and in use for years. Of course,
garbage in, garbage out. I take the utmost care in my scans to have good
enough source files, so I can create great PDFs.

Of course, Guy's commens are very informative and I'll learn more from it.
But I still believe in good preservation using PDF files. FOR ME it is the
best we have in encapsulating info. Forget HTMLs.

Please, take a look at this PDF, and tell me: Isn't that good enough for
preservation/use?
https://drive.google.com/file/d/0B7yahi4JC3juSVVkOEhwRWdUR1E/view

Thanks
Alexandre

---8<---Corte aqui---8<---
http://www.tabajara-labs.blogspot.com
http://www.tabalabs.com.br
---8<---Corte aqui---8<---


Em ter., 3 de dez. de 2019 às 00:08, Grant Taylor via cctalk <
cctalk@classiccmp.org> escreveu:

> On 12/2/19 5:34 PM, Guy Dunphy via cctalk wrote:
>
> Interesting comments Guy.
>
> I'm completely naive when it comes to scanning things for preservation.
>  Your comments do pass my naive understanding.
>
> > But PDF literally cannot be used as a wrapper for the results,
> > since it doesn't incorporate the required image compression formats.
> > This is why I use things like html structuring, wrapped as either a zip
> > file or RARbook format. Because there is no other option at present.
> > There will be eventually. Just not yet. PDF has to be either greatly
> > extended, or replaced.
>
> I *HATE* doing anything with PDFs other than reading them.  My opinion
> is that PDF is where information goes to die.  Creating the PDF was the
> last time that anything other than a human could use the information as
> a unit.  Now, in the future, it's all chopped up lines of text that may
> be in a nonsensical order.  I believe it will take humans (or something
> yet to be created with human like ability) to make sense of the content
> and recreate it in a new form for further consumption.
>
> Have you done any looking at ePub?  My understanding is that they are a
> zip of a directory structure of HTML and associated files.  That sounds
> quite similar to what you're describing.
>
> > And that's why I get upset when people physically destroy rare old
> > documents during or after scanning them currently. It happens so
> > frequently, that by the time we have a technically adequate document
> > coding scheme, a lot of old documents won't have any surviving
> > paper copies.  They'll be gone forever, with only really crap quality
> > scans surviving.
>
> Fair enough.
>
>
>
> --
> Grant. . . .
> unix || die
>


Re: Scanning docs for bitsavers

2019-12-03 Thread Jan-Benedict Glaw via cctalk
Hi!
On Tue, 2019-12-03 11:34:06 +1100, Guy Dunphy via cctalk 
 wrote:
> At 01:57 PM 2/12/2019 -0700, you wrote:
> >On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk 
> >wrote:
> >
> > > When I corresponded with Al Kossow about format several years ago, he
> > > indicated that CCITT Group 4 lossless compression was their standard.
> As for G4 bilevel encoding, the only reasons it isn't treated with the same
> disdain as JBIG2, are:
> 1. Bandwaggon effect - "It must be OK because so many people use it."
> 2. People with little or zero awareness of typography, the visual quality of
>text, and anything to do with preservation of historical character of
>printed works. For them "I can read it OK" is the sole requirement.
> 
> G4 compression was invented for fax machines. No one cared much about visual
> quality of faxes, they just had to be readable. Also the technology of fax
> machines was only capable of two-tone B reproduction, so that's what G4
> encoding provided.

So it boils down to two distinct tasks:

  * Scan old paper documentation with a proven file format (ie. no
compression artifacts, b/w or 16 gray-level for black-and-white
text, tables and the like.

  * Make these images accessible as useable documentation.


The first step is that's work-intensive, the second step can probably
be easily redone every time we "learn" something about how to make the
documents more useful.

  For accessibility, PDF seems to be quite a nice choice, as long as
we see that as a representation only (and not as the information
source.) Convert the images to TIFF for example, possibly downsample,
possibly OCR and overlay it.

> But PDF literally cannot be used as a wrapper for the results, since
> it doesn't incorporate the required image compression formats. 
> This is why I use things like html structuring, wrapped as either a zip
> file or RARbook format. Because there is no other option at present.
> There will be eventually. Just not yet. PDF has to be either greatly
> extended, or replaced.

I think that PDF actually is a quite well-working output format, but
we'd see it as a compilation product of our actual source (images),
not as the final (and only) product.

> And that's why I get upset when people physically destroy rare old documents
> during or after scanning them currently. It happens so frequently, that by
> the time we have a technically adequate document coding scheme, a lot of old
> documents won't have any surviving paper copies.
> They'll be gone forever, with only really crap quality scans surviving.

:-(  Too bad, but that happens all the time.

Thanks,
  Jan-Benedict

-- 


Re: Scanning docs for bitsavers

2019-12-03 Thread Christian Corti via cctalk

On Mon, 2 Dec 2019, Eric Smith wrote:

There are newer bilevel encodings that are somewhat more efficient than G4
(ITU-T T.6), such as JBIG (T.82) and JBIG2 (T.88), but they are not as
widely supported, and AFAIK JBIG2 is still patent encumbered. As a result,


*NEVER* use JBIG2! I hope you know about the Xerox JBIG2 bug (e.g. making 
an 8 where there is a 6 in the original). Alone the idea of multiplying 
parts from a scan in other areas is a no go. That's not archiving, that's 
dumb. Therefore using compression algorithms like the one used in JBIG2 
is discouraged or even forbidden e.g. for legal matters.


Christian


Re: Scanning docs for bitsavers

2019-12-02 Thread Grant Taylor via cctalk

On 12/2/19 9:06 PM, Grant Taylor via cctalk wrote:
In my opinion, PDFs are the last place that computer usable data goes. 
Because getting anything out of a PDF as a data source is next to 
impossible.


Sure, you, a human, can read it and consume the data.

Try importing a simple table from a PDF and working with the data in 
something like a spreadsheet.  You can't do it.  The raw data is there. 
 But you can't readily use it.


This is why I say that a PDF is the end of the line for data.

I view it as effectively impossible to take data out of a PDF and do 
anything with it without first needing to reconstitute it before I can 
use it.


I'll add this:

PDF is a decent page layout format.  But trying to view the contents in 
any different layout is problematic (at best).


Trying to use the result of a page layout as a data source is ... 
problematic.




--
Grant. . . .
unix || die





--
Grant. . . .
unix || die


Re: Scanning docs for bitsavers

2019-12-02 Thread Grant Taylor via cctalk

On 12/2/19 8:20 PM, Alexandre Souza via cctalk wrote:

I cannot understand your problems with PDF files.


My problem with PDFs starts where most people stop using them.

Take the average PDF of text, try to copy and paste the text into a text 
file.  (That may work.)


Now try to edit a piece of the text, such as taking part of a line out, 
or adding to a line.  (You can probably do that too.)


Now fix the line wrapping to get the margins back to where they should 
be.  (This will likely be a nightmare without a good text editor to 
reflow the text.)


All of the text I get out of PDFs is (at best) discrete lines that are 
unassociated with other lines.  They just happen to be next to each other.


Conversely, if I copy text off of a web page or out of many programs, I 
can paste into an editor, make my desired changes, and the line 
re-wrapping is already done for me.  This works for non-PDF sources 
because it's a continuous line of text that can be re-wrapped and re-used.


In my opinion, PDFs are the last place that computer usable data goes. 
Because getting anything out of a PDF as a data source is next to 
impossible.


Sure, you, a human, can read it and consume the data.

Try importing a simple table from a PDF and working with the data in 
something like a spreadsheet.  You can't do it.  The raw data is there. 
  But you can't readily use it.


This is why I say that a PDF is the end of the line for data.

I view it as effectively impossible to take data out of a PDF and do 
anything with it without first needing to reconstitute it before I can 
use it.



I've created lots and lots of PDFs, with treated and untreated scanned
material. All of them are very readable and in use for years.


Sure, you, a human, can quite easily read it.  But you are not 
processing the data the way that I'm talking about.



Of course, garbage in, garbage out.


I'm not talking about GIGO.

I take the utmost care in my scans to have good enough source files, 
so I can create great PDFs.


Of course, Guy's commens are very informative and I'll learn more from it.
But I still believe in good preservation using PDF files. FOR ME it is the
best we have in encapsulating info. Forget HTMLs.


I find HTML to be IMMENSELY easier to extract data from.


Please, take a look at this PDF, and tell me: Isn't that good enough for
preservation/use?


It's good enough for humans to use.

But it suffers from the same problem that I'm describing.

Try copying the text and pasting it into a wider or narrower document. 
What happens to the line wrapping or margins?  Based on my experience, 
they are crap.


With HTML, I can copy content and paste it into a wider or narrower 
window without any problem.


Data is originated somewhere.  Something is done to it.  It's 
manipulated, reformatted, processed, displayed and / or printed, and 
ultimately consumed.  In my experience, PDF files are the end of that 
chain.  There is no good way to get text out of a PDF.


Take (part of) the first paragraph of your sample PDF:  What's easier to 
re-use in a new document:


This (direct copy and paste):
--8<--
Os transceptores Control modelo TAC-45 (versão de 10 a 45 Watts) e 
TAC-70 (versão de 10 a
70 Watts) foram um marco na radiocomunicação comercial brasileira. 
Lançados em 1983,
consistiam num transceptor dividido em dois blocos: o corpo do rádio e 
um cabeçote de
comando, onde ficavam os comandos de volume, squelch e o seletor de 4 
canais.

-->8--

Or this:
--8<--
Os transceptores Control modelo TAC-45 (versão de 10 a 45 Watts) e 
TAC-70 (versão de 10 a 70 Watts) foram um marco na radiocomunicação 
comercial brasileira. Lançados em 1983, consistiam num transceptor 
dividido em dois blocos: o corpo do rádio e um cabeçote de comando, onde 
ficavam os comandos de volume, squelch e o seletor de 4 canais.

-->8--

With format=flowed, the second copy will re-scale ti any window width. I 
can also triple click to select the entire paragraph, something I can't 
do with the first copy.  Heck, I can't even reliably do anything with 
sentence in the first copy.  It's all broken lines.  The second copy is 
a continuous string that makes up (part of) the paragraph.


Which format would you like to work with if you need to extract text 
from a file and use in something else?  Something that you have to 
repair the damage introduced by the file format?  Or something that 
preserves the text integrity?




--
Grant. . . .
unix || die





--
Grant. . . .
unix || die


Re: Scanning docs for bitsavers

2019-12-02 Thread Alexandre Souza via cctalk
I cannot understand your problems with PDF files.
I've created lots and lots of PDFs, with treated and untreated scanned
material. All of them are very readable and in use for years. Of course,
garbage in, garbage out. I take the utmost care in my scans to have good
enough source files, so I can create great PDFs.

Of course, Guy's commens are very informative and I'll learn more from it.
But I still believe in good preservation using PDF files. FOR ME it is the
best we have in encapsulating info. Forget HTMLs.

Please, take a look at this PDF, and tell me: Isn't that good enough for
preservation/use?
https://drive.google.com/file/d/0B7yahi4JC3juSVVkOEhwRWdUR1E/view

Thanks
Alexandre

---8<---Corte aqui---8<---
http://www.tabajara-labs.blogspot.com
http://www.tabalabs.com.br
---8<---Corte aqui---8<---


Em ter., 3 de dez. de 2019 às 00:08, Grant Taylor via cctalk <
cctalk@classiccmp.org> escreveu:

> On 12/2/19 5:34 PM, Guy Dunphy via cctalk wrote:
>
> Interesting comments Guy.
>
> I'm completely naive when it comes to scanning things for preservation.
>   Your comments do pass my naive understanding.
>
> > But PDF literally cannot be used as a wrapper for the results,
> > since it doesn't incorporate the required image compression formats.
> > This is why I use things like html structuring, wrapped as either a zip
> > file or RARbook format. Because there is no other option at present.
> > There will be eventually. Just not yet. PDF has to be either greatly
> > extended, or replaced.
>
> I *HATE* doing anything with PDFs other than reading them.  My opinion
> is that PDF is where information goes to die.  Creating the PDF was the
> last time that anything other than a human could use the information as
> a unit.  Now, in the future, it's all chopped up lines of text that may
> be in a nonsensical order.  I believe it will take humans (or something
> yet to be created with human like ability) to make sense of the content
> and recreate it in a new form for further consumption.
>
> Have you done any looking at ePub?  My understanding is that they are a
> zip of a directory structure of HTML and associated files.  That sounds
> quite similar to what you're describing.
>
> > And that's why I get upset when people physically destroy rare old
> > documents during or after scanning them currently. It happens so
> > frequently, that by the time we have a technically adequate document
> > coding scheme, a lot of old documents won't have any surviving
> > paper copies.  They'll be gone forever, with only really crap quality
> > scans surviving.
>
> Fair enough.
>
>
>
> --
> Grant. . . .
> unix || die
>


Re: Scanning docs for bitsavers

2019-12-02 Thread Grant Taylor via cctalk

On 12/2/19 5:34 PM, Guy Dunphy via cctalk wrote:

Interesting comments Guy.

I'm completely naive when it comes to scanning things for preservation. 
 Your comments do pass my naive understanding.


But PDF literally cannot be used as a wrapper for the results, 
since it doesn't incorporate the required image compression formats. 
This is why I use things like html structuring, wrapped as either a zip 
file or RARbook format. Because there is no other option at present. 
There will be eventually. Just not yet. PDF has to be either greatly 
extended, or replaced.


I *HATE* doing anything with PDFs other than reading them.  My opinion 
is that PDF is where information goes to die.  Creating the PDF was the 
last time that anything other than a human could use the information as 
a unit.  Now, in the future, it's all chopped up lines of text that may 
be in a nonsensical order.  I believe it will take humans (or something 
yet to be created with human like ability) to make sense of the content 
and recreate it in a new form for further consumption.


Have you done any looking at ePub?  My understanding is that they are a 
zip of a directory structure of HTML and associated files.  That sounds 
quite similar to what you're describing.


And that's why I get upset when people physically destroy rare old 
documents during or after scanning them currently. It happens so 
frequently, that by the time we have a technically adequate document 
coding scheme, a lot of old documents won't have any surviving 
paper copies.  They'll be gone forever, with only really crap quality 
scans surviving.


Fair enough.



--
Grant. . . .
unix || die


Re: Scanning docs for bitsavers

2019-12-02 Thread Guy Dunphy via cctalk
At 01:57 PM 2/12/2019 -0700, you wrote:
>On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk 
>wrote:
>
>> When I corresponded with Al Kossow about format several years ago, he
>> indicated that CCITT Group 4 lossless compression was their standard.
>>
>
>There are newer bilevel encodings that are somewhat more efficient than G4
>(ITU-T T.6), such as JBIG (T.82) and JBIG2 (T.88), but they are not as
>widely supported, and AFAIK JBIG2 is still patent encumbered. As a result,
>G4 is still arguably the best bilevel encoding for general-purpose use. PDF
>has natively supported G4 for ages, though it gained JBIG and JBIG2 support
>in more recent versions.
>
>Back in 2001, support for G4 encoding in open source software was really
>awful; where it existed at all, it was horribly slow. There was no good
>reason for G4 encoding to be slow, which was part of my motivation in
>writing my own G4 encoder for tumble (an image-to-PDF utility). However, G4
>support is generally much better now.



Mentioning JBIG2 (or any of its predecessors) without noting that it is
completely unacceptable as a scanned document compression scheme, demonstrates
a lack of awareness of the defects it introduces in encoded documents.
See http://everist.org/NobLog/20131122_an_actual_knob.htm#jbig2
JBIG2 typically produces visually appalling results, and also introduces so
many actual factual errors (typically substituted letters and numbers) that
documents encoded with it have been ruled inadmissible as evidence in court.
Sucks to be an engineering or financial institution, which scanned all its
archives with JBIG2 then shredded the paper originals to save space.
The fuzzyness of JBIG is adjustable, but fundamentally there will always
be some degree of visible patchyness and risk of incorrect substitution.

As for G4 bilevel encoding, the only reasons it isn't treated with the same
disdain as JBIG2, are:
1. Bandwaggon effect - "It must be OK because so many people use it."
2. People with little or zero awareness of typography, the visual quality of
   text, and anything to do with preservation of historical character of
   printed works. For them "I can read it OK" is the sole requirement.

G4 compression was invented for fax machines. No one cared much about visual
quality of faxes, they just had to be readable. Also the technology of fax
machines was only capable of two-tone B reproduction, so that's what G4
encoding provided.

Thinking these kinds of visual degradation of quality are acceptable when
scanning documents for long term preservation, is both short sighted and
ignorant of what can already be achieved with better technique.

For example, B text and line diagram material can be presented very nicely
using 16-level gray shading, That's enough to visually preserve all the
line and edge quality. The PNG compression scheme provides a color indexed
4 bits/pixel format, combining with PNG's run-length coding. When documents
are scanned with sensible thresholds plus post-processed to ensure all white
paper is actually #FF, and solid blacks are actually #0, but edges retain
adequate gray shading, PNG achieves an excellent level of filesize compression.
The visual results are _far_ superior to G4 and JBIG2 coding, and surprisingly
the file sizes can actually be smaller. It's easy to achieve on-screen results
that are visually indistinguishable from looking at the paper original, with
quite acceptable filesizes.
And that's the way it should be.

Which brings us to PDF, that most people love because they use it all the
time, never looked into the details of its internals, and can't imagine
anything better.
Just one point here. PDF does not support PNG image encoding. *All* the
image compression schemes PDF does support, are flawed in various cases.
But because PDF structuring is opaque to users, very few are aware of 
this and its other problems. And therefore why PDF isn't acceptable as a
container for long term archiving of _scanned_ documents for historical
purposes. Even though PDF was at least extended to include an 'archival'
form in which all the font definitions must be included.

When I scan things I'm generally doing it in an experimental sense,
still exploring solutions to various issues such as the best way to deal
with screened print images and cases where ink screening for tonal images
has been overlaid with fine detail line art and text. Which makes processing
to a high quality digital image quite difficult.

But PDF literally cannot be used as a wrapper for the results, since
it doesn't incorporate the required image compression formats. 
This is why I use things like html structuring, wrapped as either a zip
file or RARbook format. Because there is no other option at present.
There will be eventually. Just not yet. PDF has to be either greatly
extended, or replaced.

And that's why I get upset when people physically destroy rare old documents
during or after scanning them currently. It happens so frequently, that by

Re: Scanning docs for bitsavers

2019-12-02 Thread Eric Smith via cctalk
On Tue, Nov 26, 2019 at 8:51 PM Jay Jaeger via cctalk 
wrote:

> When I corresponded with Al Kossow about format several years ago, he
> indicated that CCITT Group 4 lossless compression was their standard.
>

There are newer bilevel encodings that are somewhat more efficient than G4
(ITU-T T.6), such as JBIG (T.82) and JBIG2 (T.88), but they are not as
widely supported, and AFAIK JBIG2 is still patent encumbered. As a result,
G4 is still arguably the best bilevel encoding for general-purpose use. PDF
has natively supported G4 for ages, though it gained JBIG and JBIG2 support
in more recent versions.

Back in 2001, support for G4 encoding in open source software was really
awful; where it existed at all, it was horribly slow. There was no good
reason for G4 encoding to be slow, which was part of my motivation in
writing my own G4 encoder for tumble (an image-to-PDF utility). However, G4
support is generally much better now.


Re: Scanning docs for bitsavers

2019-11-27 Thread Alexandre Souza via cctalk
>My recommendation: use a proper multi-function copier (the big copiers)
>that can also scan to network. I currently use our big Konica-Minolta

I've got a Lexmark X646E full duplex printing/scanner. I'm still learning
how to use it at its max, but I believe I'll scan TONS of documents I have
stored home as soon as I learn how to PROPERLY do that.

Interesting machine. Very cheap, but with a very fast ADF duplex scanner. I
just need to learn all the adjusts and fix the ADF rollers.

Enviado do meu Tele-Movel

On Wed, Nov 27, 2019, 13:12 Christian Corti via cctalk <
cctalk@classiccmp.org> wrote:

> On Wed, 27 Nov 2019, mloe...@cpumagic.scol.pa.us wrote:
> > On Wed, 27 Nov 2019, Noel Chiappa via cctalk wrote:
> >> That's what I use too; it has tons of useful features, including being
> able
> >> to drive my single-sided page-feed scanner and being able to number the
> >> even-sided pages correctly. The one I use for this is the 'batch mode';
> I
> >> can
> >> do the entire document into CCITT 4 in one operation.
> >
> >   For scanning software, I highly recommend VueScan:
> >
> > https://www.hamrick.com/
> >
> >   There are Linux, Windows and Mac versions, and it supports thousands
> of
> > scanner models, including some very old ones. VueScan can also do CCITT
> G4
> > compression, and directly create PDF files. If you but the pro version,
> > updates are free. I've been using it for years.
>
> My recommendation: use a proper multi-function copier (the big copiers)
> that can also scan to network. I currently use our big Konica-Minolta
> bizhub 754. Although it'a b/w copier, it can also scan in color. This
> machine scans a two-sided page without flipping the paper, resolution
> 600dpi, color/bw, and I scan to TIFF multipage images (sometimes I use
> JPEG for color pages). No problems scanning a batch of A3 schematics ;-)
>
> Then I use tumble (either directly on the generated .tif or after
> tiffsplit and rearranging pages) and ocrmypdf to produce the PDF file.
> I guess my setup is much faster than Al's ;-)))
>
> Christian
>


Re: Scanning docs for bitsavers

2019-11-27 Thread Jason T via cctalk
On Wed, Nov 27, 2019 at 2:01 PM Paul Koning  wrote:

> Another problem with bilevel scans is that, on some machines at least, they 
> can be very noisy.  That's what I saw on the copier/scanner at the office.  
> For good scans I use gray scale scanning, with post-processing if desired to 
> convert to clean bilevel data, without all the noise.  Not only does it make 
> looking at the material more pleasant, but it also makes the files much 
> smaller -- noise doesn't compress well.

That's a good point.  With my Fujitsu's TWAIN driver, I've got a
choice of 2 or 3 scan algorithms, plus some sliders to tweak, and I
get good b/w scans.  With fewer options, a grey scan + post-processing
(another vote for IrfanView's batch mode) is the way to go.


Re: Scanning docs for bitsavers

2019-11-27 Thread Christian Corti via cctalk

On Wed, 27 Nov 2019, Paul Koning wrote:

On Nov 27, 2019, at 2:56 PM, Jason T via cctalk  wrote:

On Wed, Nov 27, 2019 at 10:12 AM Christian Corti via cctalk
 wrote:

My recommendation: use a proper multi-function copier (the big copiers)
that can also scan to network. I currently use our big Konica-Minolta
bizhub 754. Although it'a b/w copier, it can also scan in color. This


These are great for cranking through big stacks of paper, but watch
out for the presets.  Some (like the older Xerox I used to use at
work) would scan to "TIFF"...which was really a JPG-compressed image
in a TIFF wrapper.  So even my 600dpi bilevel scans would have
compression artifacts.


Another problem with bilevel scans is that, on some machines at least, 
they can be very noisy.  That's what I saw on the copier/scanner at the 
office.  For good scans I use gray scale scanning, with post-processing 
if desired to convert to clean bilevel data, without all the noise. 
Not only does it make looking at the material more pleasant, but it also 
makes the files much smaller -- noise doesn't compress well.


You both are right, and I had to make the proper settings for getting 
clean b/w TIFF files without noise in black parts, but it is possible. And 
the bilevel TIFF files generated are G4 compressed.


Christian


Re: Scanning docs for bitsavers

2019-11-27 Thread Jason T via cctalk
On Wed, Nov 27, 2019 at 10:12 AM Christian Corti via cctalk
 wrote:
> My recommendation: use a proper multi-function copier (the big copiers)
> that can also scan to network. I currently use our big Konica-Minolta
> bizhub 754. Although it'a b/w copier, it can also scan in color. This

These are great for cranking through big stacks of paper, but watch
out for the presets.  Some (like the older Xerox I used to use at
work) would scan to "TIFF"...which was really a JPG-compressed image
in a TIFF wrapper.  So even my 600dpi bilevel scans would have
compression artifacts.

All manufacturers are different, YMMV, etc.  Just something I got
burned on and worth watching out for.

j


Re: Scanning docs for bitsavers

2019-11-27 Thread Christian Corti via cctalk

On Wed, 27 Nov 2019, mloe...@cpumagic.scol.pa.us wrote:

On Wed, 27 Nov 2019, Noel Chiappa via cctalk wrote:

That's what I use too; it has tons of useful features, including being able
to drive my single-sided page-feed scanner and being able to number the
even-sided pages correctly. The one I use for this is the 'batch mode'; I 
can

do the entire document into CCITT 4 in one operation.


  For scanning software, I highly recommend VueScan:

https://www.hamrick.com/

  There are Linux, Windows and Mac versions, and it supports thousands of 
scanner models, including some very old ones. VueScan can also do CCITT G4 
compression, and directly create PDF files. If you but the pro version, 
updates are free. I've been using it for years.


My recommendation: use a proper multi-function copier (the big copiers) 
that can also scan to network. I currently use our big Konica-Minolta 
bizhub 754. Although it'a b/w copier, it can also scan in color. This 
machine scans a two-sided page without flipping the paper, resolution 
600dpi, color/bw, and I scan to TIFF multipage images (sometimes I use 
JPEG for color pages). No problems scanning a batch of A3 schematics ;-)


Then I use tumble (either directly on the generated .tif or after 
tiffsplit and rearranging pages) and ocrmypdf to produce the PDF file.

I guess my setup is much faster than Al's ;-)))

Christian


Re: Scanning docs for bitsavers

2019-11-27 Thread Mike Loewen via cctalk

On Wed, 27 Nov 2019, Noel Chiappa via cctalk wrote:


   > From: Jay Jaeger

   > CCITT Group 4 lossless compression

That's very good indeed. I scan text pages in B+W at slightly less resolution
(engineering prints I do higher, they need it), but compressed they turn out
to be ~50KB per page, or less - for long documents (e.g. the DOS-11 System
Programmer's Manual), that produces a reasonably-sized file.

   > The software I have been using i[s] Irfanview.

That's what I use too; it has tons of useful features, including being able
to drive my single-sided page-feed scanner and being able to number the
even-sided pages correctly. The one I use for this is the 'batch mode'; I can
do the entire document into CCITT 4 in one operation.


   For scanning software, I highly recommend VueScan:

https://www.hamrick.com/

   There are Linux, Windows and Mac versions, and it supports thousands of 
scanner models, including some very old ones. VueScan can also do CCITT G4 
compression, and directly create PDF files. If you but the pro version, 
updates are free. I've been using it for years.



Mike Loewen mloe...@cpumagic.scol.pa.us
Old Technology  http://q7.neurotica.com/Oldtech/


Re: Scanning docs for bitsavers

2019-11-27 Thread Noel Chiappa via cctalk
> From: Jay Jaeger

> CCITT Group 4 lossless compression

That's very good indeed. I scan text pages in B+W at slightly less resolution
(engineering prints I do higher, they need it), but compressed they turn out
to be ~50KB per page, or less - for long documents (e.g. the DOS-11 System
Programmer's Manual), that produces a reasonably-sized file.

> The software I have been using i[s] Irfanview.

That's what I use too; it has tons of useful features, including being able
to drive my single-sided page-feed scanner and being able to number the
even-sided pages correctly. The one I use for this is the 'batch mode'; I can
do the entire document into CCITT 4 in one operation.

Noel


Re: Scanning docs for bitsavers

2019-11-26 Thread Dennis Boone via cctalk
> As far as multi-page documents, it seems as if my scanner (or its
> software) only does uncompressed TIFF. At bitsaver's recommended 400
> dpi, that means about 4M per page.

If you're on unix of some sort, the libtiff tools can convert these
uncompressed images to G4.  The command you'd use would be tiffcp.

De


Re: Scanning docs for bitsavers

2019-11-26 Thread Al Kossow via cctalk



On 11/26/19 7:10 PM, Alexandre Souza wrote:
> Al, is there a "standard" you would recommend us mere mortals to scan and 
> archive docs?

I've moved to 600dpi bi-tonal tiffs for all new text work since that is the 
maximum
resolution my Panasonic KV-S3065 scanner supports. I use a flatbed at 300dpi 
jpeg for pages with images

I'm not going to discuss the pros and cons of my work. This is what my workflow 
is and is unlikely
to change any time soon.




Re: Scanning docs for bitsavers

2019-11-26 Thread Jay Jaeger via cctalk
On 11/26/2019 8:52 PM, Alan Perry via cctalk wrote:
> 
> I am going through stuff in my office and found that I have some SCSI
> device docs that aren't on bitsavers. As far as multi-page documents, it
> seems as if my scanner (or its software) only does uncompressed TIFF. At
> bitsaver's recommended 400 dpi, that means about 4M per page.
> 
> What should I do? Scan the docs in and find a tool to convert to
> lossless compression. Scan the docs in and just submit the huge files?
> Something else?
> 
> The docs that I have are copies, not originals. Does anyone here want
> them after I scan them?
> 
> alan
> 

This would be a function of the software, not the scanner.

When I corresponded with Al Kossow about format several years ago, he
indicated that CCITT Group 4 lossless compression was their standard.

I have been using a scanner (Ricoh IS300e) that needs a driver that
doesn't install under Windows 10, so now I use it from inside a Windows
7 VM.  The software I have been using in Irfanview.

I expect it would not be tremendously difficult to find something that
would do the compression on a raw file.

JRJ


Re: Scanning docs for bitsavers

2019-11-26 Thread Alan Perry via cctalk




On 11/26/19 7:05 PM, Chuck Guzis via cctalk wrote:

On 11/26/19 6:52 PM, Alan Perry via cctalk wrote:


I am going through stuff in my office and found that I have some SCSI
device docs that aren't on bitsavers. As far as multi-page documents, it
seems as if my scanner (or its software) only does uncompressed TIFF. At
bitsaver's recommended 400 dpi, that means about 4M per page.

What should I do? Scan the docs in and find a tool to convert to
lossless compression. Scan the docs in and just submit the huge files?
Something else?

The docs that I have are copies, not originals. Does anyone here want
them after I scan them?


Are these standards; e.g. official ANSI X3T10 docs?


No, they are vendor docs on specific devices. CDC Wren IV. Exabyte 8200. 
Adaptec AIC-6250 and AHA-1540A/1542A. Something else that I can't 
recall. Most had related docs on bitsavers, but not these specific 
documents.


alan




--Chuck



Re: Scanning docs for bitsavers

2019-11-26 Thread Alexandre Souza via cctalk
Al, is there a "standard" you would recommend us mere mortals to scan and
archive docs?

---8<---Corte aqui---8<---
http://www.tabajara-labs.blogspot.com
http://www.tabalabs.com.br
---8<---Corte aqui---8<---


Em qua., 27 de nov. de 2019 às 01:07, Al Kossow via cctalk <
cctalk@classiccmp.org> escreveu:

> you can ftp the uncompressed files to me and I'll take care of the
> conversions
>
> On 11/26/19 6:52 PM, Alan Perry via cctalk wrote:
> >
> > I am going through stuff in my office and found that I have some SCSI
> device docs that aren't on bitsavers. As far as
> > multi-page documents, it seems as if my scanner (or its software) only
> does uncompressed TIFF. At bitsaver's recommended
> > 400 dpi, that means about 4M per page.
> >
> > What should I do? Scan the docs in and find a tool to convert to
> lossless compression. Scan the docs in and just submit
> > the huge files? Something else?
> >
> > The docs that I have are copies, not originals. Does anyone here want
> them after I scan them?
> >
> > alan
>
>


Re: Scanning docs for bitsavers

2019-11-26 Thread Al Kossow via cctalk
you can ftp the uncompressed files to me and I'll take care of the conversions

On 11/26/19 6:52 PM, Alan Perry via cctalk wrote:
> 
> I am going through stuff in my office and found that I have some SCSI device 
> docs that aren't on bitsavers. As far as
> multi-page documents, it seems as if my scanner (or its software) only does 
> uncompressed TIFF. At bitsaver's recommended
> 400 dpi, that means about 4M per page.
> 
> What should I do? Scan the docs in and find a tool to convert to lossless 
> compression. Scan the docs in and just submit
> the huge files? Something else?
> 
> The docs that I have are copies, not originals. Does anyone here want them 
> after I scan them?
> 
> alan



Re: Scanning docs for bitsavers

2019-11-26 Thread Chuck Guzis via cctalk
On 11/26/19 6:52 PM, Alan Perry via cctalk wrote:
> 
> I am going through stuff in my office and found that I have some SCSI
> device docs that aren't on bitsavers. As far as multi-page documents, it
> seems as if my scanner (or its software) only does uncompressed TIFF. At
> bitsaver's recommended 400 dpi, that means about 4M per page.
> 
> What should I do? Scan the docs in and find a tool to convert to
> lossless compression. Scan the docs in and just submit the huge files?
> Something else?
> 
> The docs that I have are copies, not originals. Does anyone here want
> them after I scan them?

Are these standards; e.g. official ANSI X3T10 docs?

--Chuck