At 01:20 AM 3/12/2019 -0200, you wrote: >I cannot understand your problems with PDF files. >I've created lots and lots of PDFs, with treated and untreated scanned >material. All of them are very readable and in use for years. Of course, >garbage in, garbage out. I take the utmost care in my scans to have good >enough source files, so I can create great PDFs. > >Of course, Guy's commens are very informative and I'll learn more from it. >But I still believe in good preservation using PDF files. FOR ME it is the >best we have in encapsulating info. Forget HTMLs.
I don't propose html as a viable alternative. It has massive inadequacies for representing physical documents. I just use it for experimenting and and as a temporary wrapper, because it's entirely transparent and maleable. ie I have total control over the result (within the bounds of what html can do.) >Please, take a look at this PDF, and tell me: Isn't that good enough for >preservation/use? >https://drive.google.com/file/d/0B7yahi4JC3juSVVkOEhwRWdUR1E/view OK, not too bad in comparison to many others. But a few comments: * The images are fax-mode, and although the resolution is high enough for there to be no ambiguities, it still looks bad and stylistically greatly differs from the original. Pity I don't have a copy of the original, to make demonstration scans of a few illustrations to show what it could be like, for similar file size. * The text is OCR, with a font I expect likely approximates the original fairly well. Though I'd like to see the original. I suspect the PDF font is a bit 'thic' due to incorrect gray threshold. Also it's searchable, except that the OCR process included paper blemishes as 'characters' so if you copy-paste the text elsewhere you have to carefully vet it. And not all searches will work. This is an illustration of the point that till we achieve human-leval AI, it's never going to be possible to go from images to abstracted OCR text automatically without considerable human oversight and proof-reading. And... human-level AI won't _want_ to do drudgery like that. * Your automated PDF generation process did a lot of silly things, like chaotic attempts to OCR 'elements' of diagrams. Just try moving a text selection box over the diagrams, you'll see what I mean. Try several diagrams, it's very random. * The PCB layouts, for eg PDF page #s 28, 29 - I bet the original used light shading to represent copper, and details over the copper were clearly visible. But when you scanned it in bi-level all that is lost. These _have_ to be in gray scale, and preferably post-processed to posterize the flat shading areas (for better compression as well as visual accuracy.) * Why are all the diagram pages variously different widths? I expect the original pages (foldouts?) had common sizes. This variation is because either you didn't use a fixed recipee for scanning and processing, or your PDF generation utility 'handled' that automatically (and messed up.) * You don't have control of what was OCR'd and what wasn't. For instance, why OCR table contents, if the text selection results are garbage? For eg, select the entire block at the bottom of PDF page 48. Does the highlighting create a sense of confidence this is going to work? Now copy and paste into a text editor. Is the result useful? (No.) OCR can be over-used. * 'ownership' As well as your introduction page, you put your tag on every single page. Pretty much everyone does something like this. As if by transcribing the source material you acquired some kind of ownership or bragging rights. But no, others put a very great deal of effort into creating that work, and you just made a digital copy. That the originators probably would consider an aesthetic insult to their efforts. So, why the proud tags everywhere? Summary: It's fine as a working copy for practical use. Better to have made it than not, so long as you didn't destroy the paper original in the process. But if you're talking about an archival historical record, that someone can look at in 500 years (or 5000) and know what the original actually looked like, how much effort went into making that ink crisp and accurate, then no. It's not good enough. To be fair, I've never yet seen any PDF scan of any document that I'd consider good enough. Works created originally in PDF as line art are a different class, and typically OK. Though some other flaws of PDF do come into play. Difficulty of content export, problems with global page parameters, font failures, sequential vs content page numbers, etc. With scanning there are multiple points of failure right through the whole process at present, ranging from misunderstandings of the technology among people doing scanning, problems with scanners (why are edge scanners so rare!?), lack of critical capabilities in post-processing utilities (line art on top of ink screening, it's a nightmare, also most people can't use Photoshop well, and it's necessary), failings built unavoidably into PDF, and not so great PDF viewer utilities. Apart from the intrinsic issues (aside from a few advantages) with on-screen display and controls compared to paper. I hope I have not offended you. Btw my pickiness comes from growing up in a family with commercial art, typography, printing and technical art involvement. And having in later years assisted a little with such things. So at least I know how much effort goes into such things. Keep the original. Methods and utilities will improve, and in 10 or 20 years it may be possible to make a visually perfect digital copy (with minimal effort), worthy of becoming a sole record of that thing (if history goes that way.) Guy