> On Apr 9, 2020, at 10:16 AM, emanuel stiebler via cctalk
> <[email protected]> wrote:
>
> Hi All,
> somebody scanned documents for me in .pdfs.
> Looking into them, they are pages of jpgs embedded in .pdf ..
> (100 pages resulting in 350MBytes ...)
>
> Any easy way to convert them into some b/w .pdf file?
> It is all text, no drawings ...
>
> Pointers?
>
> Thanks
A good source of information is Al Kossow's Bitsavers archive, the section
where he describes the tools he uses.
It's very unfortunate your original scan files are JPG; those are the wrong
format for text or line art -- JPG is ONLY for photographs and similar
continuous tone images. TIFF or PNG or B/W FAX formats are all superior, and
often more compact.
If by "convert to b/w" you mean to b/w images, Al's tools will help. If you
mean extracting the actual text, that's a different matter, now you need an OCR
tool. There are good commercial OCR programs around. No open source ones that
I know of; I've seen one but it didn't work well enough to be worth the
trouble. OCR may be extremely effective or not at all depending on the quality
of the material. In really extreme cases you may have to type things in by
hand; I've done that with 600 pages of blurry listings because there was a good
reason to go to that effort.
paul