On Sunday, August 15, 2021 at 1:20, Alexander Schreiber wrote: > My current toolchain for that:
Thanks; that was quite helpful. One aspect that I find of great assistance in navigating large PDF manuals is original page numbers. Often a manual will contain references, e.g., to "page 4-13" or "Appendix B-23". Having just a set of ascending integers for PDF page numbers and having to guess where Section 4 page 13 might be in that list is difficult, especially when PDF page 1 doesn't correspond to manual page 1-1 and the sections are very large. Being able to enter a referenced page number directly into a PDF reader's "go to page" dialog is very convenient. > > http://www.leptonica.org/ > > Thanks for the pointer, I'm going to take a look - apparently > tesseract uses leptonica for some image processing work. You're welcome. Yes, tesseract is one of the major users of Leptonica. When I first started using the library about ten years ago, I found the documentation very reminiscent of those school mathematics textbooks that said, "The proof is left as an exercise for the reader." There were a couple of examples on the host site but no comprehensive index of the 2500+ library routines. The approach was, "read the source," which was fine if one was familiar with image processing terms, such as affine transformations, morphology, convolution, and octcube-based color quantization. It may be better now, but it was something of an intellectual challenge at the time. > What is that? Never heard of linearizing PDF before.... It's documented in the PDF Reference Manual from Adobe. Apparently, it's been around since PDF 1.2. The introduction to the chapter says: A linearized PDF file is one that has been organized in a special way to enable efficient incremental access in a network environment. The file is valid PDF in all respects, and it is compatible with all existing viewers and other PDF applications. Enhanced viewers can recognize that a PDF file has been linearized and can take advantage of that organization to enhance viewing performance. ...which, as others have mentioned, essentially is to allow page-at-a-time access via a browser without having to download the entire file first. -- Dave
