On Thu, 17 Jan 2013 07:55:09 -0600 Carson Chittom <car...@wistly.net> wrote:
> "C. Thomas Stover" <c...@thomasstover.com> writes: > > > Well if hardcopy means scanned paper (no ocr) then it sounds like a > > very large binary file set. > > I'm showing my ignorance, but does OCR matter in this case? We > already have OCR capabilities, and I had intended to scan in the > documents using it--because, why not, if you can? I didn't think to > mention it in my original post to the list because I didn't think it > would change the average file size significantly. > > Well think about like this. In order to get a good enough detail for most purposes, these document scanners have somewhere around 600x600dpi resolution. At first you might think monochrome would work great (and it is still used sometimes with very high res modes), but in practice gray scale (or color) is really needed for handwriting, old paper, charts, and all sorts of applications. So the uncompressed bitmap for a single page can be quite big. So what about image/raster data compression? Well you either have loss-less (PNG) which works great for rendered vector graphics (diagrams, screen shots, etc), or loss-y (JPEG) which uses the characteristics of they way human vision processes colors to really work great for photographs. Neither one of these work that good for generic pieces of paper. What ends up happening is people just do an image resize to a smaller resolution, which (especially for handwriting) can be self defeating. On the other hand think how much space it takes for a page of UTF-8 text. Not much. So perfect OCR (which is a virtual impossibility) would take a 10+mb bitmap and convert it into a 2k text file. The "solution" today's technology uses is by using a container format like PDF where both images and text can be stored, the scanner software/firmware will OCR what it can and then mix that with little cropped images. This of course leads to the "your mileage may very" file sizes. -- C. Thomas Stover www.thomasstover.com _______________________________________________ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users