On Thu, 17 Jan 2013 07:55:09 -0600
Carson Chittom <car...@wistly.net> wrote:

> "C. Thomas Stover" <c...@thomasstover.com> writes:
> 
> > Well if hardcopy means scanned paper (no ocr) then it sounds like a
> > very large binary file set. 
> 
> I'm showing my ignorance, but does OCR matter in this case?  We
> already have OCR capabilities, and I had intended to scan in the
> documents using it--because, why not, if you can?  I didn't think to
> mention it in my original post to the list because I didn't think it
> would change the average file size significantly.
> 
> 

Well think about like this. In order to get a good enough detail for
most purposes, these document scanners have somewhere around 600x600dpi
resolution. At first you might think monochrome would work great (and
it is still used sometimes with very high res modes), but in practice
gray scale (or color) is really needed for handwriting, old paper,
charts, and all sorts of applications. So the uncompressed bitmap for
a single page can be quite big. 

So what about image/raster data compression? Well you either have
loss-less (PNG) which works great for rendered vector graphics
(diagrams, screen shots, etc), or loss-y (JPEG) which uses the
characteristics of they way human vision processes colors to really
work great for photographs. Neither one of these work that good for
generic pieces of paper. What ends up happening is people just do an
image resize to a smaller resolution, which (especially for
handwriting) can be self defeating. 

On the other hand think how much space it takes for a page of UTF-8
text. Not much. So perfect OCR (which is a virtual impossibility) would
take a 10+mb bitmap and convert it into a 2k text file. The "solution"
today's technology uses is by using a container format like PDF where
both images and text can be stored, the scanner software/firmware will
OCR what it can and then mix that with little cropped images. This of
course leads to the "your mileage may very" file sizes.

-- 
C. Thomas Stover
www.thomasstover.com


_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Reply via email to