<subject line changed> On Mon, Jun 22, 2009 at 12:55 AM, Anthony <[email protected]> wrote: > > On Sun, Jun 21, 2009 at 10:23 AM, Anthony <[email protected]> wrote: > > > On Sun, Jun 21, 2009 at 8:35 AM, John Vandenberg <[email protected]> wrote: > > > >> I suggest you take a look at a few of the DJVU files provided by > >> Internet Archive. Then you can point out real faults that you see. > > > > > > I will. My apologies for misunderstanding your email. > > > > Okay, http://www.archive.org/details/catholicencyclo16herbgoog happened to > be the first book I randomly picked from Google Book Search. There's no > text version.
Lucky you. Most of the other CE1913 volumes on Internet Archive have a DJVU file. http://www.archive.org/search.php?query=The%20Catholic%20Encyclopedia%20AND%20mediatype%3Atexts > And the text version I find of other editions seems to be much much worse > than the google OCR results. The OCR engines, especially tesseract which Google uses, have only recently started to handle multiple columns well, so old OCR output are of lesser quality. If an old DJVU has been copied over to Internet Archive, Google Books may have reprocessed that book resulting in better OCR being available that way. Internet Archive also reprocesses its DJVU files, and Wikisource has its own "OCR" button which allows per-page reprocessing to be done by an OCR bot in the background. However, CE1913 is not a good example as it would be a bit silly to use OCR from _anywhere_: there are multiple complete proof-read editions on the web, including on Wikisource ;-) http://en.wikisource.org/wiki/CE1913 Also note that Google Books shows the volumes of CE1913 as mostly "No preview available" to me, probably because I am in Australia, and only one or two are "Snippet view". http://books.google.com.au/books?q=intitle%3A"Catholic+Encyclopedia" -- John Vandenberg _______________________________________________ foundation-l mailing list [email protected] Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l
