On Mon, 2009-11-02 at 08:31 -0700, Matt Graham wrote: > >From: Alex Dean <[email protected]> > > On Nov 1, 2009, at 9:24 PM, Ted Gould wrote: > >> I'd recommend gscan2pdf. It works with SANE, but does nice things > >> like handle double sided stuff easily. It will also work with > >> GOCR to do OCR > > That's not exactly a great thing. GOCR is much worse than commercial > OCR engines, especially if the original image is skewed/broken.
Oh, well, it's plugable, just GOCR is all I have :)
In general, it does a bunch of cleanup to make the OCR reasonable. I
wouldn't say it's anywhere near perfect, but it seems to pull most of
the keywords out of things like credit card statements. I would say
it's good enough for search, but it's not perfect by any stretch of the
imagination.
> > Someday I'm going to start digitizing and OCR-ing the 100 years of
> > local newspapers which are gathering mold in the library basement. I
> > really have no firm plan as to how I'm going to do it, but doing it
> > with free software would be a big plus.
>
> I spent 3 or 4 years doing stuff like this on the NYT, Wall Street
> Journal, Christian Science Monitor, and Boston Globe. You will NOT
> be able to get decent OCR with free software. Newspapers require
> a different approach than most OCR packages take; you have to split
> each article up into multiple individual image files and OCR each
> file separately, then stitch the results back together. And editing
> the results is totally necessary since newspaper text is so horrible
> in quality.
>
> (I can talk about this for at least half an hour; contact offlist
> for more info.)
+1, I wouldn't use it for archival things like that yet. But, you might
be able to use GOCR with the work Google is doing -- I'm not sure if
they're open sourcing all of it or not.
--Ted
signature.asc
Description: This is a digitally signed message part
--------------------------------------------------- PLUG-discuss mailing list - [email protected] To subscribe, unsubscribe, or to change your mail settings: http://lists.PLUG.phoenix.az.us/mailman/listinfo/plug-discuss
