> I've been poking around the list archives and didn't really come up against > anything interesting. Anyone using Lucene to index OCR text? Any > strategies/algorithms/packages you recommend? > > I have a large collection (10^7 docs) that's mostly the result of OCR. We > index/search/etc. with Lucene without any trouble, but OCR errors are a > problem, when doing exact phrase matches in particular. I'm looking for > ideas on how to deal with this thorny problem.
How about Letter-by-letter ngrams coupled with SpanQueries (or more likely, a custom query utilizing the TermPositions iterator)? -- Kyle Maxwell Software Engineer CastTV, Inc http://www.casttv.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]