> I've been poking around the list archives and didn't really come up against
> anything interesting. Anyone using Lucene to index OCR text? Any
> strategies/algorithms/packages you recommend?
>
> I have a large collection (10^7 docs) that's mostly the result of OCR. We
> index/search/etc. with Lucene without any trouble, but OCR errors are a
> problem, when doing exact phrase matches in particular. I'm looking for
> ideas on how to deal with this thorny problem.

How about Letter-by-letter ngrams coupled with SpanQueries (or more
likely, a custom query utilizing the TermPositions iterator)?

-- 
Kyle Maxwell
Software Engineer
CastTV, Inc
http://www.casttv.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to