Re: Lucene to index OCR text

Kyle Maxwell Thu, 24 Jan 2008 18:46:59 -0800

> I've been poking around the list archives and didn't really come up against
> anything interesting. Anyone using Lucene to index OCR text? Any
> strategies/algorithms/packages you recommend?
>
> I have a large collection (10^7 docs) that's mostly the result of OCR. We
> index/search/etc. with Lucene without any trouble, but OCR errors are a
> problem, when doing exact phrase matches in particular. I'm looking for
> ideas on how to deal with this thorny problem.


How about Letter-by-letter ngrams coupled with SpanQueries (or more
likely, a custom query utilizing the TermPositions iterator)?

-- 
Kyle Maxwell
Software Engineer
CastTV, Inc
http://www.casttv.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene to index OCR text

Reply via email to