Rajan, Renuka wrote: > I am trying to match accented characters with non-accented characters > in French/Spanish and other Western European languages. The use case > is that the users may type letters without accents in error and we > still want to be able to retrieve valid matches. The one idea, > albeit naïve, is to normalize the data on the inbound side as well as > the data in the database (prior to full text indexing) and retrieve > matches. > > For instance if the database contains a word like BE/BE/ (/ being the > equivalent of aigu since I don't have a French keyboard:-)) and the > input is erroneously provided as BE/BE (last aigu missing), we still > want to be able retrieve BE/BE/ as a candidate match admittedly with > an error margin. > > Has anyone using Lucene successfully (ie in terms of decent > performance AND validity of results) to match non-accented characters > with accented ones using some method? Any method? Anyone have > suggestions to improve the suggestion above?
Some of the work to do the deaccenting (normalization) is already done: <http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/ISOLatin1AccentFilter.html> Simplest method: Index and search against a single deaccented field. This has the advantage that incompletely accented text (as in your example) will still match. The disadvantage is that terms which differ only by accent(s) will be conflated, thus lowering average precision. This may not be a big enough problem for you to justify greater effort, though. Three other alternatives (roughly in increasing order of complexity): 1. Put the original and the deaccented versions of the tokens at the same position in a single field, and use the same analyzer to construct queries. The precision-lowering conflation effect mentioned above will be partially offset by better scores for documents containing tokens with accents that match those given by the user. 2. Have two fields on each document, one for the original (non-deaccented) token stream, and another for the deaccented token stream, which you can create using the above-linked ISOLatin1AccentFilter. Then when you perform searches, you can construct a query against both fields, giving a higher boost to the original (non-deaccented) field. This is probably closest to the solution you had in mind. 3. Use the ICU library <http://ibm.com/software/globalization/icu/index.jsp> to create sort keys for each token, which you would use both when indexing and searching. See Ken Krugler's post on this topic: <http://mail-archives.apache.org/mod_mbox/lucene-java-user/200506.mbox/[EMAIL PROTECTED]> Steve --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]