Re: Matching accented with non-accented characters

Steven Rowe Tue, 25 Jul 2006 10:41:08 -0700

Rajan, Renuka wrote:
> I am trying to match accented characters with non-accented characters
> in French/Spanish and other Western European languages.  The use case
> is that the users may type letters without accents in error and we
> still want to be able to retrieve valid matches.  The one idea,
> albeit naïve, is to normalize the data on the inbound side as well as
> the data in the database (prior to full text indexing) and retrieve
> matches.
> 
> For instance if the database contains a word like BE/BE/ (/ being the
> equivalent of aigu since I don't have a French keyboard:-)) and the
> input is erroneously provided as BE/BE (last aigu missing), we still
> want to be able retrieve BE/BE/ as a candidate match admittedly with
> an error margin.
> 
> Has anyone using Lucene successfully (ie in terms of decent
> performance AND validity of results) to match non-accented characters
> with accented ones using some method?  Any method?  Anyone have
> suggestions to improve the suggestion above?

Some of the work to do the deaccenting (normalization) is already done:

<http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/ISOLatin1AccentFilter.html>

Simplest method: Index and search against a single deaccented field.
This has the advantage that incompletely accented text (as in your
example) will still match. The disadvantage is that terms which differ
only by accent(s) will be conflated, thus lowering average precision.
This may not be a big enough problem for you to justify greater effort,
though.

Three other alternatives (roughly in increasing order of complexity):

1. Put the original and the deaccented versions of the tokens at the
same position in a single field, and use the same analyzer to construct
queries. The precision-lowering conflation effect mentioned above will
be partially offset by better scores for documents containing tokens
with accents that match those given by the user.

2. Have two fields on each document, one for the original
(non-deaccented) token stream, and another for the deaccented token
stream, which you can create using the above-linked
ISOLatin1AccentFilter. Then when you perform searches, you can
construct a query against both fields, giving a higher boost to the
original (non-deaccented) field. This is probably closest to the
solution you had in mind.

3. Use the ICU library
<http://ibm.com/software/globalization/icu/index.jsp> to create sort
keys for each token, which you would use both when indexing and
searching. See Ken Krugler's post on this topic:
<http://mail-archives.apache.org/mod_mbox/lucene-java-user/200506.mbox/[EMAIL
PROTECTED]>

Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Matching accented with non-accented characters

Reply via email to