We've done this in a pre-Solr Lucene context by using the position increment: 
when a token contains accented characters, you add a stripped version of that 
token with a zero increment, so that for matching purposes the original and the 
stripped version are at the same position. Accents are not stripped from 
queries. The effect is that an accented search matches your Doc A, and an 
unaccented search matches Docs A and B. We do that after lower-casing the token.

There are some limitations: users might start to expect that they can freely 
add accents to restrict their search to accented hits, but if they don't match 
the accents exactly they won't get any hits: e.g. if a word contains two 
accented characters and the user only accents one of them in their query, they 
won't match the accented or the unaccented version. 

Peter

Peter Binkley
Digital Initiatives Technology Librarian
Information Technology Services
4-30 Cameron Library
University of Alberta Libraries
Edmonton, Alberta
Canada T6G 2J8
Phone: (780) 492-3743
Fax: (780) 492-9243
e-mail: [EMAIL PROTECTED]

~ The code is willing, but the data is weak. ~


-----Original Message-----
From: climbingrose [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 10, 2008 10:01 PM
To: solr-user@lucene.apache.org
Subject: Accented search

Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to hear 
some ideas about how to use Solr with those languages. Basically, I want to 
achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:Lập Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters "Lập Trình Viên", then Doc B is also matched and "Lập 
Trình Viên" is highlighted.
  On the other hand, if the query is "Lap Trinh Vien", Doc A is also matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters "Lập Trình Viên", then Doc A should be given higher score 
than DOC B.
  if the query is "Lap Trinh Vien", Doc A should be given higher score.

Any ideas guys? Thanks in advance!

--
Regards,

Cuong Hoang

Reply via email to