I'm not sure about a way to boost scores in this case, but you can achieve the basic matching by applying a filter to the index and the queries. The ISOLatin1Accent Filter seems like it may work for you, though I'm not entirely certain if that will cover all the accent characters you need.

My approach has been to write new filters, one to normalize the unicode into the "decomposed" version, then one to manually strip out all of the "add-on" characters (with decimal codepoint greater than 256). I don't know if this will always work, but it's worked well for me so far.

I would test out adding a <filter class="ISOLatin1AccentFilterFactory"/> to your analyzer. It might do the trick. Once again, with this approach I'm not sure how to boost either score, so someone else may have better ideas. I'm pretty new to all of this stuff.

Peter

climbingrose wrote:
Hi guys,

I'm running to some problems with accented (UTF-8) language. I'd love to
hear some ideas about how to use Solr with those languages. Basically, I
want to achieve what Google did with UTF-8 language.

My requirements including:
1) Accent insensitive search and proper highlighting:
  For example, we have 2 documents:

  Doc A (title:Lập Trình Viên)
  Doc B (title:Lap Trinh Vien)

  if the user enters "Lập Trình Viên", then Doc B is also matched and "Lập
Trình Viên" is highlighted.
  On the other hand, if the query is "Lap Trinh Vien", Doc A is also
matched.
2) Assign proper scores to accented or non-accented searches:
  if the user enters "Lập Trình Viên", then Doc A should be given higher
score than DOC B.
  if the query is "Lap Trinh Vien", Doc A should be given higher score.

Any ideas guys? Thanks in advance!

Reply via email to