Snippets and encoding characters

Johnny Mariéthoz Fri, 10 Aug 2012 02:40:51 -0700

Dear all,

I have some problem with snippets for documents containing accentuated 
characters.


I have some '?' chars instead of accents for the snippets display. I propose in 
attachment a patch that seems solve it. The problem comes from the fact that 
the indexes (computed by get_tokens) should be computed on a unicode string 
instead of a standard string. For example:
'été'[0:3]
is not the same as
u'été'[0:3]

0001-Bad-unicode-chars-for-snippets-display.patch
Description: Binary data


An other problem is searching in full-text with an accentuated request do not 
retrieve document with accentuated results. For example, if I search for 
"Méthode" in French, I will retrieve all english document with "Method" but not 
the French doc containing "Méthode fonctionnelle". Probably some pre-processing 
is required to remove accent before the SOLR indexing.
Moreover, we needs to modify also snippets to keep accent for keywords during 
the grep process.

Regards,

----------------------------------------------------------------------
Johnny Mariéthoz
RERO, Av. de la Gare 45, CH - 1920 MARTIGNY
Téléphone:  +41(0)27 721 8579
Fax              : +41(0)27 721 8586
Web            : http://www.rero.ch
ReroDoc    : http://doc.rero.ch, [email protected]
----------------------------------------------------------------------

Snippets and encoding characters

Reply via email to