OpenLibrary search algorithms really ought to normalize unicode in the 
index, apparently they don't.

If you convert to unicode normalization form KC before indexing, as well 
as convert the query to that form before trying to match -- then matches 
should happen regardless of the _form_ of the diacritics entered. If you 
don't do that, then a query entered as a combining diacritic won't match 
an indexed value stored as a single codepoint, or vice versa.

To additionally make searches on non-accented characters match the 
original with accents (note that English speakers like this, but in some 
other languages it's actually the wrong choice), you would instead 
convert any accented characters to their non-accented forms before 
indexing, and to the query string before matching. I forget if OL uses 
Solr as a back-end, but if so there are a couple ways I know of to do 
that with Solr, if desired.

Existing behavior seems to be just plain broken though, if searching on 
one unicode encoding of a diacritic won't match an stored value with a 
different unicode encoding of the exact same diacritic.

On 12/13/2010 3:54 PM, Alan Millar wrote:
> On Mon, Dec 13, 2010 at 9:45 AM, Lee Passey<[email protected]>  wrote:
>> In your case the encoding can be a little confusing, because OL has used the
>> "Combining diacritical marks" set (range 300-36f) [1]. These Unicode
>> "characters" are designed not to be used as standalone characters, but rather
>> as a means of modifying the /preceding/ character.
> Very interesting; I didn't know Unicode had that.  I think I
> understand now how some of my wildcard author searches pick up accent
> variations, while others don't.
>
> This search:
> http://openlibrary.org/search/authors?q=kha*lid
> gives results with both accented and unaccented "a"s, because the
> combining diacritical falls into the wildcard range.  But this search
> http://openlibrary.org/search/authors?q=k*alid
> only gives results with un-accented "a"s because there is no place in
> the search pattern to match the combining diacritical.  Right?
>
> I learn something new every day  :-)
>
> - Alan
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to 
> [email protected]
_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Reply via email to