Re: [ol-tech] Bulk Download and Request

Karen Coyle Mon, 13 Dec 2010 13:58:13 -0800

I'm not sure this is the same problem, but here's a bug report:
    https://bugs.launchpad.net/openlibrary/+bug/389217
that covers at least some of the issues. I remember that this has come  
up before, and may have something to do with Solr. Definitely, one  
needs to be able to search on accented and unaccented characters with  
the same query. Also, those of us with dumb ASCII keyboards find it  
very hard to key accented characters, although we may want to search  
on names with diacritics.


kc

Quoting Jonathan Rochkind <[email protected]>:

> OpenLibrary search algorithms really ought to normalize unicode in the
> index, apparently they don't.
>
> If you convert to unicode normalization form KC before indexing, as well
> as convert the query to that form before trying to match -- then matches
> should happen regardless of the _form_ of the diacritics entered. If you
> don't do that, then a query entered as a combining diacritic won't match
> an indexed value stored as a single codepoint, or vice versa.
>
> To additionally make searches on non-accented characters match the
> original with accents (note that English speakers like this, but in some
> other languages it's actually the wrong choice), you would instead
> convert any accented characters to their non-accented forms before
> indexing, and to the query string before matching. I forget if OL uses
> Solr as a back-end, but if so there are a couple ways I know of to do
> that with Solr, if desired.
>
> Existing behavior seems to be just plain broken though, if searching on
> one unicode encoding of a diacritic won't match an stored value with a
> different unicode encoding of the exact same diacritic.
>
> On 12/13/2010 3:54 PM, Alan Millar wrote:
>> On Mon, Dec 13, 2010 at 9:45 AM, Lee Passey<[email protected]>  wrote:
>>> In your case the encoding can be a little confusing, because OL  
>>> has used the
>>> "Combining diacritical marks" set (range 300-36f) [1]. These Unicode
>>> "characters" are designed not to be used as standalone characters,  
>>> but rather
>>> as a means of modifying the /preceding/ character.
>> Very interesting; I didn't know Unicode had that.  I think I
>> understand now how some of my wildcard author searches pick up accent
>> variations, while others don't.
>>
>> This search:
>> http://openlibrary.org/search/authors?q=kha*lid
>> gives results with both accented and unaccented "a"s, because the
>> combining diacritical falls into the wildcard range.  But this search
>> http://openlibrary.org/search/authors?q=k*alid
>> only gives results with un-accented "a"s because there is no place in
>> the search pattern to match the combining diacritical.  Right?
>>
>> I learn something new every day  :-)
>>
>> - Alan
>> _______________________________________________
>> Ol-tech mailing list
>> [email protected]
>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
>> To unsubscribe from this mailing list, send email to  
>> [email protected]
> _______________________________________________
> Ol-tech mailing list
> [email protected]
> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
> To unsubscribe from this mailing list, send email to  
> [email protected]
>



-- 
Karen Coyle
[email protected] http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet

_______________________________________________
Ol-tech mailing list
[email protected]
http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech
To unsubscribe from this mailing list, send email to 
[email protected]

Re: [ol-tech] Bulk Download and Request

Reply via email to