Can't say for sure without doing the debugging, but that ticket does seem like it's likely part of the same issue.
It has to do with Solr only in the sense that Solr is flexible enough to allow you to solve the problem, but won't automatically magically solve the problem for you. The solution is in normalizing in the right way to meet your use cases, which Solr will let you do, but can't automatically do for you because it can't guess your use cases. Some people want to normalize by removing all accents, others (Germans for instance) don't. Almost everyone probably wants to normalize by converting to unicode NFC though, but Solr doesn't automatically do that either, but I'm sure people using Solr are doing it, and a developer can ask on the Solr list for advice how. In my own app, I don't normalize to NFC because I normalize by pretty much removing all accents and converting to straight ascii. Which is generally the right thing for English speakers. On 12/13/2010 4:57 PM, Karen Coyle wrote: > I'm not sure this is the same problem, but here's a bug report: > https://bugs.launchpad.net/openlibrary/+bug/389217 > that covers at least some of the issues. I remember that this has come > up before, and may have something to do with Solr. Definitely, one > needs to be able to search on accented and unaccented characters with > the same query. Also, those of us with dumb ASCII keyboards find it > very hard to key accented characters, although we may want to search > on names with diacritics. > > kc > > Quoting Jonathan Rochkind<[email protected]>: > >> OpenLibrary search algorithms really ought to normalize unicode in the >> index, apparently they don't. >> >> If you convert to unicode normalization form KC before indexing, as well >> as convert the query to that form before trying to match -- then matches >> should happen regardless of the _form_ of the diacritics entered. If you >> don't do that, then a query entered as a combining diacritic won't match >> an indexed value stored as a single codepoint, or vice versa. >> >> To additionally make searches on non-accented characters match the >> original with accents (note that English speakers like this, but in some >> other languages it's actually the wrong choice), you would instead >> convert any accented characters to their non-accented forms before >> indexing, and to the query string before matching. I forget if OL uses >> Solr as a back-end, but if so there are a couple ways I know of to do >> that with Solr, if desired. >> >> Existing behavior seems to be just plain broken though, if searching on >> one unicode encoding of a diacritic won't match an stored value with a >> different unicode encoding of the exact same diacritic. >> >> On 12/13/2010 3:54 PM, Alan Millar wrote: >>> On Mon, Dec 13, 2010 at 9:45 AM, Lee Passey<[email protected]> wrote: >>>> In your case the encoding can be a little confusing, because OL >>>> has used the >>>> "Combining diacritical marks" set (range 300-36f) [1]. These Unicode >>>> "characters" are designed not to be used as standalone characters, >>>> but rather >>>> as a means of modifying the /preceding/ character. >>> Very interesting; I didn't know Unicode had that. I think I >>> understand now how some of my wildcard author searches pick up accent >>> variations, while others don't. >>> >>> This search: >>> http://openlibrary.org/search/authors?q=kha*lid >>> gives results with both accented and unaccented "a"s, because the >>> combining diacritical falls into the wildcard range. But this search >>> http://openlibrary.org/search/authors?q=k*alid >>> only gives results with un-accented "a"s because there is no place in >>> the search pattern to match the combining diacritical. Right? >>> >>> I learn something new every day :-) >>> >>> - Alan >>> _______________________________________________ >>> Ol-tech mailing list >>> [email protected] >>> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech >>> To unsubscribe from this mailing list, send email to >>> [email protected] >> _______________________________________________ >> Ol-tech mailing list >> [email protected] >> http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech >> To unsubscribe from this mailing list, send email to >> [email protected] >> > > _______________________________________________ Ol-tech mailing list [email protected] http://mail.archive.org/cgi-bin/mailman/listinfo/ol-tech To unsubscribe from this mailing list, send email to [email protected]
