Re: [GENERAL] Searching for "bare" letters

Cody Caughlan Sat, 01 Oct 2011 19:33:40 -0700

One approach would be to "normalize" all the text and search against that.


That is, basically convert all non-ASCII characters to their equivalents. 

I've had to do this in Solr for searching for the exact reasons you've 
outlined: treat "ñ" as "n". Ditto for "ü" -> "u", "é" => "e", etc.

This is easily done in Solr via the included ASCIIFoldingFilterFactory:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory

You could look at the code to see how they do the conversion and implement it.

/Cody

On Oct 1, 2011, at 7:09 PM, planas wrote:

> On Sun, 2011-10-02 at 01:25 +0200, Reuven M. Lerner wrote:
>> Hi, everyone.  I'm working on a project on PostgreSQL 9.0 (soon to be 
>> upgraded to 9.1, given that we haven't yet launched).  The project will 
>> involve numerous text fields containing English, Spanish, and Portuguese.  
>> Some of those text fields will be searchable by the user.  That's easy 
>> enough to do; for our purposes, I was planning to use some combination of 
>> LIKE searches; the database is small enough that this doesn't take very much 
>> time, and we don't expect the number of searchable records (or columns 
>> within those records) to be all that large.
>> 
>> The thing is, the people running the site want searches to work on what I'm 
>> calling (for lack of a better term) "bare" letters.  That is, if the user 
>> searches for "n", then the search should also match Spanish words containing 
>> "ñ".  I'm told by Spanish-speaking members of the team that this is how they 
>> would expect searches to work.  However, when I just did a quick test using 
>> a UTF-8 encoded 9.0 database, I found that PostgreSQL didn't  see the two 
>> characters as identical.  (I must say, this is the behavior that I would 
>> have expected, had the Spanish-speaking team member not said anything on the 
>> subject.)
>> 
>> So my question is whether I can somehow wrangle PostgreSQL into thinking 
>> that "n" and "ñ" are the same character for search purposes, or if I need to 
>> do something else -- use regexps, keep a "naked," searchable version of each 
>> column alongside the native one, or something else entirely -- to get this 
>> to work.
>> 
> Could you parse the search string for the non-English characters and convert 
> them to the appropriate English character? My skills are not that good or I 
> would offer more details.
>> Any ideas?
>> 
>> Thanks,
>> 
>> Reuven
>> 
>> 
>> -- 
>> Reuven M. Lerner -- Web development, consulting, and training
>> Mobile: +972-54-496-8405 * US phone: 847-230-9795
>> Skype/AIM: reuvenlerner
> 
> 
> -- 
> Jay Lozier
> [email protected]

Re: [GENERAL] Searching for "bare" letters

Reply via email to