Re: Accent-insensitive searches

Knut Anders Hatlen Fri, 11 Sep 2009 02:46:16 -0700

josu <[email protected]> writes:

> I'm working on an database application. Items in the database are all in
> spanish language. It's mandatory that searches are accent-insensitive,
> meaning that, for example, a search for the word 'electrico' (no accent)
> must return entrances containing 'eléctrico' (with accent).
>
> Searching the web for a solution, I find I must set these two properties
> when creating the database: 
>
> territory=es_ES
> collation=TERRITORY_BASED
>
> But it still doesn't work this way. Looks like the default collation for
> es_ES is still accent-sensitive.
>
> So I try to use a custom collator that will behave as I need to. I find some
> instructions for this in the following blog:
>
>   http://blogs.sun.com/kah/entry/user_defined_collation_in_apache
> http://blogs.sun.com/kah/entry/user_defined_collation_in_apache 
>
>  In brief, I define a new CollatorProvider and register it with the JVM.
> Here's the code for this class:
>
>
> public class IgnoraAcentosCollatorProvider extends
> java.text.spi.CollatorProvider {
>
>     @Override
>     public Collator getInstance(Locale locale) {
>         if (!locale.equals(new Locale("es","ES","accentinsensitive"))){
>             throw new IllegalArgumentException("Solo acepta
> es_ES_accentinsensitive");
>         }
>         Collator c=Collator.getInstance(new Locale("es","ES"));
>         c.setStrength(Collator.PRIMARY);
>         return c;
>     }
>
>     @Override
>     public Locale[] getAvailableLocales() {
>         return new Locale[]{
>             new Locale("es","ES","accentinsensitive")
>         };
>     }
>
> }
>
>
>  It simply takes the default es_ES Collator and changes strength to PRIMARY.
> This makes the collator return 0 when comparing 'electrico' and 'eléctrico'.
>
> After making sure this new Collator is available for the JVM, I re-start
> Derby and make a new database, now setting territory=es_ES_accentinsensitive
>
> The database is created without errors (meaning Derby reaches my Collator),
> but searches are still accent-sensitive (no matter if I use = or LIKE
> operators).
>
> Any clue? I made intensive searches about this issue but I found no
> solution. I can avoid the problem simply using MySQL (the default spanish
> configuration has already the desired behaviour) but I would like to keep on
> using Derby if possible.
>
> I'm using JavaDB-Derby 10.4.2.1


Hi,

It seems to work in my environment. I used the collator provider class
that you posted and performed the steps below. (Note that when you set
the strength of the collator to PRIMARY, it will be case-insensitive in
addition to being accent-insensitive. Making it accent-insensitive and
at the same time case-sensitive is probably more work, but I think it
should be doable.)

k...@tecra:/tmp/coll % javac IgnoraAcentosCollatorProvider.java
k...@tecra:/tmp/coll % mkdir -p META-INF/services            
k...@tecra:/tmp/coll % echo IgnoraAcentosCollatorProvider > 
META-INF/services/java.text.spi.CollatorProvider                                
                     k...@tecra:/tmp/coll % jar cf coll.jar 
IgnoraAcentosCollatorProvider* META-INF 
k...@tecra:/tmp/coll % java -version       
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03)
Java HotSpot(TM) Server VM (build 14.1-b02, mixed mode)
k...@tecra:/tmp/coll % java -Djava.ext.dirs=. -jar 
/code/derby/oldreleases/10.4.2.0/derbyrun.jar ij
ij version 10.4
ij> connect 
'jdbc:derby:mydb;create=true;territory=es_ES_accentinsensitive;collation=TERRITORY_BASED';
ij> create table t(x varchar(20));
0 rows inserted/updated/deleted
ij> insert into t values 'electrico','eléctrico','Electrico';
3 rows inserted/updated/deleted
ij> select * from t where x='electrico';
X                   
--------------------
electrico           
eléctrico           
Electrico           

3 rows selected
ij> 


You can verify that the database was created with the correct territory
and collation by evaluating this in IJ and see that "TERRITORY_BASED" is
returned:

ij> values syscs_util.syscs_get_database_property('derby.database.collation');
1                                                                               
                                                
-------------------------
TERRITORY_BASED

1 row selected

And also check that service.properties in the database directory
contains the following line:

derby.serviceLocale=es_ES_accentinsensitive

Hope this helps,

-- 
Knut Anders

Re: Accent-insensitive searches

Reply via email to