Hello!
I recently discovered that my database was using latin1 default encoding. I 
switched to utf-8 (which means also sql CONVERTs and friends). Everything 
seemed to work fine (except some strangely encoded titles), but then I 
discovered some terrible indexing problems. I reindexed the whole site using 
the instructions found in bibindex's documentation.
However, things seem to be screwed if I try to search for accentuated names 
like this: 
http://infoscience.epfl.ch/search?as=0&sc=1&p=süsstrunk&f=&Submit=Search
(on my failover server there would be something like 115 results). If i use the 
verbose mode, I see that Sèèsstrunk is actually converted to susstrunk and then 
searched. Looking at the tables, there is a difference:

failover (latin-1 db): 
select term  from idxWORD04F where term like "S%sstrunk";
+-----------+
| term      |
+-----------+
| susstrunk | 
+-----------+
1 row in set (0.01 sec)

Production (utf8 db):
select term  from idxWORD04F where term like "S%sstrunk";
+-----------+
| term      |
+-----------+
| s?sstrunk | 
| susstrunk | 
+-----------+
2 rows in set (0.00 sec)

My SQL variables are:
| character_set_client     | utf8                       | 
| character_set_connection | utf8                       | 
| character_set_database   | utf8                       | 
| character_set_filesystem | binary                     | 
| character_set_results    | utf8                       | 
| character_set_server     | latin1                     | 
| character_set_system     | utf8                       | 
| character_sets_dir       | /usr/share/mysql/charsets/ | 
| collation_connection     | utf8_general_ci            | 
| collation_database       | utf8_swedish_ci            | 
| collation_server         | latin1_swedish_ci          |

show create table idxWORD04F;
CREATE TABLE `idxWORD04F` (
  `id` mediumint(9) unsigned NOT NULL auto_increment,
  `term` varchar(50) default NULL,
  `hitlist` longblob,
  PRIMARY KEY  (`id`),
  UNIQUE KEY `term` (`term`)
) ENGINE=MyISAM AUTO_INCREMENT=52242 DEFAULT CHARSET=utf8 | 

Any suggestion on how to solve this??

Best regards,
Greg


____________________________________________________________________

Gregory Favre
Coordinateur Infoscience
École Polytechnique Fédérale de Lausanne
KIS - DIT
Case Postale 121
CH-1015 Lausanne
+41 21 693 22 88
+ 41 79 599 09 06
[email protected]
http://plan.epfl.ch/?sciper=128933
____________________________________________________________________




Reply via email to