Hi Tibor, Thanks for your reply! > > Can you check for which records süsstrunk vs susstrunk appear in the > index? If you isolate record ID examples for both forms, then check > those records' MARCXML values and the bibxxx table values, to see if > there is some difference between them in stored values? Chances are > there will be. >
It seems that both bibxxx tables and idxWORDXXX tables have issues (duplicates, ? chars which appear to be real question marks and not encoding issues). > Since you mention there were some title encoding troubles, maybe the > tables were not fully properly converted from Latin-1 to UTF-8? The > conversion usually goes like: The other tables (bibdoc, collectionname and friends) seem to be encoded correctly though. Is there a way to reindex completely (bibxxx and idxWORD tables) from the bibfmt tables? Of course, I'd rather not truncate bibrec tables and reinsert the 60'000 xml files ;-) Best regards, Greg > > $ mysqldump -u root -p cdsinvenio --default-character-set=latin1 > collectionname > z.sql > $ vi z.sql # change "SET NAMES latin1" to "SET NAMES utf8" and/or "DEFAULT > CHARSET=latin1" to "DEFAULT CHARSET=utf8" > $ cat z.sql | mysql --default-character-set=utf8 -u root -p cdsinvenio > > (very schematically speaking) > > Best regards > -- > Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/> ____________________________________________________________________ Gregory Favre Coordinateur Infoscience École Polytechnique Fédérale de Lausanne KIS - DIT Case Postale 121 CH-1015 Lausanne +41 21 693 22 88 + 41 79 599 09 06 [email protected] http://plan.epfl.ch/?sciper=128933 ____________________________________________________________________
