On Tue, 6 Jan 2009 11:49:55 +0100 "Olof Sjobergh" <[email protected]> babbled:
> Hi, > > I'm working on a Swedish dictionary and keyboard for Illume, but I'm > having some trouble with sorting of utf8 chars in the dictionary. I > can't seem to get the sorting right. Looking at the code, Illume sorts > the dictionary after first normalizing the strings according to the > internal normalization table. Is there any way to reproduce this > sorting with the sort command? I've tried with a few different locales > (C, en_US.utf8) which all make the unix sort command work differently. > But no matter what I try words don't show up correctly. sort -f i think does it... i think... > Another issue I found is that the built in normalization table is not > very good for typing Swedish text. On a standard Swedish qwerty > layout, we have three additional letters (å, ä and ö). These are used > very frequently in Swedish and there are many common words that have > different meanings if spellt with a, å or ä (for example har, här and > hår are all very common words). But in Illume these are all normalized > to a. Writing Swedish with a US qwerty layout and then having to > select aåä manually after the dictionary lookup is a pain, since many > common words will have to be selected from the lookup list each time. > > Instead, what you want is a Swedish qwerty layout (which is very > simple to implement as a .kbd file), and not normalize åäö for the > Swedish dictionary lookup. So the normalization table would really > need to be configurable, either as a part of the dictionary or the > .kbd file. I suppose this problem exists for other languages as well. > If I were to work on such a change, what would be the best approach? hmm interesting i was just going of german/french and portuguese on this where i thought i could get away with simple normalisation and a basic qwerty layout - with selecting the matches (Vogel/Vögel for example). making the table part of the dictionary does make a lot of sense of course. the dict format does need to change to make it a lot faster and intl-char friendly. i avoided this at the time as i'd need to efficiently encode a b-tree in the file and be able to mmap () it efficiently and use it. -- ------------- Codito, ergo sum - "I code, therefore I am" -------------- The Rasterman (Carsten Haitzler) [email protected] _______________________________________________ Openmoko community mailing list [email protected] http://lists.openmoko.org/mailman/listinfo/community

