On Tue, 06 Jan 2009 15:43:35 +0100 Pander <[email protected]> babbled:
> Carsten Haitzler (The Rasterman) wrote: > > On Tue, 6 Jan 2009 11:49:55 +0100 "Olof Sjobergh" <[email protected]> > > babbled: > > > >> Hi, > >> > >> I'm working on a Swedish dictionary and keyboard for Illume, but I'm > >> having some trouble with sorting of utf8 chars in the dictionary. I > >> can't seem to get the sorting right. Looking at the code, Illume sorts > >> the dictionary after first normalizing the strings according to the > >> internal normalization table. Is there any way to reproduce this > >> sorting with the sort command? I've tried with a few different locales > >> (C, en_US.utf8) which all make the unix sort command work differently. > >> But no matter what I try words don't show up correctly. > > > > sort -f i think does it... i think... > > > >> Another issue I found is that the built in normalization table is not > >> very good for typing Swedish text. On a standard Swedish qwerty > >> layout, we have three additional letters (å, ä and ö). These are used > >> very frequently in Swedish and there are many common words that have > >> different meanings if spellt with a, å or ä (for example har, här and > >> hår are all very common words). But in Illume these are all normalized > >> to a. Writing Swedish with a US qwerty layout and then having to > >> select aåä manually after the dictionary lookup is a pain, since many > >> common words will have to be selected from the lookup list each time. > >> > >> Instead, what you want is a Swedish qwerty layout (which is very > >> simple to implement as a .kbd file), and not normalize åäö for the > >> Swedish dictionary lookup. So the normalization table would really > >> need to be configurable, either as a part of the dictionary or the > >> .kbd file. I suppose this problem exists for other languages as well. > >> If I were to work on such a change, what would be the best approach? > > > > hmm interesting i was just going of german/french and portuguese on this > > where i thought i could get away with simple normalisation and a basic > > qwerty layout > > - with selecting the matches (Vogel/Vögel for example). making the table > > part of the dictionary does make a lot of sense of course. the dict format > > does need to change to make it a lot faster and intl-char friendly. i > > avoided this at the time as i'd need to efficiently encode a b-tree in the > > file and be able to mmap () it efficiently and use it. > > Mapping of cafe to café (French) and Vogel to Vögel (German) is indeed > handy, this funcitonality would be handy internationally for most languages. > > What about mapping Koeln to Köln etcetera? This would be handy for > German only. Like the above story is (maybe) specific for Swedish. yup. i've gone over this before. i think the solution is a dict change. you have a match string and a list of possible outputs: vogel -> Vogel,Vögel koln -> Köln koeln -> Köln etc. etc. - this allows arbitrary mappings from 1 string to any other. should cover a whole HOST of languages (japanese, chines and korean included if using the romanised input methods of these languages). again - whole dict format change would be needed and it'd be much harder to crate dicts. > Perhaps an optional config file can be provided for the dictionaries > that need one. Keeping this info outside the dict itself eases sorting > of the dict and upgrading dicts. I would keep this optional config > surely independent of the .kbd keyboard configs. > > Raster, the dicts I'm making for Dutch will be a large version (250.000 > words) and a small version. Do you have an indication how many words is > advisable for the small version? you don't really need a small one - the small english one i used 1. because it was simpler to check my match results in a small set of data and it used less ram in my initial "in memory only" dict code. in the end there likely need a major dict format and data content change to basically support all this stuff. but once done it should cover a whole slew of languages. > However it would be desirable that each .kbd file can indicate: > - predictive mode is not possible, e.g. for numeric keyboards. I don't > want it to remember my PIN, credit card number, etcetera. (numeric > keyboard, a real one, without the é, ë, ..) outputting keysyms instead of strings (like Terminal.kbd) bypasses the dict. so this is how it is effectively turned off. > - predictive mode is default on, but user can temporarily disable it, > e.g. when going into a shell (alpha keyboard) that's what Terminal.kbd is for... ? > - predictive mode is defaul off, but user can temporarily enable it, > e.g. when typing proza inside a shell (terminal keyboard) of course this can be done - the problem is - where do i "conveniently" attach all the controls. i guess if no word is composed currently ^ on the top-left can pop up a control panel. but for now - kbd is not on my radar - got other things to do at the moment. :( -- ------------- Codito, ergo sum - "I code, therefore I am" -------------- The Rasterman (Carsten Haitzler) [email protected] _______________________________________________ Openmoko community mailing list [email protected] http://lists.openmoko.org/mailman/listinfo/community

