Illume keyboard dictionary sorting and normalization

2009-01-06 Thread Olof Sjobergh
Hi,

I'm working on a Swedish dictionary and keyboard for Illume, but I'm
having some trouble with sorting of utf8 chars in the dictionary. I
can't seem to get the sorting right. Looking at the code, Illume sorts
the dictionary after first normalizing the strings according to the
internal normalization table. Is there any way to reproduce this
sorting with the sort command? I've tried with a few different locales
(C, en_US.utf8) which all make the unix sort command work differently.
But no matter what I try words don't show up correctly.

Another issue I found is that the built in normalization table is not
very good for typing Swedish text. On a standard Swedish qwerty
layout, we have three additional letters (å, ä and ö). These are used
very frequently in Swedish and there are many common words that have
different meanings if spellt with a, å or ä (for example har, här and
hår are all very common words). But in Illume these are all normalized
to a. Writing Swedish with a US qwerty layout and then having to
select aåä manually after the dictionary lookup is a pain, since many
common words will have to be selected from the lookup list each time.

Instead, what you want is a Swedish qwerty layout (which is very
simple to implement as a .kbd file), and not normalize åäö for the
Swedish dictionary lookup. So the normalization table would really
need to be configurable, either as a part of the dictionary or the
.kbd file. I suppose this problem exists for other languages as well.
If I were to work on such a change, what would be the best approach?

Best regards,

Olof Sjobergh

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume keyboard dictionary sorting and normalization

2009-01-06 Thread Pander
Carsten Haitzler (The Rasterman) wrote:
 On Tue, 6 Jan 2009 11:49:55 +0100 Olof Sjobergh olo...@gmail.com babbled:
 
 Hi,

 I'm working on a Swedish dictionary and keyboard for Illume, but I'm
 having some trouble with sorting of utf8 chars in the dictionary. I
 can't seem to get the sorting right. Looking at the code, Illume sorts
 the dictionary after first normalizing the strings according to the
 internal normalization table. Is there any way to reproduce this
 sorting with the sort command? I've tried with a few different locales
 (C, en_US.utf8) which all make the unix sort command work differently.
 But no matter what I try words don't show up correctly.
 
 sort -f i think does it... i think...
 
 Another issue I found is that the built in normalization table is not
 very good for typing Swedish text. On a standard Swedish qwerty
 layout, we have three additional letters (å, ä and ö). These are used
 very frequently in Swedish and there are many common words that have
 different meanings if spellt with a, å or ä (for example har, här and
 hår are all very common words). But in Illume these are all normalized
 to a. Writing Swedish with a US qwerty layout and then having to
 select aåä manually after the dictionary lookup is a pain, since many
 common words will have to be selected from the lookup list each time.

 Instead, what you want is a Swedish qwerty layout (which is very
 simple to implement as a .kbd file), and not normalize åäö for the
 Swedish dictionary lookup. So the normalization table would really
 need to be configurable, either as a part of the dictionary or the
 .kbd file. I suppose this problem exists for other languages as well.
 If I were to work on such a change, what would be the best approach?
 
 hmm interesting i was just going of german/french and portuguese on this where
 i thought i could get away with simple normalisation and a basic qwerty layout
 - with selecting the matches (Vogel/Vögel for example). making the table part
 of the dictionary does make a lot of sense of course. the dict format does 
 need
 to change to make it a lot faster and intl-char friendly. i avoided this at 
 the
 time as i'd need to efficiently encode a b-tree in the file and be able to 
 mmap
 () it efficiently and use it.

Mapping of cafe to café (French) and Vogel to Vögel (German) is indeed
handy, this funcitonality would be handy internationally for most languages.

What about mapping Koeln to Köln etcetera? This would be handy for
German only. Like the above story is (maybe) specific for Swedish.

Perhaps an optional config file can be provided for the dictionaries
that need one. Keeping this info outside the dict itself eases sorting
of the dict and upgrading dicts. I would keep this optional config
surely independent of the .kbd keyboard configs.

Raster, the dicts I'm making for Dutch will be a large version (250.000
words) and a small version. Do you have an indication how many words is
advisable for the small version?

However it would be desirable that each .kbd file can indicate:
- predictive mode is not possible, e.g. for numeric keyboards. I don't
want it to remember my PIN, credit card number, etcetera. (numeric
keyboard, a real one, without the é, ë, ..)
- predictive mode is default on, but user can temporarily disable it,
e.g. when going into a shell (alpha keyboard)
- predictive mode is defaul off, but user can temporarily enable it,
e.g. when typing proza inside a shell (terminal keyboard)



 
 


___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume keyboard dictionary sorting and normalization

2009-01-06 Thread Pander
Pander wrote:
...
 However it would be desirable that each .kbd file can indicate:
 - predictive mode is not possible, e.g. for numeric keyboards. I don't
 want it to remember my PIN, credit card number, etcetera. (numeric
 keyboard, a real one, without the é, ë, ..)
 - predictive mode is default on, but user can temporarily disable it,
 e.g. when going into a shell (alpha keyboard)
 - predictive mode is defaul off, but user can temporarily enable it,
 e.g. when typing proza inside a shell (terminal keyboard)






 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community

I think the functionality described above is already partly implemented for:
- type NUMERIC
- type ALPHA
- type TERMINAL
settings in the .kbd file. But what behaviour is exactly enabled by the
above types?

Could TERMINAL and NUMERIC by default also be non-predictive?

It is not clear for me from
http://trac.enlightenment.org/e/browser/trunk/e/src/modules/illume/e_kbd.c
and
http://trac.enlightenment.org/e/browser/trunk/e/src/modules/illume/e_kbd_int.c
what effect it has on predictive mode. Also There are many more modes
available like HEX, PASSWORD etc.

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume keyboard dictionary sorting and normalization

2009-01-06 Thread Olof Sjobergh
On Tue, Jan 6, 2009 at 11:57 AM, The Rasterman Carsten Haitzler
ras...@rasterman.com wrote:
 sort -f i think does it... i think...

Thanks, that seems to work.

I created a package and uploaded to
http://www.opkg.org/package_90.html for anyone who is interested. The
source is hosted at http://github.com/olofsj/swedish-illume.

 hmm interesting i was just going of german/french and portuguese on this where
 i thought i could get away with simple normalisation and a basic qwerty layout
 - with selecting the matches (Vogel/Vögel for example). making the table part
 of the dictionary does make a lot of sense of course. the dict format does 
 need
 to change to make it a lot faster and intl-char friendly. i avoided this at 
 the
 time as i'd need to efficiently encode a b-tree in the file and be able to 
 mmap
 () it efficiently and use it.

I understand it would make the dictionary format more complicated.
Maybe it could be split into 2 files, one with general configuration
data such as a normalisation table, an icon etc, and then a raw
dictionary file like there is now.

Best regards,

Olof Sjöbergh

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume keyboard dictionary sorting and normalization

2009-01-06 Thread The Rasterman
On Tue, 06 Jan 2009 15:43:35 +0100 Pander pan...@users.sourceforge.net
babbled:

 Carsten Haitzler (The Rasterman) wrote:
  On Tue, 6 Jan 2009 11:49:55 +0100 Olof Sjobergh olo...@gmail.com
  babbled:
  
  Hi,
 
  I'm working on a Swedish dictionary and keyboard for Illume, but I'm
  having some trouble with sorting of utf8 chars in the dictionary. I
  can't seem to get the sorting right. Looking at the code, Illume sorts
  the dictionary after first normalizing the strings according to the
  internal normalization table. Is there any way to reproduce this
  sorting with the sort command? I've tried with a few different locales
  (C, en_US.utf8) which all make the unix sort command work differently.
  But no matter what I try words don't show up correctly.
  
  sort -f i think does it... i think...
  
  Another issue I found is that the built in normalization table is not
  very good for typing Swedish text. On a standard Swedish qwerty
  layout, we have three additional letters (å, ä and ö). These are used
  very frequently in Swedish and there are many common words that have
  different meanings if spellt with a, å or ä (for example har, här and
  hår are all very common words). But in Illume these are all normalized
  to a. Writing Swedish with a US qwerty layout and then having to
  select aåä manually after the dictionary lookup is a pain, since many
  common words will have to be selected from the lookup list each time.
 
  Instead, what you want is a Swedish qwerty layout (which is very
  simple to implement as a .kbd file), and not normalize åäö for the
  Swedish dictionary lookup. So the normalization table would really
  need to be configurable, either as a part of the dictionary or the
  .kbd file. I suppose this problem exists for other languages as well.
  If I were to work on such a change, what would be the best approach?
  
  hmm interesting i was just going of german/french and portuguese on this
  where i thought i could get away with simple normalisation and a basic
  qwerty layout
  - with selecting the matches (Vogel/Vögel for example). making the table
  part of the dictionary does make a lot of sense of course. the dict format
  does need to change to make it a lot faster and intl-char friendly. i
  avoided this at the time as i'd need to efficiently encode a b-tree in the
  file and be able to mmap () it efficiently and use it.
 
 Mapping of cafe to café (French) and Vogel to Vögel (German) is indeed
 handy, this funcitonality would be handy internationally for most languages.
 
 What about mapping Koeln to Köln etcetera? This would be handy for
 German only. Like the above story is (maybe) specific for Swedish.

yup. i've gone over this before. i think the solution is a dict change. you
have a match string and a list of possible outputs:

vogel - Vogel,Vögel
koln - Köln
koeln - Köln

etc. etc. - this allows arbitrary mappings from 1 string to any other. should
cover a whole HOST of languages (japanese, chines and korean included if using
the romanised input methods of these languages). again - whole dict format
change would be needed and it'd be much harder to crate dicts.

 Perhaps an optional config file can be provided for the dictionaries
 that need one. Keeping this info outside the dict itself eases sorting
 of the dict and upgrading dicts. I would keep this optional config
 surely independent of the .kbd keyboard configs.
 
 Raster, the dicts I'm making for Dutch will be a large version (250.000
 words) and a small version. Do you have an indication how many words is
 advisable for the small version?

you don't really need a small one - the small english one i used 1. because it
was simpler to check my match results in a small set of data and it used less
ram in my initial in memory only dict code. in the end there likely need a
major dict format and data content change to basically support all this stuff.
but once done it should cover a whole slew of languages.

 However it would be desirable that each .kbd file can indicate:
 - predictive mode is not possible, e.g. for numeric keyboards. I don't
 want it to remember my PIN, credit card number, etcetera. (numeric
 keyboard, a real one, without the é, ë, ..)

outputting keysyms instead of strings (like Terminal.kbd) bypasses the dict. so
this is how it is effectively turned off.

 - predictive mode is default on, but user can temporarily disable it,
 e.g. when going into a shell (alpha keyboard)

that's what Terminal.kbd is for... ?

 - predictive mode is defaul off, but user can temporarily enable it,
 e.g. when typing proza inside a shell (terminal keyboard)

of course this can be done - the problem is - where do i conveniently attach
all the controls. i guess if no word is composed currently ^ on the top-left
can pop up a control panel.

but for now - kbd is not on my radar - got other things to do at the moment. :(

-- 
- Codito, ergo sum - I code, therefore I am --
The Rasterman (Carsten 

Re: Illume keyboard dictionary sorting and normalization

2009-01-06 Thread The Rasterman
On Tue, 06 Jan 2009 15:58:55 +0100 Pander pan...@users.sourceforge.net
babbled:

 Pander wrote:
 ...
  However it would be desirable that each .kbd file can indicate:
  - predictive mode is not possible, e.g. for numeric keyboards. I don't
  want it to remember my PIN, credit card number, etcetera. (numeric
  keyboard, a real one, without the é, ë, ..)
  - predictive mode is default on, but user can temporarily disable it,
  e.g. when going into a shell (alpha keyboard)
  - predictive mode is defaul off, but user can temporarily enable it,
  e.g. when typing proza inside a shell (terminal keyboard)
 
 
 
 
 
 
  ___
  Openmoko community mailing list
  community@lists.openmoko.org
  http://lists.openmoko.org/mailman/listinfo/community
 
 I think the functionality described above is already partly implemented for:
 - type NUMERIC
 - type ALPHA
 - type TERMINAL
 settings in the .kbd file. But what behaviour is exactly enabled by the
 above types?

none. it's a hint. for example if the app requests you're in a numeric input
field - as the keyboard for a numeric mode for the keyboard code to switch
layouts to that layout. it's to hook with the hints from apps.

 Could TERMINAL and NUMERIC by default also be non-predictive?

they are both non predictive. remember the kbd only puts strings through the
dict - keysyms (raw key names) it doesn't.

 It is not clear for me from
 http://trac.enlightenment.org/e/browser/trunk/e/src/modules/illume/e_kbd.c
 and
 http://trac.enlightenment.org/e/browser/trunk/e/src/modules/illume/e_kbd_int.c
 what effect it has on predictive mode. Also There are many more modes
 available like HEX, PASSWORD etc.

dictionary use is implied by how the key outputs a char. i.ie using ! instead
of exclam (for exlamation mark) puts ! thru the dict, but exclam goes straight
to the app - no dict involved. thats whhy terminal.kbd doesnt go through the
dictionary prediction at all - as no keys there output a string.

-- 
- Codito, ergo sum - I code, therefore I am --
The Rasterman (Carsten Haitzler)ras...@rasterman.com


___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume keyboard dictionary sorting and normalization

2009-01-06 Thread The Rasterman
On Tue, 6 Jan 2009 17:28:30 +0100 Olof Sjobergh olo...@gmail.com babbled:

 On Tue, Jan 6, 2009 at 11:57 AM, The Rasterman Carsten Haitzler
 ras...@rasterman.com wrote:
  sort -f i think does it... i think...
 
 Thanks, that seems to work.
 
 I created a package and uploaded to
 http://www.opkg.org/package_90.html for anyone who is interested. The
 source is hosted at http://github.com/olofsj/swedish-illume.
 
  hmm interesting i was just going of german/french and portuguese on this
  where i thought i could get away with simple normalisation and a basic
  qwerty layout
  - with selecting the matches (Vogel/Vögel for example). making the table
  part of the dictionary does make a lot of sense of course. the dict format
  does need to change to make it a lot faster and intl-char friendly. i
  avoided this at the time as i'd need to efficiently encode a b-tree in the
  file and be able to mmap () it efficiently and use it.
 
 I understand it would make the dictionary format more complicated.
 Maybe it could be split into 2 files, one with general configuration
 data such as a normalisation table, an icon etc, and then a raw
 dictionary file like there is now.

it could be - but there needs to be a redo of the dict format. i need to at
least add the following:

1. actual match sting and display string should be different. i.e.:
(german) vogel - Vogel,Vögel
(japanese) sakana - さかな,サカナ,魚,肴,茶菓な

yes japanese is a silly example as there is no way to type matches in kanji
(chinese chars) - you NEED an input method (this is where vkbd and xim etc.
need to tie in eventually).

2. something much faster to just mmap() and use dynamically. it currently is
mmaped with a small lookup table built for faster access - but it still need to
parse whole lines on the fly to do matching currently even tho it's mmaped. so
if the format changes - then it doesn't matter much.

so the above ability to map 1 input match to multiple possible outputs (that
could even be radically different) would negate the need for a mapping table :)

-- 
- Codito, ergo sum - I code, therefore I am --
The Rasterman (Carsten Haitzler)ras...@rasterman.com


___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community