On Thu, 28 Aug 2008 04:37:23 +0200 "Marco Trevisan (Treviño)" <[EMAIL 

> Carsten Haitzler (The Rasterman) wrote:
> > On Wed, 27 Aug 2008 23:12:59 +0200 "Marco Trevisan (Treviño)"
> > <[EMAIL PROTECTED]> babbled:
> >> Imho, a way to reduce the size would be allowing a rule to set suffix 
> >> and prefix (for composed words) that would reduce the dictionary size.
> >> So, for example, in my dictionary instead of using 50 lines for each 
> >> verb I would use only one per one; i.e.:
> >>
> >> Italian verb "parlare" (to talk) would be (not complete)
> >>   parl{o,i,a,iamo,ate,ano,avo,avi,ava,avamo,avate,avano,ai,asti,ò,ammo, \
> >>        aste,arono,erò,erai,erà,eremo,erete,eranno,erei,eresti,erebbe, \
> >>        eremmo,ereste,erebbero,ii,iamo,iate,ino,assi,asse,assimo, \
> >>        assero,ino,ando,ante,ato,ata,ati}
> >>
> >> Italian noun "casa" (house) would be
> >>   cas{a,e}
> >>
> >> Italian adjective "libero" (free [as freedom]) would be
> >>   liber{a,i,o}
> > 
> > yup yup. don't worry - i understand why :) i speak several langauges myself
> > (not italian - but i did study latin, and speak french, german, english,
> > japanese, some usable level of portuguese). i definitely get the language
> > issues - for both european and asian languages :) yes. the above would
> > reduce dictionary size. it would make parsing it much harder.
> I suspected this :/. I did hoped to be wrong...

i am thinking about this... i have some ideas that may improve this... this is
my thought train:

right now format is either:

word 123\n
word 23\n
(sorted case-insensitive).

the numbers are "frequency of use" so those used more will have more primary
position in the match list

1. add a line skip byte at the start of the line - means skipping to the next
line will be much faster (just jump N bytes as per the byte - if line > 255
bytes then byte-jump == 0 and skip the slow way until newline (shouldn't be very
2. extend the line to be:

word NNN match1 match2 match3 ~suffix1 ~suffix2\n

i can't give you an italian example.. but this SHOULD work for italian, french,
german, spanish etc.. example in german:

lauf 1 ~e ~en ~st ~t\n
blod 1 blöd\n
bloss 1 bloß\n

so now we have the ability to match and "append" a suffix. suffix is ~XXX and
full replacement words are just listed. this should remain fast as i only
"lookup" on the first word on the line that is the initial match - so it
builds a list of candidates. the problem is that once you exceed the "base" it
needs to dynamically build matches for all combinations of base + extension.
also for full replacements (as in the last 2 lines) it needs to be able to
match these as well, so they end up being full entries too. the real problem is
generating such a dictionary - i tried to keep the dict format so simple that
it was trivial to generate. but it'd solve your problem. the cool bit is.. this
ALSO solves japanese and chinese... (romanji and pinyin - and even kanna input)
so for example:

sakana 1 魚 さかな 肴 坂な 茶菓な サカナ\n

(sakana is fish in japanese - but can match other kanji too and could be
written in hiragana or katakana).

anyway... this almost makes the illume keyboard... a full input method... just
not using XIM/SCIM/UIM... and i am a bit wary of treading down that path right
now. but as such a dictionary then COULD list all these completions when you
typing in roman text. this should apply for chinese too. not sure about korean.
there are other languages this may work for as well...

anyway. if i am going to go expand the dictionary format, i really need to be
careful. i kept it simple because i didn't want to solve the worlds dictionary
problems - i did want to keep it basic but working. as best i can tell the OM
userbase is still mainly western-speaking (yes - i know we have people here
from asia! :) not forgetting! just looking at dealing with the majority first!)

anyway... i am mulling this over. the byte-skip may solve some performance
issues, but this means i now need a special dict generator tool. i was trying
to avoid that :(

> >> Anyway, let me know I should send you the dict I've.
> > 
> > it's italian - right?
> Yes, it's an Italian dict.
> >> Italian standard linux dictionary (/usr/share/dict/italian) "weights" 
> >> 1,2mb but it's mostly incomplete.
> > 
> > aaah. ok. i guess that's not great quality then :)
> No at all...
> >> And this is a great thing. Since this phone without a great virtual 
> >> keyboard (like the one you're doing) won't be usable/cool as it should 
> >> be. Imho this is the killer tool of illume.
> > 
> > thanks :) though really.. there is much more to illume :)
> Yes, the keyboard is not illume (that is a cool wm for mobiles however) 
> but its keyboard makes it unique!

well the keyboard, i hoped, would be a "fairly generic basic keyboard" that
would cover most such "small screen mobile touchscreen ui usage". i know people
want things like dasher or a gazillion other bizarre and wonderful input
systems (handwriting, grafiti, etc.). thus illume supports a generic keyboard
app that can be run and thrown into the keyboard slot - the inbuilt one was
just meant to efficiently cover the basic set of stuff for most people. i need
to look into this - BUT... input on this (no put intended) is always
appreciated. as per above - your idea of having a list of suffixes lef me on
the above path. i have a feeling it still isn't perfect, but it's an
improvement. it means the dict now knows about prefix and suffix and so when u
type the "root" of a word that is conjugated, the dict can even offer the
conjugated forms as matches. that's good for western langauges - even works for
japanese. chinese as best i know doesn't conjugate, not sure about korean. and
don't even ask about the rest! :)

the cool bit about all this is... it's a bit of a lesson in linguistics. i've
spent a fair bit of my life learning various languages - some i don't really
"speak" anymore, but understand when written no problem (just rusty), but the
principles learnt are coming in handy for this little exercise :) but i don't
pretend to know everything - so please, input here is valuable. i just would
like to come up with a solution that only has to be written once (ie no special
plugins per language) to cover most people/languages. i know some may always be
left out - thus the "plug in external keyboard" option above. :)

> > hehehe - i just haven't done it. that's all. accent char normalising is
> > easy:
> > 
> > ñ -> n
> > é -> e
> > ö -> o
> > 
> > etc. - just strip any accent (and convert to lower case). what i was
> > wondering was:
> > 
> > æ -> ?
> > ß -> ? (maybe s?)
> Yeah... They should be transformed in two chars, in fact ("ae" and 
> "ss"). Can't this been supported? Considering them as two inputs!

the problem here is the code expects the string lengths to be the same and not
change (expand or contract) so i really need to map 1:1. :(

> -- 
> Treviño's World - Life and Linux
> http://www.3v1n0.net/
> _______________________________________________
> Openmoko community mailing list
> community@lists.openmoko.org
> http://lists.openmoko.org/mailman/listinfo/community

Carsten Haitzler (The Rasterman) <[EMAIL PROTECTED]>

Openmoko community mailing list

Reply via email to