On Mon, 2 Feb 2009 19:39:52 +0200 Kostis Anagnostopoulos <ankos...@gmail.com> said:
> On Sun 01 Feb 2009 00:31:09 Carsten Haitzler wrote: > > On Fri, 30 Jan 2009 21:16:57 +0100 Olof Sjobergh <olo...@gmail.com> said: > > > On Fri, Jan 30, 2009 at 8:12 PM, The Rasterman Carsten Haitzler > > > > > > <ras...@rasterman.com> wrote: > > > > On Fri, 30 Jan 2009 08:31:43 +0100 Olof Sjobergh <olo...@gmail.com> > said: > > > >> But I think a dictionary format in plain utf8 that includes the > > > >> normalised words as well as any candidates to display would be the > > > >> best way. Then the dictionary itself could choose which characters to > > > >> normalise and which to leave as is. So for Swedish, you can leave å, ä > > > >> and ö as they are but normalise é, à etc. Searching would be as simple > > > >> as in your original implementation (no need to convert from multibyte > > > >> format). > > > > > > > > the problem is - the dict in utf8 means searching is slow as you do it > > > > in utf8 space. the dict is mmaped() to save ram - if it wasnt it'd need > > > > to be allocated in non-swappable ram (its a phone - it has no swap) and > > > > thus a few mb of your ram goes into the kbd dict at all times. by using > > > > mmap you leave it to the kernels paging system to figure it out. > > > > > > > > so as such a dict change will mean a non-ascii format in future for > > > > this reason. but there will then need to be a tool to generate such a > > > > file. > > > > > > Searching in utf8 doesn't mean it has to be slow. Simple strcmp works > > > fine on multibyte utf8 strings as well, and should be as fast as the > > > dictionary was before adding multibyte to widechars conversions. But > > > if you have some other idea in mind, please don't let me disturb. =) > > > > the problem is - it INSt a simple key>value lookup. it's a possible-match > > tree build on-the-fly. that means you jump about examining 1 character at a > > time. the problem here is that 1 char may or may not be 1 byte or more and > > that makes it really nasty. if it were a simple key lookup for a given > > simple string - life would be easy. this is possible - but then u'd have to > > generate ALL permutations first then look ALL of them up. if you weed out > > permutations AS you look them up you can weed out something like 90% of the > > permutations as you KNOw there are no words starting with qz... so as you > > go through qa... qs.... qx... qz... you can easily stop all the > > combinations with qs, qz ans qx as no words begin with that (if you have an > > 8 letter word with 8 possible letters per character in the word thats 8^6 > > lookups you avoided (in the case above - ie all permutations of the other 6 > > letters). thats 262144 lookups avoided... just there. for... 1 of the above > > impossible permutation trees. now add it up over all of them. > > Do you consider this paper relevant? > http://citeseer.ist.psu.edu/schulz02fast.html > "Fast String Correction with Levenshtein-Automata", (2002), Klaus Schulz, > Stoyan Mihov > > It actually uses tries to avoid generating and comparing exhaustively all > permutations of the input word (typed keys), > but instead traverses *only* know words and accumulates permutations unless > a max-errors limit gets exceeded, in which case this path dies. not sure thats that good.. that will drop possible matches - the current scheme walks the tree of known words using the permutation list to pick paths - it wotn follow paths that dont exist, so thats already done. i was just saying that you need the permutation list per letter + walking of the data to be inherently combined. as without that you need to generate every permutation and throw it at a 1 key -> value lookup hash. it still uses a trie ( which is a binary tree with the letters inlined as part of the tree struct). :) just reading the abstract tho.. document is 67 pages i have to dig through... > It describes a mathematical model for correcting typos, > but since i have already implemented it (in java) > i know think it can be retrofitted to perform what you describe in: > http://wiki.openmoko.org/wiki/Illume_keyboard sure - can it be implemented so all data is mmaped from files? thats the biggest problem. the first dict for illume (before the current) used a 27-way per node tree - lookups were hyper-fast. but it ate ram. i went to the opposite end where i just mmaped the test file and built a very small 2-level char offset lookup table to avoid ram usage. this isnt that fast - but was ok. i know i could improve the parsing with having it all ucs2 to avoid slower utf8 decomposing and with line jump-tables built into the file it'd avoid scanning a whole line to jump to the next entry when a match fails. as such it's more a matter of just having a fast dict format that can be mmaped and walked easily while spooling off the permutations of chars per letter (and thus being able to spot a match and calculate its relative distance). > Keep up the good work. > Kostis > > _______________________________________________ > Openmoko community mailing list > community@lists.openmoko.org > http://lists.openmoko.org/mailman/listinfo/community -- ------------- Codito, ergo sum - "I code, therefore I am" -------------- The Rasterman (Carsten Haitzler) ras...@rasterman.com _______________________________________________ Openmoko community mailing list community@lists.openmoko.org http://lists.openmoko.org/mailman/listinfo/community