scoring and evolving keyboard designs

Kragen Sitaker Tue, 03 Feb 2004 01:39:45 -0800

Like other things posted here without notices to the contrary, this
code is in the public domain.


I recently posted a prototype pie-menu-based input method that
included the cryptic message:

        # some research from mine in June, redone with a better wordlist
        # in October, suggested 'erlu kxti fgpy ascb vqohm jwznd'
        # but also found that simple abcd efgh ijlkm nopq rstu vwxyz was
        # very nearly as good.  Both had less than 1.02 strokes per character.
        # however, some more research in October found that
        # 'ht ciksz dgn ervx fjmopqu ablwy' was marginally better still,
        # and also groups together letters that occur together.

This refers to the decision of which letters to group together on
which pie-menu items.  (This will be incomprehensible if you don't
understand how the input method works.) Since each menu selection is
ambiguous, it's desirable to group together letters that tend not to
occur in the same places, in order to minimize the list of candidate
words when you're done selecting letters.  To shorten the time to
enter a word, it's also desirable that subsequent letters be on the
same menu item.  So, in October, I wrote a program that does
hill-climbing keyboard design to maximize the input bit rate, assuming
that selecting a "key" (menu item) that's different from the key you
selected before takes 1 time unit, while selecting the same key again
takes 1/2 time unit.

It generally gave the same results, except when I shrank the initial
population too small, and it generally took only a few minutes.  It
thinks its preferred keyboard allows 2.988 bits of input per time
unit, which sounds pretty good for a six-key keyboard, but it usually
got to 2.7 or better within the first generation.

#!/usr/bin/python
import random, sys, keybits
digraphs = {}
for line in file('digraphs.out'):
    count, digraph = line.split()
    digraphs[digraph] = int(count)

nkeys = 6

def sum(seq):
    rv = 0
    for item in seq: rv += item
    return rv

# time(keys):
# time to enter the sample text on the keys
# exclusive of disambiguation time.
# hitting a key takes 1 time unit
# unless it's the same as the previous key
# in which case it's 1/2 unit.
# Some empirical testing (before distorting the menu in different directions,
# with four-letter words) showed that 'bass' (all letters on one key) took
# about 2/3 of the time of 'onto', doing each one ten times.

# So the time for a particular letter
# is the number of occurrences of that letter
# minus two thirds of the number of digraphs ending in that letter
# and beginning in another letter on that key
def timeletter(let, key):
    basescore = keybits.freq[let]
    digraphscore = sum([digraphs.get(prevlet + let, 0) for prevlet in key])
    return basescore - digraphscore / 2.0

def timekey(key):
    return sum([timeletter(let, key) for let in key])

def time(keys):
    return sum([timekey(key) for key in keys])
# print time(['abcdefghijklmnopqrstuvwxyz'])
# print time(['abcdefghijklm', 'nopqrstuvwxyz'])
# print time('abcdefghijklmnopqrstuvwxyz')

def try_some_stuff(maxlosers):
    bestscore = 0
    losers = 0
    while 1:
        kb = genkb()
        score = scorekb(kb)
        if score > bestscore:
            print score, kb
            sys.stdout.flush()
            bestscore = score
            losers = 0
        else:
            losers += 1
            if losers > maxlosers: return

def scorekb(kb): return keybits.kbbits(kb)/time(kb)
def randvec(): return [random.randrange(nkeys) for ii in range(26)]
def genkb(): return mkkb(randvec())

def mkkb(vec):
    rv = ['' for ii in range(nkeys)]
    for letter, ii in zip('abcdefghijklmnopqrstuvwxyz', vec):
        rv[ii] += letter
    return rv

class Breeder:
    def __init__(self, n):
        self.n = n
        self.population = [randvec() for ii in range(self.n)]
        
    def cull(self):
        kbs = [(mkkb(vec), vec) for vec in self.population]
        scores = [(scorekb(kb), kb, vec) for kb, vec in kbs]
        scores.sort()
        scores.reverse()
        print scores[0]
        self.population = [vec for score, kb, vec in scores[:self.n/2]]

    def breed(self):
        while len(self.population) < self.n:
            new = random.choice(self.population)[:]
            new[random.randrange(len(new))] = random.randrange(nkeys)
            self.population.append(new)
            

if __name__ == '__main__':
    x = Breeder(300)
    while 1:
        x.breed()
        x.cull()


This program depends on the 'keybits' module, which follows:
#!/usr/bin/python
import math
totlets = 0.0
freq = {}
for line in file('letfreq.out'):
    count, letter = line.split()
    freq[letter] = int(count)
    totlets += freq[letter]
def totfreq(akey):
    thiskey = 0
    for letter in akey: thiskey += freq[letter]
    return thiskey
def bits(akey):
    return math.log(totlets/totfreq(akey)) / math.log(2)
def weightedbits(akey):
    if totfreq(akey) == 0: return 0
    return totfreq(akey) * bits(akey)
def kbbits(kb):
    total = 0
    for key in kb: total += weightedbits(key)
    return total

The file 'letfreq.out' just says:
  51256110 e
  38440017 t
  33184490 a
  31852968 o
  30294727 i
  28596682 n
  26258206 s
  25032539 r
  22561628 h
  16778132 l
  15511322 d
  12620467 c
  11535260 u
  10302917 m
   9168289 f
   8198387 p
   8167927 w
   8109688 y
   7817011 g
   6612137 b
   4207506 v
   2908330 k
    803302 x
    678152 j
    437522 q
    237994 z

The file 'digraphs.out' is much larger, but it begins:
12106855 th
10791177 he
 7585883 in
 6643408 er
 6415329 an
 5789288 re
 5154157 on
 4428445 en
 4426168 nd
 4390391 at

It was produced from my wordlist
(http://pobox.com/~kragen/sw/wordlist) with this program:

#!/usr/bin/python
# given a file like this:
# 6187927 the
# 2941790 of
# 2682878 and
# 2544858 to
# 2150885 a
# 1849882 in
# 1089559 it
# 998867 is
# 923975 was
# 920582 that
# with word frequencies, report on the frequency of each pair of letters.

# In the above sequence, 'th' appears twice, while a number of other
# pairs of letters occur only once.  So 'th' gets a frequency of
# 6187927 + 920582, while 'he' gets a frequency only of 6187927, 'of'
# gets a frequency only of 2941790, 'an' and 'nd' get frequencies only
# of 2682878, etc.

# This information was once useful for cryptanalysis, but is now
# useful to me for keyboard design.

# This program runs in 14 seconds on a 109 000 word list computed from
# the British National Corpus, part of which is shown above.

from __future__ import generators

def pairs(astring):
    ii = iter(astring)
    first = ii.next()
    for item in ii:
        yield first + item
        first = item

digraphs = {}
for line in file('wordlist'):
    count, word = line.split()
    for digraph in pairs(word):
        try: digraphs[digraph] += int(count)
        except KeyError: digraphs[digraph] = int(count)
counts = [(count, digraph) for digraph, count in digraphs.iteritems()]
counts.sort()
counts.reverse()
for item in counts: print "%8d %s" % item

scoring and evolving keyboard designs

Reply via email to