Learning Common Lisp, and especially reading Paul Graham's just-published-on-the-Web _On Lisp_, I often wish for a quick online help program that will tell me what a particular Lisp idiom means. Kent Pitman made this awesomely cool thing called the Common Lisp HyperSpec, which weighs in at 18 megs, is densely hyperlinked, and is thoroughly and meticulously indexed with some 3000 index entries, but the indices have to be navigated by pointy-clicky web browsing. This is nice for when I'm looking for something I don't know about, but it's a nuisance when I want to know the semantics of mapcan or the argument list of map.
So here's a command-line program that looks stuff up in the index and points your web browser at it. One of these days I might hook it up to the web button toolbar so I can double-click a word in my xterm and click the appropriate button. #!/usr/bin/python """Look up a term in the Common Lisp HyperSpec, version 6-0. Requires Python 1.5.2 or newer, BSD DB support, and either a local copy of the HyperSpec or Internet access. Try running it with the argument "mapcar". The first time you run this program, assuming all goes well, it will fetch 27 files totaling half a megabyte from xanalys.com, and from them it will create a 370-kilobyte Berkeley DB file called 'hsindex.db' in your home directory. On a 56K modem, this will probably take three minutes or so. The details of the index files will probably change again in the next version, and this program won't work anymore unless you saved a copy of the version 6-0 HyperSpec. It would be nice to use the 'webbrowser' objects in Python 2.x, but unfortunately they don't default to links the way I'd like them to, and the Mozilla interface doesn't work properly when opening a new Mozilla. (My old Mozilla version doesn't accept URLs on the command line.) But using them is as simple as >>> import webbrowser >>> webbrowser.open('http://www.yahoo.com/') This program builds a Berkeley DB file, which was completely unnecessary in the situation I started with --- I had version 4-0 on my local disk, so reading all the HTML index files took 0.4 seconds instead of 0.08 seconds. "What a stupid waste of coding effort," I thought. Then I realized it would be worthwhile if only it worked against the online version, which would be as simple as changing 'open' to 'urllib.urlopen', and providing a better place to keep the index db file when using a remote copy. But online version was version 6-0, so I hacked this program to work with 6-0, which has several times as many index entries and a different HTML format in the index, and which also necessitated adding subhead support. Now the DB file is a win even for local queries, but only because Python is so slow (and because my HTML parsing code is so slow). This was probably a case of throwing good money after bad, or irrationally trying to rescue sunk costs. But it was fun. I am unhappy with the broken Berkeley DB iteration API, which conflates setting a position and getting a value (leading to error-handling and iteration-terminating code being far more complicated than it needs to be) and conflates iterators and database handles (meaning that finding the strings x such that both a=x and b=x are in the database is unnecessarily difficult --- not that I was doing that.). """ import bsddb, sys, os, string, urlparse, urllib # magic constants topdir = "/home/kragen/docs/hyperspec/HyperSpec" # where the hyperspec is if not os.path.exists(topdir): topdir = "http://www.xanalys.com/software_tools/reference/HyperSpec" browsercmd = "links %s" # how to launch a browser # stuff for the index file format wordmarker = '<B>' # where an index term starts urlmarker = ' <A REL=' # marker for index URLs urlstart = '../Body/' # how the URL starts def startswith(seq, prefix): """Returns true if seq starts with prefix. Like string.startswith in Python 2.x.""" return seq[:len(prefix)] == prefix def html_unesc(ss): """Convert HTML to ordinary ASCII text by removing entities.""" return string.replace(string.replace(string.replace(string.replace( ss, '<', '<'), '>', '>'), '"', '"'), '&', '&') def build_index(topdir, dbfilename): """Munge the HTML of the HyperSpec index to make a list of index terms.""" print "Building index of %s in %s" % (topdir, dbfilename) files = map(lambda x, topdir=topdir: '%s/Front/X_Mast_%s.htm' % (topdir, x), map(chr, range(ord('A'), ord('Z') + 1)) + ['9']) dbfile = bsddb.btopen(dbfilename, 'c') try: file = None for filename in files: file = urllib.urlopen(filename) try: word = urltail = None for line in file.readlines(): if startswith(line, wordmarker): ws = len(wordmarker) # this will die if there's no ending < on the line # html_unesc here is so things like &rest will work. word = string.lower( html_unesc(line[ws:string.index(line, '<', ws)])) elif startswith(line, urlmarker): if word is None: raise "URL before a word" # this part here should *really* use a regex! start = len(urlmarker) # this will die if the expected strings aren't found us = (string.index(line, urlstart, start) + len(urlstart)) ue = string.index(line, '">', us) # url end urltail = line[us:ue] subtermend = string.index(line, '<', ue) subterm = string.lower( html_unesc(line[ue+2:subtermend])) wholeterm = "%s -- %s" % (word, subterm) dbfile[wholeterm] = "Body/%s" % urltail if word is None: raise "No words in %s" % filename elif urltail is None: raise "No urltails in %s" % filename finally: file.close() if file is None: raise "Couldn't open any files" except: dbfile.close() os.unlink(dbfilename) raise else: dbfile.close() def get_index(topdir): """Return an open Berkeley DB file containing an index of topdir. topdir is a URL containing the HyperSpec, preferably on your local filesystem. The index is created if necessary; no check is made to see if it's out of date. """ scheme, host, path, _, _, _ = urlparse.urlparse(topdir) if scheme in ['', 'file']: # We have a local copy, so put the index in it... dir = path else: # store it in home dir or, failing that, root dir dir = os.environ.get('HOME', '') dbfilename = "%s/hsindex.db" % dir # No up-to-date check. Delete the index yourself if you update # the HyperSpec. urllib is too primitive to tell us how old a file is... if not os.path.exists(dbfilename): build_index(topdir, dbfilename) return bsddb.btopen(dbfilename) def getwords(term): """Return the list of indexed terms that match the requested term.""" term = string.lower(term) dbfile = get_index(topdir) try: try: # stupidity: set_location('') breaks if term != '': key, value = dbfile.set_location(term) else: key, value = dbfile.first() except KeyError: # more stupidity: set_location to something past the end gives # a KeyError return [] rv = [] while 1: # special case: complete, but not unique # (no longer useful since addition of subheads) if key == term: return [(key, value)] if not startswith(key, term): return rv rv.append((key, value)) try: # still more stupidity: next() off the end of the file gives # a KeyError key, value = dbfile.next() except KeyError: return rv finally: dbfile.close() def getwords_fancy(term): """Like getwords, but sometimes more selective. If the specified term is a main index item, return only the items found under that item, not all the main index items that start with it. This way, things like 'map' and 'handle' send you to the (single) appropriate item instead of giving you a list of possibilities: map, mapcar, mapcan, etc., or handle, handler, handler-bind, etc. """ rv = getwords(term) exact_term_matches = [] wanted = "%s -- " % term for found in rv: if startswith(found[0], wanted): exact_term_matches.append(found) if len(exact_term_matches) > 0: return exact_term_matches else: return rv def shrepr(ss): """Shell representation of a string. Works on Unix, but probably not elsewhere. Returns a string which, if fed to a shell, will produce a sequence of arguments which, when rejoined by spaces, produces the original string. """ rv = [] needquotes = 0 lastcharwsp = 1 safechars = string.uppercase + string.lowercase + string.digits + ' :-,/' for char in str(ss): if char not in safechars: needquotes = 1 # The only way to put "don't" in a single-quoted csh string # is 'don'\''t'. sh is saner. rv.append("'\\%s'" % char) else: rv.append(char) if char in string.whitespace: if lastcharwsp: needquotes = 1 lastcharwsp = (char in string.whitespace) rvs = string.join(rv, '') if needquotes: rvs = "'%s'" % rvs return rvs term = string.join(sys.argv[1:], ' ') words = getwords_fancy(term) me = os.path.basename(sys.argv[0]) if words == []: print "%s: No matches for '%s'" % (me, term) sys.exit(1) elif len(words) == 1: sys.exit(os.system(browsercmd % "%s/%s" % (topdir, words[0][1]))) else: print "%s: '%s' is ambiguous; possibilities follow:" % (me, term) for word in words: # Stuff the user can cut and paste into their shell. print " %s %s" % (me, shrepr(word[0])) sys.exit(1)