"oerlap": interactive analysis of tuple-element frequencies

Kragen Sitaker Fri, 01 Feb 2002 00:22:29 -0800

I wanted to call this "colapse", because it seemed like the perfect
  term, but that seems to be a widely-used word.  Google finds 5560
  hits, both misspellings and other things.


"oerlap" isn't quite as apt a term, but it isn't used for anything
  else, except as a rare misspelling of "overlap".

This is sort of OLAPish (see http://www.olapreport.com/fasmi.htm) but
  not full OLAP.  (The FASMI criteria are "fast, analysis, shared,
  multidimensional, information"; "fast" means "simplest analyses
  under one second, most responses within five seconds, very few more
  than 20 seconds"; "analysis" means end-users can program it to do
  business logic, statistical analysis, and other ad hoc calculations;
  "shared" means it supports reasonable security with shared
  read-write access; "multidimensional" means it must provide a
  multidimensional conceptual view of the data with hierarchies and
  multiple hierarchies; and "information" means it handles lots of
  information.  oerlap is a half-assed hack at all of these.)

It's often the case that I have a bunch (tens or hundreds of thousands
  of rows) of tabular data that I want to explore interactively, and I
  don't have a good way to do that.

I envision "oerlap": a simple UI that makes this easy.  You feed it
  tabular data; it presents you with a table.

Initially, the table has one row, with one cell for each field in the
  input data.  Each cell contains a list of the most frequent three
  values in that field, with their respective numbers of occurrences.
  There is an extra cell that indicates the number of input rows.

Clicking on a cell causes the table to expand until it has one row for each
  value of that field; it is sorted by the number of occurrences of those
  values, so that the first few rows are the ones that represent most of the
  input data records.  The extra cell indicating the number of input records is
  still there, but now it's an entire column, indicating the number of input
  records represented by each rows.  The remaining un-broken-out columns are
  displayed as before: each cell contains the most frequent three values for
  that field, with their respective numbers of occurrences.

So each column is in one of two states, broken-out or summary; there is one
  row in the displayed table for each distinct tuple of values from the
  broken-out columns.  Clicking on a value in a column switches it between
  broken-out and summary state.

Clicking on a column header causes the table to be sorted by the values in 
  that column; by default, it's sorted by the extra column indicating
  number of input rows.

In its current state, it only does the analysis; it doesn't provide
  the sorting, HTML interface, and interactivity I envision.  Maybe
  soon.

# incredibly powerful secret web log analysis tool
import string

def oerlap(datasrc, breakoutby):
    """Analyze data.

    Given a data source that yields tuples or None when .next() is called,
    and a sequence 'breakoutby' that specifies which fields of the tuples to
    break out by, count frequencies.

    """
    results = {}
    while 1:
        line = datasrc.next()
        if line is None: return results
        key = tuple(map(lambda f, line=line: line[f], breakoutby))
        r = results.setdefault(key, map(lambda x: {}, range(len(line))))
        if len(r) < len(line): r.extend([{}] * (len(line) - len(r)))
        for dict, value in map(None, r, line):
            dict[value] = dict.get(value, 0) + 1

class filelines:
    "Return lines from a file."
    def __init__(self, somefile):
        self.file = somefile
    def next(self):
        line = self.file.readline()
        if line == "": return None
        return tuple(map(lambda x: intern(x), string.split(line)))

class arrayitems:
    "For testing.  Return tuples from an array."
    def __init__(self, somearray):
        self.array = somearray
        self.ii = 0
    def next(self):
        if self.ii == len(self.array): return None
        try: return self.array[self.ii]
        finally: self.ii = self.ii + 1

testdata = [('a', 1, 32),
            ('a', 1, 33),
            ('b', 1, 31),
            ('c', 2, 30),
            ('a', 0, 30)]

def test(bb=[]): return oerlap(arrayitems(testdata), bb)

"oerlap": interactive analysis of tuple-element frequencies

Reply via email to