Re: [Numpy-discussion] How to implement a 'pivot table?'

Timothy Hochberg Mon, 06 Aug 2007 06:54:27 -0700

Nicely done Travis. Working code is always better than theory. I copied your
interface and used the brute-force, non-numpy approach to construct the
pivot table. On the one hand, it doesn't preserve the order that the entires
are discovered in as the original does. On the other hand, it's about 40%
faster for large files on my machine (see pivot2). Probably because you
don't have to loop through the data so many times. You can get further
improvements if you know the operation in advance as shown in pivotsum,
although this won't work on median ASAIK.


regards,

-tim


On 8/1/07, Travis Vaught <[EMAIL PROTECTED]> wrote:
>
> Greetings,
>
> Speaking of brute force... I've attached a rather ugly module that
> let's you do things with a pretty simple interface (session shown
> below).  I haven't fully tested the performance, but a million
> records with 5 fields takes about 11 seconds on my Mac to do a
> 'mean'.  I'm not sure what your performance considerations are, but
> this may be useful.  Record arrays are really nice if they make sense
> for your data.
>
> Travis
>
>
> (from an ipython command prompt)
>
> In [1]: import testpivot as p
>
> In [2]: a = p.sample_data()
>
> In [3]: a
> Out[3]:
> recarray([('ACorp', 'Region 1', 'Q1', 20000.0),
>         ('ACorp', 'Region 1', 'Q2', 22000.0),
>         ('ACorp', 'Region 1', 'Q3', 21000.0),
>         ('ACorp', 'Region 1', 'Q4', 26000.0 ),
>         ('ACorp', 'Region 2', 'Q1', 23000.0),
>         ('ACorp', 'Region 2', 'Q2', 20000.0),
>         ('ACorp', 'Region 2', 'Q3', 22000.0),
>         ('ACorp', 'Region 2', 'Q4', 21000.0),
>         ('ACorp', 'Region 3', 'Q1', 26000.0),
>         ('ACorp', 'Region 3', 'Q2', 23000.0),
>         ('ACorp', 'Region 3', 'Q3', 29000.0),
>         ('ACorp', 'Region 3', 'Q4', 27000.0),
>         ('BCorp', 'Region 1', 'Q1', 20000.0),
>         ('BCorp', 'Region 1', 'Q2', 20000.0),
>         ('BCorp', 'Region 1', 'Q3', 24000.0),
>         ('BCorp', 'Region 1', 'Q4', 24000.0),
>         ('BCorp', 'Region 2', 'Q1', 21000.0 ),
>         ('BCorp', 'Region 2', 'Q2', 21000.0),
>         ('BCorp', 'Region 2', 'Q3', 22000.0),
>         ('BCorp', 'Region 2', 'Q4', 29000.0),
>         ('BCorp', 'Region 3', 'Q1', 28000.0),
>         ('BCorp', 'Region 3', 'Q2', 25000.0),
>         ('BCorp', 'Region 3', 'Q3', 22000.0),
>         ('BCorp', 'Region 3', 'Q4', 21000.0)],
>        dtype=[('company', '|S5'), ('region', '|S8'), ('quarter', '|
> S2'), ('income', '<f8')])
>
> In [4]: p.pivot(a, 'company', 'region', 'income', p.psum)
> ######## Summary by company and region ##########
> cols:['ACorp' 'BCorp']
> rows:['Region 1' 'Region 2' 'Region 3']
> [[  89000.   88000.]
> [  86000.   93000.]
> [ 105000.   96000.]]
>
> In [5]: p.pivot(a, 'company', 'quarter', 'income', p.psum)
> ######## Summary by company and quarter ##########
> cols:['ACorp' 'BCorp']
> rows:['Q1' 'Q2' 'Q3' 'Q4']
> [[ 69000.  69000.]
> [ 65000.  66000.]
> [ 72000.  68000.]
> [ 74000.  74000.]]
>
> In [6]: p.pivot(a, 'company', 'quarter', 'income', p.pmean)
> ######## Summary by company and quarter ##########
> cols:['ACorp' 'BCorp']
> rows:['Q1' 'Q2' 'Q3' 'Q4']
> [[ 23000.          23000.        ]
> [ 21666.66666667   22000.        ]
> [ 24000.          22666.66666667]
> [ 24666.66666667  24666.66666667]]
>
>
>
>
> On Aug 1, 2007, at 2:02 PM, Bruce Southey wrote:
>
> > Hi,
> > The hard part is knowing what aggregate function that you want. So a
> > hard way, even after cheating, to take the data provided is given
> > below. (The Numpy Example List was very useful especially on the where
> > function)!
> >
> > I tried to be a little generic so you can replace the sum by any
> > suitable function and probably the array type as well. Of course it is
> > not complete because you still need to know the levels of the 'rows'
> > and 'columns' and also is not efficient as it has loops.
> >
> > Bruce
> >
> > from numpy import *
> > A=array([[1,1,10],
> >          [1,1,20],
> >          [1,2,30],
> >          [2,1,40],
> >          [2,2,50],
> >          [2,2,60] ])
> > C = zeros((2,2))
> >
> > for i in range(2):
> >       crit1 = (A[:,0]==1+i)
> >       subA=A[crit1,1:]
> >       for j in range(2):
> >             crit2 = (subA[:,0]==1+j)
> >             subB=subA[crit2,1:]
> >             C[i,j]=subB.sum()
> >
> >
> > print C
> >
> > On 7/30/07, Geoffrey Zhu <[EMAIL PROTECTED]> wrote:
> >> Hi Everyone,
> >>
> >> I am wondering what is the best (and fast) way to build a pivot table
> >> aside from the 'brute force way?'
> >>
> >> I want to transform an numpy array into a pivot table. For
> >> example, if
> >> I have a numpy array like below:
> >>
> >> Region     Date          # of Units
> >> ----------    ----------        --------------
> >> East        1/1             10
> >> East        1/1             20
> >> East        1/2             30
> >> West       1/1             40
> >> West       1/2             50
> >> West       1/2             60
> >>
> >> I want  to transform this into the following table, where f() is a
> >> given aggregate function:
> >>
> >>            Date
> >> Region           1/1          1/2
> >> ----------
> >> East         f(10,20)         f(30)
> >> West        f(40)             f(50,60)
> >>
> >>
> >> I can regroup them into 'sets' and do it the brute force way, but
> >> that
> >> is kind of slow to execute. Does anyone know a better way?
> >>
> >>
> >> Thanks,
> >> Geoffrey
> >> _______________________________________________
> >> Numpy-discussion mailing list
> >> [email protected]
> >> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >>
> > _______________________________________________
> > Numpy-discussion mailing list
> > [email protected]
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
>
>
> _______________________________________________
> Numpy-discussion mailing list
> [email protected]
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
>


-- 
.  __
.   |-\
.
.  [EMAIL PROTECTED]

import numpy as N
from pprint import pprint
from itertools import izip

def sample_data():
    # return a 'big-enough' sample record array to play with
    #   note: the strings should be int keys in any real table

    company_column = ['ACorp', 'ACorp', 'ACorp', 'ACorp',
                  'ACorp', 'ACorp', 'ACorp', 'ACorp',
                  'ACorp', 'ACorp', 'ACorp', 'ACorp',
                  'BCorp', 'BCorp', 'BCorp', 'BCorp',
                  'BCorp', 'BCorp', 'BCorp', 'BCorp',
                  'BCorp', 'BCorp', 'BCorp', 'BCorp' ]

    region_column = ['Region 1', 'Region 1', 'Region 1', 'Region 1',
                 'Region 2', 'Region 2', 'Region 2', 'Region 2',
                 'Region 3', 'Region 3', 'Region 3', 'Region 3',
                 'Region 1', 'Region 1', 'Region 1', 'Region 1',
                 'Region 2', 'Region 2', 'Region 2', 'Region 2',
                 'Region 3', 'Region 3', 'Region 3', 'Region 3']

    quarter_column = ['Q1', 'Q2', 'Q3', 'Q4',
                  'Q1', 'Q2', 'Q3', 'Q4',
                  'Q1', 'Q2', 'Q3', 'Q4',
                  'Q1', 'Q2', 'Q3', 'Q4',
                  'Q1', 'Q2', 'Q3', 'Q4',
                  'Q1', 'Q2', 'Q3', 'Q4']

    income_column = [20000., 22000., 21000., 26000.,
                 23000., 20000., 22000., 21000.,
                 26000., 23000., 29000., 27000.,
                 20000., 20000., 24000., 24000.,
                 21000., 21000., 22000., 29000.,
                 28000., 25000., 22000., 21000.]

    return N.rec.fromarrays([company_column, region_column, quarter_column, income_column],
                         names='company,region,quarter,income')


def big_sample_data(rows, cols):
    """ make a big recarray with specified row count (rows), two 'name'
        columns and specified data column count (cols) """

    namelst = []
    for i in range(rows):
        namelst.append(str(N.random.randint(1,10)))
    
    # make first two columns text that may be crosstab headings
    name1 = N.array(namelst)
    name2 = name1.copy()

    name1.sort()
    datacols=[]
    collist = ['name1', 'name2']
    
    for j in range(cols):
        datacols.append(N.random.randn(rows))
        collist.append('data' + str(j))

    #print name1, name2, datacols
    reclist = [name1, name2]
    #print reclist
    reclist.extend(datacols)
    #print reclist
    colnames = ','.join(collist)

    # dump all the names and data into a recarray
    return N.rec.fromarrays(reclist, names=colnames)

    #return reclist, colnames


def subtotal(ary, col1, col2):
    # Subtotals by col1
    print '######### Subtotals by %s:  ###########' % col1
    for itm in N.unique(ary.field(col1)):
        msk = N.array(ary.field(col1)==itm)
        print "%s: %s" % (itm, ary.field(col2)[msk].sum())

def psum(ary):
    return ary.sum()

def pmean(ary):
    return ary.mean()

def pmean2(ary):
    return N.asarray(ary).mean()

def pmax(ary):
    return ary.max()

def pmin(ary):
    return ary.min()

def pivot(ary, col1, col2, col3, func=psum):
    """ inputs: ary - numpy record array
                col1 - column from rec array in pivot columns
                col2 - column from rec array in pivot rows
                col3 - column from rec array to aggregate
                func - callable to perform aggregate operation
        """

    print '######## Summary by %s and %s ##########' % (col1, col2)

    i=0

    # get list of unique values
    top_row = N.unique(ary.field(col2))
    left_col = N.unique(ary.field(col1))

    pt = N.zeros((len(top_row), len(left_col)))

    for itm in left_col:
        msk = N.array(ary.field(col1)==itm)
        j=0
        for col in top_row:
            msk2 = msk&N.array(ary.field(col2)==col)
            val = func(ary.field(col3)[msk2])
            pt[j,i] = val
            j+=1
        i+=1
    print 'cols:%s' % (left_col)
    print 'rows:%s' % (top_row)
    print pt


def pivot2(ary, col1, col2, col3, func=psum):
    """ inputs: ary - numpy record array
                col1 - column from rec array in pivot columns
                col2 - column from rec array in pivot rows
                col3 - column from rec array to aggregate
                func - callable to perform aggregate operation
        """

    print '######## Summary by %s and %s ##########' % (col1, col2)
    map = {}
    rows = set()
    cols = set()
    column = ary.field(col1)
    row = ary.field(col2)
    data = ary.field(col3)
    for (c, r, d) in izip(column, row, data):
        cols.add(c)
        rows.add(r)
        key = (c, r)
        if key not in map:
             map[key] = []
        map[key].append(d)  
    for c in cols:
        for r in rows:
            key = (c,r)
            map[key] = func(map[key])
    cols = sorted(cols)
    rows = sorted(rows)
    print 'cols:%s' % (cols)
    print 'rows:%s' % (rows)
    pt = N.array([[map[(c, r)] for c in cols] for r in rows])
    print pt


def pivotsum(ary, col1, col2, col3, func=psum):
    """ inputs: ary - numpy record array
                col1 - column from rec array in pivot columns
                col2 - column from rec array in pivot rows
                col3 - column from rec array to aggregate
                func - callable to perform aggregate operation
        """

    print '######## Summary by %s and %s ##########' % (col1, col2)
    map = {}
    rows = set()
    cols = set()
    column = ary.field(col1)
    row = ary.field(col2)
    data = ary.field(col3)
    for (c, r, d) in izip(column, row, data):
        cols.add(c)
        rows.add(r)
        key = (c, r)
        if key not in map:
             map[key] = 0.0
        map[key] += d
    cols = sorted(cols)
    rows = sorted(rows)
    print 'cols:%s' % (cols)
    print 'rows:%s' % (rows)
    pt = N.array([[map[(c, r)] for c in cols] for r in rows])
    print pt


if __name__ == "__main__":
    import time
    a = sample_data()
    pivot(a, 'company', 'region', 'income', psum)
    pivot(a, 'company', 'quarter', 'income', psum)
    pivot(a, 'company', 'quarter', 'income', pmean)
    b = big_sample_data(1000000, 5)
    t0 = time.clock()
    pivot(b, 'name1', 'name2', 'data2', psum)
    print
    print "time =", time.clock() - t0

    print "------------------- pivot2"
    pivot2(a, 'company', 'region', 'income', sum)
    pivot2(a, 'company', 'quarter', 'income', sum)
    pivot2(a, 'company', 'quarter', 'income', pmean2)
    t0 = time.clock()
    pivotsum(b, 'name1', 'name2', 'data2', sum)
    print
    print "time =", time.clock() - t0

_______________________________________________
Numpy-discussion mailing list
[email protected]
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] How to implement a 'pivot table?'

Reply via email to