aggregating a large sorted text file

Shane Geiger Wed, 21 Mar 2007 10:48:18 -0800

Apparently you want to use this data to know how many blue circles, bluesquares, red circles and red squares. In other words, I doubt you wantto output redundant data columns, you just want this data in a moreusable format and that you don't actually need to do multiple passesover it.

This is a fun problem to solve because it uses two very powerful tools:cvs.dictreader and bitwise categorization.

Note: your initial data has three records with the same ID. I assumesthe ID is the unique key. So I changed the data slightly.






[EMAIL PROTECTED] wrote:

Hi,

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.


ID,color,shape,msr1
------------------------------
001, blue, square,  4
001, red , circle,    5
001, red,  circle,    6


ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...


--
Shane Geiger
IT Director
National Council on Economic Education
[EMAIL PROTECTED]  |  402-438-8958  |  http://www.ncee.net

Leading the Campaign for Economic and Financial Literacy

"""

Apparently you want to use this data to know how many blue circles, blue 
squares, red circles and red squares.  In other words, I doubt you want to 
output redundant data columns, you just want this data in a more usable format 
and that you don't actually need to do multiple passes over it.

This is a fun problem to solve because it uses two very powerful tools: 
cvs.dictreader and bitwise categorization.

Note: your initial data has three records with the same ID.  I assumes the ID 
is the unique key.  So I changed the data slightly.

------

Given a large ascii file (delimited or fixed width) with one ID field
and dimensions/measures fields, sorted by dimensions, I'd like to
"flatten" or "rollup" the file by creating new columns: one for each
combination of dimension level, and summing up measures over all
records for a given ID.

If the wheel has already been invented, great, please point me in the
right direction. If not, please share some pointers on how to think
about this problem in order to write efficient code.

Is a hash with dimension level combinations a good approach, with
values reset at each new ID level?

I know mysql, Oracle etc will do this , but they all have a cap on #
of columns allowed. SAS will allow unlimited columns, but I don't own
SAS.

Thanks.


ID,color,shape,msr1
------------------------------
001, blue, square,  4
001, red , circle,    5
001, red,  circle,    6


ID, blue_circle, blue_square, red_circle, red_square
--------------------------------------------------------------------------
001,0,4,11,0
002 ...

"""

import string


## BITWISE CATEGORIZATION STUFF

def gNextBit(val=0):
    while True:
        y = 2**val
        val += 1
        yield y

nb = gNextBit()

categories = ['blue','red','square','circle']
#categories_value = ['blue','red','square','circle']

def bitwise_categorize(items):
    d = {}
    for item in items:
        d[item] = nb.next()
    return d

categories_dict = bitwise_categorize(categories)

#print categories_dict  # {'blue': 1, 'circle': 8, 'square': 4, 'red': 2}  

def get_properties(category_int):
    p_list = []
    for k,v in categories_dict.items():
        if category_int & v == v:  
            p_list.append(k)
    return p_list

def list_properties():
    for i in range(len(categories)**2):
        print "Properties for something with category_int 
of",str(i),str(get_properties(i))

#list_properties()



### EXAMPLE DATA

header_fields = ['id','color','shape','msr1']

example_data = """
001, blue, square,  4
002, red , circle,    5
003, red,  circle,    6
"""

# write out the example
import os
def writefile(f, data, perms=750): open(f, 'w').write(data) and os.chmod(f, 
perms)
csv_file = "/Users/shanegeiger/temp.csv"
writefile(csv_file, example_data)




###   READING IN THE DATA AND CATEGORIZING IT WITH BITWISE CATEGORIZATION

import csv    
reader = csv.DictReader(open(csv_file), [], delimiter=",")

data = []
info = {}

while True:
    try:
        # Read next "header" line (if there isn't one then exit the loop)
        reader.fieldnames = header_fields
        rdr = reader.next()
        data.append(rdr) 
                         
    except StopIteration: break

    categories_int = 0

#    print "categories_dict:",categories_dict

    for rec in data:   # for each record
        color_cat = categories_dict[string.strip(rec['color'])]  # should add 
them to the color category
        shape_cat = categories_dict[string.strip(rec['shape'])]  # should add 
them to the color category
        combined_cat = color_cat + shape_cat 

        #print "color_category:", color_cat,
        #print "  shape_category:", shape_cat,
        #print "  combined categories:", combined_cat

        rec['categories_int'] = combined_cat
        if rec['id']:
            info[rec['id']] = rec
        elif rec['color']:
            info[rec['color']] = rec
        elif rec['shape']:
            info[rec['shape']] = rec
        elif rec['msr1']:
            info[rec['msr1']] = rec
        else:
            print "Warning: Not all fields found for record:",str(rec)

for k,v in info.items():
    print k,"-->",v
    cat_int = v['categories_int']
    print " -- The categories_int variable has value ",cat_int,"which means 
this record has these properties: ", get_properties(cat_int)

    print 



print "Just to be clear, here are the bitwise categories:"
list_properties()

begin:vcard
fn:Shane Geiger
n:Geiger;Shane
org:National Council on Economic Education (NCEE)
adr:Suite 215;;201 N. 8th Street;Lincoln;NE;68508;United States
email;internet:[EMAIL PROTECTED]
title:IT Director
tel;work:402-438-8958
x-mozilla-html:FALSE
url:http://www.ncee.net
version:2.1
end:vcard

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: flattening/rolling up/aggregating a large sorted text file

Reply via email to