Re: [Pytables-users] Extracting unique values of a column

Francesc Alted Tue, 12 May 2009 04:45:43 -0700

On Tuesday 12 May 2009 12:29:04 Armando Serrano Lombillo wrote:
> Compression: zlib, level 1.
> Size: 150 MB (compressed) but it could be even bigger, or it could be less
> than 1 MB. Anyway, even with small files, I find it slower than I would
> expect.
> Available memory: depends. I am now running it with 512 MB of RAM.
> Expectedrows: no, I didn't know about it.
> Other information: I first create the table, save the file and close it. At
> the time of creating the table I can't now how many rows there will be. I
> then reopen the file and extract the unique values.
> I'm running it on Windows XP, python 2.5.


Well, 150 MB is not really a big deal for PyTables.  I think the problem is 
more the way you are accessing columns.  The next is quite more efficient:

    # Build a dictionary for getting the different values per column
    uniqvals = dict((name, set()) for name in t.colnames)

    # Get the unique values for every column
    for row in t:
        for column in uniqvals:
            uniqvals[column].add(row[column])

Here you have some timings for a really small table (400 KB):

Creating table with 10000 rows and 5 columns...                                 
                 
Getting unique values for each column (method 1)...                             
                 
column: col0, unique values: set([0])                                           
                 
column: col1, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])                
                 
column: col2, unique values: set([0, 8, 2, 4, 6])                               
                 
column: col3, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])                
                 
column: col4, unique values: set([0, 8, 2, 4, 6])                               
                 
Time for finding unique values (method 1): 4.928
Getting unique values for each column (method 2)...
column: col0, unique values: set([0])
column: col1, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
column: col2, unique values: set([0, 8, 2, 4, 6])
column: col3, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
column: col4, unique values: set([0, 8, 2, 4, 6])
Time for finding unique values (method 2): 0.017

where method 1 is what you are using and method 2 is the above one.  As you 
can see, the speed-up is more than 250x.

Also, using a dictionary instead of a set could be worth a try:

    # Build a dictionary for getting the different values per column
    uniqvals = dict((name, {}) for name in t.colnames)

    # Get the unique values for every column
    for row in t:
        for column in uniqvals:
            uniqvals[column][row[column]] = None

this method (let's call it method 3) is a little faster:

Creating table with 1000000 rows and 10 columns...
Getting unique values for each column (method 2)...
Time for finding unique values (method 2): 2.563
Getting unique values for each column (method 3)...
Time for finding unique values (method 3): 2.032

This last run was using a 40 MB table.  So, this operation can be done at 
around 20 MB/s is which pretty good.

I'm attaching my small benchmark in case you you want to experiment a little 
more with different compressors or other parameters.

Hope this helps,

-- 
Francesc Alted

import tables as tb
from time import time

FNAME = "/tmp/prova.h5"
NROWS = int(1e6)
NCOLS = 10
MAXDIFF = 10

def createtable(filename):
    # Build a table descriptor as a dictionary
    tdesc = {}
    for i in range(NCOLS):
        tdesc["col%d"%i] = tb.IntCol()

    f = tb.openFile(filename, "w")
    t = f.createTable(f.root, 'table', tdesc, #expectedrows=NROWS,
                      filters=tb.Filters(complib='lzo', complevel=1))

    # Fill the table with values
    row = t.row
    for j in range(NROWS):
        for i in range(NCOLS):
            row["col%d"%i] = j*i % MAXDIFF    # MAXDIFF different values
        row.append()

    f.close()


def tuniq1(filename):
    f = tb.openFile(filename, "r")
    t = f.root.table

    # Get the different values per column in one shot
    uniqvals = dict((name, set(t.cols._f_col(name))) for name in t.colnames)

    f.close()
    return uniqvals


def tuniq2(filename):
    f = tb.openFile(filename, "r")
    t = f.root.table

    # Build a dictionary for getting the different values per column
    uniqvals = dict((name, set()) for name in t.colnames)

    # Get the unique values for every column
    for row in t:
        for column in uniqvals:
            uniqvals[column].add(row[column])

    f.close()
    return uniqvals

def tuniq3(filename):
    f = tb.openFile(filename, "r")
    t = f.root.table

    # Build a dictionary for getting the different values per column
    uniqvals = dict((name, {}) for name in t.colnames)

    # Get the unique values for every column
    for row in t:
        for column in uniqvals:
            uniqvals[column][row[column]] = None

    f.close()
    return uniqvals


def print_uniq(uniqvals):
    cols = uniqvals.keys()
    cols.sort()
    for column in cols:
        print "column: %s, unique values: %s" % (column, uniqvals[column])


if __name__ == '__main__':
    print "Creating table with %s rows and %s columns..." % (NROWS, NCOLS)
    createtable(FNAME)

    #print "Getting unique values for each column (method 1)..."
    #t0 = time()
    #uniqvals = tuniq1(FNAME)
    #t1 = time() - t0
    #print_uniq(uniqvals)
    #print "Time for finding unique values (method 1):", round(t1, 3)
    
    print "Getting unique values for each column (method 2)..."
    t0 = time()
    uniqvals = tuniq2(FNAME)
    t1 = time() - t0
    #print_uniq(uniqvals)
    print "Time for finding unique values (method 2):", round(t1, 3)

    print "Getting unique values for each column (method 3)..."
    t0 = time()
    uniqvals = tuniq3(FNAME)
    t1 = time() - t0
    #print_uniq(uniqvals)
    print "Time for finding unique values (method 3):", round(t1, 3)

------------------------------------------------------------------------------
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Extracting unique values of a column

Reply via email to