Re: [Pytables-users] Extracting unique values of a column

Armando Serrano Lombillo Tue, 12 May 2009 05:01:57 -0700

Ok, it looks like we were writing similar emails at the same time. :)

I'll change my code right away, but I'm still interested in what exactly was
slowing my first approach. Was it the way I accessed the file, that is, is
t.colinstances[ind] slow? Or was it that directly building the set is slower
that using .add()? The difference is huge, as my impressions and your
benchmarks showed.


Thanks again for your fast and helpful response.
Armando.

On Tue, May 12, 2009 at 1:44 PM, Francesc Alted <[email protected]> wrote:

> On Tuesday 12 May 2009 12:29:04 Armando Serrano Lombillo wrote:
> > Compression: zlib, level 1.
> > Size: 150 MB (compressed) but it could be even bigger, or it could be
> less
> > than 1 MB. Anyway, even with small files, I find it slower than I would
> > expect.
> > Available memory: depends. I am now running it with 512 MB of RAM.
> > Expectedrows: no, I didn't know about it.
> > Other information: I first create the table, save the file and close it.
> At
> > the time of creating the table I can't now how many rows there will be. I
> > then reopen the file and extract the unique values.
> > I'm running it on Windows XP, python 2.5.
>
> Well, 150 MB is not really a big deal for PyTables.  I think the problem is
> more the way you are accessing columns.  The next is quite more efficient:
>
>    # Build a dictionary for getting the different values per column
>    uniqvals = dict((name, set()) for name in t.colnames)
>
>    # Get the unique values for every column
>    for row in t:
>        for column in uniqvals:
>            uniqvals[column].add(row[column])
>
> Here you have some timings for a really small table (400 KB):
>
> Creating table with 10000 rows and 5 columns...
> Getting unique values for each column (method 1)...
> column: col0, unique values: set([0])
> column: col1, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
> column: col2, unique values: set([0, 8, 2, 4, 6])
> column: col3, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
> column: col4, unique values: set([0, 8, 2, 4, 6])
> Time for finding unique values (method 1): 4.928
> Getting unique values for each column (method 2)...
> column: col0, unique values: set([0])
> column: col1, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
> column: col2, unique values: set([0, 8, 2, 4, 6])
> column: col3, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
> column: col4, unique values: set([0, 8, 2, 4, 6])
> Time for finding unique values (method 2): 0.017
>
> where method 1 is what you are using and method 2 is the above one.  As you
> can see, the speed-up is more than 250x.
>
> Also, using a dictionary instead of a set could be worth a try:
>
>    # Build a dictionary for getting the different values per column
>    uniqvals = dict((name, {}) for name in t.colnames)
>
>    # Get the unique values for every column
>    for row in t:
>        for column in uniqvals:
>            uniqvals[column][row[column]] = None
>
> this method (let's call it method 3) is a little faster:
>
> Creating table with 1000000 rows and 10 columns...
> Getting unique values for each column (method 2)...
> Time for finding unique values (method 2): 2.563
> Getting unique values for each column (method 3)...
> Time for finding unique values (method 3): 2.032
>
> This last run was using a 40 MB table.  So, this operation can be done at
> around 20 MB/s is which pretty good.
>
> I'm attaching my small benchmark in case you you want to experiment a
> little
> more with different compressors or other parameters.
>
> Hope this helps,
>
> --
> Francesc Alted
>
>
> ------------------------------------------------------------------------------
> The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
> production scanning environment may not be a perfect world - but thanks to
> Kodak, there's a perfect scanner to get the job done! With the NEW KODAK
> i700
> Series Scanner you'll get full speed at 300 dpi even with all image
> processing features enabled. http://p.sf.net/sfu/kodak-com
> _______________________________________________
> Pytables-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Extracting unique values of a column

Reply via email to