Re: [Pytables-users] Extracting unique values of a column

Armando Serrano Lombillo Tue, 12 May 2009 05:05:02 -0700

Oh, and it also surprises me that using a dictionary of Nones is faster than
using a set. Maybe python's set type needs some performance optimizations,
but that has nothing to do with pytables.


Armando.

On Tue, May 12, 2009 at 2:00 PM, Armando Serrano Lombillo <
[email protected]> wrote:

> Ok, it looks like we were writing similar emails at the same time. :)
>
> I'll change my code right away, but I'm still interested in what exactly
> was slowing my first approach. Was it the way I accessed the file, that is,
> is t.colinstances[ind] slow? Or was it that directly building the set is
> slower that using .add()? The difference is huge, as my impressions and your
> benchmarks showed.
>
> Thanks again for your fast and helpful response.
> Armando.
>
> On Tue, May 12, 2009 at 1:44 PM, Francesc Alted <[email protected]>wrote:
>
>> On Tuesday 12 May 2009 12:29:04 Armando Serrano Lombillo wrote:
>> > Compression: zlib, level 1.
>> > Size: 150 MB (compressed) but it could be even bigger, or it could be
>> less
>> > than 1 MB. Anyway, even with small files, I find it slower than I would
>> > expect.
>> > Available memory: depends. I am now running it with 512 MB of RAM.
>> > Expectedrows: no, I didn't know about it.
>> > Other information: I first create the table, save the file and close it.
>> At
>> > the time of creating the table I can't now how many rows there will be.
>> I
>> > then reopen the file and extract the unique values.
>> > I'm running it on Windows XP, python 2.5.
>>
>> Well, 150 MB is not really a big deal for PyTables.  I think the problem
>> is
>> more the way you are accessing columns.  The next is quite more efficient:
>>
>>    # Build a dictionary for getting the different values per column
>>    uniqvals = dict((name, set()) for name in t.colnames)
>>
>>    # Get the unique values for every column
>>    for row in t:
>>        for column in uniqvals:
>>            uniqvals[column].add(row[column])
>>
>> Here you have some timings for a really small table (400 KB):
>>
>> Creating table with 10000 rows and 5 columns...
>> Getting unique values for each column (method 1)...
>> column: col0, unique values: set([0])
>> column: col1, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>> column: col2, unique values: set([0, 8, 2, 4, 6])
>> column: col3, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>> column: col4, unique values: set([0, 8, 2, 4, 6])
>> Time for finding unique values (method 1): 4.928
>> Getting unique values for each column (method 2)...
>> column: col0, unique values: set([0])
>> column: col1, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>> column: col2, unique values: set([0, 8, 2, 4, 6])
>> column: col3, unique values: set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>> column: col4, unique values: set([0, 8, 2, 4, 6])
>> Time for finding unique values (method 2): 0.017
>>
>> where method 1 is what you are using and method 2 is the above one.  As
>> you
>> can see, the speed-up is more than 250x.
>>
>> Also, using a dictionary instead of a set could be worth a try:
>>
>>    # Build a dictionary for getting the different values per column
>>    uniqvals = dict((name, {}) for name in t.colnames)
>>
>>    # Get the unique values for every column
>>    for row in t:
>>        for column in uniqvals:
>>            uniqvals[column][row[column]] = None
>>
>> this method (let's call it method 3) is a little faster:
>>
>> Creating table with 1000000 rows and 10 columns...
>> Getting unique values for each column (method 2)...
>> Time for finding unique values (method 2): 2.563
>> Getting unique values for each column (method 3)...
>> Time for finding unique values (method 3): 2.032
>>
>> This last run was using a 40 MB table.  So, this operation can be done at
>> around 20 MB/s is which pretty good.
>>
>> I'm attaching my small benchmark in case you you want to experiment a
>> little
>> more with different compressors or other parameters.
>>
>> Hope this helps,
>>
>> --
>> Francesc Alted
>>
>>
>> ------------------------------------------------------------------------------
>> The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
>> production scanning environment may not be a perfect world - but thanks to
>> Kodak, there's a perfect scanner to get the job done! With the NEW KODAK
>> i700
>> Series Scanner you'll get full speed at 300 dpi even with all image
>> processing features enabled. http://p.sf.net/sfu/kodak-com
>> _______________________________________________
>> Pytables-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>

------------------------------------------------------------------------------
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com

_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Extracting unique values of a column

Reply via email to