Re: [Pytables-users] Extracting unique values of a column

Francesc Alted Tue, 12 May 2009 05:35:16 -0700

On Tuesday 12 May 2009 14:00:41 Armando Serrano Lombillo wrote:
> Ok, it looks like we were writing similar emails at the same time. :)
>
> I'll change my code right away, but I'm still interested in what exactly
> was slowing my first approach. Was it the way I accessed the file, that is,
> is t.colinstances[ind] slow? Or was it that directly building the set is
> slower that using .add()? The difference is huge, as my impressions and
> your benchmarks showed.


That's a good question.  As I was not certain on what was happening there, 
I've done some profiling.  Here are the routines that were consuming the most 
for your first method:

Tue May 12 14:07:25 2009    tuniq1.prof                                         
    

         2401085 function calls (2401062 primitive calls) in 5.835 CPU seconds

   Ordered by: internal time, call count
   List reduced from 184 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    50000    2.788    0.000    3.092    0.000 {method '_fillCol' of 
'tables.tableExtension.Row' objects}                                            
                                                        
    50000    0.442    0.000    3.569    0.000 table.py:1496(_read)              
                      
   100000    0.313    0.000    0.861    0.000 leaf.py:425(_processRange)        
                      
150030/150010    0.253    0.000    0.491    0.000 file.py:880(_getNode)         
                      
    50005    0.241    0.000    5.759    0.000 table.py:2914(__getitem__)        
                      
   150025    0.220    0.000    0.236    0.000 file.py:249(__getitem__)          
                      
    50000    0.209    0.000    4.822    0.000 table.py:1553(read)

It is clear now that, for every element in the table a `Table.__getitem__()` 
was issued for every *single* item in table.  As this is a user-accessible 
method, it has to do a lot of checks first in order to ensure that the user is 
requesting a valid item, and this has a lot of overhead.

In comparison, the second method is using a table iterator, which is 
implemented as an extension (i.e. it is fast) and besides, only performs 
checks at the beginning.  Also, by using the iterator you only have to read 
each item once per run, instead of once per existing column (remember that 
tables are implemented row-wise, and you were accessing items column-wise in 
method1).  Finally, the table iterators always do buffered I/O, so reading 
data ahead and re-using this data in next iterations.  All in all, this 
approach is much faster.

The moral of this is: use table iterators whenever you can :)

Cheers,

-- 
Francesc Alted

------------------------------------------------------------------------------
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
_______________________________________________
Pytables-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Extracting unique values of a column

Reply via email to