Re: [Pytables-users] Write performance & iterating through nodes

Francesc Alted Tue, 17 Jan 2012 10:53:41 -0800

2012/1/17 Anthony Scopatz <scop...@gmail.com>

>
>
> On Tue, Jan 17, 2012 at 4:35 AM, Ümit Seren <uemit.se...@gmail.com> wrote:
>
>> @Anthony:
>> Thanks for the quick reply.
>> I fixed my problem (I will get to it later) but first to my previous
>> problem:
>>
>> I actually made a mistake in my previous mail.
>> My setup is the following: I have around 29 0000 groups. In each of
>> these groups I have 5 result tables.
>> Each of these tables contains approx. 31k rows. From each of these
>> tables I try to retrieve a subset which fulfills a specific score
>> criteria (higher than a specific score)
>>
>> This is the pseudo code for my script:
>>
>>    h5_f = tables.openFile(hdf5_file,'r+')
>>    table  = h5_f.createTable('/','top_results',TopResults,'Top
>> results from all results')
>>    row = table.row
>>    for group_info in h5_f.root.result_list:
>>            group_name = group_info['name']
>>            group = h5_f.getNode('/results/%s' %group_name)
>>            for result_type in
>> ['result1','result2','result3','result4','result5']:
>>                    res_table = h5_f.getNode(group,result_type)
>>                    //retrieves top results
>>                    results = _getTopResults(res_table,min_score)
>>                    for result_row in results:
>>                        assignvValues(row,result_row)
>>                        row.append()
>>    table.flush()
>>    table.cols.score.createIndex()
>>    h5_f.flush()
>>    h5_f.close()
>>
>> So the first couple of thousand tables run really quickly and then the
>> performance degrades to 1 table/second or even less.
>>
>> After the script finished I checked the resulting tables and the
>> tables contained around 51 million rows.
>>
>> I fixed my problem in several ways:
>>
>>  1.) Split up the one huge table into 5 tables (for each result_type).
>>  2.) I set NODE_CACHE_SLOTS=1024 in table.openFile()
>>  3.) First retrieve all rows for a specific result_type and then
>> append them via table.append()
>>
>> By doing this the performance doesn't degrade at all. Memory
>> consumption is also reasonable.
>>
>
> Great to hear that this works for you.  Table.append() I think is what
> the real fix is.
>


Hmmm,  Row.append() uses a buffered approach, so I generally recommend
this. Umit, could you assess how much speed-up does Table.append() is
buying you?  Just curious.

-- 
Francesc Alted

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Write performance & iterating through nodes

Reply via email to