Re: [Pytables-users] Write performance & iterating through nodes

Ümit Seren Wed, 18 Jan 2012 14:31:49 -0800

Hi Francesc,
I will try to get some numbers as soon as I have some time at hand.
However I am not sure if I can come with an absolute number.
It seems that at the beginning (first 1000 tables) I see no
performance penalty, however after that the performance quickly
degrades. Does traversing/accessing a huge number of groups/dataset
have an effect on row.append?
Just as a sidenote: when I did my test I didn't change any of the
default parameters like METADATA_CACHE_SIZE or NODE_CACHE_SLOTS=1024.
BTW I am using pytables 2.3 and HDF 1.8.7



On Tue, Jan 17, 2012 at 7:53 PM, Francesc Alted <fal...@pytables.org> wrote:
> 2012/1/17 Anthony Scopatz <scop...@gmail.com>
>>
>>
>>
>> On Tue, Jan 17, 2012 at 4:35 AM, Ümit Seren <uemit.se...@gmail.com> wrote:
>>>
>>> @Anthony:
>>> Thanks for the quick reply.
>>> I fixed my problem (I will get to it later) but first to my previous
>>> problem:
>>>
>>> I actually made a mistake in my previous mail.
>>> My setup is the following: I have around 29 0000 groups. In each of
>>> these groups I have 5 result tables.
>>> Each of these tables contains approx. 31k rows. From each of these
>>> tables I try to retrieve a subset which fulfills a specific score
>>> criteria (higher than a specific score)
>>>
>>> This is the pseudo code for my script:
>>>
>>>    h5_f = tables.openFile(hdf5_file,'r+')
>>>    table  = h5_f.createTable('/','top_results',TopResults,'Top
>>> results from all results')
>>>    row = table.row
>>>    for group_info in h5_f.root.result_list:
>>>            group_name = group_info['name']
>>>            group = h5_f.getNode('/results/%s' %group_name)
>>>            for result_type in
>>> ['result1','result2','result3','result4','result5']:
>>>                    res_table = h5_f.getNode(group,result_type)
>>>                    //retrieves top results
>>>                    results = _getTopResults(res_table,min_score)
>>>                    for result_row in results:
>>>                        assignvValues(row,result_row)
>>>                        row.append()
>>>    table.flush()
>>>    table.cols.score.createIndex()
>>>    h5_f.flush()
>>>    h5_f.close()
>>>
>>> So the first couple of thousand tables run really quickly and then the
>>> performance degrades to 1 table/second or even less.
>>>
>>> After the script finished I checked the resulting tables and the
>>> tables contained around 51 million rows.
>>>
>>> I fixed my problem in several ways:
>>>
>>>  1.) Split up the one huge table into 5 tables (for each result_type).
>>>  2.) I set NODE_CACHE_SLOTS=1024 in table.openFile()
>>>  3.) First retrieve all rows for a specific result_type and then
>>> append them via table.append()
>>>
>>> By doing this the performance doesn't degrade at all. Memory
>>> consumption is also reasonable.
>>
>>
>> Great to hear that this works for you.  Table.append() I think is what
>> the real fix is.
>
>
> Hmmm,  Row.append() uses a buffered approach, so I generally recommend this.
> Umit, could you assess how much speed-up does Table.append() is buying you?
>  Just curious.
>
> --
> Francesc Alted
>
> ------------------------------------------------------------------------------
> Keep Your Developer Skills Current with LearnDevNow!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-d2d
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Write performance & iterating through nodes

Reply via email to