Hi Francesc, I will try to get some numbers as soon as I have some time at hand. However I am not sure if I can come with an absolute number. It seems that at the beginning (first 1000 tables) I see no performance penalty, however after that the performance quickly degrades. Does traversing/accessing a huge number of groups/dataset have an effect on row.append? Just as a sidenote: when I did my test I didn't change any of the default parameters like METADATA_CACHE_SIZE or NODE_CACHE_SLOTS=1024. BTW I am using pytables 2.3 and HDF 1.8.7
On Tue, Jan 17, 2012 at 7:53 PM, Francesc Alted <fal...@pytables.org> wrote: > 2012/1/17 Anthony Scopatz <scop...@gmail.com> >> >> >> >> On Tue, Jan 17, 2012 at 4:35 AM, Ümit Seren <uemit.se...@gmail.com> wrote: >>> >>> @Anthony: >>> Thanks for the quick reply. >>> I fixed my problem (I will get to it later) but first to my previous >>> problem: >>> >>> I actually made a mistake in my previous mail. >>> My setup is the following: I have around 29 0000 groups. In each of >>> these groups I have 5 result tables. >>> Each of these tables contains approx. 31k rows. From each of these >>> tables I try to retrieve a subset which fulfills a specific score >>> criteria (higher than a specific score) >>> >>> This is the pseudo code for my script: >>> >>> h5_f = tables.openFile(hdf5_file,'r+') >>> table = h5_f.createTable('/','top_results',TopResults,'Top >>> results from all results') >>> row = table.row >>> for group_info in h5_f.root.result_list: >>> group_name = group_info['name'] >>> group = h5_f.getNode('/results/%s' %group_name) >>> for result_type in >>> ['result1','result2','result3','result4','result5']: >>> res_table = h5_f.getNode(group,result_type) >>> //retrieves top results >>> results = _getTopResults(res_table,min_score) >>> for result_row in results: >>> assignvValues(row,result_row) >>> row.append() >>> table.flush() >>> table.cols.score.createIndex() >>> h5_f.flush() >>> h5_f.close() >>> >>> So the first couple of thousand tables run really quickly and then the >>> performance degrades to 1 table/second or even less. >>> >>> After the script finished I checked the resulting tables and the >>> tables contained around 51 million rows. >>> >>> I fixed my problem in several ways: >>> >>> 1.) Split up the one huge table into 5 tables (for each result_type). >>> 2.) I set NODE_CACHE_SLOTS=1024 in table.openFile() >>> 3.) First retrieve all rows for a specific result_type and then >>> append them via table.append() >>> >>> By doing this the performance doesn't degrade at all. Memory >>> consumption is also reasonable. >> >> >> Great to hear that this works for you. Table.append() I think is what >> the real fix is. > > > Hmmm, Row.append() uses a buffered approach, so I generally recommend this. > Umit, could you assess how much speed-up does Table.append() is buying you? > Just curious. > > -- > Francesc Alted > > ------------------------------------------------------------------------------ > Keep Your Developer Skills Current with LearnDevNow! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-d2d > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users > ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users