@Anthony: Thanks for the quick reply. I fixed my problem (I will get to it later) but first to my previous problem:
I actually made a mistake in my previous mail. My setup is the following: I have around 29 0000 groups. In each of these groups I have 5 result tables. Each of these tables contains approx. 31k rows. From each of these tables I try to retrieve a subset which fulfills a specific score criteria (higher than a specific score) This is the pseudo code for my script: h5_f = tables.openFile(hdf5_file,'r+') table = h5_f.createTable('/','top_results',TopResults,'Top results from all results') row = table.row for group_info in h5_f.root.result_list: group_name = group_info['name'] group = h5_f.getNode('/results/%s' %group_name) for result_type in ['result1','result2','result3','result4','result5']: res_table = h5_f.getNode(group,result_type) //retrieves top results results = _getTopResults(res_table,min_score) for result_row in results: assignvValues(row,result_row) row.append() table.flush() table.cols.score.createIndex() h5_f.flush() h5_f.close() So the first couple of thousand tables run really quickly and then the performance degrades to 1 table/second or even less. After the script finished I checked the resulting tables and the tables contained around 51 million rows. I fixed my problem in several ways: 1.) Split up the one huge table into 5 tables (for each result_type). 2.) I set NODE_CACHE_SLOTS=1024 in table.openFile() 3.) First retrieve all rows for a specific result_type and then append them via table.append() By doing this the performance doesn't degrade at all. Memory consumption is also reasonable. cheers Ümit P.S.: Sorry for writing this mail in this way. However I somehow didn't get your response directly via mail On Mon, Jan 16, 2012 at 7:43 PM, Ümit Seren <uemit.se...@gmail.com> wrote: > I created a hdf5 file with pytables which contains around 29 000 > tables with around 31k rows each. > I am trying to create a caching table in the same hdf5 file which > contains a subset of those 29 000 tables. > > I wrote a script which basically iterates through each of the 29 000 > tables retrieves a subset and then writes it to the caching table. > Basically it goes through the subset and then adds the rows from the > subset one by one to the caching table. > The first couple of 1000 tables run really quickly (around 5-8 tables > per second or so). However the longer the script runs the slower it > becomes (down to 1 table per second). > > Does anyone know why this is the case? (LRU cache maybe?) > > Right now I write row by row using row.append(). > Is it faster to create the dataset in memory and then write it as a > whole block to the table? > > thanks in advance > > Ümit >> Yes. In general, the more you can read / write in one go, the better >> performance is. There is overhead in both Python and HDF5 to the >> method calls. >> However, the 9x slowdown per call is a little disconcerting. Do you have >> a demonstration script that you can share? >> Be Well >> Anthony ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users