@Anthony:
Thanks for the quick reply.
I fixed my problem (I will get to it later) but first to my previous problem:

I actually made a mistake in my previous mail.
My setup is the following: I have around 29 0000 groups. In each of
these groups I have 5 result tables.
Each of these tables contains approx. 31k rows. From each of these
tables I try to retrieve a subset which fulfills a specific score
criteria (higher than a specific score)

This is the pseudo code for my script:

    h5_f = tables.openFile(hdf5_file,'r+')
    table  = h5_f.createTable('/','top_results',TopResults,'Top
results from all results')
    row = table.row
    for group_info in h5_f.root.result_list:
            group_name = group_info['name']
            group = h5_f.getNode('/results/%s' %group_name)
            for result_type in
['result1','result2','result3','result4','result5']:
                    res_table = h5_f.getNode(group,result_type)
                    //retrieves top results
                    results = _getTopResults(res_table,min_score)
                    for result_row in results:
                        assignvValues(row,result_row)
                        row.append()
    table.flush()
    table.cols.score.createIndex()
    h5_f.flush()
    h5_f.close()

So the first couple of thousand tables run really quickly and then the
performance degrades to 1 table/second or even less.

After the script finished I checked the resulting tables and the
tables contained around 51 million rows.

I fixed my problem in several ways:

 1.) Split up the one huge table into 5 tables (for each result_type).
 2.) I set NODE_CACHE_SLOTS=1024 in table.openFile()
 3.) First retrieve all rows for a specific result_type and then
append them via table.append()

By doing this the performance doesn't degrade at all. Memory
consumption is also reasonable.

cheers
Ümit

P.S.: Sorry for writing this mail in this way. However I somehow
didn't get your response directly via mail

On Mon, Jan 16, 2012 at 7:43 PM, Ümit Seren <uemit.se...@gmail.com> wrote:
> I created a hdf5 file with pytables which contains around 29 000
> tables with around 31k rows each.
> I am trying to create a caching table in the same hdf5 file which
> contains a subset of those 29 000 tables.
>
> I wrote a script which basically iterates through each of the 29 000
> tables retrieves a subset and then writes it to the caching table.
> Basically it goes through the subset and then adds the rows from the
> subset one by one to the caching table.
> The first couple of 1000 tables run really quickly (around 5-8 tables
> per second or so). However the longer the script runs the slower it
> becomes (down to 1 table per second).
>
> Does anyone know why this is the case? (LRU cache  maybe?)
>
> Right now I write row by row using row.append().
> Is it faster to create the dataset in memory and then write it as a
> whole block to the table?
>
> thanks in advance
>
> Ümit

>> Yes.  In general, the more you can read / write in one go, the better
>> performance is.  There is overhead in both Python and HDF5 to the
>> method calls.

>> However, the 9x slowdown per call is a little disconcerting.  Do you have
>> a demonstration script that you can share?

>> Be Well
>> Anthony

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to