Re: [Pytables-users] Write performance & iterating through nodes

Anthony Scopatz Tue, 17 Jan 2012 10:08:33 -0800

On Tue, Jan 17, 2012 at 4:35 AM, Ümit Seren <uemit.se...@gmail.com> wrote:


> @Anthony:
> Thanks for the quick reply.
> I fixed my problem (I will get to it later) but first to my previous
> problem:
>
> I actually made a mistake in my previous mail.
> My setup is the following: I have around 29 0000 groups. In each of
> these groups I have 5 result tables.
> Each of these tables contains approx. 31k rows. From each of these
> tables I try to retrieve a subset which fulfills a specific score
> criteria (higher than a specific score)
>
> This is the pseudo code for my script:
>
>    h5_f = tables.openFile(hdf5_file,'r+')
>    table  = h5_f.createTable('/','top_results',TopResults,'Top
> results from all results')
>    row = table.row
>    for group_info in h5_f.root.result_list:
>            group_name = group_info['name']
>            group = h5_f.getNode('/results/%s' %group_name)
>            for result_type in
> ['result1','result2','result3','result4','result5']:
>                    res_table = h5_f.getNode(group,result_type)
>                    //retrieves top results
>                    results = _getTopResults(res_table,min_score)
>                    for result_row in results:
>                        assignvValues(row,result_row)
>                        row.append()
>    table.flush()
>    table.cols.score.createIndex()
>    h5_f.flush()
>    h5_f.close()
>
> So the first couple of thousand tables run really quickly and then the
> performance degrades to 1 table/second or even less.
>
> After the script finished I checked the resulting tables and the
> tables contained around 51 million rows.
>
> I fixed my problem in several ways:
>
>  1.) Split up the one huge table into 5 tables (for each result_type).
>  2.) I set NODE_CACHE_SLOTS=1024 in table.openFile()
>  3.) First retrieve all rows for a specific result_type and then
> append them via table.append()
>
> By doing this the performance doesn't degrade at all. Memory
> consumption is also reasonable.
>

Great to hear that this works for you.  Table.append() I think is what
the real fix is.

Be Well
Anthony


>
> cheers
> Ümit
>
> P.S.: Sorry for writing this mail in this way. However I somehow
> didn't get your response directly via mail
>
> On Mon, Jan 16, 2012 at 7:43 PM, Ümit Seren <uemit.se...@gmail.com> wrote:
> > I created a hdf5 file with pytables which contains around 29 000
> > tables with around 31k rows each.
> > I am trying to create a caching table in the same hdf5 file which
> > contains a subset of those 29 000 tables.
> >
> > I wrote a script which basically iterates through each of the 29 000
> > tables retrieves a subset and then writes it to the caching table.
> > Basically it goes through the subset and then adds the rows from the
> > subset one by one to the caching table.
> > The first couple of 1000 tables run really quickly (around 5-8 tables
> > per second or so). However the longer the script runs the slower it
> > becomes (down to 1 table per second).
> >
> > Does anyone know why this is the case? (LRU cache  maybe?)
> >
> > Right now I write row by row using row.append().
> > Is it faster to create the dataset in memory and then write it as a
> > whole block to the table?
> >
> > thanks in advance
> >
> > Ümit
>
> >> Yes.  In general, the more you can read / write in one go, the better
> >> performance is.  There is overhead in both Python and HDF5 to the
> >> method calls.
>
> >> However, the 9x slowdown per call is a little disconcerting.  Do you
> have
> >> a demonstration script that you can share?
>
> >> Be Well
> >> Anthony
>
>
> ------------------------------------------------------------------------------
> Keep Your Developer Skills Current with LearnDevNow!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-d2d
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Write performance & iterating through nodes

Reply via email to