Re: [Pytables-users] Improving my write speed

Francesc Alted Mon, 19 Mar 2012 09:44:22 -0700

My advice regarding parallelization is: do not worry about this *at all* unless 
you already spent long time profiling your problem and you are sure that 
parallelizing could be of help.  99% of the time is much more productive 
focusing on improving serial speed.


Please, try to follow Anthony's suggestion and split your queries in blocks, 
and pass these blocks to PyTables.  That would represent a huge win.  For 
example, use:

SELECT * FROM `your_table` LIMIT 0, 10000

for the first block, and send the results to `Table.append`.  Then go for the 
second block as:

SELECT * FROM `your_table` LIMIT 10000, 20000

and pass this to `Table.append`.  And so on and so forth until you exhaust all 
the data in your tables.

Hope this helps,

Francesc

On Mar 19, 2012, at 11:36 AM, sreeaurovindh viswanathan wrote:

> Hi,
> 
> Thanks for your reply.In that case how will be my querying efficiency? will i 
> be able to query parrellely?(i.e) will i be able to run multiple queries on a 
> single file.Also if i do it in 6 chunks will i be able to parrelize it?
> 
> 
> Thanks
> Sree aurovindh Viswanathan
> On Mon, Mar 19, 2012 at 10:01 PM, Anthony Scopatz <scop...@gmail.com> wrote:
> Is there any way that you can query and write in much larger chunks that 6?  
> I don't know much about postgresql in specific, but in general HDF5 does much 
> better if you can take larger chunks.  Perhaps you could at least do the 
> postgresql in parallel.
> 
> Be Well
> Anthony
> 
> On Mon, Mar 19, 2012 at 11:23 AM, sreeaurovindh viswanathan 
> <sreeaurovi...@gmail.com> wrote:
> The problem is with respect to the writing speed of my computer and the 
> postgresql query performance.I will explain the scenario in detail.
> 
> I have data about 80 Gb (along with approprite database indexes in place). I 
> am trying to read it from Postgresql database and writing it into HDF5 using 
> Pytables.I have 1 table and 5 variable arrays in one hdf5 file.The 
> implementation of Hdf5 is not multithreaded or enabled for symmetric multi 
> processing.
> 
> As for as the postgresql table is concerned the overall record size is 140 
> million and I have 5 primary- foreign key referring tables.I am not using 
> joins as it is not scalable
> 
> So for a single lookup i do 6 lookup without joins and write them into hdf5 
> format. For each lookup i do 6 inserts into each of the table and its 
> corresponding arrays.
> 
> The queries are really simple
> 
> 
> select * from x.train where tr_id=1 (primary key & indexed)
> 
> 
> 
> 
> 
> 
> 
> 
> select q_t from x.qt where q_id=2 (non-primary key but indexed) 
> 
> 
> 
> 
> 
> 
> 
> 
> (similarly four queries)
> 
> Each computer writes two hdf5 files and hence the total count comes around 20 
> files.
> 
> Some Calculations and statistics:
> 
> 
> Total number of records : 14,37,00,000
> 
> 
> 
> 
> 
> 
> 
> Total number 
> of records per file : 143700000/20 =71,85,000 
> 
> 
> 
> 
> 
> 
> 
> The total number 
> of records in each file : 71,85,000 * 5 = 3,59,25,000
> 
> 
> 
> 
> 
> 
> 
> 
> Current Postgresql database config :
> 
> My current Machine : 8GB RAM with i7 2nd generation Processor.
> 
> I made changes to the following to postgresql configuration file : 
> shared_buffers : 2 GB effective_cache_size : 4 GB
> 
> Note on current performance:
> 
> I have run it for about ten hours and the performance is as follows: The 
> total number of records written for a single file is about 25,00,000 * 5 
> =1,25,00,000 only. It has written 2 such files .considering the size it would 
> take me atleast 20 hrs  for 2 files.I have  about 10 files and hence the 
> total hours would be 200 hrs= 9 days. I have to start my experiments as early 
> as possible and 10 days is too much. Can you please help me to enhance the 
> performance.
> 
>  Questions: 1. Should i use Symmetric multi processing on my computer.In that 
> case what is suggested or prefereable?  2. Should i use multi threading.. In 
> that case any links or pointers would be of great help.
> 
> 
> 
> Thanks
> 
> Sree aurovindh V
> 
> 
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
> 
> 
> 
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
> 
> 
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here 
> http://p.sf.net/sfu/sfd2d-msazure_______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users

-- Francesc Alted







------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Improving my write speed

Reply via email to