Re: [Pytables-users] Improving my write speed

sreeaurovindh viswanathan Mon, 19 Mar 2012 11:40:23 -0700

Thanks for your clarification and immense help

Regards
Sree aurovindh V


On Mon, Mar 19, 2012 at 11:58 PM, Anthony Scopatz <scop...@gmail.com> wrote:

> On Mon, Mar 19, 2012 at 12:08 PM, sreeaurovindh viswanathan <
> sreeaurovi...@gmail.com> wrote:
>
> [snip]
>
>
>> 2) Can you please point out to an example where i can do block hdf5 file
>> write using pytables (sorry for this naive question)
>>
>
> The Table.append() method (
> http://pytables.github.com/usersguide/libref.html#tables.Table.append)
> allows you to write multiple rows at the same time.  Arrays have similar
> methods (http://pytables.github.com/usersguide/libref.html#earray-methods)
> if they are extensible.   Please note that these methods accept any
> sequence which can be converted to a numpy record array!
>
> Be Well
> Anthony
>
>
>>
>> Thanks
>> Sree aurovindh V
>>
>> On Mon, Mar 19, 2012 at 10:16 PM, Anthony Scopatz <scop...@gmail.com>wrote:
>>
>>> What Francesc said ;)
>>>
>>>
>>> On Mon, Mar 19, 2012 at 11:43 AM, Francesc Alted <fal...@gmail.com>wrote:
>>>
>>>> My advice regarding parallelization is: do not worry about this *at
>>>> all* unless you already spent long time profiling your problem and you are
>>>> sure that parallelizing could be of help.  99% of the time is much more
>>>> productive focusing on improving serial speed.
>>>>
>>>> Please, try to follow Anthony's suggestion and split your queries in
>>>> blocks, and pass these blocks to PyTables.  That would represent a huge
>>>> win.  For example, use:
>>>>
>>>> SELECT * FROM `your_table` LIMIT 0, 10000
>>>>
>>>> for the first block, and send the results to `Table.append`.  Then go
>>>> for the second block as:
>>>>
>>>> SELECT * FROM `your_table` LIMIT 10000, 20000
>>>>
>>>> and pass this to `Table.append`.  And so on and so forth until you
>>>> exhaust all the data in your tables.
>>>>
>>>> Hope this helps,
>>>>
>>>> Francesc
>>>>
>>>> On Mar 19, 2012, at 11:36 AM, sreeaurovindh viswanathan wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > Thanks for your reply.In that case how will be my querying
>>>> efficiency? will i be able to query parrellely?(i.e) will i be able to run
>>>> multiple queries on a single file.Also if i do it in 6 chunks will i be
>>>> able to parrelize it?
>>>> >
>>>> >
>>>> > Thanks
>>>> > Sree aurovindh Viswanathan
>>>> > On Mon, Mar 19, 2012 at 10:01 PM, Anthony Scopatz <scop...@gmail.com>
>>>> wrote:
>>>> > Is there any way that you can query and write in much larger chunks
>>>> that 6?  I don't know much about postgresql in specific, but in general
>>>> HDF5 does much better if you can take larger chunks.  Perhaps you could at
>>>> least do the postgresql in parallel.
>>>> >
>>>> > Be Well
>>>> > Anthony
>>>> >
>>>> > On Mon, Mar 19, 2012 at 11:23 AM, sreeaurovindh viswanathan <
>>>> sreeaurovi...@gmail.com> wrote:
>>>> > The problem is with respect to the writing speed of my computer and
>>>> the postgresql query performance.I will explain the scenario in detail.
>>>> >
>>>> > I have data about 80 Gb (along with approprite database indexes in
>>>> place). I am trying to read it from Postgresql database and writing it into
>>>> HDF5 using Pytables.I have 1 table and 5 variable arrays in one hdf5
>>>> file.The implementation of Hdf5 is not multithreaded or enabled for
>>>> symmetric multi processing.
>>>> >
>>>> > As for as the postgresql table is concerned the overall record size
>>>> is 140 million and I have 5 primary- foreign key referring tables.I am not
>>>> using joins as it is not scalable
>>>> >
>>>> > So for a single lookup i do 6 lookup without joins and write them
>>>> into hdf5 format. For each lookup i do 6 inserts into each of the table and
>>>> its corresponding arrays.
>>>> >
>>>> > The queries are really simple
>>>> >
>>>> >
>>>> > select * from x.train where tr_id=1 (primary key & indexed)
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > select q_t from x.qt where q_id=2 (non-primary key but indexed)
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > (similarly four queries)
>>>> >
>>>> > Each computer writes two hdf5 files and hence the total count comes
>>>> around 20 files.
>>>> >
>>>> > Some Calculations and statistics:
>>>> >
>>>> >
>>>> > Total number of records : 14,37,00,000
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Total number
>>>> > of records per file : 143700000/20 =71,85,000
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > The total number
>>>> > of records in each file : 71,85,000 * 5 = 3,59,25,000
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Current Postgresql database config :
>>>> >
>>>> > My current Machine : 8GB RAM with i7 2nd generation Processor.
>>>> >
>>>> > I made changes to the following to postgresql configuration file :
>>>> shared_buffers : 2 GB effective_cache_size : 4 GB
>>>> >
>>>> > Note on current performance:
>>>> >
>>>> > I have run it for about ten hours and the performance is as follows:
>>>> The total number of records written for a single file is about 25,00,000 *
>>>> 5 =1,25,00,000 only. It has written 2 such files .considering the size it
>>>> would take me atleast 20 hrs  for 2 files.I have  about 10 files and hence
>>>> the total hours would be 200 hrs= 9 days. I have to start my experiments as
>>>> early as possible and 10 days is too much. Can you please help me to
>>>> enhance the performance.
>>>> >
>>>> >  Questions: 1. Should i use Symmetric multi processing on my
>>>> computer.In that case what is suggested or prefereable?  2. Should i use
>>>> multi threading.. In that case any links or pointers would be of great 
>>>> help.
>>>> >
>>>> >
>>>> >
>>>> > Thanks
>>>> >
>>>> > Sree aurovindh V
>>>> >
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> > This SF email is sponsosred by:
>>>> > Try Windows Azure free for 90 days Click Here
>>>> > http://p.sf.net/sfu/sfd2d-msazure
>>>> > _______________________________________________
>>>> > Pytables-users mailing list
>>>> > Pytables-users@lists.sourceforge.net
>>>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>> >
>>>> >
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> > This SF email is sponsosred by:
>>>> > Try Windows Azure free for 90 days Click Here
>>>> > http://p.sf.net/sfu/sfd2d-msazure
>>>> > _______________________________________________
>>>> > Pytables-users mailing list
>>>> > Pytables-users@lists.sourceforge.net
>>>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>> >
>>>> >
>>>> >
>>>> ------------------------------------------------------------------------------
>>>> > This SF email is sponsosred by:
>>>> > Try Windows Azure free for 90 days Click Here
>>>> >
>>>> http://p.sf.net/sfu/sfd2d-msazure_______________________________________________
>>>> > Pytables-users mailing list
>>>> > Pytables-users@lists.sourceforge.net
>>>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>>
>>>> -- Francesc Alted
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> This SF email is sponsosred by:
>>>> Try Windows Azure free for 90 days Click Here
>>>> http://p.sf.net/sfu/sfd2d-msazure
>>>> _______________________________________________
>>>> Pytables-users mailing list
>>>> Pytables-users@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>>
>>>
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> This SF email is sponsosred by:
>>> Try Windows Azure free for 90 days Click Here
>>> http://p.sf.net/sfu/sfd2d-msazure
>>> _______________________________________________
>>> Pytables-users mailing list
>>> Pytables-users@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Pytables-users mailing list
>> Pytables-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>>
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Improving my write speed

Reply via email to