hi,
Thanks for your suggestions.
Sorry to misphrase my question.But by querying speed i meant the speed of
"pytable querying and not the postgresql querying.To rephrase,
1) Will i be able to query(using kernel queries) a single HDF5 file using
pytables parallely with five different programs? How will the efficiency in
that case..
Secondly as per the suggestions,
I will break it into 6 chunks as per your advise and try to incorporate in
the code.Also i will try to break my query into chunks and write it into
hdf5 tables as chunks as advised by frensec. But..
2) Can you please point out to an example where i can do block hdf5 file
write using pytables (sorry for this naive question)
Thanks
Sree aurovindh V
On Mon, Mar 19, 2012 at 10:16 PM, Anthony Scopatz <scop...@gmail.com> wrote:
> What Francesc said ;)
>
>
> On Mon, Mar 19, 2012 at 11:43 AM, Francesc Alted <fal...@gmail.com> wrote:
>
>> My advice regarding parallelization is: do not worry about this *at all*
>> unless you already spent long time profiling your problem and you are sure
>> that parallelizing could be of help. 99% of the time is much more
>> productive focusing on improving serial speed.
>>
>> Please, try to follow Anthony's suggestion and split your queries in
>> blocks, and pass these blocks to PyTables. That would represent a huge
>> win. For example, use:
>>
>> SELECT * FROM `your_table` LIMIT 0, 10000
>>
>> for the first block, and send the results to `Table.append`. Then go for
>> the second block as:
>>
>> SELECT * FROM `your_table` LIMIT 10000, 20000
>>
>> and pass this to `Table.append`. And so on and so forth until you
>> exhaust all the data in your tables.
>>
>> Hope this helps,
>>
>> Francesc
>>
>> On Mar 19, 2012, at 11:36 AM, sreeaurovindh viswanathan wrote:
>>
>> > Hi,
>> >
>> > Thanks for your reply.In that case how will be my querying efficiency?
>> will i be able to query parrellely?(i.e) will i be able to run multiple
>> queries on a single file.Also if i do it in 6 chunks will i be able to
>> parrelize it?
>> >
>> >
>> > Thanks
>> > Sree aurovindh Viswanathan
>> > On Mon, Mar 19, 2012 at 10:01 PM, Anthony Scopatz <scop...@gmail.com>
>> wrote:
>> > Is there any way that you can query and write in much larger chunks
>> that 6? I don't know much about postgresql in specific, but in general
>> HDF5 does much better if you can take larger chunks. Perhaps you could at
>> least do the postgresql in parallel.
>> >
>> > Be Well
>> > Anthony
>> >
>> > On Mon, Mar 19, 2012 at 11:23 AM, sreeaurovindh viswanathan <
>> sreeaurovi...@gmail.com> wrote:
>> > The problem is with respect to the writing speed of my computer and the
>> postgresql query performance.I will explain the scenario in detail.
>> >
>> > I have data about 80 Gb (along with approprite database indexes in
>> place). I am trying to read it from Postgresql database and writing it into
>> HDF5 using Pytables.I have 1 table and 5 variable arrays in one hdf5
>> file.The implementation of Hdf5 is not multithreaded or enabled for
>> symmetric multi processing.
>> >
>> > As for as the postgresql table is concerned the overall record size is
>> 140 million and I have 5 primary- foreign key referring tables.I am not
>> using joins as it is not scalable
>> >
>> > So for a single lookup i do 6 lookup without joins and write them into
>> hdf5 format. For each lookup i do 6 inserts into each of the table and its
>> corresponding arrays.
>> >
>> > The queries are really simple
>> >
>> >
>> > select * from x.train where tr_id=1 (primary key & indexed)
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > select q_t from x.qt where q_id=2 (non-primary key but indexed)
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > (similarly four queries)
>> >
>> > Each computer writes two hdf5 files and hence the total count comes
>> around 20 files.
>> >
>> > Some Calculations and statistics:
>> >
>> >
>> > Total number of records : 14,37,00,000
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Total number
>> > of records per file : 143700000/20 =71,85,000
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > The total number
>> > of records in each file : 71,85,000 * 5 = 3,59,25,000
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > Current Postgresql database config :
>> >
>> > My current Machine : 8GB RAM with i7 2nd generation Processor.
>> >
>> > I made changes to the following to postgresql configuration file :
>> shared_buffers : 2 GB effective_cache_size : 4 GB
>> >
>> > Note on current performance:
>> >
>> > I have run it for about ten hours and the performance is as follows:
>> The total number of records written for a single file is about 25,00,000 *
>> 5 =1,25,00,000 only. It has written 2 such files .considering the size it
>> would take me atleast 20 hrs for 2 files.I have about 10 files and hence
>> the total hours would be 200 hrs= 9 days. I have to start my experiments as
>> early as possible and 10 days is too much. Can you please help me to
>> enhance the performance.
>> >
>> > Questions: 1. Should i use Symmetric multi processing on my
>> computer.In that case what is suggested or prefereable? 2. Should i use
>> multi threading.. In that case any links or pointers would be of great help.
>> >
>> >
>> >
>> > Thanks
>> >
>> > Sree aurovindh V
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > This SF email is sponsosred by:
>> > Try Windows Azure free for 90 days Click Here
>> > http://p.sf.net/sfu/sfd2d-msazure
>> > _______________________________________________
>> > Pytables-users mailing list
>> > Pytables-users@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>> >
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > This SF email is sponsosred by:
>> > Try Windows Azure free for 90 days Click Here
>> > http://p.sf.net/sfu/sfd2d-msazure
>> > _______________________________________________
>> > Pytables-users mailing list
>> > Pytables-users@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>> >
>> >
>> >
>> ------------------------------------------------------------------------------
>> > This SF email is sponsosred by:
>> > Try Windows Azure free for 90 days Click Here
>> >
>> http://p.sf.net/sfu/sfd2d-msazure_______________________________________________
>> > Pytables-users mailing list
>> > Pytables-users@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>> -- Francesc Alted
>>
>>
>>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Pytables-users mailing list
>> Pytables-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>>
>
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users