Re: [Pytables-users] Merging multiple DB

Anthony Scopatz Tue, 06 Mar 2012 19:39:00 -0800

Hi Daπid,

So in general there are a couple of different ways of tackling this issue.
The one that you choose will depend on your desired scalablity and existing
architecture.  I'll outline some options now:

1) What you described.  Every process writes out its own library and then
a post-process sweeps through and combines them all later.  This is
probably the easiest to implement.  You wouldn't even need to dump them
to ASCII.  Tables have an append() method you would find useful.

2) Have one library per node (ie 10 total libraries, 4 processes per
library).
If the writing is done in a thread safe way, then you only have to sweep
through and post-process 10 files.  Naturally, the individual file sizes
are
larger.

3) Have one master process whose sole job it is to write the single library.
All other 'compute' processes communicate with this process.  The compute
processes will calculate a row of the table and send it back over the wire
to the master process as a tuple.  The master process will take this row,
put it on a stack and crank through the stack, adding rows to the table when
it has free time.  For communication, you could use something like JSON RPC
(in the Python standard library) or ZeroMQ / pyzmq (which is easy to use and
has a lot of nice features) or MPI / mpi4py (which is meant for high
performance
computing concerns).  No post-processing is needed for this strategy.

None of this should every require your writing a plain text file ever.  I
hope that
this helps!

Be Well
Anthony

On Tue, Mar 6, 2012 at 6:08 PM, Daπid <davidmen...@gmail.com> wrote:

> It was me, at that moment I hadn't confirmed my subscription. Sorry!
>
> On Wed, Mar 7, 2012 at 1:06 AM, Francesc Alted <fal...@pytables.org>
> wrote:
> > This has been probably sent from an unsubscribed address.
> >
> > Begin forwarded message:
> >
> > From: pytables-users-boun...@lists.sourceforge.net
> > Subject: Auto-discard notification
> > Date: March 6, 2012 2:55:54 PM PST
> > To: pytables-users-ow...@lists.sourceforge.net
> >
> > The attached message has been automatically discarded.
> > From: Daπid <davidmen...@gmail.com>
> > Subject: Merging multiple DB
> > Date: March 6, 2012 2:55:27 PM PST
> > To: pytables-users@lists.sourceforge.net
> >
> >
> > Hello.
> >
> > First of all, I have to warn I am an absolute newbie to PyTables and
> > DB, so please forgive my conceptual holes.
> >
> > I am running a Monte Carlo simulation of an embarrassingly
> > parallelizable problem. The calculations are being done on a grid of
> > 10 computers QuadCore, running each one four independent processes. I
> > assume the safer is to generate one DB per process, ending up with
> > forty different (but equivalent) DB. My question is: is there any easy
> > way of merging all of them?
> >
> > The final size of the DB will be around ten columns of numbers by a
> > few million rows, relatively small, so I compression is not required
> > and reading optimization is not vital.
> >
> > The simplest -and maybe shabby- way I can think of is to output every
> > thread on different ASCII, read them all and insert them in a master
> > DB, but this looks inefficient and cumbersome to me.
> >
> >
> > Thank you very much,
> >
> > David.
> >
> >
> >
> >
> > -- Francesc Alted
> >
> >
> >
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Virtualization & Cloud Management Using Capacity Planning
> > Cloud computing makes use of virtualization - but cloud computing
> > also focuses on allowing computing to be delivered as a service.
> > http://www.accelacomm.com/jaw/sfnl/114/51521223/
> > _______________________________________________
> > Pytables-users mailing list
> > Pytables-users@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
>
>
> ------------------------------------------------------------------------------
> Virtualization & Cloud Management Using Capacity Planning
> Cloud computing makes use of virtualization - but cloud computing
> also focuses on allowing computing to be delivered as a service.
> http://www.accelacomm.com/jaw/sfnl/114/51521223/
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Merging multiple DB

Reply via email to