@Lev, thanks for the tip; I will look into it.

In the meanwhile, I am running into some speed issues. I notice that it
slows down progressively almost by a factor of 0.5, in just 7000 updates.
It starts with about 2.6 sec/ mini-batch (average speed), but after 7000
mini-batches, the time increases to 3.7 secs/ mini-batch.

I suspect that I may not be sending the host memory pointers but the actual
arrays, serialized by zmq's send_pyobj (see below in the code). Could
someone confirm whether I am doing it correctly? Should I just be sending/
receiving host memory pointers?

Also, is it correct that host memory pointers don't change throughout the
training; I do pagelocked_zeros_like first and then just copy to the same
memory through memcpy_dtoh_async. In this case, I thought that I don't have
to send/ receive the tr_params_host_list/ tr_params_host_other_list
everytime; however that didn't work.

Here is the relevant snippets from my code:

    # We need to create gpuarrays to receive values from the other gpu,
done once before training
    for tr_param in itemlist(tparams):
        # create empty gpuarrays to receive from other gpu
        tr_param_other = theano.shared(tr_param.get_value() * 0.)
        tr_param_ga_other =
theano.misc.pycuda_utils.to_gpuarray(tr_param_other.container.value)
        tr_params_other_list.append(tr_param_other)
        tr_params_ga_other_list.append(tr_param_ga_other)

        # gpuarrays for current params
        tr_param_ga =
theano.misc.pycuda_utils.to_gpuarray(tr_param.container.value)
        tr_param_host = drv.pagelocked_zeros_like(tr_param_ga)
        tr_params_ga_list.append(tr_param_ga)
        tr_params_host_list.append(tr_param_host)

    # Now during training, we need to copy to host and then exchange the
params in host mem
    for x, y in train:
        mb_start = time.time()
        ...
        f_cost = f_update(x, y)

        if numpy.mod(uidx, syncFreq) == 0:
            # copy from device to host memory and pass host params list
            d2h_start = time.time()
            for tr_param_host, tr_param_ga in zip(tr_params_host_list,
tr_params_ga_list):
                drv.memcpy_dtoh_async(tr_param_host, tr_param_ga.ptr)

            sock.send_pyobj(tr_params_host_list)
            d2h = time.time() - d2h_start
            d2h_tot += d2h

            h2d_start = time.time()
            tr_params_host_other_list = sock.recv_pyobj()   # receive host
params list

            for tr_param_ga_other, tr_param_host_other in
zip(tr_params_ga_other_list, tr_params_host_other_list):
                drv.memcpy_htod_async(tr_param_ga_other.ptr,
tr_param_host_other)
            h2d = time.time() - h2d_start
            h2d_tot += h2d
            f_avg_params(x, y)    # average the params in two gpus

        mb_tot += time.time() - mb_start

The other possibility is that send_pyobj() and recv_pyobj() are blocking
causing the slowdown as it waits. But, the d2h/ h2d times increases only
marginally, for example from 0.1 secs in the beginning to 0.24 secs/
minibatch after 7k minibatches. So clearly this doesn't explain more than 1
sec slowdown. In any case, I have now added the zmq.NOBLOCK to
send_pyobj(); will have to see if it helps.

Thanks a lot for any help on these.

Best
- Baskaran


On Wed, Nov 11, 2015 at 1:58 PM, Lev Givon <[email protected]> wrote:

> Received from Baskaran Sankaran on Wed, Nov 11, 2015 at 01:50:09PM EST:
> > Thanks Andreas for the hint. Actually, what I am trying is little bit
> > complex than that. I have two python processes running on two GPUs. In a
> > simpler setting, I have array x in gpu0's Python process to be
> transferred
> > to gpu1's process and vice versa.
> >
> > I solved it in this schema:
> > * alloc-host-memory
> > * memcpy from device to host (gpu0 to host; gpu1 to host)
> > * send/receive objects in host memory to the Python process in the other
> > gpu
> > * memcpy from host to device within respective gpu
> >
> > The solution and output from a sample run follow. Now, I wonder if it is
> > possible to improve this further. One possibility is whether the device
> to
> > host copy can be eliminated. Because, I need to transfer several theano
> > tensors between multiple (up to 4) gpus and I need to do this quite
> > frequently (say every nth mini batch) during training.
> >
> > Note: Not all gpus are P2P capable and so memcpy_peer wouldn't work.
>
> If you have access to a recent release of OpenMPI or MVAPICH2 built with
> CUDA
> support, you may wish to try using mpi4py for transferring data between
> GPUArrays in different processes; you can pass the MPI wrapper functions
> the
> GPUArray pointers and let the underlying MPI implementation determine when
> to
> take advantage of P2P.
> --
> Lev Givon
> Bionet Group | Neurokernel Project
> http://lebedov.github.io/
> http://neurokernel.github.io/
>
>
_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to