[PyCUDA] Compute several FFT with GPU using Python multiprocessing and pyfft: how to avoid GPU memory leak?

Marco Tazzari Wed, 21 Jan 2015 06:58:32 -0800

I am trying to implement in Python the following pattern for *multi-CPU and
single-GPU* computation using *pycuda* and *pyfft* packages.I would like to
have *several processes* (e.g. launched with multiprocessing.Pool()), with
*each of them* able to perform *FFTs using the GPU (using NVIDIA
CUDA)*.However, I have the following problem:If I run too many processes or
too many FFTs per process, *the overall script remains on hold without
terminating* (and without computing all the FFTs that are due). From further
investigations I suppose this is due to the *memory limit* on the GPU
(currently 2048MB on NVIDIA GeForce GT 750M). It seems that the
multiprocessing pool is not able to acquire the control back.Is there any
way to avoid this?Since each process requires less than 2048 MB, I would
like to have something  like a *queue* where each process can /book/ the
usage of the GPU and, when a process releases the context, the next process
in the queue starts using it.Is this doable?Alternatively, is it possible to
force that only one process uses the GPU at a given time?  I have tried
separately these solutions but they do not work (or probably I have not
implemented them correctly):  1. synchronize the stream, with
proc_stream.synchronize() 2. clear context cache, with
pycuda.tools.clear_context_caches() 3. change the compute mode, with
cuda.compute_mode = cuda.compute_mode.EXCLUSIVE*Note:* The solution 2. seems
to free some memory, but it makes the computation way slower, and does not
solve the problem: e.g. increasing the number of ffts to be computed, the
script shows the same behaviour.Here the code. To start from a simple task,
here each process computes 1 FFT (then one can use batch option in execute()
to do more FFTs in a row).     import multiprocessing    import
pycuda.driver as cuda    import pycuda.gpuarray as gpuarray    from
pycuda.tools import make_default_context    from pyfft.cuda import Plan   
def main():        # generates simple matrix, (e.g. image with a signal at
the center)        size = 4096        center = size/2        in_matrix =
np.zeros((size, size), dtype='complex64')        in_matrix[center:center+2,
center:center+2] = 10.        pool_size = 4  # integer up to
multiprocessing.cpu_count()        pool =
multiprocessing.Pool(processes=pool_size)        func =
FuncWrapper(in_matrix, size)        nffts = 16  # total number of ffts to be
computed        par = np.arange(nffts)            results = pool.map(func,
par)        pool.close()        pool.join()            print resultsAnd here
the function wrapper:    class FuncWrapper(object):        def
__init__(self, matrix, size):            self.in_matrix = matrix           
self.size = size            print("Func initialized with matrix size=%i" %
size)            def __call__(self, par):            proc_id =
multiprocessing.current_process().name                        # take control
over the GPU            cuda.init()            context =
make_default_context()            device = context.get_device()           
proc_stream = cuda.Stream()                  # move data to GPU            #
multiplication self.in_matrix*par is just to have each process computing        
   
# different matrices            in_map_gpu =
gpuarray.to_gpu(self.in_matrix*par)                  # create Plan, execute
FFT and get back the result from GPU            plan = Plan((self.size,
self.size), dtype=np.complex64,                        fast_math=False,
normalize=False, wait_for_finish=True,                       
stream=proc_stream)            plan.execute(in_map_gpu,
wait_for_finish=True)            result = in_map_gpu.get()                #
free memory on GPU            del in_map_gpu                mem =
np.array(cuda.mem_get_info())/1.e6            print("%s free=%f\ttot=%f" %
(proc_id, mem[0], mem[1]))                # release context           
context.pop()                return parNow, with nffts=16 and pool_size=4
the script terminates correctly and gives this output:    Func initialized
with matrix size=4096    PoolWorker-1 free=1481.019392  tot=2147.024896   
PoolWorker-2 free=1331.011584   tot=2147.024896    PoolWorker-3
free=1181.003776        tot=2147.024896    PoolWorker-4 free=1030.631424
tot=2147.024896    PoolWorker-1 free=881.074176 tot=2147.024896   
PoolWorker-2 free=731.746304    tot=2147.024896    PoolWorker-3 free=582.418432
tot=2147.024896    PoolWorker-4 free=433.090560 tot=2147.024896   
PoolWorker-1 free=582.754304    tot=2147.024896    PoolWorker-2 free=718.946304
tot=2147.024896    PoolWorker-3 free=881.254400 tot=2147.024896   
PoolWorker-4 free=1030.684672   tot=2147.024896    PoolWorker-1
free=868.028416 tot=2147.024896    PoolWorker-2 free=731.713536
tot=2147.024896    PoolWorker-3 free=582.402048 tot=2147.024896   
PoolWorker-4 free=433.090560    tot=2147.024896    [0, 1, 2, 3, 4, 5, 6, 7, 8,
9, 10, 11, 12, 13, 14, 15]  But with nffts=18 and pool_size=4 the script
does not terminate and gives this output, remaining stuck at the last line:   
Func initialized with matrix size=4096    PoolWorker-1 free=1416.392704
tot=2147.024896    PoolWorker-2 free=982.544384 tot=2147.024896   
PoolWorker-1 free=1101.037568   tot=2147.024896    PoolWorker-2
free=682.991616 tot=2147.024896    PoolWorker-3 free=815.747072
tot=2147.024896    PoolWorker-4 free=396.918784 tot=2147.024896   
PoolWorker-3 free=503.046144    tot=2147.024896    PoolWorker-4 free=397.144064
tot=2147.024896    PoolWorker-1 free=531.361792 tot=2147.024896   
PoolWorker-1 free=397.246464    tot=2147.024896    PoolWorker-2 free=518.610944
tot=2147.024896    PoolWorker-2 free=397.021184 tot=2147.024896   
PoolWorker-3 free=517.193728    tot=2147.024896    PoolWorker-4 free=397.021184
tot=2147.024896    PoolWorker-3 free=504.336384 tot=2147.024896   
PoolWorker-4 free=149.123072    tot=2147.024896    PoolWorker-1 free=283.340800
tot=2147.024896Many thanks for your help!




--
View this message in context: 
http://pycuda.2962900.n2.nabble.com/Compute-several-FFT-with-GPU-using-Python-multiprocessing-and-pyfft-how-to-avoid-GPU-memory-leak-tp7575532.html
Sent from the PyCuda mailing list archive at Nabble.com.

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

[PyCUDA] Compute several FFT with GPU using Python multiprocessing and pyfft: how to avoid GPU memory leak?

Reply via email to