Hi Lorenzo:

This is more of a question for the python community.  However, a couple
things I have noticed.  Pandas tends to be much slower than working in
numpy directly.
I never saw an improvement in timings when using Pool().  What I do is
utilize Process() and Queue() or JoinableQueue() from the multiprocessing
library.

You can setup a pool of workers that all read from the same input queue.
Then you can just feed your data into the queue so the workers can process
the maps.





Jerl Simpson
Sr. Systems Engineer
Weather Trends Internationalhttp://www.weathertrends360.com/

This communication is privileged and may contain confidential information.
It's intended only for the use of the person or entity named above.
If you are not the intended recipient, do not distribute or copy this
communication.
If you have received this communication in error,
please notify the sender immediately and return the original to the
email address above.
© Copyright 2016 Weather Trends International, Inc.


On Wed, Mar 2, 2016 at 5:44 PM, Lorenzo Bottaccioli <
[email protected]> wrote:

> Hi,
> I'm trying to parallelize a code for raster calculation with Gdal_calc.py,
> but i have relay bad results. I need to perform several raster operation
> like FILE_out=FILA_a*k1+FILE_b*k2.
>
> This is the code I'm usign:
>
> import pandas as pdimport osimport timefrom multiprocessing import Pool
>
> df = pd.read_csv('input.csv', sep=";", index_col='Date Time', decimal=',')
> df.index = pd.to_datetime(df.index, unit='s')
>
> start_time = time.time()
> pool=Pool(processes=8)
> pool.map(mapcalc,[df.iloc[i*20:(i+1)*20] for i in range(len(df.index)/20+1)])
> pool.close()
> pool.join()print("--- %s seconds ---" % (time.time() - start_time))
>
> def mapcalc(df):
>
>     
> month={1:'17',2:'47',3:'75',4:'105',5:'135',6:'162',7:'198',8:'228',9:'258',10:'288',11:'318',12:'344'}
>     
> hour={4:'04',5:'05',6:'06',7:'07',8:'08',9:'09',10:'10',11:'11',12:'12',13:'13',14:'14',15:'15',16:'16',17:'17',18:'18',19:'19',20:'20',21:'21',22:'22'}
>     minute={0:'00',15:'15',30:'30',45:'45'}
>     directory='/home/user/Raster/'
>     tmp='/home/usr/tmp/'
>     for i in df.index:
>         if 4<=i.hour<22:
>             #try:
>         timeg=time.time()
>             os.system('gdal_calc.py -A 
> '+directory+'filea_'+month[i.month]+'_'+hour[i.hour]+minute[i.minute]+' -B 
> '+directory+'fileb_'+month[i.month]+'_'+hour[i.hour]+minute[i.minute]+' 
> --outfile='+tmp+str(i.date())+'_'+str(i.time())+' 
> --calc=A*'+str(df.ix[i,'k1'])+'+B*'+str(df.ix[i,'k2']))
>             print(i,"--- %s seconds ---" % (time.time() - timeg))
>
> If i run the code with out parallelization it takes around 650s to
> complete the calculation. Each process of the for loop is executed in ~10s.
> If i run with parallelization it takes ~900s to complete the procces and
> each process of the for loop it takes ~30s.
>
> How is that? how can i Fix this?
>
> Best L
>
> _______________________________________________
> gdal-dev mailing list
> [email protected]
> http://lists.osgeo.org/mailman/listinfo/gdal-dev
>
_______________________________________________
gdal-dev mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to