Hi Lorenzo: This is more of a question for the python community. However, a couple things I have noticed. Pandas tends to be much slower than working in numpy directly. I never saw an improvement in timings when using Pool(). What I do is utilize Process() and Queue() or JoinableQueue() from the multiprocessing library.
You can setup a pool of workers that all read from the same input queue. Then you can just feed your data into the queue so the workers can process the maps. Jerl Simpson Sr. Systems Engineer Weather Trends Internationalhttp://www.weathertrends360.com/ This communication is privileged and may contain confidential information. It's intended only for the use of the person or entity named above. If you are not the intended recipient, do not distribute or copy this communication. If you have received this communication in error, please notify the sender immediately and return the original to the email address above. © Copyright 2016 Weather Trends International, Inc. On Wed, Mar 2, 2016 at 5:44 PM, Lorenzo Bottaccioli < [email protected]> wrote: > Hi, > I'm trying to parallelize a code for raster calculation with Gdal_calc.py, > but i have relay bad results. I need to perform several raster operation > like FILE_out=FILA_a*k1+FILE_b*k2. > > This is the code I'm usign: > > import pandas as pdimport osimport timefrom multiprocessing import Pool > > df = pd.read_csv('input.csv', sep=";", index_col='Date Time', decimal=',') > df.index = pd.to_datetime(df.index, unit='s') > > start_time = time.time() > pool=Pool(processes=8) > pool.map(mapcalc,[df.iloc[i*20:(i+1)*20] for i in range(len(df.index)/20+1)]) > pool.close() > pool.join()print("--- %s seconds ---" % (time.time() - start_time)) > > def mapcalc(df): > > > month={1:'17',2:'47',3:'75',4:'105',5:'135',6:'162',7:'198',8:'228',9:'258',10:'288',11:'318',12:'344'} > > hour={4:'04',5:'05',6:'06',7:'07',8:'08',9:'09',10:'10',11:'11',12:'12',13:'13',14:'14',15:'15',16:'16',17:'17',18:'18',19:'19',20:'20',21:'21',22:'22'} > minute={0:'00',15:'15',30:'30',45:'45'} > directory='/home/user/Raster/' > tmp='/home/usr/tmp/' > for i in df.index: > if 4<=i.hour<22: > #try: > timeg=time.time() > os.system('gdal_calc.py -A > '+directory+'filea_'+month[i.month]+'_'+hour[i.hour]+minute[i.minute]+' -B > '+directory+'fileb_'+month[i.month]+'_'+hour[i.hour]+minute[i.minute]+' > --outfile='+tmp+str(i.date())+'_'+str(i.time())+' > --calc=A*'+str(df.ix[i,'k1'])+'+B*'+str(df.ix[i,'k2'])) > print(i,"--- %s seconds ---" % (time.time() - timeg)) > > If i run the code with out parallelization it takes around 650s to > complete the calculation. Each process of the for loop is executed in ~10s. > If i run with parallelization it takes ~900s to complete the procces and > each process of the for loop it takes ~30s. > > How is that? how can i Fix this? > > Best L > > _______________________________________________ > gdal-dev mailing list > [email protected] > http://lists.osgeo.org/mailman/listinfo/gdal-dev >
_______________________________________________ gdal-dev mailing list [email protected] http://lists.osgeo.org/mailman/listinfo/gdal-dev
