Re: [otb-users] Re: Performance of LSMS Segmentation with many CPU server

Stephen Woodbridge Fri, 19 May 2017 08:19:00 -0700

Here is an interim summary of what I have found out so far:

   - base starting case (10-15 Load average)
   - converting the input file from .vrt to .tif improves performance a 
   little (12-17 Load average)
   - ITK_USE_THREADPOOL=ON improves performance a little using .tif (17-28 
   Load average)
   - using .tif input and mpi improves performance a lot (80-85 Load 
   average)
   
Load average as reported by htop.


On Friday, May 19, 2017 at 7:11:40 AM UTC-4, Manuel Grizonnet wrote:
>
> Hi Stephen,
>
> just want to add that there is perhaps something else to try with the ITK 
> mechanism which allows to use pool of threads:
>
>
> https://github.com/InsightSoftwareConsortium/ITK/blob/master/Modules/Core/Common/include/itkMultiThreader.h#L210
>
> You can easily test this by setting the environment variable 
> ITK_USE_THREADPOOL (to 'ON' for instance).
>
> Never personally tried this configuration and I was not able to find much 
> documentation about it for now.
>
> Best regards,
>
> Manuel
>
>
> 2017-05-18 23:52 GMT+02:00 Stephen Woodbridge <[email protected] 
> <javascript:>>:
>
>> Hello Remi,
>>
>> I have never used MPI before. I can run the LSMS Smooth from the cli. The 
>> system has 4 cpu sockets with 14 core per socket:
>>
>> $ lscpu
>> Architecture:          x86_64
>> CPU op-mode(s):        32-bit, 64-bit
>> Byte Order:            Little Endian
>> CPU(s):                56
>> On-line CPU(s) list:   0-55
>> Thread(s) per core:    2
>> Core(s) per socket:    14
>> Socket(s):             2
>> NUMA node(s):          2
>> Vendor ID:             GenuineIntel
>> CPU family:            6
>> Model:                 79
>> Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
>> Stepping:              1
>> CPU MHz:               1497.888
>> CPU max MHz:           2600.0000
>> CPU min MHz:           1200.0000
>> BogoMIPS:              5207.83
>> Virtualization:        VT-x
>> L1d cache:             32K
>> L1i cache:             32K
>> L2 cache:              256K
>> L3 cache:              35840K
>> NUMA node0 CPU(s):     0-13,28-41
>> NUMA node1 CPU(s):     14-27,42-55
>>
>> So I launch something like:
>>
>> mpirun -n 4 --bind-to socket otbcli_MeanShiftSmoothing -in maur_rgb.png 
>> -fout smooth.tif -foutpos position.tif -spatialr 16 -ranger 16 -thres 0.1 
>> -maxiter 100
>>
>> So if I understand this launches 4 copies of the application, but how do 
>> they know which instance is working on what? Is that just the magic of MPI?
>>
>> -Steve
>>
>>
>> On Thursday, May 18, 2017 at 12:07:52 PM UTC-4, remicres wrote:
>>>
>>> Hello Stephen,
>>> I am really interested in your results. 
>>> A few years ago I failed to have good benchmarks of otb apps (that is, a 
>>> good scalability over the cpu usage) on the same kind of machine of yours. 
>>> The speedup was collapsing near 10-30 cpus (depending of the app). I 
>>> suspected fine tuning to be the cause, and I did not have the time to 
>>> persevere. This bad speedup might be related to threads positioning, cache 
>>> issues: the actual framework is well cpu-scalable in processing images in a 
>>> shared memory context, particularly when threads are on the same socket of 
>>> cpus. Depending of the algorithm used, I suspect one might need to fine 
>>> tune also the settings of the environment. 
>>> Could you provide the number of sockets of your machine? (with the 
>>> number of cpus for each one)
>>>
>>> If this machine has many sockets, one quick workaround to have good 
>>> speedup could consist in using the MPI support and force the binding of mpi 
>>> processes over the sockets (e.g. with openmpi: "mpirun -n <nb of socket of 
>>> your machine> --bind-to socket ..."). However, not sure how to use it from 
>>> python.
>>>
>>> Keep us updated!
>>>
>>> Rémi
>>>
>>> Le mercredi 17 mai 2017 21:34:48 UTC+2, Stephen Woodbridge a écrit :
>>>>
>>>> I started watching this with htop and all the cpus are getting action. 
>>>> There is a pattern where the number of threads spikes from about 162 up to 
>>>> 215 and the number of running threads spkies to  about 50ish for a few 
>>>> secounds, then the running threads drops to 2 for 5-10 seconds and repeats 
>>>> this pattern. I thinking that the parent thread is spinning up a bunch of 
>>>> workers, the finish, then the parent thread cycles through each of the 
>>>> finished workers collecting the results and presumably write it to disk or 
>>>> something. If it is writing to disk, there could be a huge potential 
>>>> performance improvement by writing the output to memory if enough memory 
>>>> is 
>>>> available which is clearly the case on this machine, then flushing the 
>>>> memory to disk. The current process is only using 3 GB or memory when it 
>>>> has 100 GB available to it and the system has 120GB.
>>>>
>>>> On Wednesday, May 17, 2017 at 12:13:04 PM UTC-4, Stephen Woodbridge 
>>>> wrote:
>>>>>
>>>>> Hi, first I want to say the LSMS Segmentation is very cool and works 
>>>>> nicely. I recently got access to a sever with 56 cores and 128GB of 
>>>>> memory 
>>>>> but I can't seem to get it to use more than 10-15 cores. I'm running the 
>>>>> smoothing on an image approx 20000x20000 in size. The image is a gdal VRT 
>>>>> file that combines 8 DOQQ images into a mosaic. It has 4 bands R, G, B, 
>>>>> IR 
>>>>> with each having Mask Flags: PER_DATASET (see below). I'm running this 
>>>>> from 
>>>>> a Python script like:
>>>>>
>>>>> def smoothing(fin, fout, foutpos, spatialr, ranger, rangeramp, thres, 
>>>>> maxiter, ram):
>>>>>     app = otbApplication.Registry.CreateApplication(
>>>>> 'MeanShiftSmoothing')
>>>>>     app.SetParameterString('in', fin)
>>>>>     app.SetParameterString('fout', fout)
>>>>>     app.SetParameterString('foutpos', foutpos)
>>>>>     app.SetParameterInt('spatialr', spatialr)
>>>>>     app.SetParameterFloat('ranger', ranger)
>>>>>     app.SetParameterFloat('rangeramp', rangeramp)
>>>>>     app.SetParameterFloat('thres', thres)
>>>>>     app.SetParameterInt('maxiter', maxiter)
>>>>>     app.SetParameterInt('ram', ram)
>>>>>     app.SetParameterInt('modesearch', 0)
>>>>>     app.ExecuteAndWriteOutput()
>>>>>
>>>>> Where:
>>>>> spatialr: 24
>>>>> ranger: 36
>>>>> rangeramp: 0
>>>>> thres: 0.1
>>>>> maxiter: 100
>>>>> ram: 102400
>>>>>
>>>>> Any thoughts on how I can get this to utilize more of the processing 
>>>>> power of this machine?
>>>>>
>>>>> -Steve
>>>>>
>>>>> woodbri@optane28:/u/ror/buildings/tmp$ otbcli_ReadImageInfo -in tmp-
>>>>> 23081-areaofinterest.vrt
>>>>> 2017 May 17 15:36:04  :  Application.logger  (INFO)
>>>>> Image general information:
>>>>>         Number of bands : 4
>>>>>         No data flags : Not found
>>>>>         Start index :  [0,0]
>>>>>         Size :  [19933,19763]
>>>>>         Origin :  [-118.442,34.0035]
>>>>>         Spacing :  [9.83578e-06,-9.83578e-06]
>>>>>         Estimated ground spacing (in meters): [0.90856,1.09369]
>>>>>
>>>>> Image acquisition information:
>>>>>         Sensor :
>>>>>         Image identification number:
>>>>>         Image projection : GEOGCS["WGS 84",
>>>>>     DATUM["WGS_1984",
>>>>>         SPHEROID["WGS 84",6378137,298.257223563,
>>>>>             AUTHORITY["EPSG","7030"]],
>>>>>         AUTHORITY["EPSG","6326"]],
>>>>>     PRIMEM["Greenwich",0],
>>>>>     UNIT["degree",0.0174532925199433],
>>>>>     AUTHORITY["EPSG","4326"]]
>>>>>
>>>>> Image default RGB composition:
>>>>>         [R, G, B] = [0,1,2]
>>>>>
>>>>> Ground control points information:
>>>>>         Number of GCPs = 0
>>>>>         GCPs projection =
>>>>>
>>>>> Output parameters value:
>>>>> indexx: 0
>>>>> indexy: 0
>>>>> sizex: 19933
>>>>> sizey: 19763
>>>>> spacingx: 9.835776837e-06
>>>>> spacingy: -9.835776837e-06
>>>>> originx: -118.4418488
>>>>> originy: 34.00345612
>>>>> estimatedgroundspacingx: 0.9085595012
>>>>> estimatedgroundspacingy: 1.093693733
>>>>> numberbands: 4
>>>>> sensor:
>>>>> id:
>>>>> time:
>>>>> ullat: 0
>>>>> ullon: 0
>>>>> urlat: 0
>>>>> urlon: 0
>>>>> lrlat: 0
>>>>> lrlon: 0
>>>>> lllat: 0
>>>>> lllon: 0
>>>>> town:
>>>>> country:
>>>>> rgb.r: 0
>>>>> rgb.g: 1
>>>>> rgb.b: 2
>>>>> projectionref: GEOGCS["WGS 84",
>>>>>     DATUM["WGS_1984",
>>>>>         SPHEROID["WGS 84",6378137,298.257223563,
>>>>>             AUTHORITY["EPSG","7030"]],
>>>>>         AUTHORITY["EPSG","6326"]],
>>>>>     PRIMEM["Greenwich",0],
>>>>>     UNIT["degree",0.0174532925199433],
>>>>>     AUTHORITY["EPSG","4326"]]
>>>>> keyword:
>>>>> gcp.count: 0
>>>>> gcp.proj:
>>>>> gcp.ids:
>>>>> gcp.info:
>>>>> gcp.imcoord:
>>>>> gcp.geocoord:
>>>>>
>>>>> woodbri@optane28:/u/ror/buildings/tmp$ gdalinfo tmp-23081-
>>>>> areaofinterest.vrt
>>>>> Driver: VRT/Virtual Raster
>>>>> Files: tmp-23081-areaofinterest.vrt
>>>>>        /u/ror/buildings/tmp/tmp-23081-areaofinterest.vrt.vrt
>>>>> Size is 19933, 19763
>>>>> Coordinate System is:
>>>>> GEOGCS["WGS 84",
>>>>>     DATUM["WGS_1984",
>>>>>         SPHEROID["WGS 84",6378137,298.257223563,
>>>>>             AUTHORITY["EPSG","7030"]],
>>>>>         AUTHORITY["EPSG","6326"]],
>>>>>     PRIMEM["Greenwich",0],
>>>>>     UNIT["degree",0.0174532925199433],
>>>>>     AUTHORITY["EPSG","4326"]]
>>>>> Origin = (-118.441851318576212,34.003461706049677)
>>>>> Pixel Size = (0.000009835776490,-0.000009835776490)
>>>>> Corner Coordinates:
>>>>> Upper Left  (-118.4418513,  34.0034617) (118d26'30.66"W, 34d 0'12.46
>>>>> "N)
>>>>> Lower Left  (-118.4418513,  33.8090773) (118d26'30.66"W, 33d48
>>>>> '32.68"N)
>>>>> Upper Right (-118.2457948,  34.0034617) (118d14'44.86"W, 34d 0'12.46"N
>>>>> )
>>>>> Lower Right (-118.2457948,  33.8090773) (118d14'44.86"W, 33d48'32.68
>>>>> "N)
>>>>> Center      (-118.3438231,  33.9062695) (118d20'37.76"W, 33d54
>>>>> '22.57"N)
>>>>> Band 1 Block=128x128 Type=Byte, ColorInterp=Red
>>>>>   Mask Flags: PER_DATASET
>>>>> Band 2 Block=128x128 Type=Byte, ColorInterp=Green
>>>>>   Mask Flags: PER_DATASET
>>>>> Band 3 Block=128x128 Type=Byte, ColorInterp=Blue
>>>>>   Mask Flags: PER_DATASET
>>>>> Band 4 Block=128x128 Type=Byte, ColorInterp=Gray
>>>>>   Mask Flags: PER_DATASET
>>>>>
>>>>>
>>>>>
>>>>> -- 
>> -- 
>> Check the OTB FAQ at
>> http://www.orfeo-toolbox.org/FAQ.html
>>  
>> You received this message because you are subscribed to the Google
>> Groups "otb-users" group.
>> To post to this group, send email to [email protected] 
>> <javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/otb-users?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "otb-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
>
> -- 
> Manuel Grizonnet
>

-- 
-- 
Check the OTB FAQ at
http://www.orfeo-toolbox.org/FAQ.html

You received this message because you are subscribed to the Google
Groups "otb-users" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/otb-users?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"otb-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [otb-users] Re: Performance of LSMS Segmentation with many CPU server

Reply via email to