[otb-users] Re: Performance of LSMS Segmentation with many CPU server

Stephen Woodbridge Thu, 25 May 2017 17:34:07 -0700

Yes, I was running Ubuntu 16.04 in a docker container on CentOS box. I have 
since rebuilt the boxes with native Ubuntu 16.04. I actually have 
evaluation access to three of these intel boxes and I tried to run MPI 
across all three boxes but I'm not sure I have the settings correct as it 
ran for approx. the same elapsed time as run it on a single box.


$ cat mpi-3host
optane30 slots=4
optane29 slots=4
optane28 slots=4


$ mpirun --bind-to socket --hostfile mpi-3host -x 
ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=14 otbcli_MeanShiftSmoothing -in /u/ror
/buildings/data/naip/doqqs/2014/33118/m_3311805_se_11_1_20140513.tif -fout /
u/ror/buildings/tmp/test5-smooth.tif -foutpos /u/ror/buildings/tmp/test5-
smoothpos.tif -spatialr 24 -ranger 36 -ram 102400 -thres 0.1 -maxiter 100

where all three boxes have identical software installed on them.  optane30 
is the master and I did an nfs exportfs of /u/ror/buildings and then 
mounted that on the other two boxes. All the work gets done in the 
/u/ror/buildings/tmp/ directory.

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                56
On-line CPU(s) list:   0-55
Thread(s) per core:    2
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping:              1
CPU MHz:               1845.289
CPU max MHz:           3500.0000
CPU min MHz:           1200.0000
BogoMIPS:              5202.36
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0-13,28-41
NUMA node1 CPU(s):     14-27,42-55
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl 
xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor 
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 
x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm 
abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid 
fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx 
smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat 
pln pts


Any suggestions on how to work this in a cluster?

Thanks,
  -Steve

Performance stats so far:

Intel Server 
Description Elapsed CPU Sec Hours 
Test 1  -thres 0.001 -maxiter 4  4,521.9   205,217.4   1.26  
smoothing  3,747.0   203,160.2  
time mpirun -np 4 --bind-to socket otbcli_MeanShiftSmoothing -in  
/u/ror/buildings/data/naip/doqqs/2014/33118/m_3311805_se_11_1_20140513.tif 
-fout /u/ror/buildings/tmp/test1-smooth.tif -foutpos 
/u/ror/buildings/tmp/test1-smoothpos.tif -spatialr 24 -ranger 36 -ram 102400 
segmentation  318.1   378.0  
time otbcli_LSMSSegmentation -in /u/ror/buildings/tmp/test1-smooth.tif 
-inpos /u/ror/buildings/tmp/test1-smoothpos.tif -out 
/u/ror/buildings/tmp/test1-segs.tif -tmpdir /u/ror/buildings/tmp -spatialr 
24 -ranger 36 -minsize 128 -tilesizex 1025 -tilesizey 1025 
vectorization  456.7   1,679.2  
time mpirun -np 4 --bind-to socket otbcli_LSMSVectorization -in 
/u/ror/buildings/tmp/test1-smooth.tif -inseg 
/u/ror/buildings/tmp/test1-segs.tif -out 
/u/ror/buildings/tmp/test1-segments-mpi.shp -tilesizex 1025 -tilesizey 1025 





Test 2  -thres 0.001 -maxiter 4  4,318.0   204,090.5   1.20  
smoothing  3,746.8   203,418.8  
time mpirun -np 4 --bind-to socket otbcli_MeanShiftSmoothing -in 
/u/ror/buildings/data/naip/doqqs/2014/33118/m_3311805_se_11_1_20140513.tif 
-fout /u/ror/buildings/tmp/test2-smooth.tif -foutpos 
/u/ror/buildings/tmp/test2-smoothpos.tif -spatialr 24 -ranger 36 -ram 102400 
segmentation  155.1   164.0  
time otbcli_LSMSSegmentation -in /u/ror/buildings/tmp/test2-smooth.tif 
-inpos /u/ror/buildings/tmp/test2-smoothpos.tif -out 
/u/ror/buildings/tmp/test2-segs.tif -tmpdir /u/ror/buildings/tmp -spatialr 
24 -ranger 36 -minsize 128 -tilesizex 1025 -tilesizey 1025 
vectorization  416.1   507.7  
time otbcli_LSMSVectorization -in /u/ror/buildings/tmp/test2-smooth.tif 
-inseg /u/ror/buildings/tmp/test2-segs.tif -out 
/u/ror/buildings/tmp/test2-segments.shp -tilesizex 1025 -tilesizey 1025 





Test 3   -thres 0.1 -maxiter 100  4,256.2   203,771.0   1.18  
smoothing  3,747.1   203,180.5  
time mpirun -np 4 --bind-to socket otbcli_MeanShiftSmoothing -in 
/u/ror/buildings/data/naip/doqqs/2014/33118/m_3311805_se_11_1_20140513.tif 
-fout /u/ror/buildings/tmp/test3-smooth.tif -foutpos 
/u/ror/buildings/tmp/test3-smoothpos.tif -spatialr 24 -ranger 36 -ram 
102400 -thres 0.1 -maxiter 100 
segmentation  167.7   178.8  
time otbcli_LSMSSegmentation -in /u/ror/buildings/tmp/test3-smooth.tif 
-inpos /u/ror/buildings/tmp/test3-smoothpos.tif -out 
/u/ror/buildings/tmp/test3-segs.tif -tmpdir /u/ror/buildings/tmp -spatialr 
24 -ranger 36 -minsize 128 -tilesizex 1025 -tilesizey 1025 
vectorization  341.4   411.7  
time otbcli_LSMSVectorization -in /u/ror/buildings/tmp/test3-smooth.tif 
-inseg /u/ror/buildings/tmp/test3-segs.tif -out 
/u/ror/buildings/tmp/test3-segments.shp -tilesizex 1025 -tilesizey 1025 





Test 4  ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=14  3,871.2   151,724.4   1.08  
smoothing  3,131.0   150,921.2  
time mpirun -np 4 --bind-to socket otbcli_MeanShiftSmoothing -in 
/u/ror/buildings/data/naip/doqqs/2014/33118/m_3311805_se_11_1_20140513.tif 
-fout /u/ror/buildings/tmp/test4-smooth.tif -foutpos 
/u/ror/buildings/tmp/test4-smoothpos.tif -spatialr 24 -ranger 36 -ram 
102400 -thres 0.1 -maxiter 100 
segmentation  184.3   195.1  
time otbcli_LSMSSegmentation -in /u/ror/buildings/tmp/test4-smooth.tif 
-inpos /u/ror/buildings/tmp/test4-smoothpos.tif -out 
/u/ror/buildings/tmp/test4-segs.tif -tmpdir /u/ror/buildings/tmp -spatialr 
24 -ranger 36 -minsize 128 -tilesizex 1025 -tilesizey 1025 
vectorization  555.9   608.1  
time otbcli_LSMSVectorization -in /u/ror/buildings/tmp/test4-smooth.tif 
-inseg /u/ror/buildings/tmp/test4-segs.tif -out 
/u/ror/buildings/tmp/test4-segments.shp -tilesizex 1025 -tilesizey 1025 





Test 1  -thres 0.1 -maxiter 100  (Native Ubuntu)  6,238.1   157,528.8   1.73
  
smoothing  5,571.6   156,831.6  
mpirun -np 4 --bind-to socket otbcli_MeanShiftSmoothing -in 
/u/ror/buildings/data/naip/doqqs/2014/33118/m_3311805_se_11_1_20140513.tif 
-fout /u/ror/buildings/tmp/test1-smooth.tif -foutpos 
/u/ror/buildings/tmp/test1-smoothpos.tif -spatialr 24 -ranger 36 -ram 
102400 -thres 0.1 -maxiter 100 
segmentation  157.5   157.4  
otbcli_LSMSSegmentation -in /u/ror/buildings/tmp/test1-smooth.tif -inpos 
/u/ror/buildings/tmp/test1-smoothpos.tif -out 
/u/ror/buildings/tmp/test1-segs.tif -tmpdir /u/ror/buildings/tmp -spatialr 
24 -ranger 36 -minsize 128 -tilesizex 1025 -tilesizey 1025 
vectorization  509.1   539.7  
otbcli_LSMSVectorization -in /u/ror/buildings/tmp/test1-smooth.tif -inseg 
/u/ror/buildings/tmp/test1-segs.tif -out 
/u/ror/buildings/tmp/test1-segments.shp -tilesizex 1025 -tilesizey 1025 





Test 5 3-Server MPI on Ubuntu  3,722.5   149,495.6   1.03  
smoothing  3,081.3   148,862.0  
mpirun --bind-to socket --hostfile mpi-3host -x 
ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=14 otbcli_MeanShiftSmoothing -in 
/u/ror/buildings/data/naip/doqqs/2014/33118/m_3311805_se_11_1_20140513.tif 
-fout /u/ror/buildings/tmp/test5-smooth.tif -foutpos 
/u/ror/buildings/tmp/test5-smoothpos.tif -spatialr 24 -ranger 36 -ram 
102400 -thres 0.1 -maxiter 100 
segmentation  169.1   123.8  
otbcli_LSMSSegmentation -in /u/ror/buildings/tmp/test5-smooth.tif -inpos 
/u/ror/buildings/tmp/test5-smoothpos.tif -out 
/u/ror/buildings/tmp/test5-segs.tif -tmpdir /u/ror/buildings/tmp -spatialr 
24 -ranger 36 -minsize 128 -tilesizex 1025 -tilesizey 1025 
vectorization  472.1   509.8  
otbcli_LSMSVectorization -in /u/ror/buildings/tmp/test5-smooth.tif -inseg 
/u/ror/buildings/tmp/test5-segs.tif -out 
/u/ror/buildings/tmp/test5-segments.shp -tilesizex 1025 -tilesizey 1025 

On Tuesday, May 23, 2017 at 4:00:37 AM UTC-4, remicres wrote:
>
> Hi Stephen,
>
>
>> Thanks, I have this more or less working. I have not set the env variable 
>> ITK_GLOBAL_DEFAULT_NUMBER_OF_
>> THREADS but I will try that, I seem to be getting about 4 times that many 
>> threads running.
>>
>  
> If you do not adjust the correct number of threads, you could have more 
> threads than cpus, which will lead to poor performance (The goal is to have 
> 1 thread on 1 cpu, virtually speaking).
>
>
>
>> Below are various problems I've run into. Some of these might be code 
>> bugs, or config issues, or who knows what :)
>>
>> I get an error with --bind-to socket
>>
>> mpirun -np 4 --bind-to socket otbcli_MeanShiftSmoothing -in /u/ror/
>> buildings/data/naip/doqqs/2014/33118/m_3311805_se_11_1_20140513.tif -fout 
>> /u/ror/buildings/tmp/test1-smooth.tif -foutpos /u/ror/buildings/tmp/test1
>> -smoothpos.tif -spatialr 24 -ranger 36 -ram 102400
>> Unexpected end of /proc/mounts line `overlay / overlay 
>> rw,seclabel,relatime,lowerdir=/var/lib/docker/overlay2/l/JPC7E5F4RB77LOK22ETL5FMEPN:/var/lib/docker/overlay2/l/DM3Q73J52BCAIEZVAQZGAMXLCX:/var/lib/docker/overlay2/l/WC5LQTPG4RBGOUEZ7KBJZLUB2R:/var/lib/docker/overlay2/l/BESSO2WOBICH2P4GSVX7VSCGG6:/var/lib/docker/overlay2/l/FMSJDZMFK67RHOIIZOLKOICAHI:/var/lib/docker/overlay2/l/U7AFHXIVI6KAKUO2VJMZWLQOHH:/var/lib/docker/overlay2/l/EIRHWP2GOK3F2PH7SHY4FK6J6P,upperdir=/var/lib/docker/overlay2/73d138b0a2dadf534a9d9c7d2ed894484515bfe3d2f1807a2b8'
>> --------------------------------------------------------------------------
>> WARNING: Open MPI tried to bind a process but failed.  This is a
>> warning only; your job will continue, though performance may
>> be degraded.
>>
>>   Local host:        optane30
>>   Application name:  /usr/bin/otbcli_MeanShiftSmoothing
>>   Error message:     failed to bind memory
>>   Location:          odls_default_module.c:639
>>
>> --------------------------------------------------------------------------
>>
>>
>>
> From your logs, I feel that you are using a virtual environment (docker?).
> First impression is that MPI fails to bind processes in this environment. 
> However I never used MPI in such configuration.
>  
>
>> But the job runs to completion. When I try to run otbcli_LSMSVectorization 
>> under mpi it fails. The same command runs fine without mpi. If this 
>> command shouldn't run under mpi, you  might want to add a check and report 
>> to the user, or just internally disable mpi.
>>
>
> Indeed, LSMSVectorization does not currently support MPI.
> You are absolutely right, we need to add something to prevent the use of 
> applications which can't work with MPI.
>
> Thank you for these useful feedbacks, we will take care of improving the 
> MPI feature, plus providing more doc!
> Rémi
>
>

-- 
-- 
Check the OTB FAQ at
http://www.orfeo-toolbox.org/FAQ.html

You received this message because you are subscribed to the Google
Groups "otb-users" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/otb-users?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"otb-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[otb-users] Re: Performance of LSMS Segmentation with many CPU server

Reply via email to