I believe the performance penalty will still always be greater than zero,
however, as the TCP stack is smart enough to take an optimized path when
doing a loopback as opposed to inter-node communication.


On Mon, Jan 25, 2016 at 4:28 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Federico,
>
> I did not expect 0% degradation, since you are now comparing two different
> cases
> 1 orted means tasks are bound on sockets
> 16 orted means tasks are not bound.
>
> a quick way to improve things is to use a wrapper that binds MPI tasks
> mpirun --bind-to none wrapper.sh skampi
>
> wrapper.sh can use environment variable to retrieve the rank id
> (PMI(X)_RANK iirc) and then bind the tasks with taskset or helicopter utils
>
> mpirun --tag-output grep Cpus_allowed_list /proc/self/status
> with 1 orted should return the same output than
> mpirun --tag-output --bind-to none wrapper.sh grep CPUs_allowed_list
> /proc/self/status
> with 16 orted
>
> when wrapper.sh works fine, skampi degradation should be smaller with 16
> orted
>
> Cheers,
>
> Gilles
>
> On Monday, January 25, 2016, Federico Reghenzani <
> federico1.reghenz...@mail.polimi.it> wrote:
>
>> Thank you Gilles, you're right, with --bind-to none we have ~ 15% of
>> degradation rather than 50%.
>>
>> It's much better now, but I think it should be (in theory) around 0%.
>> The benchmark is MPI bound (the standard benchmark provided with SkaMPI),
>> it tests these functions: MPI_Bcast, MPI_Barrier, MPI_Reduce, MPI_Allreduce,
>> MPI_Alltoall, MPI_Gather, MPI_Scatter, MPI_Scan, MPI_Send/Recv
>>
>> Cheers,
>> Federico
>> __
>> Federico Reghenzani
>> M.Eng. Student @ Politecnico di Milano
>> Computer Science and Engineering
>>
>>
>>
>> 2016-01-25 12:17 GMT+01:00 Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com>:
>>
>>> Federico,
>>>
>>> unless you already took care of that, I would guess all 16 orted
>>> bound their children MPI tasks on socket 0
>>>
>>> can you try
>>> mpirun --bind-to none ...
>>>
>>> btw, is your benchmark application cpu bound ? memory bound ? MPI bound ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>> On Monday, January 25, 2016, Federico Reghenzani <
>>> federico1.reghenz...@mail.polimi.it> wrote:
>>>
>>>> Hello,
>>>>
>>>> we have executed a benchmark (SkaMPI) on the same machine (32 core
>>>> Intel Xeon 86_64) with these two configurations:
>>>> - 1 orted with 16 processes with BTL forced to TCP (--mca btl self,tcp)
>>>> - 16 orted with each 1 process (that uses TCP)
>>>>
>>>> We use a custom RAS to allow multiple orted on the same machine (I know
>>>> that it seems non-sense to have multiple orteds on the same machine for the
>>>> same application, but we are doing some experiments for migration).
>>>>
>>>> Initially we have expected approximately the same performance in both
>>>> cases (we have 16 processes communicating via TCP in both cases), but we
>>>> have a degradation of 50%, and we are sure that is not an overhead due to
>>>> orteds initialization.
>>>>
>>>> Do you have any idea how can multiple orteds influence the processess
>>>> performance?
>>>>
>>>>
>>>> Cheers,
>>>> Federico
>>>> __
>>>> Federico Reghenzani
>>>> M.Eng. Student @ Politecnico di Milano
>>>> Computer Science and Engineering
>>>>
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2016/01/18499.php
>>>
>>
>>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/01/18501.php
>

Reply via email to