Re: [OMPI devel] Benchmark with multiple orteds

Ralph Castain Mon, 25 Jan 2016 12:01:50 -0500 (EST)

I also assumed that was true. However, when communicating between two
procs, the TCP stack will use a shortcut in the loopback code if the two
procs are known to be on the same node. In the case of multiple orteds, it
isn't clear to me that the stack knows this situation as the orteds, at
least, must have unique IP addresses and think they are on separate nodes.


On Mon, Jan 25, 2016 at 6:32 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Though I did not repeat it, I assumed --mca btl tcp,self is always used,
> as described in the initial email
>
> Cheers,
>
> Gilles
>
>
> On Monday, January 25, 2016, Ralph Castain <r...@open-mpi.org> wrote:
>
>> I believe the performance penalty will still always be greater than zero,
>> however, as the TCP stack is smart enough to take an optimized path when
>> doing a loopback as opposed to inter-node communication.
>>
>>
>> On Mon, Jan 25, 2016 at 4:28 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> Federico,
>>>
>>> I did not expect 0% degradation, since you are now comparing two
>>> different cases
>>> 1 orted means tasks are bound on sockets
>>> 16 orted means tasks are not bound.
>>>
>>> a quick way to improve things is to use a wrapper that binds MPI tasks
>>> mpirun --bind-to none wrapper.sh skampi
>>>
>>> wrapper.sh can use environment variable to retrieve the rank id
>>> (PMI(X)_RANK iirc) and then bind the tasks with taskset or helicopter
>>> utils
>>>
>>> mpirun --tag-output grep Cpus_allowed_list /proc/self/status
>>> with 1 orted should return the same output than
>>> mpirun --tag-output --bind-to none wrapper.sh grep CPUs_allowed_list
>>> /proc/self/status
>>> with 16 orted
>>>
>>> when wrapper.sh works fine, skampi degradation should be smaller with 16
>>> orted
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Monday, January 25, 2016, Federico Reghenzani <
>>> federico1.reghenz...@mail.polimi.it> wrote:
>>>
>>>> Thank you Gilles, you're right, with --bind-to none we have ~ 15% of
>>>> degradation rather than 50%.
>>>>
>>>> It's much better now, but I think it should be (in theory) around 0%.
>>>> The benchmark is MPI bound (the standard benchmark provided with
>>>> SkaMPI), it tests these functions: MPI_Bcast, MPI_Barrier, MPI_Reduce, 
>>>> MPI_Allreduce,
>>>> MPI_Alltoall, MPI_Gather, MPI_Scatter, MPI_Scan, MPI_Send/Recv
>>>>
>>>> Cheers,
>>>> Federico
>>>> __
>>>> Federico Reghenzani
>>>> M.Eng. Student @ Politecnico di Milano
>>>> Computer Science and Engineering
>>>>
>>>>
>>>>
>>>> 2016-01-25 12:17 GMT+01:00 Gilles Gouaillardet <
>>>> gilles.gouaillar...@gmail.com>:
>>>>
>>>>> Federico,
>>>>>
>>>>> unless you already took care of that, I would guess all 16 orted
>>>>> bound their children MPI tasks on socket 0
>>>>>
>>>>> can you try
>>>>> mpirun --bind-to none ...
>>>>>
>>>>> btw, is your benchmark application cpu bound ? memory bound ? MPI
>>>>> bound ?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Gilles
>>>>>
>>>>>
>>>>> On Monday, January 25, 2016, Federico Reghenzani <
>>>>> federico1.reghenz...@mail.polimi.it> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> we have executed a benchmark (SkaMPI) on the same machine (32 core
>>>>>> Intel Xeon 86_64) with these two configurations:
>>>>>> - 1 orted with 16 processes with BTL forced to TCP (--mca btl
>>>>>> self,tcp)
>>>>>> - 16 orted with each 1 process (that uses TCP)
>>>>>>
>>>>>> We use a custom RAS to allow multiple orted on the same machine (I
>>>>>> know that it seems non-sense to have multiple orteds on the same machine
>>>>>> for the same application, but we are doing some experiments for 
>>>>>> migration).
>>>>>>
>>>>>> Initially we have expected approximately the same performance in both
>>>>>> cases (we have 16 processes communicating via TCP in both cases), but we
>>>>>> have a degradation of 50%, and we are sure that is not an overhead due to
>>>>>> orteds initialization.
>>>>>>
>>>>>> Do you have any idea how can multiple orteds influence the processess
>>>>>> performance?
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Federico
>>>>>> __
>>>>>> Federico Reghenzani
>>>>>> M.Eng. Student @ Politecnico di Milano
>>>>>> Computer Science and Engineering
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2016/01/18499.php
>>>>>
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2016/01/18501.php
>>>
>>
>>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/01/18504.php
>

Re: [OMPI devel] Benchmark with multiple orteds

Reply via email to