For other purposes, some of us have been looking at a "profiling" tool
that would be run on a cluster and output a "recommended" mca param
file to optimize OMPI's behavior for that environment. The idea was
that a sys admin would launch this once across the cluster so we could
do things like determine if the system is homogeneous (so modex can be
flagged for reduction), do some collective tuning, etc.
I would think something like this could easily be included in such a
tool. If memcpy is truly implemented as a component, then specifying
in a default mca param file the particular component to use would seem
like it would solve the problem, and be more in keeping with the OMPI
design than a run-to-run global registry (sounds too much like Windows).
Ralph
On Aug 18, 2008, at 7:42 AM, George Bosilca wrote:
We don't really need a finer grain knowledge about the processor at
compile time. The only thing we should detect is if a bit of code
can or cannot be compiled. We can deal with the processor
characteristics at runtime. I imagine that most of today processors
have the capability of exporting an ID string, with bits set to the
supported instruction sets (at least x86 does). Based on these bits
[at runtime] we can figure out if a special version of memcpy can be
used or not.
The second question is how and when to figure out which of the
available memcpy functions give the best performance. On a
homogeneous architecture, this might be a one node selection [I
don't imagine using the modex to spread this information], when on a
homogeneous one every class of processors should do it. The really
annoying thing here, is that in the best case [in a perfect world]
this should be done once per cluster. There is no need to run the
benchmark at each startup. We should think about a storage
mechanism, where node can push small bits information that will be
available on subsequent runs. A little bit like the registry, but
more stable...
george.
On Aug 18, 2008, at 3:16 AM, Brian Barrett wrote:
I obviously won't be in Dublin (I'll be in a fishing boat in the
middle of nowhere Canada -- much better), so I'm going to chime in
now.
The m4 part actually isn't too bad and is pretty simple. I'm not
sure other than looking at some variables set by ompi_config_asm
that there is much to check. The hard parts are dealing with the
finer grained instruction set requirements.
On x86 in particular, many of the operations in the memcpy are part
of SSE, SSE2, or SSE3. Currently, we don't have any finer concept
of a processor than x86 and most compilers target an instruction
set that will run on anything considered 686, which is almost
everything out there. We'd have to decide how to handle
instruction streams which are no longer going to work on every
chip. Since we know we have a number of users with heterogeneous
x86 clusters, this is something to think about.
Brian
On Aug 17, 2008, at 7:57 AM, Jeff Squyres wrote:
Let's talk about this in Dublin. I can probably help with the m4
magic, but I need to understand exactly what needs to be done first.
On Aug 16, 2008, at 11:51 AM, Terry Dontje wrote:
George Bosilca wrote:
The intent of the memcpy framework is to allow a selection
between several memcpy at runtime. Of course, there will be a
preselection at compile time, but all versions that can compile
on a given architecture will be benchmarked at runtime and the
best one will be selected. There is a file with several versions
of memcpy for x86 (32 and 64) somewhere around (I should have
one if interested), that can be used as a starting point.
Ok, I guess I need to look at this code. I wonder if there may
be cases for Sun's machines in which this benchmark could end up
picking the wrong memcpy?
The only thing we need is a volunteer to build the m4 magic.
Figuring out what we can compile if kind of tricky, as some of
the functions are in assembly, some others in C, and some others
a mixture (the MMX headers).
Isn't the atomic code very similar? If I get to this point
before anyone else I probably will volunteer.
--td
george.
On Aug 16, 2008, at 3:19 PM, Terry Dontje wrote:
Hi Tim,
Thanks for bringing the below up and asking for a redirection
to the devel list. I think looking/using the MCA memcpy
framework would be a good thing to do and maybe we can work on
this together once I get out from under some commitments.
However, some of the challenges that originally scared me away
from looking at the memcpy MCA is whether we really want all
the OMPI memcpy's to be replaced or just specific ones. Also,
I was concerned on trying to figure out which version of memcpy
I should be using. I believe currently things are done such
that you get one version based on which system you compile on.
For Sun there may be several different SPARC platforms that
would need to use different memcpy code but we would like to
just ship one set of bits.
Not saying the above not doable under the memcpy MCA framework
just that it somewhat scared me away from thinking about it at
first glance.
--td
Date: Fri, 15 Aug 2008 12:08:18 -0400 From: "Tim Mattox" <timat...@open-mpi.org
> Subject: Re: [OMPI users] SM btl slows down bandwidth? To:
"Open MPI Users" <us...@open-mpi.org> Message-ID: <ea86ce220808150908t62818a21k32c49b9b6f07...@mail.gmail.com
> Content-Type: text/plain; charset=ISO-8859-1 Hi Terry (and
others), I have previously explored this some on Linux/X86-64
and concluded that Open MPI needs to supply it's own memcpy
routine to get good sm performance, since the memcpy supplied
by glibc is not even close to optimal. We have an unused MCA
framework already set up to supply an opal_memcpy. AFAIK,
George and Brian did the original work to set up that
framework. It has been on my to-do list for awhile to start
implementing opal_memcpy components for the architectures I
have access to, and to modify OMPI to actually use opal_memcpy
where ti makes sense. Terry, I presume what you suggest could
be dealt with similarly when we are running/building on SPARC.
Any followup discussion on this should probably happen on the
developer mailing list. On Thu, Aug 14, 2008 at 12:19 PM,
Terry Dontje <terry.don...@sun.com> wrote:
> Interestingly enough on the SPARC platform the Solaris
memcpy's actually use
> non-temporal stores for copies >= 64KB. By default some of
the mca
> parameters to the sm BTL stop at 32KB. I've done
experimentations of
> bumping the sm segment sizes to above 64K and seen
incredible speedup on our
> M9000 platforms. I am looking for some nice way to
integrate a memcpy that
> lowers this boundary to 32KB or lower into Open MPI.
> I have not looked into whether Solaris x86/x64 memcpy's use
the non-temporal
> stores or not.
>
> --td
>>
>> Message: 1
>> Date: Thu, 14 Aug 2008 09:28:59 -0400
>> From: Jeff Squyres <jsquy...@cisco.com>
>> Subject: Re: [OMPI users] SM btl slows down bandwidth?
>> To: rbbr...@sandia.gov, Open MPI Users <us...@open-mpi.org>
>> Message-ID: <562557EB-857C-4CA8-97AD-
f294c7fed...@cisco.com>
>> Content-Type: text/plain; charset=US-ASCII;
format=flowed; delsp=yes
>>
>> At this time, we are not using non-temporal stores for
shared memory
>> operations.
>>
>>
>> On Aug 13, 2008, at 11:46 AM, Ron Brightwell wrote:
>>
>>
>>>>
>>>> >> [...]
>>>> >>
>>>> >> MPICH2 manages to get about 5GB/s in shared
memory performance on the
>>>> >> Xeon 5420 system.
>>>>
>>>
>>> >
>>> > Does the sm btl use a memcpy with non-temporal
stores like MPICH2?
>>> > This can be a big win for bandwidth benchmarks that
don't actually
>>> > touch their receive buffers at all...
>>> >
>>> > -Ron
>>> >
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> -- Jeff Squyres Cisco Systems
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
-- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com
|| timat...@open-mpi.org I'm a bright... http://www.the-brights.net/
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel