Re: [OMPI devel] memcpy MCA framework

Ralph Castain Mon, 18 Aug 2008 10:07:16 -0400

For other purposes, some of us have been looking at a "profiling" toolthat would be run on a cluster and output a "recommended" mca paramfile to optimize OMPI's behavior for that environment. The idea wasthat a sys admin would launch this once across the cluster so we coulddo things like determine if the system is homogeneous (so modex can beflagged for reduction), do some collective tuning, etc.

I would think something like this could easily be included in such atool. If memcpy is truly implemented as a component, then specifyingin a default mca param file the particular component to use would seemlike it would solve the problem, and be more in keeping with the OMPIdesign than a run-to-run global registry (sounds too much like Windows).


Ralph

On Aug 18, 2008, at 7:42 AM, George Bosilca wrote:

We don't really need a finer grain knowledge about the processor atcompile time. The only thing we should detect is if a bit of codecan or cannot be compiled. We can deal with the processorcharacteristics at runtime. I imagine that most of today processorshave the capability of exporting an ID string, with bits set to thesupported instruction sets (at least x86 does). Based on these bits[at runtime] we can figure out if a special version of memcpy can beused or not.
The second question is how and when to figure out which of theavailable memcpy functions give the best performance. On ahomogeneous architecture, this might be a one node selection [Idon't imagine using the modex to spread this information], when on ahomogeneous one every class of processors should do it. The reallyannoying thing here, is that in the best case [in a perfect world]this should be done once per cluster. There is no need to run thebenchmark at each startup. We should think about a storagemechanism, where node can push small bits information that will beavailable on subsequent runs. A little bit like the registry, butmore stable...
 george.

On Aug 18, 2008, at 3:16 AM, Brian Barrett wrote:
I obviously won't be in Dublin (I'll be in a fishing boat in themiddle of nowhere Canada -- much better), so I'm going to chime innow.
The m4 part actually isn't too bad and is pretty simple. I'm notsure other than looking at some variables set by ompi_config_asmthat there is much to check. The hard parts are dealing with thefiner grained instruction set requirements.
On x86 in particular, many of the operations in the memcpy are partof SSE, SSE2, or SSE3. Currently, we don't have any finer conceptof a processor than x86 and most compilers target an instructionset that will run on anything considered 686, which is almosteverything out there. We'd have to decide how to handleinstruction streams which are no longer going to work on everychip. Since we know we have a number of users with heterogeneousx86 clusters, this is something to think about.
Brian

On Aug 17, 2008, at 7:57 AM, Jeff Squyres wrote:
Let's talk about this in Dublin. I can probably help with the m4magic, but I need to understand exactly what needs to be done first.
On Aug 16, 2008, at 11:51 AM, Terry Dontje wrote:
George Bosilca wrote:
The intent of the memcpy framework is to allow a selectionbetween several memcpy at runtime. Of course, there will be apreselection at compile time, but all versions that can compileon a given architecture will be benchmarked at runtime and thebest one will be selected. There is a file with several versionsof memcpy for x86 (32 and 64) somewhere around (I should haveone if interested), that can be used as a starting point.
Ok, I guess I need to look at this code. I wonder if there maybe cases for Sun's machines in which this benchmark could end uppicking the wrong memcpy?
The only thing we need is a volunteer to build the m4 magic.Figuring out what we can compile if kind of tricky, as some ofthe functions are in assembly, some others in C, and some othersa mixture (the MMX headers).
Isn't the atomic code very similar? If I get to this pointbefore anyone else I probably will volunteer.
--td
george.

On Aug 16, 2008, at 3:19 PM, Terry Dontje wrote:
Hi Tim,
Thanks for bringing the below up and asking for a redirectionto the devel list. I think looking/using the MCA memcpyframework would be a good thing to do and maybe we can work onthis together once I get out from under some commitments.However, some of the challenges that originally scared me awayfrom looking at the memcpy MCA is whether we really want allthe OMPI memcpy's to be replaced or just specific ones. Also,I was concerned on trying to figure out which version of memcpyI should be using. I believe currently things are done suchthat you get one version based on which system you compile on.For Sun there may be several different SPARC platforms thatwould need to use different memcpy code but we would like tojust ship one set of bits.Not saying the above not doable under the memcpy MCA frameworkjust that it somewhat scared me away from thinking about it atfirst glance.
--td
Date: Fri, 15 Aug 2008 12:08:18 -0400 From: "Tim Mattox" <[email protected]> Subject: Re: [OMPI users] SM btl slows down bandwidth? To:"Open MPI Users" <[email protected]> Message-ID: <[email protected]> Content-Type: text/plain; charset=ISO-8859-1 Hi Terry (andothers), I have previously explored this some on Linux/X86-64and concluded that Open MPI needs to supply it's own memcpyroutine to get good sm performance, since the memcpy suppliedby glibc is not even close to optimal. We have an unused MCAframework already set up to supply an opal_memcpy. AFAIK,George and Brian did the original work to set up thatframework. It has been on my to-do list for awhile to startimplementing opal_memcpy components for the architectures Ihave access to, and to modify OMPI to actually use opal_memcpywhere ti makes sense. Terry, I presume what you suggest couldbe dealt with similarly when we are running/building on SPARC.Any followup discussion on this should probably happen on thedeveloper mailing list. On Thu, Aug 14, 2008 at 12:19 PM,Terry Dontje <[email protected]> wrote:
> Interestingly enough on the SPARC platform the Solarismemcpy's actually use> non-temporal stores for copies >= 64KB. By default some ofthe mca> parameters to the sm BTL stop at 32KB. I've doneexperimentations of> bumping the sm segment sizes to above 64K and seenincredible speedup on our> M9000 platforms. I am looking for some nice way tointegrate a memcpy that
> lowers this boundary to 32KB or lower into Open MPI.
> I have not looked into whether Solaris x86/x64 memcpy's usethe non-temporal
> stores or not.
>
> --td
>>
>> Message: 1
>> Date: Thu, 14 Aug 2008 09:28:59 -0400
>> From: Jeff Squyres <[email protected]>
>> Subject: Re: [OMPI users] SM btl slows down bandwidth?
>> To: [email protected], Open MPI Users <[email protected]>
>> Message-ID: <562557EB-857C-4CA8-97AD-[email protected]>>> Content-Type: text/plain; charset=US-ASCII;format=flowed; delsp=yes
>>
>> At this time, we are not using non-temporal stores forshared memory
>>  operations.
>>
>>
>> On Aug 13, 2008, at 11:46 AM, Ron Brightwell wrote:
>>
>>
>>>>
>>>> >> [...]
>>>> >>
>>>> >> MPICH2 manages to get about 5GB/s in sharedmemory performance on the
>>>> >> Xeon 5420 system.
>>>>
>>>
>>> >
>>> > Does the sm btl use a memcpy with non-temporalstores like MPICH2?>>> > This can be a big win for bandwidth benchmarks thatdon't actually
>>> > touch their receive buffers at all...
>>> >
>>> > -Ron
>>> >
>>> >
>>> > _______________________________________________
>>> > users mailing list
>>> > [email protected]
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>> -- Jeff Squyres Cisco Systems
>
> _______________________________________________
> users mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
-- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ [email protected]|| [email protected] I'm a bright... http://www.the-brights.net/
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] memcpy MCA framework

Reply via email to