[hwloc-devel] Create success (hwloc r1.5a1r4360)

2012-03-05 Thread MPI Team
Creating nightly hwloc snapshot SVN tarball was a success.

Snapshot:   hwloc 1.5a1r4360
Start time: Mon Mar  5 21:01:01 EST 2012
End time:   Mon Mar  5 21:04:19 EST 2012

Your friendly daemon,
Cyrador


Re: [OMPI devel] Collective communications may be abend when it use over 2GiB buffer

2012-03-05 Thread N.M. Maclaren

On Mar 5 2012, George Bosilca wrote:


I gave it a try (r26103). It was messy, and I hope I got it right. Let's 
soak it for few days with our nightly testing to see how it behave.


That'll at least check that it's not totally broken.  The killer about
such wording is that you cannot guarantee exactly how a new vendor will
interpret it.  That caused a LOT of trouble with C90 and to some extent
with C99.  Currently compilers are following the path of doing less and
less optimisation, rather than more (which was happening then), which
reduces the chances of problems.

I think the chances of any new standard not honouring casts in such
expressions is very low, but I didn't expect the breakages in C99.
I didn't expect that WG14 would completely change their interpretation
of the same wording in a decade, either.   As far as I know, C11 hasn't
broken anything in this area (though it has in others), but I haven't
looked at it in detail.


Regards,
Nick Maclaren.



Re: [OMPI devel] Collective communications may be abend when it use over 2GiB buffer

2012-03-05 Thread George Bosilca
I gave it a try (r26103). It was messy, and I hope I got it right. Let's soak 
it for few days with our nightly testing to see how it behave.

  george.

On Mar 5, 2012, at 16:37 , N.M. Maclaren wrote:

> On Mar 5 2012, George Bosilca wrote:
>> 
>> I was afraid about all those little intermediary steps. I asked a compiler 
>> guy and apparently reversing the order (aka starting with the ptrdiff_t 
>> variable) will not solve anything. The only portable way to solve this is to 
>> cast every single member, to prevent __any__ compiler from hurting us.
> 
> That is true, but even that may not help, given that each version of
> the C standard has been incompatible with its predecessors.  And see
> below.
> 
>>> In my copy of C99, section 6.5 Expressions says " the order of evaluation 
>>> of subexpressions and the order in which side effects take place are both 
>>> unspecified. There is a footnote 71 that "specifies the precedence of 
>>> operators in the evaluation of an expressions, which is the same as the 
>>> order of the major subclauses of this subclause, highest precedence first." 
>>> It is the footnote that implies multiplication (6.5.5 Multiplicative 
>>> operators) has higher precedence than addition (6.5.6 Additive operators) 
>>> in the expression "(char*) rbuf + rank * rcount * rext". But, the main text 
>>> states that there is no ordering of the subexpression "rank * rcount * 
>>> rext". When the compiler chooses to evaluate "rank * rcount" first, the 
>>> overflow described by Yuki can result. I think you are correct that the 
>>> subexpression will get promoted to (ptrdiff_t), but that is not quite the 
>>> same thing.
> 
> No, it's not as simple as that :-(
> 
> That was the intent during the standardisation of C90, but those of
> us who tried failed to get any explicit statement into it, and the
> situation during C99 was that "but everybody knows that" the syntax
> rules also define the evaluation order.  We failed to get that stated
> then, either :-(  That interpretation was apparently also the one
> assumed by C++03, too, and now is explicitly (if informally) stated in
> C++11.  So you theoretically can just cast the first operand to the
> maximum precision and it will all work.
> 
> What it means by the "order of evaluation of subexpressions" is that
> the assignments in '(a = b) + (c = d) + (e = f)' can take place in
> any order, which is a different issue. 
> HOWEVER, about half of the C communities have given C99 the thumbs
> down, I doubt that C11 will be taken much notice of, gcc is the
> de facto standard definer, and most compilers have optimisation
> options that say "ignore the standard when it helps to go faster".
> So the only feasible rule is to do your damnedest to defend yourself
> against the aberrations, ambiguities and inconsistencies of C, and
> hope for the best.  I.e. what George recommends.
> 
> But will even that work reliably in the medium term?  I wouldn't
> bet on it :-(
> 
> 
> Regards,
> Nick Maclaren.
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Collective communications may be abend when it use over 2GiB buffer

2012-03-05 Thread George Bosilca
I was afraid about all those little intermediary steps. I asked a compiler guy 
and apparently reversing the order (aka starting with the ptrdiff_t variable) 
will not solve anything. The only portable way to solve this is to cast every 
single member, to prevent __any__ compiler from hurting us.

  george.

On Mar 5, 2012, at 13:40 , Larry Baker wrote:

> George,
> 
> I think Yuki's interpretation is correct.
> 
>>> The following is one of the suspicious parts.
>>> (Many similar code in ompi/coll/tuned/*.c)
>>> 
>>> --- in ompi/coll/tuned/coll_tuned_allgather.c (V1.4.X's trunk)---
>>> 398tmprecv = (char*) rbuf + rank * rcount * rext;
>>> -
>>> 
>>> if this condition is met, "rank * rcount" is overflowed.
>>> So, we fixed it tentatively like following:
>>> (cast int to size_t)
>>> --- in ompi/coll/tuned/coll_tuned_allgather.c --
>>> 398tmprecv = (char*) rbuf + (size_t)rank * rcount * rext;
>>> 
>> 
>> Based on my understanding of the C standard this operation should be done on 
>> the most extended type, in this particular case the one of the rext 
>> (ptrdiff_t). Thus I would say the displacement should be correctly computed.
> 
> In my copy of C99, section 6.5 Expressions says " the order of evaluation of 
> subexpressions and the order in which side effects take place are both 
> unspecified.  There is a footnote 71 that "specifies the precedence of 
> operators in the evaluation of an expressions, which is the same as the order 
> of the major subclauses of this subclause, highest precedence first."  It is 
> the footnote that implies multiplication (6.5.5 Multiplicative operators) has 
> higher precedence than addition (6.5.6 Additive operators) in the expression 
> "(char*) rbuf + rank * rcount * rext".  But, the main text states that there 
> is no ordering of the subexpression "rank * rcount * rext".  When the 
> compiler chooses to evaluate "rank * rcount" first, the overflow described by 
> Yuki can result.  I think you are correct that the subexpression will get 
> promoted to (ptrdiff_t), but that is not quite the same thing.
> 
> Larry Baker
> US Geological Survey
> 650-329-5608
> ba...@usgs.gov
> 



Re: [OMPI devel] Collective communications may be abend when it use over 2GiB buffer

2012-03-05 Thread Larry Baker

George,

I think Yuki's interpretation is correct.


The following is one of the suspicious parts.
(Many similar code in ompi/coll/tuned/*.c)

--- in ompi/coll/tuned/coll_tuned_allgather.c (V1.4.X's trunk)---
398tmprecv = (char*) rbuf + rank * rcount * rext;
-

if this condition is met, "rank * rcount" is overflowed.
So, we fixed it tentatively like following:
(cast int to size_t)
--- in ompi/coll/tuned/coll_tuned_allgather.c --
398tmprecv = (char*) rbuf + (size_t)rank * rcount * rext;



Based on my understanding of the C standard this operation should be  
done on the most extended type, in this particular case the one of  
the rext (ptrdiff_t). Thus I would say the displacement should be  
correctly computed.


In my copy of C99, section 6.5 Expressions says " the order of  
evaluation of subexpressions and the order in which side effects take  
place are both unspecified.  There is a footnote 71 that "specifies  
the precedence of operators in the evaluation of an expressions, which  
is the same as the order of the major subclauses of this subclause,  
highest precedence first."  It is the footnote that implies  
multiplication (6.5.5 Multiplicative operators) has higher precedence  
than addition (6.5.6 Additive operators) in the expression "(char*)  
rbuf + rank * rcount * rext".  But, the main text states that there is  
no ordering of the subexpression "rank * rcount * rext".  When the  
compiler chooses to evaluate "rank * rcount" first, the overflow  
described by Yuki can result.  I think you are correct that the  
subexpression will get promoted to (ptrdiff_t), but that is not quite  
the same thing.


Larry Baker
US Geological Survey
650-329-5608
ba...@usgs.gov



Re: [OMPI devel] Collective communications may be abend when it use over 2GiB buffer

2012-03-05 Thread George Bosilca
Yuki,

I pushed a fix for this issue in the trunk (r26097). However, I disagree with 
you on some of the topics below.

On Mar 5, 2012, at 04:02 , Y.MATSUMOTO wrote:

> Dear All,
> 
> Next feedback is about "collective communications".
> 
> Collective communication may be abend when it use over 2GiB buffer.
> This problem occurs following condition:
> -- communicator_size * count(scount/rcount) >= 2GiB
> It occurs in even small PC cluster.
> 
> The following is one of the suspicious parts.
> (Many similar code in ompi/coll/tuned/*.c)
> 
> --- in ompi/coll/tuned/coll_tuned_allgather.c (V1.4.X's trunk)---
> 398tmprecv = (char*) rbuf + rank * rcount * rext;
> -
> 
> if this condition is met, "rank * rcount" is overflowed.
> So, we fixed it tentatively like following:
> (cast int to size_t)
> --- in ompi/coll/tuned/coll_tuned_allgather.c --
> 398tmprecv = (char*) rbuf + (size_t)rank * rcount * rext;
> 

Based on my understanding of the C standard this operation should be done on 
the most extended type, in this particular case the one of the rext 
(ptrdiff_t). Thus I would say the displacement should be correctly computed.

> It needs not only "ompi/coll/tuned" but also other codes to fix this problem.
> We try to fix, but following functions have problem (argument may be 
> overflowed):
> -"ompi_coll_tuned_sendrecv" may be called when "scount/rcount" sets over 2GiB.
> -"ompi_datatype_copy_content_same_ddt" may be called when "count" sets over 
> 2GiB.

These two should have been fixed by the previous commit (r26097)

> -"basic_linear in Allgather": Bcast may be called when "count" sets over 2GiB.

Fixed in r26098.

  george.

> 
> Best Regards,
> Yuki Matsumoto
> MPI development team,
> Fujitsu
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] Collective communications may be abend when it use over 2GiB buffer

2012-03-05 Thread Y.MATSUMOTO
Dear All,

Next feedback is about "collective communications".

Collective communication may be abend when it use over 2GiB buffer.
This problem occurs following condition:
-- communicator_size * count(scount/rcount) >= 2GiB
It occurs in even small PC cluster.

The following is one of the suspicious parts.
(Many similar code in ompi/coll/tuned/*.c)

--- in ompi/coll/tuned/coll_tuned_allgather.c (V1.4.X's trunk)---
398tmprecv = (char*) rbuf + rank * rcount * rext;
-

if this condition is met, "rank * rcount" is overflowed.
So, we fixed it tentatively like following:
(cast int to size_t)
--- in ompi/coll/tuned/coll_tuned_allgather.c --
398tmprecv = (char*) rbuf + (size_t)rank * rcount * rext;


It needs not only "ompi/coll/tuned" but also other codes to fix this problem.
We try to fix, but following functions have problem (argument may be 
overflowed):
-"ompi_coll_tuned_sendrecv" may be called when "scount/rcount" sets over 2GiB.
-"ompi_datatype_copy_content_same_ddt" may be called when "count" sets over 
2GiB.
-"basic_linear in Allgather": Bcast may be called when "count" sets over 2GiB.

Best Regards,
Yuki Matsumoto
MPI development team,
Fujitsu



Re: [OMPI devel] poor btl sm latency

2012-03-05 Thread Matthias Jurenz
Here the SM BTL parameters:

$ ompi_info --param btl sm
MCA btl: parameter "btl_base_verbose" (current value: <0>, data source: 
default value) Verbosity level of the BTL framework
MCA btl: parameter "btl" (current value: , data source: file 
[/sw/atlas/libraries/openmpi/1.5.5rc3/x86_64/etc/openmpi-mca-params.conf]) 
Default selection set of components for the btl framework ( means use 
all components that can be found)
MCA btl: information "btl_sm_have_knem_support" (value: <1>, data source: 
default value) Whether this component supports the knem Linux kernel module or 
not
MCA btl: parameter "btl_sm_use_knem" (current value: <-1>, data source: 
default value) Whether knem support is desired or not (negative = try to 
enable knem support, but continue even if it is not available, 0 = do not 
enable knem support, positive = try to enable knem support and fail if it is 
not available)
MCA btl: parameter "btl_sm_knem_dma_min" (current value: <0>, data source: 
default value) Minimum message size (in bytes) to use the knem DMA mode; 
ignored if knem does not support DMA mode (0 = do not use the knem DMA mode)
MCA btl: parameter "btl_sm_knem_max_simultaneous" (current value: <0>, data 
source: default value) Max number of simultaneous ongoing knem operations to 
support (0 = do everything synchronously, which probably gives the best large 
message latency; >0 means to do all operations asynchronously, which supports 
better overlap for simultaneous large message sends)
MCA btl: parameter "btl_sm_free_list_num" (current value: <8>, data source: 
default value)
MCA btl: parameter "btl_sm_free_list_max" (current value: <-1>, data source: 
default value)
MCA btl: parameter "btl_sm_free_list_inc" (current value: <64>, data source: 
default value)
MCA btl: parameter "btl_sm_max_procs" (current value: <-1>, data source: 
default value)
MCA btl: parameter "btl_sm_mpool" (current value: , data source: default 
value)
MCA btl: parameter "btl_sm_fifo_size" (current value: <4096>, data source: 
default value)
MCA btl: parameter "btl_sm_num_fifos" (current value: <1>, data source: default 
value)
MCA btl: parameter "btl_sm_fifo_lazy_free" (current value: <120>, data source: 
default value)
MCA btl: parameter "btl_sm_sm_extra_procs" (current value: <0>, data source: 
default value)
MCA btl: parameter "btl_sm_exclusivity" (current value: <65535>, data source: 
default value) BTL exclusivity (must be >= 0)
MCA btl: parameter "btl_sm_flags" (current value: <5>, data source: default 
value) BTL bit flags (general flags: SEND=1, PUT=2, GET=4, SEND_INPLACE=8, 
RDMA_MATCHED=64, HETEROGENEOUS_RDMA=256; flags only used by the "dr" PML 
(ignored by others): ACK=16, CHECKSUM=32, RDMA_COMPLETION=128; flags only used 
by the "bfo" PML (ignored by others): FAILOVER_SUPPORT=512)
MCA btl: parameter "btl_sm_rndv_eager_limit" (current value: <4096>, data 
source: default value) Size (in bytes) of "phase 1" fragment sent for all 
large messages (must be >= 0 and <= eager_limit)
MCA btl: parameter "btl_sm_eager_limit" (current value: <4096>, data source: 
default value) Maximum size (in bytes) of "short" messages (must be >= 1).
MCA btl: parameter "btl_sm_max_send_size" (current value: <32768>, data 
source: default value) Maximum size (in bytes) of a single "phase 2" fragment 
of a long message when using the pipeline protocol (must be >= 1)
MCA btl: parameter "btl_sm_bandwidth" (current value: <9000>, data source: 
default value) Approximate maximum bandwidth of interconnect(0 = auto-detect 
value at run-time [not supported in all BTL modules], >= 1 = bandwidth in 
Mbps)
MCA btl: parameter "btl_sm_latency" (current value: <1>, data source: default 
value) Approximate latency of interconnect (must be >= 0)
MCA btl: parameter "btl_sm_priority" (current value: <0>, data source: default 
value)
MCA btl: parameter "btl_base_warn_component_unused" (current value: <1>, data 
source: default value) This parameter is used to turn on warning messages when 
certain NICs are not used

Matthias

On Friday 02 March 2012 16:23:32 George Bosilca wrote:
> Please do a "ompi_info --param btl sm" on your environment. The lazy_free
> direct the internals of the SM BTL not to release the memory fragments
> used to communicate until the lazy limit is reached. The default value was
> deemed as reasonable a while back when the number of default fragments was
> large. Lately there were some patches to reduce the memory footprint of
> the SM BTL and these might have lowered the available fragments to a limit
> where the default value for the lazy_free is now too large.
> 
>   george.
> 
> On Mar 2, 2012, at 10:08 , Matthias Jurenz wrote:
> > In thanks to the OTPO tool, I figured out that setting the MCA parameter
> > btl_sm_fifo_lazy_free to 1 (default is 120) improves the latency
> > significantly: 0,88µs
> > 
> > But somehow I get the feeling that this doesn't eliminate the actual
> > problem...
> > 
> > Matthias
> > 
> > On Friday 02 March 2012