Re: [OMPI devel] C/R and orte_oob

2014-03-06 Thread Ralph Castain

On Mar 6, 2014, at 1:02 PM, Adrian Reber  wrote:

> On Tue, Feb 18, 2014 at 03:46:58PM +0100, Adrian Reber wrote:
>> I tried to implement something like you described. It is not yet event
>> driven, but before continuing I wanted to get some feedback if it is at
>> least the right start:
>> 
>> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706
>> 
>> I looked at the other ORTE_OOB_* macros and tried to model my
>> functionality a bit after what I have seen there. Right now it is still
>> a simple function which just tries to call ft_event() on all oob
>> components. Does this look right so far?
> 
> Sorry for delay - yes, that looks like the right direction. I would 
> suggest doing it via the current state machine, though, by simply 
> defining another job or proc state in orte/mca/plm/plm_types.h, and then 
> registering a callback function using the 
> orte_state.add_job[proc]_state(state, function to be called, 
> ORTE_ERR_PRI). Then you can activate it by calling 
> ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the 
> proper order.
 
 What is a job/proc in the Open MPI context.
>>> 
>>> A "job" is the entire application, while a "proc" is just one process in 
>>> that application. In this case you could use either one as you are 
>>> checkpointing the entire job, but all this activity is occurring inside 
>>> each proc. So I'd suggest defining it as a proc state since it only really 
>>> involves local actions.
>>> 
>>> If you like, I can define the required code in the trunk and let you fill 
>>> in the event functionality.
>> 
>> That would be great.
> 
> Thanks for your changes. When using --with-ft there are a few compiler
> errors which I tried to fix with following patch:
> 
> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c

That looks okay, with the only caveat being that you wouldn't ordinarily pass 
the state_caddy_t into a function. It's just there to pass along the job etc in 
case the callback function needs to reference something. In this case, I can't 
think of anything the FT event function would need to know - you just want it 
to quiet all messaging.


> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/03/14309.php



Re: [OMPI devel] C/R and orte_oob

2014-03-06 Thread Adrian Reber
On Tue, Feb 18, 2014 at 03:46:58PM +0100, Adrian Reber wrote:
> > >>> I tried to implement something like you described. It is not yet event
> > >>> driven, but before continuing I wanted to get some feedback if it is at
> > >>> least the right start:
> > >>> 
> > >>> https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=5048a9cec2cd0bc4867eadfd7e48412b73267706
> > >>> 
> > >>> I looked at the other ORTE_OOB_* macros and tried to model my
> > >>> functionality a bit after what I have seen there. Right now it is still
> > >>> a simple function which just tries to call ft_event() on all oob
> > >>> components. Does this look right so far?
> > >> 
> > >> Sorry for delay - yes, that looks like the right direction. I would 
> > >> suggest doing it via the current state machine, though, by simply 
> > >> defining another job or proc state in orte/mca/plm/plm_types.h, and then 
> > >> registering a callback function using the 
> > >> orte_state.add_job[proc]_state(state, function to be called, 
> > >> ORTE_ERR_PRI). Then you can activate it by calling 
> > >> ORTE_ACTIVATE_JOB[PROC]_STATE(NULL, state) and it will be handled in the 
> > >> proper order.
> > > 
> > > What is a job/proc in the Open MPI context.
> > 
> > A "job" is the entire application, while a "proc" is just one process in 
> > that application. In this case you could use either one as you are 
> > checkpointing the entire job, but all this activity is occurring inside 
> > each proc. So I'd suggest defining it as a proc state since it only really 
> > involves local actions.
> > 
> > If you like, I can define the required code in the trunk and let you fill 
> > in the event functionality.
> 
> That would be great.

Thanks for your changes. When using --with-ft there are a few compiler
errors which I tried to fix with following patch:

https://lisas.de/git/?p=open-mpi.git;a=commitdiff;h=71521789ef9d248a7eef53030d2ec5de900faa4c

Adrian


Re: [OMPI devel] autoconf warnings: openib BTL

2014-03-06 Thread Mike Dubman
but AF_IB is always declared, regardless of actual presence in the kernel.


On Thu, Mar 6, 2014 at 5:56 PM, Ralph Castain  wrote:

> Let me see if I can help translate. I think the problem here is Jeff's
> comment about a "run time check", which wasn't actually what he is
> proposing here.
>
> If you look at Jeff's proposed code, what he is saying is that you don't
> need to use AC_TRY_RUN - you can just build based on whether or not AF_IB
> is declared, and so AC_CHECK_DECLS is adequate. If the resulting code
> fails, then that's an error anyway. So you can just protect the code as he
> shows and be done with it.
>
> This would avoid all the warnings we are now receiving on the trunk, and
> do what you need. Make sense?
>
>
>
>
>
> On Thu, Mar 6, 2014 at 7:26 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> On Mar 6, 2014, at 4:08 AM, Vasily Filipov 
>> wrote:
>>
>> >> #if HAVE_DECL_AF_IB
>> >>rc = try_using_af_ib();
>> >>if (OMPI_ERR_NOT_AVAILABLE == rc) {
>> >>rc = try_the_other_way();
>> >>}
>> >> #else
>> >>rc = try_the_other_way();
>> >> #endif
>> >I mean I cannot  use "another way" if func call for
>> "try_using_af_ib" fails (call for "try_the_other_way()") because RDMACM was
>> compiled for AF_IB   usage (different fields in structs, different
>> functions prototypes).
>>
>> Ok, that means the implementation is reduced to:
>>
>> #if HAVE_DECL_AF_IB
>>rc = try_using_af_ib();
>> #else
>>rc = try_the_other_way();
>> #endif
>>
>> Right?  If so, I don't see why you need the AC_TRY_RUN -- if RDMACM is
>> easily detectable as to which way it is compiled (because it has, for
>> example, different fields), then AC_CHECK_DECLS should be enough, right?
>>
>> I must be missing something...?
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/03/14306.php
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/03/14307.php
>


Re: [OMPI devel] autoconf warnings: openib BTL

2014-03-06 Thread Ralph Castain
Let me see if I can help translate. I think the problem here is Jeff's
comment about a "run time check", which wasn't actually what he is
proposing here.

If you look at Jeff's proposed code, what he is saying is that you don't
need to use AC_TRY_RUN - you can just build based on whether or not AF_IB
is declared, and so AC_CHECK_DECLS is adequate. If the resulting code
fails, then that's an error anyway. So you can just protect the code as he
shows and be done with it.

This would avoid all the warnings we are now receiving on the trunk, and do
what you need. Make sense?





On Thu, Mar 6, 2014 at 7:26 AM, Jeff Squyres (jsquyres)
wrote:

> On Mar 6, 2014, at 4:08 AM, Vasily Filipov 
> wrote:
>
> >> #if HAVE_DECL_AF_IB
> >>rc = try_using_af_ib();
> >>if (OMPI_ERR_NOT_AVAILABLE == rc) {
> >>rc = try_the_other_way();
> >>}
> >> #else
> >>rc = try_the_other_way();
> >> #endif
> >I mean I cannot  use "another way" if func call for "try_using_af_ib"
> fails (call for "try_the_other_way()") because RDMACM was compiled for
> AF_IB   usage (different fields in structs, different functions prototypes).
>
> Ok, that means the implementation is reduced to:
>
> #if HAVE_DECL_AF_IB
>rc = try_using_af_ib();
> #else
>rc = try_the_other_way();
> #endif
>
> Right?  If so, I don't see why you need the AC_TRY_RUN -- if RDMACM is
> easily detectable as to which way it is compiled (because it has, for
> example, different fields), then AC_CHECK_DECLS should be enough, right?
>
> I must be missing something...?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/03/14306.php
>


Re: [OMPI devel] autoconf warnings: openib BTL

2014-03-06 Thread Jeff Squyres (jsquyres)
On Mar 6, 2014, at 4:08 AM, Vasily Filipov  wrote:

>> #if HAVE_DECL_AF_IB
>>rc = try_using_af_ib();
>>if (OMPI_ERR_NOT_AVAILABLE == rc) {
>>rc = try_the_other_way();
>>}
>> #else
>>rc = try_the_other_way();
>> #endif
>I mean I cannot  use "another way" if func call for "try_using_af_ib" 
> fails (call for "try_the_other_way()") because RDMACM was compiled for AF_IB  
>  usage (different fields in structs, different functions prototypes).

Ok, that means the implementation is reduced to:

#if HAVE_DECL_AF_IB
   rc = try_using_af_ib();
#else
   rc = try_the_other_way();
#endif

Right?  If so, I don't see why you need the AC_TRY_RUN -- if RDMACM is easily 
detectable as to which way it is compiled (because it has, for example, 
different fields), then AC_CHECK_DECLS should be enough, right?

I must be missing something...?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] autoconf warnings: openib BTL

2014-03-06 Thread Vasily Filipov


On 05-Mar-14 18:08, Jeff Squyres (jsquyres) wrote:

On Mar 3, 2014, at 10:59 PM, Vasily Filipov  wrote:


Yes, it is possible, but there is some different if I will do it this way -
  With the current implementation (today into a trunk) if AC_RUN_IFELSE fails 
=> old code of RDMACM will rise,
  And by way you suggest, if we postpone the decision to a run time and the check 
fails =>
  we have to abort  RDMACM  at all, because it was compiled for working with 
AF_IB.
  So my question to you, if we take into account all this stuff above -
   What's the right way to implement it ? What do you think ?

I'm not sure I understand.  Can't you write something like:

#if HAVE_DECL_AF_IB
rc = try_using_af_ib();
if (OMPI_ERR_NOT_AVAILABLE == rc) {
rc = try_the_other_way();
}
#else
rc = try_the_other_way();
#endif
I mean I cannot  use "another way" if func call for 
"try_using_af_ib" fails (call for "try_the_other_way()") because RDMACM 
was compiled for AF_IB   usage (different fields in structs, different 
functions prototypes).