Re: [hwloc-devel] [Xen-devel] Hwloc with Xen host topology

2014-01-02 Thread Andrew Cooper
On 02/01/14 21:55, Samuel Thibault wrote:
> Andrew Cooper, le Thu 02 Jan 2014 21:50:06 +, a écrit :
>> On 02/01/14 21:24, Samuel Thibault wrote:
>>> Andrew Cooper, le Thu 02 Jan 2014 20:26:49 +, a écrit :
 Cores are numbered per-socket in Xen, while sockets,
 numa nodes and cpus are numbered on an absolute scale.  There is
 currently a gross hack in my hwloc code which adds (socket_id *
 cores_per_socket * threads_per_core) onto each core id to make them
 similarly numbered on an absolute scale.  This is fine for a homogeneous
 system, but not for a hetrogeneous system.
>>> BTW, hwloc does not need these physical ids to be unique, it can cope
>>> with duplication and whatnot.  That said, having a coherent interface at
>>> the Xen layer would be a good thing, indeed :)
>> If I take out the described hack, I am presented with
>>
>> 
>> * hwloc has encountered what looks like an error from the operating system.
>> *
>> * object (Core P#0 cpuset 0x3003) intersection without inclusion!
>> * Error occurred in topology.c line 853
>> *
>> * Please report this error message to the hwloc user's mailing list,
>> * along with the output from the hwloc-gather-topology.sh script.
>> 
>>
>> Which I took to mean "I have done something stupid".  I looked and saw
>> that I was attempting to insert a second Core P#0 object with a
>> different cpuset and decided to renumber the cores so they didn't
>> overlap in physical ids.
>>
>> If you believe that this should indeed work, then I guess I need to
>> raise a bug...
> Well, logical processor physical ids, i.e. what is used for indexing
> physical cpusets, have to be unique. The core/socket/node IDs don't have
> to.
>
> Samuel

Then a bug needs raising.  My hack only changes the Core physical ID as
far as hwloc is concerned.  The PU physical IDs are unchanged by the
hack, and already unique as presented by Xen.

~Andrew



Re: [OMPI devel] [EXTERNAL] Re: bug in mca framework?

2014-01-02 Thread Barrett, Brian W
Igor -

Sorry for the slow reply; I was on vacation for the last week and a half.

The patch doesn't look quite right to me.  If the cm PML is used, the spml
(or something else in the OSHMEM layer) is going to have to call add_procs
on the BML to initialize the procs arrays for the BTLs.

Brian

On 12/23/13 3:49 AM, "Igor Ivanov"  wrote:

>Brian,
>
>Could you look at patch based on your suggestion. It resolves the issue
>with mca variable.
>
>Igor
>
>On 18.12.2013 01:48, Barrett, Brian W wrote:
>> The proposed solution at the bottom is wrong.  There aren't two
>>different
>> BMLs, there's one, and it lives in OMPI.
>>
>> The solution is to open the bml and btls in ompi_mpi_init and not in the
>> pmls.  I checked, and the bml will deal with add_procs being called
>> multiple times on the same proc, so just moving the framework open /
>>init
>> is sufficient.  This will also solve the MTL problem.
>>
>> Brian
>>
>> On 12/17/13 8:33 AM, "Joshua Ladd"  wrote:
>>
>>> I believe Devendar Bureddy nailed the root cause. I am providing his
>>> excellent analysis below:
>>>
>> >From Devendar:
>>> with curiosity i looked at this issue. here's my 2 cents
>>> I think issue is because of BTL components is opened
>>> twice(ompi_init, yoda) which leading to incorrect usage of var groups.
>>> The following sequence of events creating invalid memory
>>>
>>> 1) all openib component parameters registered in ompi_mpi_init
>>> main > start_pes> shmem_init -> oshmem_shmem_init -> ompi_mpi_init ->
>>> mca_base_framework_open -> mca_pml_base_open . mca_bml_base_open...
>>> -> btl_openib_component_register()
>>>
>>> *   for all string variables it allocated a memory block
>>>(var->mbv_storage
>>> = PTR)
>>>
>>> At this time a new var group id:114 (of parent group id: 112) is
>>>created
>>> for all openib component variables.
>>>
>>> 2) This var group is de-registered in ompi_mpi_init. It marks all
>>> variables as invalid. but, the group is still exist
>>> main > start_pes> shmem_init -> oshmem_shmem_init ->
>>>mca_pml_base_select
>>> -> mca_base_components_close -> ... -> mca_bml_base_close ->
>>> mca_base_framework_close -> mca_base_var_group_deregister(groupid:
>>>114) *
>>> all string variables memory is deallocated ( set var->mbv_storage =
>>>NULL;)
>>>
>>> 3) because of step 2). btl_openib.so shared lib dlclosed
>>>
>>> 4) Now we are reopening openib in yoda and registering the openib
>>> variables again.
>>> main > start_pes> shmem_init > oshmem_shmem_init -> _shmem_init ->
>>> mca_base_framework_open -> mca_spml_base_open>
>>> mca_spml_yoda_component_open-> . mca_bml_base_open... ->
>>> btl_openib_component_register -> register_variables()
>>>
>>> *   In register_variables(), var_find() finds this variable( from the
>>>same
>>> old group: 114) and reset the variables.
>>> *   For string variables, it allocated the buffers again (
>>> (var->mbv_storage = PTR)
>>> *   note that group:114 is not belongs to yoda component.
>>>
>>> 5) In yoda component close, it never finds above group(114) because
>>>this
>>> is not belongs to this component. So, do not call
>>> mca_base_var_group_deregister() again on the var group. string var
>>>memory
>>> is not deallocated.
>>> main > start_pes> shmem_init > oshmem_shmem_init -> _shmem_init ->
>>> mca_spml_base_select ->..> mca_spml_yoda_component_close ->
>>> mca_bml_base_close -> mca_base_var_group_find().
>>>
>>> 6) because of step 5), the btl_openib.so is dlclosed(). This step
>>> invalidates, all openib string vars memory ( var->mbv_storage = PTR)
>>> allocated in step 4)
>>>
>>> 7) in ompi_mpi_finalize(), it will loop through all vars and finalizes
>>> and deallocate the string var memory (var->mbv_storage = PTR)
>>> ompi_mpi_finalize >...> mca_base_var_finalize * var->mbv_storage = PTR
>>>is
>>> invalid at this stage and causing the SEGFAULT.
>>>
>>>
>>> This also explains why Dinar's patch, kostul_fix.patch
>>> (http://bgate.mellanox.com/redmine/attachments/1643/kostul_fix.patch),
>>> resolves the issue. His patch prevents you from finding the invalid
>>> already opened params.
>>> So, I see in a lot of these registration functions the signature has an
>>> entry for the project name, but now, NULL, is always passed. I see a
>>>note
>>> by Nathan in
>>>
>>> ../opal/mca/base/mca_base_var.c +1311
>>> {
>>> /* XXX -- component_update -- We will stash the project name in the
>>> component */
>>> return mca_base_var_register (NULL, component->mca_type_name,
>>>
>>>
>>> Seems knowing the project name, oshmem, would allow us to distinguish
>>> between the different BMLs.
>>>
>>> Nathan, please advise.
>>>
>>> Josh
>>>
>>>
>>> -Original Message-
>>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Nathan
>>>Hjelm
>>> Sent: Monday, December 16, 2013 12:44 PM
>>> To: Open MPI Developers
>>> Subject: Re: [OMPI devel] bug in mca framework?
>>>
>>> On Mon, Dec 16, 2013 at 05:21:05PM +, Joshua Ladd wrote:
 After 

Re: [OMPI devel] return value of opal_compress_base_register() in opal/mca/compress/base/compress_base_open.c

2014-01-02 Thread Josh Hursey
I think the only reason I protected that framework is to reduce the
overhead of an application using a build of Open MPI with CR support, but
no enabling it at runtime. Nothing in the compress framework depends on the
CR infrastructure (although the CR infrastructure can use the compress
framework if the user chooses to). So I bet we can remove the protection
altogether and be fine.

So I think this patch is fine. I might also go as far as removing the 'if'
block altogether as the protection should not been needed any longer.

Thanks,
Josh



On Fri, Dec 27, 2013 at 3:46 PM, Adrian Reber  wrote:

> Right now the C/R code fails because of a change introduced in
> opal/mca/compress/base/compress_base_open.c in 2013 with commit
>
> git 734c724ff76d9bf814f3ab0396bcd9ee6fddcd1b
> svn r28239
>
> Update OPAL frameworks to use the MCA framework system.
>
> This commit changed a lot but also the return value of the function from
> OPAL_SUCCESS to OPAL_ERR_NOT_AVAILABLE. With following patch I can make
> it work again:
>
> diff --git a/opal/mca/compress/base/compress_base_open.c
> b/opal/mca/compress/base/compress_base_open.c
> index a09fe59..69eabfa 100644
> --- a/opal/mca/compress/base/compress_base_open.c
> +++ b/opal/mca/compress/base/compress_base_open.c
> @@ -45,11 +45,11 @@ MCA_BASE_FRAMEWORK_DECLARE(opal, compress, NULL,
> opal_compress_base_register, op
>
>  static int opal_compress_base_register (mca_base_register_flag_t flags)
>  {
>  /* Compression currently only used with C/R */
>  if( !opal_cr_is_enabled ) {
>  opal_output_verbose(10,
> opal_compress_base_framework.framework_output,
>  "compress:open: FT is not enabled,
> skipping!");
> -return OPAL_ERR_NOT_AVAILABLE;
> +return OPAL_SUCCESS;
>  }
>
>  return OPAL_SUCCESS;
>
>
> My question is if OPAL_ERR_NOT_AVAILABLE is indeed the correct return value
> and the function calling opal_compress_base_register() should actually
> handle OPAL_ERR_NOT_AVAILABLE or if returning OPAL_SUCCESS is still the
> right
> return value?
>
> Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey


Re: [OMPI devel] [PATCH v3 1/2] Trying to get the C/R code to compile again. (recv_*_nb)

2014-01-02 Thread Josh Hursey
+1



On Thu, Dec 19, 2013 at 4:04 PM, Ralph Castain  wrote:

> Looks okay to me. On the places where you need to block while waiting for
> an answer, you can use OMPI_WAIT_FOR_COMPLETION - this will spin on
> opal_progress until the condition is met. We use it elsewhere for similar
> purposes.
>
> See ompi/mca/rte/rte.h for the definition
>
>
> On Dec 19, 2013, at 12:54 PM, Adrian Reber  wrote:
>
> > From: Adrian Reber 
> >
> > This patch changes all recv/recv_buffer occurrences in the C/R code
> > to recv_nb/recv_buffer_nb.
> > The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
> > The new code compiles but does not work.
> >
> > Changes from V1:
> > * #ifdef out the code (so it is preserved for later re-design)
> > * marked the broken C/R code with ENABLE_FT_FIXED
> >
> > Changes from V2:
> > * only #ifdef out the code where the behaviour is changed
> >  (used to be blocking; now non-blocking)
> >
> > Signed-off-by: Adrian Reber 
> > ---
> > ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 41 +
> > orte/mca/errmgr/base/errmgr_base_tool.c | 16 +
> > orte/mca/rml/ftrm/rml_ftrm.h| 27 ++---
> > orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
> > orte/mca/rml/ftrm/rml_ftrm_module.c | 78
> +++--
> > orte/mca/snapc/full/snapc_full_app.c| 12 
> > orte/mca/snapc/full/snapc_full_global.c | 37 +++-
> > orte/mca/snapc/full/snapc_full_local.c  | 36 +++-
> > orte/mca/sstore/central/sstore_central_app.c|  6 ++
> > orte/mca/sstore/central/sstore_central_global.c | 17 +-
> > orte/mca/sstore/central/sstore_central_local.c  | 17 +-
> > orte/mca/sstore/stage/sstore_stage_app.c|  5 ++
> > orte/mca/sstore/stage/sstore_stage_global.c | 17 +-
> > orte/mca/sstore/stage/sstore_stage_local.c  | 17 +-
> > orte/tools/orte-checkpoint/orte-checkpoint.c| 16 +
> > orte/tools/orte-migrate/orte-migrate.c  | 16 +
> > 16 files changed, 87 insertions(+), 273 deletions(-)
> >
> > diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > index 5d4005f..05cd598 100644
> > --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > @@ -4717,7 +4717,6 @@ static int ft_event_post_drain_acks(void)
> > ompi_crcp_bkmrk_pml_drain_message_ack_ref_t * drain_msg_ack = NULL;
> > opal_list_item_t* item = NULL;
> > size_t req_size;
> > -int ret;
> >
> > req_size  = opal_list_get_size(_msg_ack_list);
> > if(req_size <= 0) {
> > @@ -4739,17 +4738,8 @@ static int ft_event_post_drain_acks(void)
> > drain_msg_ack =
> (ompi_crcp_bkmrk_pml_drain_message_ack_ref_t*)item;
> >
> > /* Post the receive */
> > -if( OMPI_SUCCESS != (ret = ompi_rte_recv_buffer_nb(
> _msg_ack->peer,
> > -
>  OMPI_CRCP_COORD_BOOKMARK_TAG,
> > -0,
> > -
>  drain_message_ack_cbfunc,
> > -NULL) ) ) {
> > -opal_output(mca_crcp_bkmrk_component.super.output_handle,
> > -"crcp:bkmrk: %s <-- %s: Failed to post a RML
> receive to the peer\n",
> > -OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
> > -OMPI_NAME_PRINT(&(drain_msg_ack->peer)));
> > -return ret;
> > -}
> > +ompi_rte_recv_buffer_nb(_msg_ack->peer,
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> > +0, drain_message_ack_cbfunc, NULL);
> > }
> >
> > return OMPI_SUCCESS;
> > @@ -5322,26 +5312,14 @@ static int send_bookmarks(int peer_idx)
> > static int recv_bookmarks(int peer_idx)
> > {
> > ompi_process_name_t peer_name;
> > -int exit_status = OMPI_SUCCESS;
> > -int ret;
> >
> > START_TIMER(CRCP_TIMER_CKPT_EX_PEER_R);
> >
> > peer_name.jobid  = OMPI_PROC_MY_NAME->jobid;
> > peer_name.vpid   = peer_idx;
> >
> > -if ( 0 > (ret = ompi_rte_recv_buffer_nb(_name,
> > -
>  OMPI_CRCP_COORD_BOOKMARK_TAG,
> > -0,
> > -recv_bookmarks_cbfunc,
> > -NULL) ) ) {
> > -opal_output(mca_crcp_bkmrk_component.super.output_handle,
> > -"crcp:bkmrk: recv_bookmarks: Failed to post receive
> bookmark from peer %s: Return %d\n",
> > -OMPI_NAME_PRINT(_name),
> > -ret);
> > -exit_status = ret;
> > -goto cleanup;
> > -}
> > +ompi_rte_recv_buffer_nb(_name, OMPI_CRCP_COORD_BOOKMARK_TAG,
> > +0, recv_bookmarks_cbfunc, NULL);
> >
> > ++total_recv_bookmarks;
> >
> > @@ -5350,7 +5328,7 @@ static int recv_bookmarks(int peer_idx)
> >

Re: [OMPI devel] [PATCH v3 2/2] Trying to get the C/R code to compile again. (send_*_nb)

2014-01-02 Thread Josh Hursey
(Sorry for the delay, just catching up on email after the holidays)

I think that looks good too.


On Thu, Dec 19, 2013 at 4:01 PM, Ralph Castain  wrote:

> +1 from me
>
>
> On Dec 19, 2013, at 12:54 PM, Adrian Reber  wrote:
>
> > From: Adrian Reber 
> >
> > This patch changes all send/send_buffer occurrences in the C/R code
> > to send_nb/send_buffer_nb.
> > The new code compiles but does not work.
> >
> > Changes from V1:
> > * #ifdef out the code (so it is preserved for later re-design)
> > * marked the broken C/R code with ENABLE_FT_FIXED
> >
> > Changes from V2:
> > * just replace the blocking calls with the non-blocking calls
> > * all #ifdef's introduced in V1 are gone
> > * send_* returns error code or ORTE_SUCCESS (not the number of bytes)
> >
> > Signed-off-by: Adrian Reber 
> > ---
> > ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 23 ++
> > orte/mca/errmgr/base/errmgr_base_tool.c |  4 +-
> > orte/mca/rml/ftrm/rml_ftrm.h| 19 
> > orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
> > orte/mca/rml/ftrm/rml_ftrm_module.c | 61
> +++--
> > orte/mca/snapc/full/snapc_full_app.c| 20 ++--
> > orte/mca/snapc/full/snapc_full_global.c | 15 --
> > orte/mca/snapc/full/snapc_full_local.c  |  4 +-
> > orte/mca/sstore/central/sstore_central_app.c|  8 +++-
> > orte/mca/sstore/central/sstore_central_global.c |  4 +-
> > orte/mca/sstore/central/sstore_central_local.c  | 12 +++--
> > orte/mca/sstore/stage/sstore_stage_app.c|  8 +++-
> > orte/mca/sstore/stage/sstore_stage_global.c |  4 +-
> > orte/mca/sstore/stage/sstore_stage_local.c  | 16 +--
> > orte/tools/orte-checkpoint/orte-checkpoint.c|  4 +-
> > orte/tools/orte-migrate/orte-migrate.c  |  4 +-
> > 16 files changed, 99 insertions(+), 109 deletions(-)
> >
> > diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > index 05cd598..5ad9a3e 100644
> > --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > @@ -5077,7 +5077,7 @@ static int wait_quiesce_drained(void)
> >  "crcp:bkmrk: %s --> %s Send ACKs to
> Peer\n",
> >  OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
> >
>  OMPI_NAME_PRINT(&(cur_peer_ref->proc_name)) ));
> > -
> > +
> > /* Send All Clear to Peer */
> > if (NULL == (buffer = OBJ_NEW(opal_buffer_t))) {
> > exit_status = OMPI_ERROR;
> > @@ -5087,7 +5087,9 @@ static int wait_quiesce_drained(void)
> > PACK_BUFFER(buffer, response, 1, OPAL_SIZE, "");
> >
> > /* JJH - Performance Optimization? - Why not post all
> isends, then wait? */
> > -if ( 0 > ( ret =
> ompi_rte_send_buffer(&(cur_peer_ref->proc_name), buffer,
> OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
> > +if (ORTE_SUCCESS != (ret =
> ompi_rte_send_buffer_nb(&(cur_peer_ref->proc_name),
> > +   buffer,
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> > +
> orte_rml_send_callback, NULL))) {
> > exit_status = ret;
> > goto cleanup;
> > }
> > @@ -5288,7 +5290,9 @@ static int send_bookmarks(int peer_idx)
> > PACK_BUFFER(buffer, (peer_ref->total_msgs_recvd), 1, OPAL_UINT32,
> > "crcp:bkmrk: send_bookmarks: Unable to pack
> total_msgs_recvd");
> >
> > -if ( 0 > ( ret = ompi_rte_send_buffer(_name, buffer,
> OMPI_CRCP_COORD_BOOKMARK_TAG, 0)) ) {
> > +if (ORTE_SUCCSS != (ret = ompi_rte_send_buffer_nb(_name,
> buffer,
> > +
>  OMPI_CRCP_COORD_BOOKMARK_TAG,
> > +
>  orte_rml_send_callback, NULL))) {
> > opal_output(mca_crcp_bkmrk_component.super.output_handle,
> > "crcp:bkmrk: send_bookmarks: Failed to send bookmark
> to peer %s: Return %d\n",
> > OMPI_NAME_PRINT(_name),
> > @@ -5567,13 +5571,14 @@ static int
> do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> > /*
> >  * Do the send...
> >  */
> > -if ( 0 > ( ret = ompi_rte_send_buffer(_ref->proc_name, buffer,
> > -  OMPI_CRCP_COORD_BOOKMARK_TAG,
> 0)) ) {
> > +if (ORTE_SUCCESS != (ret =
> ompi_rte_send_buffer_nb(_ref->proc_name, buffer,
> > +
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> > +
> orte_rml_send_callback, NULL))) {
> > opal_output(mca_crcp_bkmrk_component.super.output_handle,
> > "crcp:bkmrk: do_send_msg_detail: Unable to send
> message details to peer %s: Return %d\n",
> > OMPI_NAME_PRINT(_ref->proc_name),
> > ret);
> > -
> > +
> > exit_status = OMPI_ERROR;
> > goto cleanup;
> > }
> > @@ -6185,8 +6190,10 @@ static int
> do_recv_msg_detail_resp(ompi_crcp_bkmrk_pml_peer_ref_t 

Re: [OMPI devel] [PATCH v2 1/2] Trying to get the C/R code to compile again. (recv_*_nb)

2014-01-02 Thread Josh Hursey
(Sorry for the delay, just catching up on email after the holidays)

I agree with Ralph. You can remove the old function signatures, but keep
the places where you replace a blocking send/recv with a non-blocking
version. Then I think it is good.

Thanks,
Josh



On Wed, Dec 18, 2013 at 9:52 AM, Ralph Castain  wrote:

> Hi Adrian
>
> No point in keeping the old code for those places where you update the
> syntax of a non-blocking recv (i.e., you remove the no-longer-reqd extra
> param). I would only keep it where you have to replace a blocking recv with
> a non-blocking one as that is where the behavior will change.
>
> Other than that, it looks fine to me.
>
> On Dec 18, 2013, at 6:42 AM, Adrian Reber  wrote:
>
> > From: Adrian Reber 
> >
> > This patch changes all recv/recv_buffer occurrences in the C/R code
> > to recv_nb/recv_buffer_nb.
> > The old code is still there but disabled using ifdefs (ENABLE_FT_FIXED).
> > The new code compiles but does not work.
> >
> > Changes from V1:
> > * #ifdef out the code (so it is preserved for later re-design)
> > * marked the broken C/R code with ENABLE_FT_FIXED
> >
> > Signed-off-by: Adrian Reber 
> > ---
> > ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c| 19 ++
> > orte/mca/errmgr/base/errmgr_base_tool.c |  6 +-
> > orte/mca/rml/ftrm/rml_ftrm.h| 27 ++---
> > orte/mca/rml/ftrm/rml_ftrm_component.c  |  2 -
> > orte/mca/rml/ftrm/rml_ftrm_module.c | 78
> +++--
> > orte/mca/snapc/full/snapc_full_app.c| 12 
> > orte/mca/snapc/full/snapc_full_global.c | 25 
> > orte/mca/snapc/full/snapc_full_local.c  | 24 
> > orte/mca/sstore/central/sstore_central_app.c|  6 ++
> > orte/mca/sstore/central/sstore_central_global.c | 11 ++--
> > orte/mca/sstore/central/sstore_central_local.c  | 11 ++--
> > orte/mca/sstore/stage/sstore_stage_app.c|  5 ++
> > orte/mca/sstore/stage/sstore_stage_global.c | 11 ++--
> > orte/mca/sstore/stage/sstore_stage_local.c  | 11 ++--
> > orte/tools/orte-checkpoint/orte-checkpoint.c|  9 ++-
> > orte/tools/orte-migrate/orte-migrate.c  |  9 ++-
> > 16 files changed, 124 insertions(+), 142 deletions(-)
> >
> > diff --git a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > index 5d4005f..cba7586 100644
> > --- a/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > +++ b/ompi/mca/crcp/bkmrk/crcp_bkmrk_pml.c
> > @@ -4739,6 +4739,8 @@ static int ft_event_post_drain_acks(void)
> > drain_msg_ack =
> (ompi_crcp_bkmrk_pml_drain_message_ack_ref_t*)item;
> >
> > /* Post the receive */
> > +#ifdef ENABLE_FT_FIXED
> > +/* This is the old, now broken code */
> > if( OMPI_SUCCESS != (ret = ompi_rte_recv_buffer_nb(
> _msg_ack->peer,
> >
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> > 0,
> > @@ -4750,6 +4752,9 @@ static int ft_event_post_drain_acks(void)
> > OMPI_NAME_PRINT(&(drain_msg_ack->peer)));
> > return ret;
> > }
> > +#endif /* ENABLE_FT_FIXED */
> > +ompi_rte_recv_buffer_nb(_msg_ack->peer,
> OMPI_CRCP_COORD_BOOKMARK_TAG,
> > +0, drain_message_ack_cbfunc, NULL);
> > }
> >
> > return OMPI_SUCCESS;
> > @@ -5330,6 +5335,8 @@ static int recv_bookmarks(int peer_idx)
> > peer_name.jobid  = OMPI_PROC_MY_NAME->jobid;
> > peer_name.vpid   = peer_idx;
> >
> > +#ifdef ENABLE_FT_FIXED
> > +/* This is the old, now broken code */
> > if ( 0 > (ret = ompi_rte_recv_buffer_nb(_name,
> > OMPI_CRCP_COORD_BOOKMARK_TAG,
> > 0,
> > @@ -5342,6 +5349,9 @@ static int recv_bookmarks(int peer_idx)
> > exit_status = ret;
> > goto cleanup;
> > }
> > +#endif /* ENABLE_FT_FIXED */
> > +ompi_rte_recv_buffer_nb(_name, OMPI_CRCP_COORD_BOOKMARK_TAG,
> > +   0, recv_bookmarks_cbfunc, NULL);
> >
> > ++total_recv_bookmarks;
> >
> > @@ -5616,6 +5626,8 @@ static int
> do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> > /*
> >  * Recv the ACK msg
> >  */
> > +#ifdef ENABLE_FT_FIXED
> > +/* This is the old, now broken code */
> > if ( 0 > (ret = ompi_rte_recv_buffer(_ref->proc_name, buffer,
> >  OMPI_CRCP_COORD_BOOKMARK_TAG,
> 0) ) ) {
> > opal_output(mca_crcp_bkmrk_component.super.output_handle,
> > @@ -5626,6 +5638,9 @@ static int
> do_send_msg_detail(ompi_crcp_bkmrk_pml_peer_ref_t *peer_ref,
> > exit_status = ret;
> > goto cleanup;
> > }
> > +#endif /* ENABLE_FT_FIXED */
> > +ompi_rte_recv_buffer_nb(_ref->proc_name,
> OMPI_CRCP_COORD_BOOKMARK_TAG, 0,
> > +orte_rml_recv_callback, NULL);