Folks,

I found (at least) two issues with oshmem put if btl/vader is used with
knem enabled :

$ oshrun -np 2 --mca btl vader,self ./oshmem_max_reduction
--------------------------------------------------------------------------
SHMEM_ABORT was invoked on rank 0 (pid 11936, host=soleil) with
errorcode -1.
--------------------------------------------------------------------------
[soleil.iferc.local:11934] 1 more process has sent help message
help-shmem-api.txt / shmem-abort
[soleil.iferc.local:11934] Set MCA parameter "orte_base_help_aggregate"
to 0 to see all help / error messages


the error message is not helpful at all ...
the abort happens in the vader btl in mca_btl_vader_put_knem
   if (OPAL_UNLIKELY(0 != ioctl (mca_btl_vader.knem_fd,
KNEM_CMD_INLINE_COPY, &icopy))) {
        return OPAL_ERROR;
    }
ioctl fails with EACCES

the root cause is the symmetric memory was "prepared" with
vader_prepare_src that uses
knem_cr.protection = PROT_READ;

a trivial workaround (probably not good for production) is to
knem_cr.protection = PROT_READ|PROT_WRITE;


then we run into the second issue :

in mca_btl_vader_put_knem :
    icopy.remote_offset     = 0;

and this is clearly not what we want ...
in my environment, we want to put to 0x0600df0, so the remote_offset
should be 0xdf0 since the symmetric memory was "prepared" starting at
0x0600000

i do not think the vader btl is to be blamed here ... i'd rather think
yoda way to use the btl is not correct (but only for put with vader btl
when knem is used)

i can get the test program run correctly by manually setting
icopy.remote_offset with a debugger.

please note i fixed a typo in the vader btl so make sure you update the
master.


in the mean time, what about forcing put_via_send to 1 in
mca_spml_yoda_put_internal ?
/* an other option is to unset the MCA_BTL_FLAGS_PUT flag in the vader
btl if knem is used, but i do not believe this is a vader issue */

Cheers,

Gilles

Reply via email to