Re: [hpx-users] Generate OTF2 traces of distributed runs with APEX, re-revisited

Kevin Huck Fri, 03 Sep 2021 09:26:42 -0700

I thought we had fixed this issue…the problem that never dies...

Most versions of OTF2 (2.2 and lower, I believe) had an uninitialized variable 
that sometimes led to this error message and prematurely exited the 
initialization process, leading to other problems.  Which version of OTF2 are 
you using?


I just checked the 2.2 and 2.3 source code, and I still see the bug (the 
`status` variable is uninitialized for ranks other than 0) - no GitHub link for 
OTF2, sadly:

/************/
    OTF2_ErrorCode status;

    /* It is time to create the directories by the root rank. */
    if ( archive->file_mode == OTF2_FILEMODE_WRITE )
    {
        if ( otf2_archive_is_master( archive ) )
        {
            status = otf2_archive_create_directory( archive );
        }
        OTF2_CallbackCode callback_ret =
            otf2_collectives_bcast_error( archive,
                                          archive->global_comm_context,
                                          &status,
                                          OTF2_COLLECTIVES_ROOT );
        if ( OTF2_CALLBACK_SUCCESS != callback_ret )
        {
            status = UTILS_ERROR( OTF2_ERROR_COLLECTIVE_CALLBACK,
                                  "Can't broadcast failed for result of 
creating the directories." );
            goto out;
        }
        if ( OTF2_SUCCESS != status )     /**** <--- THIS WILL RESOLVE AS TRUE 
IF status HAS A GARBAGE VALUE! ****/
        {
            UTILS_ERROR( status, "Couldn't create directories on root." );
            goto out;
        }
    }
/*********************/


Also, HPX _should_ be passing in the number of ranks when APEX is initialized, 
so if that’s not the case then this might be an HPX bug (not likely).

Finally, APEX does a check during startup 
(https://github.com/UO-OACISS/apex/blob/07471212e1c826d275e911e79c6128ac9a05bb66/src/apex/utils.cpp#L419-L533
 
<https://github.com/UO-OACISS/apex/blob/07471212e1c826d275e911e79c6128ac9a05bb66/src/apex/utils.cpp#L419-L533>)
 to check for MPI environment variables, in case the initialization is lied to 
(https://github.com/UO-OACISS/apex/blob/07471212e1c826d275e911e79c6128ac9a05bb66/src/apex/apex.cpp#L388-L391
 
<https://github.com/UO-OACISS/apex/blob/07471212e1c826d275e911e79c6128ac9a05bb66/src/apex/apex.cpp#L388-L391>).
  It’s possible I have the wrong SLURM environment variables. Could you please 
do something like the following on your system (with a small test case) and see 
what you get?

`srun <srun arguments> env | grep SLURM`

That should confirm that I am checking/interpreting the correct SLURM 
variables, and that slurm is setting them correctly.

I would test a patched OTF2 first, and see if that helps.  I will create a 
patch file for this bug, and if APEX builds OTF2 automatically, it will be 
patched.

Thanks!
Kevin

> On Sep 3, 2021, at 8:08 AM, Hartmut Kaiser <[email protected]> wrote:
> 
> I'm tying in Kevin, as he might have a solution.
> 
> Regards Hartmut
> ---------------
> https://stellar-group.org
> https://github.com/STEllAR-GROUP/hpx
> 
> 
>> -----Original Message-----
>> From: [email protected] <hpx-users-bounces@stellar-
>> group.org> On Behalf Of Kor de Jong
>> Sent: Friday, September 3, 2021 9:40 AM
>> To: [email protected]; Kilian Werner <[email protected]>
>> Cc: [email protected]
>> Subject: Re: [hpx-users] Generate OTF2 traces of distributed runs with
>> APEX, re-revisited
>> 
>> Hi Kilian,
>> 
>> Thanks for your reply. Indeed, passing '--hpx:ignore-batch-env --
>> hpx:localities=[N]' gets rid of the warning messages. Great! You mention
>> this only works when using a single locality per node, but it seems to
>> work in my case with 8 localities on a single node as well. Or do I
>> misunderstand you maybe?
>> 
>> OTF2 still has issues writing the trace, unfortunately. The same error
>> messages are printed:
>> 
>> [OTF2] src/otf2_archive_int.c:1108: error: Unknown error code:
>> Couldn't create directories on root.
>> 
>> I noticed they are printed 7 times, so it seems one process is able to
>> write its trace, while the other 7 are not. This confirms John's remark
>> that "Ranks 1-N-1 try to create the otf files and clobber each other".
>> 
>> Kor
>> 
>> 
>> On 9/3/21 3:55 PM, Kilian Werner wrote:
>>> Hi Kor,
>>> 
>>> regarding the SLURM error there is a workaround that I was taught in
>>> this issue: https://github.com/STEllAR-GROUP/hpx/issues/4297
>>> 
>>>  ./application --hpx:ignore-batch-env --hpx:localities=[N]
>>> 
>>> Where [N] is the number of nodes you use (works only for one locality
>>> per node). This should get rid of the "every locality thinks its rank 0"
>>> problem. Please let me know if it works and the two issues were indeed
>>> related.
>>> 
>>> Kind regards,
>>> 
>>> Kilian
>> _______________________________________________
>> hpx-users mailing list
>> [email protected]
>> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
> 
> 

--
Kevin Huck, PhD
Research Associate / Computer Scientist
OACISS - Oregon Advanced Computing Institute for Science and Society
University of Oregon
[email protected]
http://tau.uoregon.edu
http://oaciss.uoregon.edu

_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users

Re: [hpx-users] Generate OTF2 traces of distributed runs with APEX, re-revisited

Reply via email to