I thought we had fixed this issue…the problem that never dies...
Most versions of OTF2 (2.2 and lower, I believe) had an uninitialized variable
that sometimes led to this error message and prematurely exited the
initialization process, leading to other problems. Which version of OTF2 are
you using?
I just checked the 2.2 and 2.3 source code, and I still see the bug (the
`status` variable is uninitialized for ranks other than 0) - no GitHub link for
OTF2, sadly:
/************/
OTF2_ErrorCode status;
/* It is time to create the directories by the root rank. */
if ( archive->file_mode == OTF2_FILEMODE_WRITE )
{
if ( otf2_archive_is_master( archive ) )
{
status = otf2_archive_create_directory( archive );
}
OTF2_CallbackCode callback_ret =
otf2_collectives_bcast_error( archive,
archive->global_comm_context,
&status,
OTF2_COLLECTIVES_ROOT );
if ( OTF2_CALLBACK_SUCCESS != callback_ret )
{
status = UTILS_ERROR( OTF2_ERROR_COLLECTIVE_CALLBACK,
"Can't broadcast failed for result of
creating the directories." );
goto out;
}
if ( OTF2_SUCCESS != status ) /**** <--- THIS WILL RESOLVE AS TRUE
IF status HAS A GARBAGE VALUE! ****/
{
UTILS_ERROR( status, "Couldn't create directories on root." );
goto out;
}
}
/*********************/
Also, HPX _should_ be passing in the number of ranks when APEX is initialized,
so if that’s not the case then this might be an HPX bug (not likely).
Finally, APEX does a check during startup
(https://github.com/UO-OACISS/apex/blob/07471212e1c826d275e911e79c6128ac9a05bb66/src/apex/utils.cpp#L419-L533
<https://github.com/UO-OACISS/apex/blob/07471212e1c826d275e911e79c6128ac9a05bb66/src/apex/utils.cpp#L419-L533>)
to check for MPI environment variables, in case the initialization is lied to
(https://github.com/UO-OACISS/apex/blob/07471212e1c826d275e911e79c6128ac9a05bb66/src/apex/apex.cpp#L388-L391
<https://github.com/UO-OACISS/apex/blob/07471212e1c826d275e911e79c6128ac9a05bb66/src/apex/apex.cpp#L388-L391>).
It’s possible I have the wrong SLURM environment variables. Could you please
do something like the following on your system (with a small test case) and see
what you get?
`srun <srun arguments> env | grep SLURM`
That should confirm that I am checking/interpreting the correct SLURM
variables, and that slurm is setting them correctly.
I would test a patched OTF2 first, and see if that helps. I will create a
patch file for this bug, and if APEX builds OTF2 automatically, it will be
patched.
Thanks!
Kevin
> On Sep 3, 2021, at 8:08 AM, Hartmut Kaiser <[email protected]> wrote:
>
> I'm tying in Kevin, as he might have a solution.
>
> Regards Hartmut
> ---------------
> https://stellar-group.org
> https://github.com/STEllAR-GROUP/hpx
>
>
>> -----Original Message-----
>> From: [email protected] <hpx-users-bounces@stellar-
>> group.org> On Behalf Of Kor de Jong
>> Sent: Friday, September 3, 2021 9:40 AM
>> To: [email protected]; Kilian Werner <[email protected]>
>> Cc: [email protected]
>> Subject: Re: [hpx-users] Generate OTF2 traces of distributed runs with
>> APEX, re-revisited
>>
>> Hi Kilian,
>>
>> Thanks for your reply. Indeed, passing '--hpx:ignore-batch-env --
>> hpx:localities=[N]' gets rid of the warning messages. Great! You mention
>> this only works when using a single locality per node, but it seems to
>> work in my case with 8 localities on a single node as well. Or do I
>> misunderstand you maybe?
>>
>> OTF2 still has issues writing the trace, unfortunately. The same error
>> messages are printed:
>>
>> [OTF2] src/otf2_archive_int.c:1108: error: Unknown error code:
>> Couldn't create directories on root.
>>
>> I noticed they are printed 7 times, so it seems one process is able to
>> write its trace, while the other 7 are not. This confirms John's remark
>> that "Ranks 1-N-1 try to create the otf files and clobber each other".
>>
>> Kor
>>
>>
>> On 9/3/21 3:55 PM, Kilian Werner wrote:
>>> Hi Kor,
>>>
>>> regarding the SLURM error there is a workaround that I was taught in
>>> this issue: https://github.com/STEllAR-GROUP/hpx/issues/4297
>>>
>>> ./application --hpx:ignore-batch-env --hpx:localities=[N]
>>>
>>> Where [N] is the number of nodes you use (works only for one locality
>>> per node). This should get rid of the "every locality thinks its rank 0"
>>> problem. Please let me know if it works and the two issues were indeed
>>> related.
>>>
>>> Kind regards,
>>>
>>> Kilian
>> _______________________________________________
>> hpx-users mailing list
>> [email protected]
>> https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
>
>
--
Kevin Huck, PhD
Research Associate / Computer Scientist
OACISS - Oregon Advanced Computing Institute for Science and Society
University of Oregon
[email protected]
http://tau.uoregon.edu
http://oaciss.uoregon.edu
_______________________________________________
hpx-users mailing list
[email protected]
https://mail.cct.lsu.edu/mailman/listinfo/hpx-users