OK, I am now on the openmpi-1.9a1r27954 tarball. In order to build OMPI and compile apps on this machine I must
1) edit the xe6 platform to --disable-shared/--enable-static (site-specific) 2) edit the xe6 platform file to provide a full path to the alps headers because the logic in orte_check_alps.m4 for default values is wrong 3) edit the xe6 platform file to remove with_devel_headers=yes because --with-devel-headers breaks "make install" 4) edit configure (!!!) to allow ras_alps_CPPFLAGS (and other vars) to get set at configure time 5) edit orte/mca/ras/alps/ras_alps_component.c and/or orte/mca/ras/alps/ras-alps-command.sh with the proper path to apstat (perhaps only one needs to be edited?) Item (1) is due to site differences, and is not an OMPI bug. The other 4 have all been reported in one form or another on this list. Now, the *next* bug is the following: > $ ./INSTALL/bin/mpirun -mca ras_base_verbose 1 -mca orte_debug_verbose 1 > -np 2 ./ring_c 2>&1 | tee -a log > [nid00704:21984] ras:alps:allocate: Trying ALPS configuration file: > "/etc/sysconfig/alps" > [nid00704:21984] ras:alps:allocate: parser_ini > [nid00704:21984] ras:alps:allocate: Trying ALPS configuration file: > "/etc/alps.conf" > [nid00704:21984] ras:alps:allocate: Skipping ALPS configuration file: > "/etc/alps.conf" (No such file or directory). > [nid00704:21984] ras:alps:allocate: Could not locate ALPS scheduler file. > [nid00704:21984] [[8668,0],0] ORTE_ERROR_LOG: Not found in file > ../../../../orte/mca/ras/base/ras_base_allocate.c at line 178 My best guess is that this is related to something Ralph said in http://www.open-mpi.org/community/lists/devel/2013/01/11989.php > I'm currently tracking down a problem on the Cray XE6 - it appears that > recent OS release changed the way alps stores allocation info :-( Looking at the debug output prior to the error, and examining the system, I made the following 1-line addition: --- openmpi-1.9a1r27954/orte/mca/ras/alps/ras_alps_module.c~ 2013-01-28 23:54:31.443749000 -0800 +++ openmpi-1.9a1r27954/orte/mca/ras/alps/ras_alps_module.c 2013-01-28 23:54:34.770766635 -0800 @@ -74,6 +74,7 @@ static int parser_separated_columns(char static const orte_ras_alps_sysconfig_t sysconfigs[] = { {"/etc/sysconfig/alps", "ALPS_SHARED_DIR_PATH", parser_ini}, {"/etc/alps.conf" , "sharedDir" , parser_separated_columns}, + {"/etc/opt/cray/alps/alps.conf", "sharedDir" , parser_separated_columns}, /* must be last element */ {NULL , NULL , NULL} }; That appears to work for locating the allocation: > $ ./INSTALL/bin/mpirun -mca ras_base_verbose 1 -mca orte_debug_verbose 1 > -np 2 ./ring_c 2>&1 | tee -a log > [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file: > "/etc/sysconfig/alps" > [nid00320:22990] ras:alps:allocate: parser_ini > [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file: > "/etc/alps.conf" > [nid00320:22990] ras:alps:allocate: Skipping ALPS configuration file: > "/etc/alps.conf" (No such file or directory). > [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file: > "/etc/opt/cray/alps/alps.conf" > [nid00320:22990] ras:alps:allocate: parser_separated_columns > [nid00320:22990] ras:alps:allocate: Located ALPS scheduler file: > "/ufs/alps_shared/appinfo" > [nid00320:22990] ras:alps:allocate: begin processing appinfo file > [nid00320:22990] ras:alps:allocate: file /ufs/alps_shared/appinfo read > [nid00320:22990] ras:alps:allocate: 3 entries in file > [nid00320:22990] ras:alps:allocate: read data for resId 26 - myId 41 > [nid00320:22990] ras:alps:allocate: read data for resId 26 - myId 41 > [nid00320:22990] ras:alps:allocate: read data for resId 41 - myId 41 > [nid00320:22990] ras:alps:allocate: success But wait, where is the application output? Did anything even run? I honestly don't know where to go from here. Please let me know what I can do to help move forward on any of these issues. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900