OK, I am now on the openmpi-1.9a1r27954 tarball.
In order to build OMPI and compile apps on this machine I must

1) edit the xe6 platform to --disable-shared/--enable-static (site-specific)

2) edit the xe6 platform file to provide a full path to the alps headers
because the logic in orte_check_alps.m4 for default values is wrong

3) edit the xe6 platform file to remove with_devel_headers=yes because
--with-devel-headers breaks "make install"

4) edit configure (!!!) to allow ras_alps_CPPFLAGS (and other vars) to get
set at configure time

5) edit orte/mca/ras/alps/ras_alps_component.c and/or
orte/mca/ras/alps/ras-alps-command.sh with the proper path to apstat
(perhaps only one needs to be edited?)

Item (1) is due to site differences, and is not an OMPI bug.
The other 4 have all been reported in one form or another on this list.

Now, the *next* bug is the following:

> $ ./INSTALL/bin/mpirun -mca ras_base_verbose 1 -mca orte_debug_verbose 1
> -np 2 ./ring_c 2>&1 | tee -a log
> [nid00704:21984] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/sysconfig/alps"
> [nid00704:21984] ras:alps:allocate: parser_ini
> [nid00704:21984] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/alps.conf"
> [nid00704:21984] ras:alps:allocate: Skipping ALPS configuration file:
> "/etc/alps.conf" (No such file or directory).
> [nid00704:21984] ras:alps:allocate: Could not locate ALPS scheduler file.
> [nid00704:21984] [[8668,0],0] ORTE_ERROR_LOG: Not found in file
> ../../../../orte/mca/ras/base/ras_base_allocate.c at line 178



My best guess is that this is related to something Ralph said in
http://www.open-mpi.org/community/lists/devel/2013/01/11989.php

> I'm currently tracking down a problem on the Cray XE6 - it appears that
> recent OS release changed the way alps stores allocation info :-(


Looking at the debug output prior to the error, and examining the system, I
made the following 1-line addition:
--- openmpi-1.9a1r27954/orte/mca/ras/alps/ras_alps_module.c~    2013-01-28
23:54:31.443749000 -0800
+++ openmpi-1.9a1r27954/orte/mca/ras/alps/ras_alps_module.c     2013-01-28
23:54:34.770766635 -0800
@@ -74,6 +74,7 @@ static int parser_separated_columns(char
 static const orte_ras_alps_sysconfig_t sysconfigs[] = {
     {"/etc/sysconfig/alps", "ALPS_SHARED_DIR_PATH", parser_ini},
     {"/etc/alps.conf"     , "sharedDir"           ,
parser_separated_columns},
+    {"/etc/opt/cray/alps/alps.conf", "sharedDir"  ,
parser_separated_columns},
     /* must be last element */
     {NULL                 , NULL                  , NULL}
 };

That appears to work for locating the allocation:

> $ ./INSTALL/bin/mpirun -mca ras_base_verbose 1 -mca orte_debug_verbose 1
> -np 2 ./ring_c 2>&1 | tee -a log
> [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/sysconfig/alps"
> [nid00320:22990] ras:alps:allocate: parser_ini
> [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/alps.conf"
> [nid00320:22990] ras:alps:allocate: Skipping ALPS configuration file:
> "/etc/alps.conf" (No such file or directory).
> [nid00320:22990] ras:alps:allocate: Trying ALPS configuration file:
> "/etc/opt/cray/alps/alps.conf"
> [nid00320:22990] ras:alps:allocate: parser_separated_columns
> [nid00320:22990] ras:alps:allocate: Located ALPS scheduler file:
> "/ufs/alps_shared/appinfo"
> [nid00320:22990] ras:alps:allocate: begin processing appinfo file
> [nid00320:22990] ras:alps:allocate: file /ufs/alps_shared/appinfo read
> [nid00320:22990] ras:alps:allocate: 3 entries in file
> [nid00320:22990] ras:alps:allocate: read data for resId 26 - myId 41
> [nid00320:22990] ras:alps:allocate: read data for resId 26 - myId 41
> [nid00320:22990] ras:alps:allocate: read data for resId 41 - myId 41
> [nid00320:22990] ras:alps:allocate: success


But wait, where is the application output?
Did anything even run?
I honestly don't know where to go from here.

Please let me know what I can do to help move forward on any of these
issues.

-Paul

-- 
Paul H. Hargrove                          phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

Reply via email to