Indeed. Sorry to jump late back into the melee. I did reproduce the problem on a second SPARC system, to answer Ralph's earlier question; I don't know how interesting that is given that it's very similar to the original system. And, to corroborate Paul's AMD observation, we have an x86/Solaris/Studio system that is *not* seeing the problem. Thanks to Paul for identifying the likely cause of the problem.

On 8/24/2012 6:32 PM, Ralph Castain wrote:
Thanks Paul!! That is very helpful - hopefully the ORNL folks can now fix the problem.

On Aug 24, 2012, at 6:29 PM, Paul Hargrove <phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>> wrote:

I *can* reproduce the problem on SPARC/Solaris-10 with the SS12.3 compiler and an ALMOST vanilla configure:
$ [path_to]configure \
       --prefix=[blah]  CC=cc CXX=CC F77=f77 FC=f90 \
CFLAGS="-m64" --with-wrapper-cflags="-m64" CXXFLAGS="-m64" --with-wrapper-cxxflags="-m64" \ FFLAGS="-m64" --with-wrapper-fflags="-m64" FCFLAGS="-m64" --with-wrapper-fcflags="-m64" \
       CXXFLAGS="-m64 -library=stlport4"

I did NOT manage to reproduce on AMD64/Solaris-11, which completed a build w/ VT disabled. Unfortunately I have neither SPARC/Solaris-11 nor AMD64/Solaris-10 readily available to disambiguate the key factor. Hopefully it is enough to know that the problem is reproducible w/o Oracle's massive configure commandline.


The build isn't complete, but I can already see that the symbol has "leaked" into libmpi:

$ grep -arl mca_coll_ml_memsync_intra BLD/
BLD/ompi/mca/bcol/.libs/libmca_bcol.a
BLD/ompi/mca/bcol/base/.libs/bcol_base_open.o
BLD/ompi/.libs/libmpi.so.0.0.0
BLD/ompi/.libs/libmpi.so
BLD/ompi/.libs/libmpi.so.0

It is referenced by mca_coll_ml_generic_collectives_launcher:

$ nm BLD/ompi/.libs/libmpi.so.0.0.0 | grep -B1 mca_coll_ml_memsync_intra
00000000006a6088 t mca_coll_ml_generic_collectives_launcher
                 U mca_coll_ml_memsync_intra

This is coming from libmca_bcol.a:
$ nm BLD/ompi/mca/bcol/.libs/libmca_bcol.a | grep -B1 mca_coll_ml_memsync_intra
0000000000005248 t mca_coll_ml_generic_collectives_launcher
                 U mca_coll_ml_memsync_intra


This appears to be via the following chain of calls within coll_ml.h:

mca_coll_ml_generic_collectives_launcher
   mca_coll_ml_task_completion_processing
      coll_ml_fragment_completion_processing
         mca_coll_ml_buffer_recycling
             mca_coll_ml_memsync_intra

All of which are marked as "static inline __opal_attribute_always_inline__".

-Paul


On Fri, Aug 24, 2012 at 4:55 PM, Paul Hargrove <phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>> wrote:

    OK, I have a vanilla configure+make running on both
    SPARC/Solaris-10 and AMD64/Solaris-11.
    I am using the 12.3 Oracle compilers in both cases to match the
    original report.
    I'll post the results when they complete.

    In the meantime, I took a quick look at the code and have a
    pretty reasonable guess as to the cause.
    Looking at ompi/mca/coll/ml/coll_ml.h I see:

       827  int mca_coll_ml_memsync_intra(mca_coll_ml_module_t
    *module, int bank_index);
    [...]
       996  static inline __opal_attribute_always_inline__
       997          int
    mca_coll_ml_buffer_recycling(mca_coll_ml_collective_operation_progress_t
    *ml_request)
       998  {
    [...]
      1023                  rc = mca_coll_ml_memsync_intra(ml_module,
    ml_memblock->memsync_counter);
    [...]
      1041  }

    Based on past experience w/ the Sun/Oracle compilers on another
    project (See
    http://bugzilla.hcs.ufl.edu/cgi-bin/bugzilla3/show_bug.cgi?id=193 ),
    I suspect that this static-inline-always function is
    being emitted by the compiler in every object which includes this
    header even if they don't call it..  The call on line 1023 then
    results in the undefined reference to mca_coll_ml_memsync_intra.
     Basically it is not safe for an inline function in a header to
    call an extern function that isn't available to every object that
    includes the header REGARDLESS of whether the object invokes the
    inline function or not.

    -Paul



    On Fri, Aug 24, 2012 at 4:40 PM, Ralph Castain <r...@open-mpi.org
    <mailto:r...@open-mpi.org>> wrote:

        Oracle uses an abysmally complicated configure line, but
        nearly all of it is irrelevant to the problem here. For this,
        I would suggest just doing a vanilla ./configure - if the
        component gets pulled into libmpi, then we know there is a
        problem.

        Thanks!

        Just FYI: here is there actual configure line, just in case
        you spot something problematic:

        CC=cc CXX=CC F77=f77 FC=f90  --with-openib  
--enable-openib-connectx-xrc  --without-udapl
        --disable-openib-ibcm  --enable-btl-openib-failover   --without-dtrace  
--enable-heterogeneous
        --enable-cxx-exceptions --enable-shared 
--enable-orterun-prefix-by-default --with-sge
        --enable-mpi-f90 --with-mpi-f90-size=small  --disable-peruse 
--disable-state
        --disable-mpi-thread-multiple   --disable-debug  --disable-mem-debug  
--disable-mem-profile
        CFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch 
-xprefetch_level=2 -xvector=lib -Qoption
        cg -xregs=no%appl -xdepend=yes -xbuiltin=%all -xO5"  
CXXFLAGS="-xtarget=ultra3 -m32
        -xarch=sparcvis2 -xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg 
-xregs=no%appl -xdepend=yes
        -xbuiltin=%all -xO5 -Bstatic -lCrun -lCstd -Bdynamic"  
FFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2
        -xprefetch -xprefetch_level=2 -xvector=lib -Qoption cg -xregs=no%appl 
-stackvar -xO5"
        FCFLAGS="-xtarget=ultra3 -m32 -xarch=sparcvis2 -xprefetch 
-xprefetch_level=2 -xvector=lib -Qoption
        cg -xregs=no%appl -stackvar -xO5"
        
--prefix=/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/installs/JA08/install
        --mandir=${prefix}/man  --bindir=${prefix}/bin  --libdir=${prefix}/lib
        --includedir=${prefix}/include   
--with-tm=/ws/ompi-tools/orte/torque/current/shared-install32
        --enable-contrib-no-build=vt --with-package-string="Oracle Message Passing 
Toolkit "
        --with-ident-string="@(#)RELEASE VERSION 1.9openmpi-1.5.4-r1.9a1r27092"


        and the error he gets is:

        make[2]: Entering directory
        
`/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info'
           CCLD     ompi_info
        Undefined                       first referenced
          symbol                            in file
        mca_coll_ml_memsync_intra           ../../../ompi/.libs/libmpi.so
        ld: fatal: symbol referencing errors. No output written to 
.libs/ompi_info
        make[2]: *** [ompi_info] Error 2
        make[2]: Leaving directory
        
`/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi/tools/ompi_info'
        make[1]: *** [install-recursive] Error 1
        make[1]: Leaving directory
        
`/workspace/euloh/hpc/mtt-scratch/burl-ct-t2k-3/ompi-tarball-testing/mpi-install/s3rI/src/openmpi-1.9a1r27092/ompi'
        make: *** [install-recursive] Error 1


        On Aug 24, 2012, at 4:30 PM, Paul Hargrove
        <phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>> wrote:

        I have access to a few different Solaris machines and can
        offer to build the trunk if somebody tells me what configure
        flags are desired.

        -Paul

        On Fri, Aug 24, 2012 at 8:54 AM, Ralph Castain
        <r...@open-mpi.org <mailto:r...@open-mpi.org>> wrote:

            Eugene - can you confirm that this is only happening on
            the one Solaris system? In other words, is this a
            general issue or something specific to that one machine?

            I'm wondering because if it is just the one machine,
            then it might be something strange about how it is setup
            - perhaps the version of Solaris, or it is configuring
            --enable-static, or...

            Just trying to assess how general a problem this might
            be, and thus if this should be a blocker or not.

            On Aug 24, 2012, at 8:00 AM, Eugene Loh
            <eugene....@oracle.com <mailto:eugene....@oracle.com>>
            wrote:

            > On 08/24/12 09:54, Shamis, Pavel wrote:
            >> Maybe there is a chance to get direct access to this
            system ?
            > No.
            >
            > But I'm attaching compressed log files from
            configure/make.
            >
            >
            
<tarball-of-log-files.tar.bz2>_______________________________________________
            > devel mailing list
            > de...@open-mpi.org <mailto:de...@open-mpi.org>
            > http://www.open-mpi.org/mailman/listinfo.cgi/devel


            _______________________________________________
            devel mailing list
            de...@open-mpi.org <mailto:de...@open-mpi.org>
            http://www.open-mpi.org/mailman/listinfo.cgi/devel




-- Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
        Future Technologies Group
        Computer and Data Sciences Department     Tel:
        +1-510-495-2352 <tel:%2B1-510-495-2352>
        Lawrence Berkeley National Laboratory     Fax:
        +1-510-486-6900 <tel:%2B1-510-486-6900>

        _______________________________________________
        devel mailing list
        de...@open-mpi.org <mailto:de...@open-mpi.org>
        http://www.open-mpi.org/mailman/listinfo.cgi/devel


        _______________________________________________
        devel mailing list
        de...@open-mpi.org <mailto:de...@open-mpi.org>
        http://www.open-mpi.org/mailman/listinfo.cgi/devel




-- Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
    Future Technologies Group
    Computer and Data Sciences Department     Tel: +1-510-495-2352
    <tel:%2B1-510-495-2352>
    Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
    <tel:%2B1-510-486-6900>




--
Paul H. Hargrove phhargr...@lbl.gov <mailto:phhargr...@lbl.gov>
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to