Classification: UNCLASSIFIED Caveats: NONE I'm seeing those errors with both client-server parallel pvserver and with pvbatch. I'm going to throw this over to our systems people to see if they have any ideas. But, I'm suspicious that is a ParaView thing since it happens on two different machines, two different compute platforms.
________________________________ Rick Angelini USArmy Research Laboratory CISD/HPC Architectures Team Building 120 Cube 315 Aberdeen Proving Ground, MD Phone: 410-278-6266 -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Utkarsh Ayachit Sent: Tuesday, December 03, 2013 10:07 PM To: Angelini, Richard C (Rick) CIV USARMY ARL (US) Cc: [email protected] Subject: Re: [Paraview] 4.1.0 release candidate build (UNCLASSIFIED) I can't think of anything in particular that changed that could affect this. Are you trying this with pvserver? Can you try pvbatch? Same problem? On Tue, Dec 3, 2013 at 12:44 PM, Angelini, Richard C (Rick) CIV USARMY ARL (US) <[email protected]> wrote: > Classification: UNCLASSIFIED > Caveats: NONE > > I've built 4.1.0 on a couple of our HPC systems and I'm getting a clean > build, but fails on execution of parallel servers. On both systems (an SGI > Altix/ICE and an IBM iDataPlex) I'm using gcc and openmpi and the same > exact build environment that I used to build 4.0.1. However, both systems > are failing with identical errors that begins with a "Leave Pinned" > mpi feature which is a flag set in our mpirun command environment and > works with 4.0.1. Did something change behind that scenes in ParaView > 4.1.0 that impacts the build or runtime parameters? > > > > orterun -x MODULE_VERSION_STACK -x MANPATH -x MPI_VER -x HOSTNAME -x > _MODULESBEGINENV_ -x PBS_ACCOUNT -x HOST -x SHELL -x TMPDIR -x > PBS_JOBNAME -x PBS_ENVIRONMENT -x PBS_O_WORKDIR -x NCPUS -x DAAC_HOME > -x GROUP -x PBS_TASKNUM -x USER -x LD_LIBRARY_PATH -x LS_COLORS -x > PBS_O_HOME -x COMPILER_VER -x HOSTTYPE -x PBS_MOMPORT -x PV_ROOT -x > PBS_O_QUEUE -x NLSPATH -x MODULE_VERSION -x MAIL -x PBS_O_LOGNAME -x > PATH -x PBS_O_LANG -x PBS_JOBCOOKIE -x F90 -x PWD -x _LMFILES_ -x > PBS_NODENUM -x LANG -x MODULEPATH -x LOADEDMODULES -x PBS_JOBDIR -x > F77 -x PBS_O_SHELL -x PBS_JOBID -x MPICC_F77 -x CXX -x ENVIRONMENT -x > SHLVL -x HOME -x OSTYPE -x PBS_O_HOST -x MPIHOME -x FC -x VENDOR -x > MACHTYPE -x LOGNAME -x MPICC_CXX -x PBS_QUEUE -x MPI_HOME -x > MODULESHOME -x COMPILER -x LESSOPEN -x OMP_NUM_THREADS -x PBS_O_MAIL > -x CC -x PBS_O_SYSTEM -x MPICC_F90 -x G_BROKEN_FILENAMES -x > PBS_NODEFILE -x MPICC_CC -x PBS_O_PATH -x module -x } -x premode -x > premod -x PBS_HOME -x PBS_GET_IBWINS -x NUM_MPITASKS -np 3 > -machinefile new.1133.machines.txt --prefix > /usr/cta/unsupported/openmpi/gcc/4.4.0/openmpi-1.6.3 -mca > orte_rsh_agent ssh -mca mpi_paffinity_alone 1 -mca maffinity first_use > -mca mpi_leave_pinned 1 -mca btl openib,self -mca > orte_default_hostname new.1133.machines.txt pvserver > --use-offscreen-rendering --server-port=50481 --client-host=localhost > --reverse-connection --timeout=15 --connect-id=30526 [pershing-n0221:01190] Warning: could not find environment variable "}" > ---------------------------------------------------------------------- > ---- A process attempted to use the "leave pinned" MPI feature, but no > memory registration hooks were found on the system at run time. This > may be the result of running on a system that does not support memory > hooks or having some other software subvert Open MPI's use of the > memory hooks. You can disable Open MPI's use of memory hooks by > setting both the mpi_leave_pinned and mpi_leave_pinned_pipeline MCA > parameters to 0. > > Open MPI will disable any transports that are attempting to use the > leave pinned functionality; your job may still run, but may fall back > to a slower network transport (such as TCP). > > Mpool name: rdma > Process: [[43622,1],0] > Local host: xxx-n0221 > ---------------------------------------------------------------------- > ---- > ---------------------------------------------------------------------- > ---- > WARNING: There is at least one OpenFabrics device found but there are > no active ports detected (or Open MPI was unable to use them). This > is most certainly not what you wanted. Check your cables, subnet > manager configuration, etc. The openib BTL will be ignored for this > job. > > Local host: xxx-n0221 > ---------------------------------------------------------------------- > ---- > ---------------------------------------------------------------------- > ---- At least one pair of MPI processes are unable to reach each other > for MPI communications. This means that no Open MPI device has > indicated that it can be used to communicate between these processes. > This is an error; Open MPI requires that all MPI processes be able to > reach each other. This error can sometimes be the result of > forgetting to specify the "self" BTL. > > Process 1 ([[43622,1],2]) is on host: xxx-n0221 > Process 2 ([[43622,1],0]) is on host: xxx-n0221 > BTLs attempted: self > > Your MPI job is now going to abort; sorry. > ---------------------------------------------------------------------- > ---- > ---------------------------------------------------------------------- > ---- MPI_INIT has failed because at least one MPI process is > unreachable from another. This *usually* means that an underlying > communication plugin -- such as a BTL or an MTL -- has either not > loaded or not allowed itself to be used. Your MPI job will now abort. > > You may wish to try to narrow down the problem; > > * Check the output of ompi_info to see which BTL/MTL plugins are > available. > * Run your application with MPI_THREAD_SINGLE. > * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, > if using MTL-based communications) to see exactly which > communication plugins were considered and/or discarded. > ---------------------------------------------------------------------- > ---- [pershing-n0221:1198] *** An error occurred in MPI_Init > [pershing-n0221:1198] *** on a NULL communicator [pershing-n0221:1198] > *** Unknown error [pershing-n0221:1198] *** MPI_ERRORS_ARE_FATAL: your > MPI job will now abort > ---------------------------------------------------------------------- > ---- An MPI process is aborting at a time when it cannot guarantee > that all of its peer processes in the job will be killed properly. > You should double check that everything has shut down cleanly. > > Reason: Before MPI_INIT completed > Local host: pershing-n0221 > PID: 1198 > ---------------------------------------------------------------------- > ---- > ---------------------------------------------------------------------- > ---- orterun has exited due to process rank 2 with PID 1198 on node > pershing-n0221 exiting improperly. There are two reasons this could > occur: > > 1. this process did not call "init" before exiting, but others in the > job did. This can cause a job to hang indefinitely while it waits for > all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > This may have caused other processes in the application to be > terminated by signals sent by orterun (as reported here). > ---------------------------------------------------------------------- > ---- [pershing-n0221:01190] 2 more processes have sent help message > help-mpool-base.txt / leave pinned failed [pershing-n0221:01190] Set > MCA parameter "orte_base_help_aggregate" to 0 to see all help / error > messages [pershing-n0221:01190] 2 more processes have sent help > message help-mpi-btl-openib.txt / no active ports found > [pershing-n0221:01190] 2 more processes have sent help message > help-mca-bml-r2.txt / unreachable proc [pershing-n0221:01190] 2 more > processes have sent help message help-mpi-runtime / > mpi_init:startup:pml-add-procs-fail > [pershing-n0221:01190] 2 more processes have sent help message > help-mpi-errors.txt / mpi_errors_are_fatal unknown handle > [pershing-n0221:01190] 2 more processes have sent help message > help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed > > ________________________________ > Rick Angelini > USArmy Research Laboratory > CISD/HPC Architectures Team > Building 120 Cube 315 > Aberdeen Proving Ground, MD > Phone: 410-278-6266 > > > > Classification: UNCLASSIFIED > Caveats: NONE > > > > _______________________________________________ > Powered by www.kitware.com > > Visit other Kitware open-source projects at > http://www.kitware.com/opensource/opensource.html > > Please keep messages on-topic and check the ParaView Wiki at: > http://paraview.org/Wiki/ParaView > > Follow this link to subscribe/unsubscribe: > http://www.paraview.org/mailman/listinfo/paraview > _______________________________________________ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview Classification: UNCLASSIFIED Caveats: NONE
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview
