Classification: UNCLASSIFIED Caveats: NONE I've built 4.1.0 on a couple of our HPC systems and I'm getting a clean build, but fails on execution of parallel servers. On both systems (an SGI Altix/ICE and an IBM iDataPlex) I'm using gcc and openmpi and the same exact build environment that I used to build 4.0.1. However, both systems are failing with identical errors that begins with a "Leave Pinned" mpi feature which is a flag set in our mpirun command environment and works with 4.0.1. Did something change behind that scenes in ParaView 4.1.0 that impacts the build or runtime parameters?
orterun -x MODULE_VERSION_STACK -x MANPATH -x MPI_VER -x HOSTNAME -x _MODULESBEGINENV_ -x PBS_ACCOUNT -x HOST -x SHELL -x TMPDIR -x PBS_JOBNAME -x PBS_ENVIRONMENT -x PBS_O_WORKDIR -x NCPUS -x DAAC_HOME -x GROUP -x PBS_TASKNUM -x USER -x LD_LIBRARY_PATH -x LS_COLORS -x PBS_O_HOME -x COMPILER_VER -x HOSTTYPE -x PBS_MOMPORT -x PV_ROOT -x PBS_O_QUEUE -x NLSPATH -x MODULE_VERSION -x MAIL -x PBS_O_LOGNAME -x PATH -x PBS_O_LANG -x PBS_JOBCOOKIE -x F90 -x PWD -x _LMFILES_ -x PBS_NODENUM -x LANG -x MODULEPATH -x LOADEDMODULES -x PBS_JOBDIR -x F77 -x PBS_O_SHELL -x PBS_JOBID -x MPICC_F77 -x CXX -x ENVIRONMENT -x SHLVL -x HOME -x OSTYPE -x PBS_O_HOST -x MPIHOME -x FC -x VENDOR -x MACHTYPE -x LOGNAME -x MPICC_CXX -x PBS_QUEUE -x MPI_HOME -x MODULESHOME -x COMPILER -x LESSOPEN -x OMP_NUM_THREADS -x PBS_O_MAIL -x CC -x PBS_O_SYSTEM -x MPICC_F90 -x G_BROKEN_FILENAMES -x PBS_NODEFILE -x MPICC_CC -x PBS_O_PATH -x module -x } -x premode -x premod -x PBS_HOME -x PBS_GET_IBWINS -x NUM_MPITASKS -np 3 -machinefile new.1133.machines.txt --prefix /usr/cta/unsupported/openmpi/gcc/4.4.0/openmpi-1.6.3 -mca orte_rsh_agent ssh -mca mpi_paffinity_alone 1 -mca maffinity first_use -mca mpi_leave_pinned 1 -mca btl openib,self -mca orte_default_hostname new.1133.machines.txt pvserver --use-offscreen-rendering --server-port=50481 --client-host=localhost --reverse-connection --timeout=15 --connect-id=30526 [pershing-n0221:01190] Warning: could not find environment variable "}" -------------------------------------------------------------------------- A process attempted to use the "leave pinned" MPI feature, but no memory registration hooks were found on the system at run time. This may be the result of running on a system that does not support memory hooks or having some other software subvert Open MPI's use of the memory hooks. You can disable Open MPI's use of memory hooks by setting both the mpi_leave_pinned and mpi_leave_pinned_pipeline MCA parameters to 0. Open MPI will disable any transports that are attempting to use the leave pinned functionality; your job may still run, but may fall back to a slower network transport (such as TCP). Mpool name: rdma Process: [[43622,1],0] Local host: xxx-n0221 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There is at least one OpenFabrics device found but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job. Local host: xxx-n0221 -------------------------------------------------------------------------- -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[43622,1],2]) is on host: xxx-n0221 Process 2 ([[43622,1],0]) is on host: xxx-n0221 BTLs attempted: self Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- -------------------------------------------------------------------------- MPI_INIT has failed because at least one MPI process is unreachable from another. This *usually* means that an underlying communication plugin -- such as a BTL or an MTL -- has either not loaded or not allowed itself to be used. Your MPI job will now abort. You may wish to try to narrow down the problem; * Check the output of ompi_info to see which BTL/MTL plugins are available. * Run your application with MPI_THREAD_SINGLE. * Set the MCA parameter btl_base_verbose to 100 (or mtl_base_verbose, if using MTL-based communications) to see exactly which communication plugins were considered and/or discarded. -------------------------------------------------------------------------- [pershing-n0221:1198] *** An error occurred in MPI_Init [pershing-n0221:1198] *** on a NULL communicator [pershing-n0221:1198] *** Unknown error [pershing-n0221:1198] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort -------------------------------------------------------------------------- An MPI process is aborting at a time when it cannot guarantee that all of its peer processes in the job will be killed properly. You should double check that everything has shut down cleanly. Reason: Before MPI_INIT completed Local host: pershing-n0221 PID: 1198 -------------------------------------------------------------------------- -------------------------------------------------------------------------- orterun has exited due to process rank 2 with PID 1198 on node pershing-n0221 exiting improperly. There are two reasons this could occur: 1. this process did not call "init" before exiting, but others in the job did. This can cause a job to hang indefinitely while it waits for all processes to call "init". By rule, if one process calls "init", then ALL processes must call "init" prior to termination. 2. this process called "init", but exited without calling "finalize". By rule, all processes that call "init" MUST call "finalize" prior to exiting or it will be considered an "abnormal termination" This may have caused other processes in the application to be terminated by signals sent by orterun (as reported here). -------------------------------------------------------------------------- [pershing-n0221:01190] 2 more processes have sent help message help-mpool-base.txt / leave pinned failed [pershing-n0221:01190] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [pershing-n0221:01190] 2 more processes have sent help message help-mpi-btl-openib.txt / no active ports found [pershing-n0221:01190] 2 more processes have sent help message help-mca-bml-r2.txt / unreachable proc [pershing-n0221:01190] 2 more processes have sent help message help-mpi-runtime / mpi_init:startup:pml-add-procs-fail [pershing-n0221:01190] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle [pershing-n0221:01190] 2 more processes have sent help message help-mpi-runtime.txt / ompi mpi abort:cannot guarantee all killed ________________________________ Rick Angelini USArmy Research Laboratory CISD/HPC Architectures Team Building 120 Cube 315 Aberdeen Proving Ground, MD Phone: 410-278-6266 Classification: UNCLASSIFIED Caveats: NONE
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Powered by www.kitware.com Visit other Kitware open-source projects at http://www.kitware.com/opensource/opensource.html Please keep messages on-topic and check the ParaView Wiki at: http://paraview.org/Wiki/ParaView Follow this link to subscribe/unsubscribe: http://www.paraview.org/mailman/listinfo/paraview
