WTF On Jan 21, 2014, at 1:14 PM, Roc Wang <[email protected]> wrote:
> Hi, > > I am trying to run a PETSc program with 1024 MPI ranks on > vesta.alcf.anl.gov. The original program which was debugged and run > successfully on other clusters and on vesta with a small number of ranks > included many PETSc functions to use KSP solver, but they are commented off > to test the PETSc initialization. Therefore, only PetscInitialize() and > PetscFinalize() and some output functions are in the program. The command to > run the job is: > > qsub -n <number of nodes> -t 10 --mode <ranks per node> --env "F00=a:BAR=b" > ./x.r > > The total number of ranks is 1024 with different combinations of <number of > nodes> and <ranks per node>, such as -n 64 --mode c16 or -n 16 --mode 64. > > The results showed that PetscInitialize() cannot start the petsc process with > -n 64 --mode c16 since there is no output printed to stdout. The .cobaltlog > file shows the job started but just .output file didn't record any output. > The .error file is like: > > 2014-01-21 16:31:50.414 (INFO ) [0x40000a3bc20] > 32092:ibm.runjob.AbstractOptions: using properties file > /bgsys/local/etc/bg.properties > 2014-01-21 16:31:50.416 (INFO ) [0x40000a3bc20] > 32092:ibm.runjob.AbstractOptions: max open file descriptors: 65536 > 2014-01-21 16:31:50.416 (INFO ) [0x40000a3bc20] > 32092:ibm.runjob.AbstractOptions: core file limit: 18446744073709551615 > 2014-01-21 16:31:50.416 (INFO ) [0x40000a3bc20] 32092:tatu.runjob.client: > scheduler job id is 154599 > 2014-01-21 16:31:50.419 (INFO ) [0x400004034e0] 32092:tatu.runjob.monitor: > monitor started > 2014-01-21 16:31:50.421 (INFO ) [0x40000a3bc20] > VST-00420-11731-64:32092:ibm.runjob.client.options.Parser: set local socket > to runjob_mux from properties file > 2014-01-21 16:31:53.111 (INFO ) [0x40000a3bc20] > VST-00420-11731-64:729041:ibm.runjob.client.Job: job 729041 started > 2014-01-21 16:32:03.603 (WARN ) [0x400004034e0] 32092:tatu.runjob.monitor: > tracklib terminated with exit code 1 > 2014-01-21 16:41:09.554 (WARN ) [0x40000a3bc20] > VST-00420-11731-64:ibm.runjob.LogSignalInfo: received signal 15 > 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] > VST-00420-11731-64:ibm.runjob.LogSignalInfo: signal sent from USER > 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] > VST-00420-11731-64:ibm.runjob.LogSignalInfo: sent from pid 5894 > 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] > VST-00420-11731-64:ibm.runjob.LogSignalInfo: could not read /proc/5894/exe > 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] > VST-00420-11731-64:ibm.runjob.LogSignalInfo: Permission denied > 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] > VST-00420-11731-64:ibm.runjob.LogSignalInfo: sent from uid 0 (root) > 2014-01-21 16:41:11.248 (WARN ) [0x40000a3bc20] > VST-00420-11731-64:729041:ibm.runjob.client.Job: terminated by signal 9 > 2014-01-21 16:41:11.248 (WARN ) [0x40000a3bc20] > VST-00420-11731-64:729041:ibm.runjob.client.Job: abnormal termination by > signal 9 from rank 720 > 2014-01-21 16:41:11.248 (INFO ) [0x40000a3bc20] tatu.runjob.client: task > terminated by signal 9 > 2014-01-21 16:41:11.248 (INFO ) [0x400004034e0] 32092:tatu.runjob.monitor: > monitor terminating > 2014-01-21 16:41:11.250 (INFO ) [0x40000a3bc20] tatu.runjob.client: monitor > completed > > > The petsc can start with -n 16 --mode 64 and -n 1024 --mode c1. I also > replaced PetscInitialize() with MPI_Init() and the program can start > correctly with all combinations of the options. > > What is the reason cause this strange result? Thanks.
