WTF

On Jan 21, 2014, at 1:14 PM, Roc Wang <[email protected]> wrote:

> Hi,
> 
>    I am trying to run a PETSc program with 1024 MPI ranks on 
> vesta.alcf.anl.gov.  The original program which was debugged and run 
> successfully on other clusters and on vesta with a small number of ranks 
> included many PETSc functions to use KSP solver, but they are commented off 
> to test the PETSc initialization. Therefore, only PetscInitialize() and 
> PetscFinalize() and some output functions are in the program. The command to 
> run the job is:
> 
> qsub -n <number of nodes> -t 10 --mode <ranks per node> --env "F00=a:BAR=b" 
> ./x.r 
> 
> The total number of ranks is 1024 with different combinations of <number of 
> nodes> and <ranks per node>, such as -n 64 --mode c16 or -n 16 --mode  64.
> 
> The results showed that PetscInitialize() cannot start the petsc process with 
> -n 64 --mode c16 since there is no output printed to stdout.  The .cobaltlog 
> file shows the job started but just .output file didn't record any output. 
> The .error file is like:
> 
> 2014-01-21 16:31:50.414 (INFO ) [0x40000a3bc20] 
> 32092:ibm.runjob.AbstractOptions: using properties file 
> /bgsys/local/etc/bg.properties
> 2014-01-21 16:31:50.416 (INFO ) [0x40000a3bc20] 
> 32092:ibm.runjob.AbstractOptions: max open file descriptors: 65536
> 2014-01-21 16:31:50.416 (INFO ) [0x40000a3bc20] 
> 32092:ibm.runjob.AbstractOptions: core file limit: 18446744073709551615
> 2014-01-21 16:31:50.416 (INFO ) [0x40000a3bc20] 32092:tatu.runjob.client: 
> scheduler job id is 154599
> 2014-01-21 16:31:50.419 (INFO ) [0x400004034e0] 32092:tatu.runjob.monitor: 
> monitor started
> 2014-01-21 16:31:50.421 (INFO ) [0x40000a3bc20] 
> VST-00420-11731-64:32092:ibm.runjob.client.options.Parser: set local socket 
> to runjob_mux from properties file
> 2014-01-21 16:31:53.111 (INFO ) [0x40000a3bc20] 
> VST-00420-11731-64:729041:ibm.runjob.client.Job: job 729041 started
> 2014-01-21 16:32:03.603 (WARN ) [0x400004034e0] 32092:tatu.runjob.monitor: 
> tracklib terminated with exit code 1
> 2014-01-21 16:41:09.554 (WARN ) [0x40000a3bc20] 
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: received signal 15
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] 
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: signal sent from USER
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] 
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: sent from pid 5894
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] 
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: could not read /proc/5894/exe
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] 
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: Permission denied
> 2014-01-21 16:41:09.555 (WARN ) [0x40000a3bc20] 
> VST-00420-11731-64:ibm.runjob.LogSignalInfo: sent from uid 0 (root)
> 2014-01-21 16:41:11.248 (WARN ) [0x40000a3bc20] 
> VST-00420-11731-64:729041:ibm.runjob.client.Job: terminated by signal 9
> 2014-01-21 16:41:11.248 (WARN ) [0x40000a3bc20] 
> VST-00420-11731-64:729041:ibm.runjob.client.Job: abnormal termination by 
> signal 9 from rank 720
> 2014-01-21 16:41:11.248 (INFO ) [0x40000a3bc20] tatu.runjob.client: task 
> terminated by signal 9
> 2014-01-21 16:41:11.248 (INFO ) [0x400004034e0] 32092:tatu.runjob.monitor: 
> monitor terminating
> 2014-01-21 16:41:11.250 (INFO ) [0x40000a3bc20] tatu.runjob.client: monitor 
> completed
> 
> 
> The petsc can start with -n 16 --mode  64 and -n 1024 --mode c1.  I also 
> replaced PetscInitialize()  with MPI_Init() and the program can start 
> correctly with all combinations of the options. 
> 
> What is the reason cause this strange result? Thanks.

Reply via email to