On Aug 25, 2013, at 4:30 PM, Frank <[email protected]> wrote:
> Hi, > I have very weird problem here. > I am using FORTRAN to call PETSc to solve Poisson equation. > When I run my code with 8 cores, it works fine, and the consumed memory does > not increase. However, when it is run with 64 cores, first of all it gives > lots of error like this: > > [n310:18951] [[62652,0],2] -> [[62652,0],10] (node: n219) oob-tcp: > Number of attempts to create TCP connection has been exceeded. Can not > communicate with peer > [n310:18951] [[62652,0],2] -> [[62652,0],18] (node: n128) oob-tcp: > Number of attempts to create TCP connection has been exceeded. Can not > communicate with peer > [n310:18951] [[62652,0],2] -> [[62652,0],34] (node: n089) oob-tcp: > Number of attempts to create TCP connection has been exceeded. Can not > communicate with peer > [n310:18951] [[62652,0],2] ORTED_CMD_PROCESSOR: STUCK IN INFINITE LOOP - > ABORTING I don't know where you are getting "memory" errors but this looks like a pretty fatal error. Unless someone recognizes something else I'd look at this in a debugger and see where this is happening. See if its deterministic or not. And if it is see what code is killing it. Mark > [n310:18951] *** Process received signal *** > [n310:18951] Signal: Aborted (6) > [n310:18951] Signal code: (-6) > [n310:18951] [ 0] /lib64/libpthread.so.0() [0x35b120f500] > [n310:18951] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x35b0e328a5] > [n310:18951] [ 2] /lib64/libc.so.6(abort+0x175) [0x35b0e34085] > [n310:18951] [ 3] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_daemon_cmd_processor+0x243) > [0x2ae5e02f0813] > [n310:18951] [ 4] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_base_loop+0x31a) > [0x2ae5e032f56a] > [n310:18951] [ 5] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_loop+0x12) > [0x2ae5e032f242] > [n310:18951] [ 6] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_progress+0x5c) > [0x2ae5e031845c] > [n310:18951] [ 7] > /global/software/openmpi-1.6.1-intel1/lib/openmpi/mca_grpcomm_bad.so(+0x1bd7) > [0x2ae5e28debd7] > [n310:18951] [ 8] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_ess_base_orted_finalize+0x1e) > [0x2ae5e02f431e] > [n310:18951] [ 9] > /global/software/openmpi-1.6.1-intel1/lib/openmpi/mca_ess_tm.so(+0x1294) > [0x2ae5e1ab1294] > [n310:18951] [10] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_finalize+0x4e) > [0x2ae5e02d0fbe] > [n310:18951] [11] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(+0x4840b) > [0x2ae5e02f040b] > [n310:18951] [12] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_base_loop+0x31a) > [0x2ae5e032f56a] > [n310:18951] [13] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_loop+0x12) > [0x2ae5e032f242] > [n310:18951] [14] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_progress+0x5c) > [0x2ae5e031845c] > [n310:18951] [15] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_trigger_event+0x50) > [0x2ae5e02dc930] > [n310:18951] [16] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(+0x4916f) > [0x2ae5e02f116f] > [n310:18951] [17] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_daemon_cmd_processor+0x149) > [0x2ae5e02f0719] > [n310:18951] [18] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_base_loop+0x31a) > [0x2ae5e032f56a] > [n310:18951] [19] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_loop+0x12) > [0x2ae5e032f242] > [n310:18951] [20] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(opal_event_dispatch+0x8) > [0x2ae5e032f228] > [n310:18951] [21] > /global/software/openmpi-1.6.1-intel1/lib/libopen-rte.so.4(orte_daemon+0x9f0) > [0x2ae5e02ef8a0] > [n310:18951] [22] orted(main+0x88) [0x4024d8] > [n310:18951] [23] /lib64/libc.so.6(__libc_start_main+0xfd) [0x35b0e1ecdd] > [n310:18951] [24] orted() [0x402389] > [n310:18951] *** End of error message *** > > but the program still gives the right result for a short period. After that, > it suddenly stopped because memory exceeds some limit. I don't understand > this. If there is memory leakage in my code, how come it can work with 8 > cores? Please help me.Thank you so much! > > Sincerely > Xingjun > >
