Let me take a look at it. How did you configure your build?
Thanks,
--
Samuel K. Gutierrez
Los Alamos National Laboratory
On Sep 20, 2010, at 10:14 AM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com
> wrote:
Hi
I believe the new common shared memory component was committed to
the trunk sometime towards the later part of August. I had not tried
this trunk version until last week and I have seen some discrepancy
with this component specifically related to checkpoint
functionality. I am not able to checkpoint any program with the
latest trunk version. Am I missing something here? Should I be using
any other options to enable checkpoint functionality for shared
memory component?
However if I disable shared memory component and use only self, tcp,
and openib (--mca btl self,tcp,openib), I can checkpoint
successfully!!
Following are the options I have used with mpirun:
mpirun -am ft-enable-cr --mca opal_cr_enable_timer 1 --mca
sstore_stage_global_is_shared 1 --mca
sstore_base_global_snapshot_dir /scratch/hpl005/UIT_test/amudar/FWI
--mca mpi_paffinity_alone 1 -np 32 -hostfile hostfile-32 ../hellompi
Please note that hellompi is a very simple program without any
collective calls. When I issue checkpoint, this program fails with
the following messages:
hplcnlj158:13937] Signal: Segmentation fault (11)
[hplcnlj158:13937] Signal code: Address not mapped (1)
[hplcnlj158:13937] Failing at address: 0x2aaa00000001
[hplcnlj158:13937] [ 0] /lib64/libpthread.so.0 [0x2b4019a064c0]
[hplcnlj158:13937] [ 1] /users/amudar/openmpi-1.7/lib/
libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
[0x2aaaad96628a]
[hplcnlj158:13937] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
mca_btl_sm.so [0x2aaaaf0a55e8]
[hplcnlj158:13937] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b4018c3c11b]
[hplcnlj158:13937] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(mca_base_components_open+0x3ef) [0x2b4018c3b70b]
[hplcnlj158:13937] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(mca_btl_base_open+0xfd) [0x2b4018b620fe]
[hplcnlj158:13937] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
mca_bml_r2.so [0x2aaaadd9e4fb]
[hplcnlj158:13937] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
mca_pml_ob1.so [0x2aaaae5fa429]
[hplcnlj158:13937] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
mca_pml_crcpw.so [0x2aaaadfadce6]
[hplcnlj158:13937] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b4018b01a0d]
[hplcnlj158:13937] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(ompi_cr_coord+0xc0) [0x2b4018b017ba]
[hplcnlj158:13937] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(opal_cr_inc_core_recover+0xed) [0x2b4018c0efab]
[hplcnlj158:13937] [12] /users/amudar/openmpi-1.7/lib/openmpi/
mca_snapc_full.so [0x2aaaabd280fc]
[hplcnlj158:13937] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b4018c0ecd3]
[hplcnlj158:13937] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b4018c0f6e7]
[hplcnlj158:13937] [15] /lib64/libpthread.so.0 [0x2b40199fe367]
[hplcnlj158:13937] [16] /lib64/libc.so.6(clone+0x6d) [0x2b4019ce5f7d]
[hplcnlj158:13937] *** End of error message ***
[hplcnlj161:00637] *** Process received signal ***
[hplcnlj161:00637] Signal: Segmentation fault (11)
[hplcnlj161:00637] Signal code: Address not mapped (1)
[hplcnlj161:00637] Failing at address: 0x2aaa00000001
[hplcnlj161:00649] *** Process received signal ***
[hplcnlj161:00649] Signal: Segmentation fault (11)
[hplcnlj161:00649] Signal code: Address not mapped (1)
[hplcnlj161:00649] Failing at address: 0x2aaa00000001
/users/amudar/Fix_for_pidinuse/cr_restart: line 5: 14012
Segmentation fault /usr/blcr/bin/cr_restart --no-restore-pid "$@"
[hplcnlj161:00643] *** Process received signal ***
[hplcnlj161:00643] Signal: Segmentation fault (11)
[hplcnlj161:00643] Signal code: Address not mapped (1)
[hplcnlj161:00643] Failing at address: 0x2aaa00000001
[hplcnlj161:00640] *** Process received signal ***
[hplcnlj161:00640] Signal: Segmentation fault (11)
[hplcnlj161:00640] Signal code: Address not mapped (1)
[hplcnlj161:00640] Failing at address: 0x2aaa00000001
[hplcnlj161:00636] *** Process received signal ***
[hplcnlj161:00652] *** Process received signal ***
[hplcnlj161:00652] Signal: Segmentation fault (11)
[hplcnlj161:00652] Signal code: Address not mapped (1)
[hplcnlj161:00652] Failing at address: 0x2aaa00000001
[hplcnlj161:00636] Signal: Segmentation fault (11)
[hplcnlj161:00636] Signal code: Address not mapped (1)
[hplcnlj161:00636] Failing at address: 0x2aaa00000001
[hplcnlj161:00637] [ 0] /lib64/libpthread.so.0 [0x2b86c74694c0]
[hplcnlj161:00637] [ 1] /users/amudar/openmpi-1.7/lib/
libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
[0x2aaaad96628a]
[hplcnlj161:00637] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
mca_btl_sm.so [0x2aaaaf0a55e8]
[hplcnlj161:00637] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b86c669f11b]
[hplcnlj161:00637] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(mca_base_components_open+0x3ef) [0x2b86c669e70b]
[hplcnlj161:00637] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(mca_btl_base_open+0xfd) [0x2b86c65c50fe]
[hplcnlj161:00637] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
mca_bml_r2.so [0x2aaaadd9e4fb]
[hplcnlj161:00637] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
mca_pml_ob1.so [0x2aaaae5fa429]
[hplcnlj161:00637] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
mca_pml_crcpw.so [0x2aaaadfadce6]
[hplcnlj161:00637] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b86c6564a0d]
[hplcnlj161:00637] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(ompi_cr_coord+0xc0) [0x2b86c65647ba]
[hplcnlj161:00637] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(opal_cr_inc_core_recover+0xed) [0x2b86c6671fab]
[hplcnlj161:00637] [12] /users/amudar/openmpi-1.7/lib/openmpi/
mca_snapc_full.so [0x2aaaabd280fc]
[hplcnlj161:00637] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b86c6671cd3]
[hplcnlj161:00637] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b86c66726e7]
[hplcnlj161:00637] [15] /lib64/libpthread.so.0 [0x2b86c7461367]
[hplcnlj161:00637] [16] /lib64/libc.so.6(clone+0x6d) [0x2b86c7748f7d]
[hplcnlj161:00637] *** End of error message ***
[hplcnlj161:00649] [ 0] /lib64/libpthread.so.0 [0x2b7bfa6204c0]
[hplcnlj161:00649] [ 1] /users/amudar/openmpi-1.7/lib/
libmca_common_sm.so.0(mca_common_sm_param_register+0x262)
[0x2aaaad96628a]
[hplcnlj161:00649] [ 2] /users/amudar/openmpi-1.7/lib/openmpi/
mca_btl_sm.so [0x2aaaaf0a55e8]
[hplcnlj161:00649] [ 3] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b7bf985611b]
[hplcnlj161:00649] [ 4] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(mca_base_components_open+0x3ef) [0x2b7bf985570b]
[hplcnlj161:00649] [ 5] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(mca_btl_base_open+0xfd) [0x2b7bf977c0fe]
[hplcnlj161:00649] [ 6] /users/amudar/openmpi-1.7/lib/openmpi/
mca_bml_r2.so [0x2aaaadd9e4fb]
[hplcnlj161:00649] [ 7] /users/amudar/openmpi-1.7/lib/openmpi/
mca_pml_ob1.so [0x2aaaae5fa429]
[hplcnlj161:00649] [ 8] /users/amudar/openmpi-1.7/lib/openmpi/
mca_pml_crcpw.so [0x2aaaadfadce6]
[hplcnlj161:00649] [ 9] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b7bf971ba0d]
[hplcnlj161:00649] [10] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(ompi_cr_coord+0xc0) [0x2b7bf971b7ba]
[hplcnlj161:00649] [11] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(opal_cr_inc_core_recover+0xed) [0x2b7bf9828fab]
[hplcnlj161:00649] [12] /users/amudar/openmpi-1.7/lib/openmpi/
mca_snapc_full.so [0x2aaaabd280fc]
[hplcnlj161:00649] [13] /users/amudar/openmpi-1.7/lib/libmpi.so.
0(opal_cr_test_if_checkpoint_ready+0x11b) [0x2b7bf9828cd3]
[hplcnlj161:00649] [14] /users/amudar/openmpi-1.7/lib/libmpi.so.0
[0x2b7bf98296e7]
[hplcnlj161:00649] [15] /lib64/libpthread.so.0 [0x2b7bfa618367]
[hplcnlj161:00649] [16] /lib64/libc.so.6(clone+0x6d) [0x2b7bfa8fff7d]
[hplcnlj161:00649] *** End of error message ***
Thanks
Ananda
Ananda B Mudar, PMP
Senior Technical Architect
Wipro Technologies
Ph: 972 765 8093
ananda.mu...@wipro.com
Please do not print this email unless it is absolutely necessary.
The information contained in this electronic message and any
attachments to this message are intended for the exclusive use of
the addressee(s) and may contain proprietary, confidential or
privileged information. If you are not the intended recipient, you
should not disseminate, distribute or copy this e-mail. Please
notify the sender immediately and destroy all copies of this message
and any attachments.
WARNING: Computer viruses can be transmitted via email. The
recipient should check this email and any attachments for the
presence of viruses. The company accepts no liability for any damage
caused by any virus transmitted by this email.
www.wipro.com
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel