Ralph and I are trying to track down the mysterious ORTE error. In doing so, I have found at least one fairly repeatable error on my cluster: when running through SLURM the ibm/dynamic/spawn test, where we mpirun 3 procs and then we MPI_COMM_SPAWN 3 more. Running the orteds through valgrind, I see a bunch of uninitialized epoch issues.
Attached at the 2 valgrind outputs. Can these be fixed? I don't know if they're actual problems or not, but seeing uninitialized values go by makes me extremely nervous. Thanks! -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
==4436== Memcheck, a memory error detector ==4436== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==4436== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==4436== Command: /home/jsquyres/bogus/bin/orted -mca ess slurm -mca orte_ess_jobid 2778071040 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "2778071040.0.0;tcp://172.29.218.140:40955;tcp://10.10.10.140:40955;tcp://10.10.20.140:40955;tcp://10.10.30.140:40955" --mca orte_startup_timeout 10000 --mca mpi_leave_pinned 0 --mca btl tcp,self ==4436== ==4436== Conditional jump or move depends on uninitialised value(s) ==4436== at 0x4E6634C: orte_util_print_epoch (name_fns.c:301) ==4436== by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4436== by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4436== by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4436== by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4436== by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4436== by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4436== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436== by 0x4F3EBAF: event_process_active (event.c:1370) ==4436== by 0x4F3EFBE: opal_libevent207_event_base_loop (event.c:1566) ==4436== by 0x4E806AE: orte_daemon (orted_main.c:682) ==4436== by 0x400929: main (orted.c:62) ==4436== ==4436== Conditional jump or move depends on uninitialised value(s) ==4436== at 0x4E66392: orte_util_print_epoch (name_fns.c:303) ==4436== by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4436== by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4436== by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4436== by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4436== by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4436== by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4436== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436== by 0x4F3EBAF: event_process_active (event.c:1370) ==4436== by 0x4F3EFBE: opal_libevent207_event_base_loop (event.c:1566) ==4436== by 0x4E806AE: orte_daemon (orted_main.c:682) ==4436== by 0x400929: main (orted.c:62) ==4436== ==4436== Use of uninitialised value of size 8 ==4436== at 0x64649BD: _itoa_word (in /lib64/libc-2.5.so) ==4436== by 0x6467E5A: vfprintf (in /lib64/libc-2.5.so) ==4436== by 0x648C889: vsnprintf (in /lib64/libc-2.5.so) ==4436== by 0x6470492: snprintf (in /lib64/libc-2.5.so) ==4436== by 0x4E6640E: orte_util_print_epoch (name_fns.c:306) ==4436== by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4436== by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4436== by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4436== by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4436== by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4436== by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4436== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436== ==4436== Conditional jump or move depends on uninitialised value(s) ==4436== at 0x64649C7: _itoa_word (in /lib64/libc-2.5.so) ==4436== by 0x6467E5A: vfprintf (in /lib64/libc-2.5.so) ==4436== by 0x648C889: vsnprintf (in /lib64/libc-2.5.so) ==4436== by 0x6470492: snprintf (in /lib64/libc-2.5.so) ==4436== by 0x4E6640E: orte_util_print_epoch (name_fns.c:306) ==4436== by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4436== by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4436== by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4436== by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4436== by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4436== by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4436== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436== ==4436== Conditional jump or move depends on uninitialised value(s) ==4436== at 0x6467ED4: vfprintf (in /lib64/libc-2.5.so) ==4436== by 0x648C889: vsnprintf (in /lib64/libc-2.5.so) ==4436== by 0x6470492: snprintf (in /lib64/libc-2.5.so) ==4436== by 0x4E6640E: orte_util_print_epoch (name_fns.c:306) ==4436== by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4436== by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4436== by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4436== by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4436== by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4436== by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4436== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436== by 0x4F3EBAF: event_process_active (event.c:1370) ==4436== ==4436== Conditional jump or move depends on uninitialised value(s) ==4436== at 0x4E6634C: orte_util_print_epoch (name_fns.c:301) ==4436== by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4436== by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4436== by 0x4E96DBF: orte_grpcomm_base_daemon_collective (grpcomm_base_coll.c:715) ==4436== by 0x4E979D4: process_msg (grpcomm_base_coll.c:883) ==4436== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436== by 0x4F3EBAF: event_process_active (event.c:1370) ==4436== by 0x4F3EFBE: opal_libevent207_event_base_loop (event.c:1566) ==4436== by 0x4E806AE: orte_daemon (orted_main.c:682) ==4436== by 0x400929: main (orted.c:62) ==4436== ==4436== Conditional jump or move depends on uninitialised value(s) ==4436== at 0x4E66392: orte_util_print_epoch (name_fns.c:303) ==4436== by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4436== by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4436== by 0x4E96DBF: orte_grpcomm_base_daemon_collective (grpcomm_base_coll.c:715) ==4436== by 0x4E979D4: process_msg (grpcomm_base_coll.c:883) ==4436== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436== by 0x4F3EBAF: event_process_active (event.c:1370) ==4436== by 0x4F3EFBE: opal_libevent207_event_base_loop (event.c:1566) ==4436== by 0x4E806AE: orte_daemon (orted_main.c:682) ==4436== by 0x400929: main (orted.c:62) ==4436== ==4436== Invalid free() / delete / delete[] ==4436== at 0x4C20A31: free (vg_replace_malloc.c:325) ==4436== by 0x4F1B433: opal_free (malloc.c:190) ==4436== by 0x4E67987: orte_proc_info_finalize (proc_info.c:200) ==4436== by 0x4E4EE18: orte_finalize (orte_finalize.c:67) ==4436== by 0x4E53972: orte_quit (orte_quit.c:155) ==4436== by 0x4E83147: orte_daemon_process_commands (orted_comm.c:756) ==4436== by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4436== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4436== by 0x4F3EBAF: event_process_active (event.c:1370) ==4436== by 0x4F3EFBE: opal_libevent207_event_base_loop (event.c:1566) ==4436== by 0x4E806AE: orte_daemon (orted_main.c:682) ==4436== by 0x400929: main (orted.c:62) ==4436== Address 0x7ff000f5f is on thread 1's stack ==4436== ==4436== ==4436== HEAP SUMMARY: ==4436== in use at exit: 192,837 bytes in 442 blocks ==4436== total heap usage: 8,178 allocs, 7,737 frees, 12,493,593 bytes allocated ==4436== ==4436== LEAK SUMMARY: ==4436== definitely lost: 50,169 bytes in 72 blocks ==4436== indirectly lost: 52,519 bytes in 81 blocks ==4436== possibly lost: 8,408 bytes in 3 blocks ==4436== still reachable: 81,741 bytes in 286 blocks ==4436== suppressed: 0 bytes in 0 blocks ==4436== Rerun with --leak-check=full to see details of leaked memory ==4436== ==4436== For counts of detected and suppressed errors, rerun with: -v ==4436== Use --track-origins=yes to see where uninitialised values come from ==4436== ERROR SUMMARY: 151 errors from 8 contexts (suppressed: 7 from 7)
==4267== Memcheck, a memory error detector ==4267== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al. ==4267== Using Valgrind-3.5.0 and LibVEX; rerun with -h for copyright info ==4267== Command: /home/jsquyres/bogus/bin/orted -mca ess slurm -mca orte_ess_jobid 2778071040 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 3 --hnp-uri "2778071040.0.0;tcp://172.29.218.140:40955;tcp://10.10.10.140:40955;tcp://10.10.20.140:40955;tcp://10.10.30.140:40955" --mca orte_startup_timeout 10000 --mca mpi_leave_pinned 0 --mca btl tcp,self ==4267== ==4267== Conditional jump or move depends on uninitialised value(s) ==4267== at 0x4E6634C: orte_util_print_epoch (name_fns.c:301) ==4267== by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4267== by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4267== by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4267== by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4267== by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4267== by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4267== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4267== by 0x4F3EBAF: event_process_active (event.c:1370) ==4267== by 0x4F3EFBE: opal_libevent207_event_base_loop (event.c:1566) ==4267== by 0x4E806AE: orte_daemon (orted_main.c:682) ==4267== by 0x400929: main (orted.c:62) ==4267== ==4267== Conditional jump or move depends on uninitialised value(s) ==4267== at 0x4E66392: orte_util_print_epoch (name_fns.c:303) ==4267== by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4267== by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4267== by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4267== by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4267== by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4267== by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4267== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4267== by 0x4F3EBAF: event_process_active (event.c:1370) ==4267== by 0x4F3EFBE: opal_libevent207_event_base_loop (event.c:1566) ==4267== by 0x4E806AE: orte_daemon (orted_main.c:682) ==4267== by 0x400929: main (orted.c:62) ==4267== ==4267== Use of uninitialised value of size 8 ==4267== at 0x64649BD: _itoa_word (in /lib64/libc-2.5.so) ==4267== by 0x6467E5A: vfprintf (in /lib64/libc-2.5.so) ==4267== by 0x648C889: vsnprintf (in /lib64/libc-2.5.so) ==4267== by 0x6470492: snprintf (in /lib64/libc-2.5.so) ==4267== by 0x4E6640E: orte_util_print_epoch (name_fns.c:306) ==4267== by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4267== by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4267== by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4267== by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4267== by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4267== by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4267== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4267== ==4267== Conditional jump or move depends on uninitialised value(s) ==4267== at 0x64649C7: _itoa_word (in /lib64/libc-2.5.so) ==4267== by 0x6467E5A: vfprintf (in /lib64/libc-2.5.so) ==4267== by 0x648C889: vsnprintf (in /lib64/libc-2.5.so) ==4267== by 0x6470492: snprintf (in /lib64/libc-2.5.so) ==4267== by 0x4E6640E: orte_util_print_epoch (name_fns.c:306) ==4267== by 0x4E65CB3: orte_util_print_name_args (name_fns.c:144) ==4267== by 0x4E898D6: orte_ess_base_proc_get_epoch (ess_base_select.c:46) ==4267== by 0x4EA3B6D: orte_odls_base_default_construct_child_list (odls_base_default_fns.c:737) ==4267== by 0xA36FD92: orte_odls_default_launch_local_procs (odls_default_module.c:1496) ==4267== by 0x4E823A3: orte_daemon_process_commands (orted_comm.c:508) ==4267== by 0x4E819B1: orte_daemon_cmd_processor (orted_comm.c:324) ==4267== by 0x4F3EA39: event_process_active_single_queue (event.c:1303) ==4267== ==4267== Conditional jump or move depends on uninitialised value(s) ==4267== at 0x6467ED4: vfprintf (in /lib64/libc-2.5.so) ==4267== by 0x648C889: vsnprintf (in /lib64/libc-2.5.so) ==4267== by 0x6470492: snprintf (in /lib64/libc-2.5.so) ==4267== by 0x4E6640E: orte_util_print_epoch (name_fns.c:306) ==4267== Address 0x7ff000f5f is on thread 1's stack ==4267== ==4267== ==4267== HEAP SUMMARY: ==4267== in use at exit: 174,607 bytes in 393 blocks ==4267== total heap usage: 6,696 allocs, 6,304 frees, 11,958,108 bytes allocated ==4267== ==4267== LEAK SUMMARY: ==4267== definitely lost: 41,547 bytes in 42 blocks ==4267== indirectly lost: 43,935 bytes in 62 blocks ==4267== possibly lost: 7,384 bytes in 3 blocks ==4267== still reachable: 81,741 bytes in 286 blocks ==4267== suppressed: 0 bytes in 0 blocks ==4267== Rerun with --leak-check=full to see details of leaked memory ==4267== ==4267== For counts of detected and suppressed errors, rerun with: -v ==4267== Use --track-origins=yes to see where uninitialised values come from ==4267== ERROR SUMMARY: 69 errors from 8 contexts (suppressed: 7 from 7)