Siegmar -- So it looks like the net problem is fixed; good. I'll commit and CMR that.
For the DDT test, can you give us access to this machine? It might help speed debugging a lot. (I'll let Nathan reply about the var problem) If not, can you provide the following information about the DDT test: 1. It SIGBUS's at a point; can you send the full backtrace? 2. It complains about a misaligned read of a variable and shows its address. Can you print the values of all the parameters of the function so that we can see *which* one it is using for the misaligned read? (the printf is using 4 different variables, and we don't know which one is causing the misaligned read) On Dec 19, 2013, at 8:52 AM, Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de> wrote: > Hi, > > at first thank you very much for your help. > > 1st patch: > >> Can you apply the following patch to a trunk tarball and see if it works >> for you? > > 2nd patch: > >> Found the problem. Was accessing a boolean variable using intval. That >> is a bug that has gone unnoticed on all platforms but thankfully Solaris >> caught it. >> >> Please try the attached patch. > > > I applied both patches manually to openmpi-1.9a1r29972, because > my patch program couldn't use the patches. Unfortunately I still > get a Bus Error. Hopefully I didn't make a mistake applying your > patches. Therefore I show you a "diff" for my files. By the way, > I tried to apply your patches with "patch -b -i <your file>". > Is it necessary to use a different command? > > > tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c* > -rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c > -rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig > tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c* > 1685,1689c1685 > < if (MCA_BASE_VAR_TYPE_BOOL == var->mbv_type) { > < ret = > var->mbv_enumerator->string_from_value(var->mbv_enumerator, > value->boolval, &tmp); > < } else { > < ret = > var->mbv_enumerator->string_from_value(var->mbv_enumerator, > value->intval, &tmp); > < } > --- >> ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, > value->intval, &tmp); > tyr openmpi-1.9a1r29972 163 > > > > tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c* > -rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c > -rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig > tyr openmpi-1.9a1r29972 166 diff opal/util/net.c* > 267,271c267,268 > < struct sockaddr_in inaddr1, inaddr2; > < /* Use temporary variables and memcpy's so that we don't > < run into bus errors on Solaris/SPARC */ > < memcpy(&inaddr1, addr1, sizeof(inaddr1)); > < memcpy(&inaddr2, addr2, sizeof(inaddr2)); > --- >> const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1; >> const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2; > 274,275c271,272 > < if((inaddr1.sin_addr.s_addr & netmask) == > < (inaddr2.sin_addr.s_addr & netmask)) { > --- >> if((inaddr1->sin_addr.s_addr & netmask) == >> (inaddr2->sin_addr.s_addr & netmask)) { > 284,290c281,284 > < struct sockaddr_in6 inaddr1, inaddr2; > < /* Use temporary variables and memcpy's so that we don't > < run into bus errors on Solaris/SPARC */ > < memcpy(&inaddr1, addr1, sizeof(inaddr1)); > < memcpy(&inaddr2, addr2, sizeof(inaddr2)); > < struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1.sin6_addr; > < struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2.sin6_addr; > --- >> const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1; >> const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2; >> struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1->sin6_addr; >> struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2->sin6_addr; > tyr openmpi-1.9a1r29972 167 > > > > Now my debug information. > > tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/ > tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info > For information about new features see `help changes' > To remove this message, put `dbxenv suppress_startup_message 7.9' in your > .dbxrc > Reading ompi_info > Reading ld.so.1 > Reading libmpi.so.0.0.0 > Reading libopen-rte.so.0.0.0 > Reading libopen-pal.so.0.0.0 > Reading libsendfile.so.1 > Reading libpicl.so.1 > Reading libkstat.so.1 > Reading liblgrp.so.1 > Reading libsocket.so.1 > Reading libnsl.so.1 > Reading librt.so.1 > Reading libm.so.2 > Reading libthread.so.1 > Reading libc.so.1 > Reading libdoor.so.1 > Reading libaio.so.1 > Reading libmd.so.1 > (dbx) run -a > Running: ompi_info -a > (process id 10998) > Reading libc_psr.so.1 > ... > MCA compress: parameter "compress_base_verbose" (current value: > "-1", data source: default, level: 8 dev/detail, > type: int) > Verbosity level for the compress framework (0 = no > verbosity) > t@1 (l@1) signal BUS (invalid address alignment) in var_value_string > at line 1680 in file "mca_base_var.c" > 1680 ret = asprintf (value_string, var_type_formats[var->mbv_type], > value[0]); > (dbx) > (dbx) > (dbx) check -all > dbx: warning: check -all will be turned on in the next run of the process > access checking - OFF > memuse checking - OFF > (dbx) run -a > Running: ompi_info -a > (process id 11000) > Reading rtcapihook.so > Reading libdl.so.1 > Reading rtcaudit.so > Reading libmapmalloc.so.1 > Reading rtcboot.so > Reading librtc.so > Reading libmd_psr.so.1 > RTC: Enabling Error Checking... > RTC: Using UltraSparc trap mechanism > RTC: See `help rtc showmap' and `help rtc limitations' for details. > RTC: Running program... > Read from uninitialized (rui) on thread 1: > Attempting to read 4 bytes at address 0xffffffff7fffd5f8 > which is 184 bytes above the current stack pointer > Variable is 'index' > t@1 (l@1) stopped in var_find at line 802 in file "mca_base_var.c" > 802 return (OPAL_SUCCESS != ret) ? ret : index; > (dbx) > > > In my opinion it is the same error as before. > > > > I still get a Bus Error with "make check". > > tyr bin 54 cd > /export2/src/openmpi-1.9/openmpi-1.9a1r29972-SunOS.sparc.64_cc/test/datatype/.li > bs/ > tyr .libs 55 /opt/solstudio12.3/bin/sparcv9/dbx ddt_raw > For information about new features see `help changes' > To remove this message, put `dbxenv suppress_startup_message 7.9' in your > .dbxrc > Reading ddt_raw > Reading ld.so.1 > Reading libmpi.so.0.0.0 > Reading libopen-rte.so.0.0.0 > Reading libopen-pal.so.0.0.0 > Reading libsendfile.so.1 > Reading libpicl.so.1 > Reading libkstat.so.1 > Reading liblgrp.so.1 > Reading libsocket.so.1 > Reading libnsl.so.1 > Reading librt.so.1 > Reading libm.so.2 > Reading libthread.so.1 > Reading libc.so.1 > Reading libdoor.so.1 > Reading libaio.so.1 > Reading libmd.so.1 > (dbx) run > Running: ddt_raw > (process id 11018) > Reading libc_psr.so.1 > > > # > * TEST INVERSED VECTOR > # > > t@1 (l@1) signal BUS (invalid address alignment) in opal_convertor_raw > at line 71 in file "opal_convertor_raw.c" > 71 DO_DEBUG( opal_output( 0, "opal_convertor_raw( %p, {%p, > %u}, %lu )\n", (void*)pConvertor, > (dbx) > > > Once more I think it is the same error. I have the same problem with > my small program. > > > > > tyr small_prog 62 mpicc init_finalize.c > tyr small_prog 63 /opt/solstudio12.3/bin/sparcv9/dbx \ > /usr/local/openmpi-1.9_64_cc/bin/mpiexec > For information about new features see `help changes' > To remove this message, put `dbxenv suppress_startup_message 7.9' > in your .dbxrc > Reading mpiexec > Reading ld.so.1 > Reading libopen-rte.so.0.0.0 > Reading libopen-pal.so.0.0.0 > Reading libsendfile.so.1 > Reading libpicl.so.1 > Reading libkstat.so.1 > Reading liblgrp.so.1 > Reading libsocket.so.1 > Reading libnsl.so.1 > Reading librt.so.1 > Reading libm.so.2 > Reading libthread.so.1 > Reading libc.so.1 > Reading libdoor.so.1 > Reading libaio.so.1 > Reading libmd.so.1 > (dbx) > (dbx) run -np 1 a.out > Running: mpiexec -np 1 a.out > (process id 11050) > Reading libc_psr.so.1 > Reading mca_shmem_mmap.so > Reading libmp.so.2 > Reading libscf.so.1 > Reading libuutil.so.1 > Reading libgen.so.1 > Reading mca_shmem_posix.so > Reading mca_shmem_sysv.so > Reading mca_ess_env.so > Reading mca_ess_hnp.so > Reading mca_ess_singleton.so > Reading mca_ess_tool.so > Reading mca_pstat_test.so > Reading mca_state_app.so > Reading mca_state_hnp.so > Reading mca_state_novm.so > Reading mca_state_orted.so > Reading mca_state_staged_hnp.so > Reading mca_state_staged_orted.so > Reading mca_state_tool.so > Reading mca_errmgr_default_app.so > Reading mca_errmgr_default_hnp.so > Reading mca_errmgr_default_orted.so > Reading mca_errmgr_default_tool.so > Reading mca_plm_rsh.so > Reading mca_oob_tcp.so > Reading mca_rml_oob.so > Reading mca_routed_binomial.so > Reading mca_routed_debruijn.so > Reading mca_routed_direct.so > Reading mca_routed_radix.so > Reading mca_db_hash.so > Reading mca_db_print.so > Reading mca_grpcomm_bad.so > Reading mca_ras_simulator.so > Reading mca_rmaps_lama.so > Reading mca_rmaps_mindist.so > Reading mca_rmaps_ppr.so > Reading mca_rmaps_rank_file.so > Reading mca_rmaps_resilient.so > Reading mca_rmaps_round_robin.so > Reading mca_rmaps_seq.so > Reading mca_rmaps_staged.so > Reading mca_odls_default.so > Reading mca_iof_hnp.so > Reading mca_iof_mr_hnp.so > Reading mca_iof_mr_orted.so > Reading mca_iof_orted.so > Reading mca_iof_tool.so > Reading mca_filem_raw.so > Reading mca_dfs_app.so > Reading mca_dfs_orted.so > Reading mca_dfs_test.so > > Now the program hangs. > > ^Cdbx: warning: Interrupt ignored but forwarded to child. > t@1 (l@1) signal INT (Interrupt) in __pollsys at 0xffffffff7d5dc740 > 0xffffffff7d5dc740: __pollsys+0x0004: ta %icc,0x0000000000000040 > Current function is orterun > 1049 opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE); > (dbx) > (dbx) > (dbx) > (dbx) check -all > dbx: warning: check -all will be turned on in the next run of the process > access checking - OFF > memuse checking - OFF > (dbx) run -np 1 a.out > Running: mpiexec -np 1 a.out > (process id 11054) > Reading rtcapihook.so > Reading libdl.so.1 > Reading rtcaudit.so > Reading libmapmalloc.so.1 > Reading rtcboot.so > Reading librtc.so > Reading libmd_psr.so.1 > RTC: Enabling Error Checking... > RTC: Using UltraSparc trap mechanism > RTC: See `help rtc showmap' and `help rtc limitations' for details. > RTC: Running program... > Read from uninitialized (rui) on thread 1: > Attempting to read 4 bytes at address 0xffffffff7fffd438 > which is 184 bytes above the current stack pointer > Variable is 'index' > t@1 (l@1) stopped in var_find at line 802 in file "mca_base_var.c" > 802 return (OPAL_SUCCESS != ret) ? ret : index; > (dbx) > > > > I'm sorry that you have so much trouble with me and Solaris. On the > other hand I still hope that you can solve the problem(s). Once more > thank you very much for your help in advance. > > > Kind regards > > Siegmar > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/