Siegmar --

So it looks like the net problem is fixed; good.  I'll commit and CMR that.

For the DDT test, can you give us access to this machine?  It might help speed 
debugging a lot.  (I'll let Nathan reply about the var problem)

If not, can you provide the following information about the DDT test:

1. It SIGBUS's at a point; can you send the full backtrace?
2. It complains about a misaligned read of a variable and shows its address.  
Can you print the values of all the parameters of the function so that we can 
see *which* one it is using for the misaligned read?  (the printf is using 4 
different variables, and we don't know which one is causing the misaligned read)


On Dec 19, 2013, at 8:52 AM, Siegmar Gross 
<siegmar.gr...@informatik.hs-fulda.de> wrote:

> Hi,
> 
> at first thank you very much for your help.
> 
> 1st patch:
> 
>> Can you apply the following patch to a trunk tarball and see if it works
>> for you?
> 
> 2nd patch:
> 
>> Found the problem. Was accessing a boolean variable using intval. That
>> is a bug that has gone unnoticed on all platforms but thankfully Solaris
>> caught it.
>> 
>> Please try the attached patch.
> 
> 
> I applied both patches manually to openmpi-1.9a1r29972, because
> my patch program couldn't use the patches. Unfortunately I still
> get a Bus Error. Hopefully I didn't make a mistake applying your
> patches. Therefore I show you a "diff" for my files. By the way,
> I tried to apply your patches with "patch -b -i <your file>".
> Is it necessary to use a different command?
> 
> 
> tyr openmpi-1.9a1r29972 161 ls -l opal/mca/base/mca_base_var.c*
> -rw-r--r-- 1 fd1026 inf 60418 Dec 19 08:35 opal/mca/base/mca_base_var.c
> -rw-r--r-- 1 fd1026 inf 60236 Dec 19 03:05 opal/mca/base/mca_base_var.c.orig
> tyr openmpi-1.9a1r29972 162 diff opal/mca/base/mca_base_var.c*
> 1685,1689c1685
> <        if (MCA_BASE_VAR_TYPE_BOOL == var->mbv_type) {
> <            ret = 
> var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
> value->boolval, &tmp);
> <        } else {
> <            ret = 
> var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
> value->intval, &tmp);
> <        }
> ---
>>        ret = var->mbv_enumerator->string_from_value(var->mbv_enumerator, 
> value->intval, &tmp);
> tyr openmpi-1.9a1r29972 163 
> 
> 
> 
> tyr openmpi-1.9a1r29972 165 ls -l opal/util/net.c*
> -rw-r--r-- 1 fd1026 inf 12922 Dec 19 07:55 opal/util/net.c
> -rw-r--r-- 1 fd1026 inf 12675 Dec 19 03:05 opal/util/net.c.orig
> tyr openmpi-1.9a1r29972 166 diff opal/util/net.c*
> 267,271c267,268
> <             struct sockaddr_in inaddr1, inaddr2;
> <             /* Use temporary variables and memcpy's so that we don't
> <                run into bus errors on Solaris/SPARC */
> <             memcpy(&inaddr1, addr1, sizeof(inaddr1));
> <             memcpy(&inaddr2, addr2, sizeof(inaddr2));
> ---
>>            const struct sockaddr_in *inaddr1 = (struct sockaddr_in*) addr1;
>>            const struct sockaddr_in *inaddr2 = (struct sockaddr_in*) addr2;
> 274,275c271,272
> <             if((inaddr1.sin_addr.s_addr & netmask) ==
> <                (inaddr2.sin_addr.s_addr & netmask)) {
> ---
>>            if((inaddr1->sin_addr.s_addr & netmask) ==
>>               (inaddr2->sin_addr.s_addr & netmask)) {
> 284,290c281,284
> <             struct sockaddr_in6 inaddr1, inaddr2;
> <             /* Use temporary variables and memcpy's so that we don't
> <                run into bus errors on Solaris/SPARC */
> <             memcpy(&inaddr1, addr1, sizeof(inaddr1));
> <             memcpy(&inaddr2, addr2, sizeof(inaddr2));
> <             struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1.sin6_addr;
> <             struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2.sin6_addr;
> ---
>>            const struct sockaddr_in6 *inaddr1 = (struct sockaddr_in6*) addr1;
>>            const struct sockaddr_in6 *inaddr2 = (struct sockaddr_in6*) addr2;
>>            struct in6_addr *a6_1 = (struct in6_addr*) &inaddr1->sin6_addr;
>>            struct in6_addr *a6_2 = (struct in6_addr*) &inaddr2->sin6_addr;
> tyr openmpi-1.9a1r29972 167 
> 
> 
> 
> Now my debug information.
> 
> tyr fd1026 52 cd /usr/local/openmpi-1.9_64_cc/bin/
> tyr bin 53 /opt/solstudio12.3/bin/sparcv9/dbx ompi_info
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.9' in your 
> .dbxrc
> Reading ompi_info
> Reading ld.so.1
> Reading libmpi.so.0.0.0
> Reading libopen-rte.so.0.0.0
> Reading libopen-pal.so.0.0.0
> Reading libsendfile.so.1
> Reading libpicl.so.1
> Reading libkstat.so.1
> Reading liblgrp.so.1
> Reading libsocket.so.1
> Reading libnsl.so.1
> Reading librt.so.1
> Reading libm.so.2
> Reading libthread.so.1
> Reading libc.so.1
> Reading libdoor.so.1
> Reading libaio.so.1
> Reading libmd.so.1
> (dbx) run -a
> Running: ompi_info -a 
> (process id 10998)
> Reading libc_psr.so.1
> ...
>    MCA compress: parameter "compress_base_verbose" (current value:
>                  "-1", data source: default, level: 8 dev/detail,
>                  type: int)
>                  Verbosity level for the compress framework (0 = no
>                  verbosity)
> t@1 (l@1) signal BUS (invalid address alignment) in var_value_string
>  at line 1680 in file "mca_base_var.c"
> 1680  ret = asprintf (value_string, var_type_formats[var->mbv_type],
>  value[0]);
> (dbx) 
> (dbx) 
> (dbx) check -all
> dbx: warning: check -all will be turned on in the next run of the process
> access checking - OFF
> memuse checking - OFF
> (dbx) run -a
> Running: ompi_info -a 
> (process id 11000)
> Reading rtcapihook.so
> Reading libdl.so.1
> Reading rtcaudit.so
> Reading libmapmalloc.so.1
> Reading rtcboot.so
> Reading librtc.so
> Reading libmd_psr.so.1
> RTC: Enabling Error Checking...
> RTC: Using UltraSparc trap mechanism
> RTC: See `help rtc showmap' and `help rtc limitations' for details.
> RTC: Running program...
> Read from uninitialized (rui) on thread 1:
> Attempting to read 4 bytes at address 0xffffffff7fffd5f8
>    which is 184 bytes above the current stack pointer
> Variable is 'index'
> t@1 (l@1) stopped in var_find at line 802 in file "mca_base_var.c"
>  802       return (OPAL_SUCCESS != ret) ? ret : index;
> (dbx) 
> 
> 
> In my opinion it is the same error as before.
> 
> 
> 
> I still get a Bus Error with "make check".
> 
> tyr bin 54 cd 
> /export2/src/openmpi-1.9/openmpi-1.9a1r29972-SunOS.sparc.64_cc/test/datatype/.li
> bs/
> tyr .libs 55 /opt/solstudio12.3/bin/sparcv9/dbx ddt_raw
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.9' in your 
> .dbxrc
> Reading ddt_raw
> Reading ld.so.1
> Reading libmpi.so.0.0.0
> Reading libopen-rte.so.0.0.0
> Reading libopen-pal.so.0.0.0
> Reading libsendfile.so.1
> Reading libpicl.so.1
> Reading libkstat.so.1
> Reading liblgrp.so.1
> Reading libsocket.so.1
> Reading libnsl.so.1
> Reading librt.so.1
> Reading libm.so.2
> Reading libthread.so.1
> Reading libc.so.1
> Reading libdoor.so.1
> Reading libaio.so.1
> Reading libmd.so.1
> (dbx) run
> Running: ddt_raw 
> (process id 11018)
> Reading libc_psr.so.1
> 
> 
> #
> * TEST INVERSED VECTOR
> #
> 
> t@1 (l@1) signal BUS (invalid address alignment) in opal_convertor_raw
>  at line 71 in file "opal_convertor_raw.c"
>   71       DO_DEBUG( opal_output( 0, "opal_convertor_raw( %p, {%p,
>   %u}, %lu )\n", (void*)pConvertor,
> (dbx) 
> 
> 
> Once more I think it is the same error. I have the same problem with
> my small program.
> 
> 
> 
> 
> tyr small_prog 62 mpicc init_finalize.c
> tyr small_prog 63 /opt/solstudio12.3/bin/sparcv9/dbx \
>  /usr/local/openmpi-1.9_64_cc/bin/mpiexec
> For information about new features see `help changes'
> To remove this message, put `dbxenv suppress_startup_message 7.9'
>   in your .dbxrc
> Reading mpiexec
> Reading ld.so.1
> Reading libopen-rte.so.0.0.0
> Reading libopen-pal.so.0.0.0
> Reading libsendfile.so.1
> Reading libpicl.so.1
> Reading libkstat.so.1
> Reading liblgrp.so.1
> Reading libsocket.so.1
> Reading libnsl.so.1
> Reading librt.so.1
> Reading libm.so.2
> Reading libthread.so.1
> Reading libc.so.1
> Reading libdoor.so.1
> Reading libaio.so.1
> Reading libmd.so.1
> (dbx) 
> (dbx) run -np 1 a.out
> Running: mpiexec -np 1 a.out 
> (process id 11050)
> Reading libc_psr.so.1
> Reading mca_shmem_mmap.so
> Reading libmp.so.2
> Reading libscf.so.1
> Reading libuutil.so.1
> Reading libgen.so.1
> Reading mca_shmem_posix.so
> Reading mca_shmem_sysv.so
> Reading mca_ess_env.so
> Reading mca_ess_hnp.so
> Reading mca_ess_singleton.so
> Reading mca_ess_tool.so
> Reading mca_pstat_test.so
> Reading mca_state_app.so
> Reading mca_state_hnp.so
> Reading mca_state_novm.so
> Reading mca_state_orted.so
> Reading mca_state_staged_hnp.so
> Reading mca_state_staged_orted.so
> Reading mca_state_tool.so
> Reading mca_errmgr_default_app.so
> Reading mca_errmgr_default_hnp.so
> Reading mca_errmgr_default_orted.so
> Reading mca_errmgr_default_tool.so
> Reading mca_plm_rsh.so
> Reading mca_oob_tcp.so
> Reading mca_rml_oob.so
> Reading mca_routed_binomial.so
> Reading mca_routed_debruijn.so
> Reading mca_routed_direct.so
> Reading mca_routed_radix.so
> Reading mca_db_hash.so
> Reading mca_db_print.so
> Reading mca_grpcomm_bad.so
> Reading mca_ras_simulator.so
> Reading mca_rmaps_lama.so
> Reading mca_rmaps_mindist.so
> Reading mca_rmaps_ppr.so
> Reading mca_rmaps_rank_file.so
> Reading mca_rmaps_resilient.so
> Reading mca_rmaps_round_robin.so
> Reading mca_rmaps_seq.so
> Reading mca_rmaps_staged.so
> Reading mca_odls_default.so
> Reading mca_iof_hnp.so
> Reading mca_iof_mr_hnp.so
> Reading mca_iof_mr_orted.so
> Reading mca_iof_orted.so
> Reading mca_iof_tool.so
> Reading mca_filem_raw.so
> Reading mca_dfs_app.so
> Reading mca_dfs_orted.so
> Reading mca_dfs_test.so
> 
> Now the program hangs.
> 
> ^Cdbx: warning: Interrupt ignored but forwarded to child.
> t@1 (l@1) signal INT (Interrupt) in __pollsys at 0xffffffff7d5dc740
> 0xffffffff7d5dc740: __pollsys+0x0004:   ta       %icc,0x0000000000000040
> Current function is orterun
> 1049           opal_event_loop(orte_event_base, OPAL_EVLOOP_ONCE);
> (dbx) 
> (dbx) 
> (dbx) 
> (dbx) check -all
> dbx: warning: check -all will be turned on in the next run of the process
> access checking - OFF
> memuse checking - OFF
> (dbx) run -np 1 a.out
> Running: mpiexec -np 1 a.out 
> (process id 11054)
> Reading rtcapihook.so
> Reading libdl.so.1
> Reading rtcaudit.so
> Reading libmapmalloc.so.1
> Reading rtcboot.so
> Reading librtc.so
> Reading libmd_psr.so.1
> RTC: Enabling Error Checking...
> RTC: Using UltraSparc trap mechanism
> RTC: See `help rtc showmap' and `help rtc limitations' for details.
> RTC: Running program...
> Read from uninitialized (rui) on thread 1:
> Attempting to read 4 bytes at address 0xffffffff7fffd438
>    which is 184 bytes above the current stack pointer
> Variable is 'index'
> t@1 (l@1) stopped in var_find at line 802 in file "mca_base_var.c"
>  802       return (OPAL_SUCCESS != ret) ? ret : index;
> (dbx) 
> 
> 
> 
> I'm sorry that you have so much trouble with me and Solaris. On the
> other hand I still hope that you can solve the problem(s). Once more
> thank you very much for your help in advance.
> 
> 
> Kind regards
> 
> Siegmar
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to