Nathan,

The diffrences seems to be the flags on registering.

Normal MCA variables shmem_sysv_priority etc. have flag
MCA_BASE_VAR_FLAG_DWG so that they are deregistered through
mca_base_var_group_deregister in mca_base_component_unload.

But shmem_sysv_major_version doesn't have the flag.

Regards,
KAWASHIMA Takahiro

> This is odd. The variable in question is registered by the MCA itself. I
> will take a look and see if I can determine why it isn't being
> deregistered correctly when the rest of the component's parameters are.
> 
> -Nathan
> 
> On Wed, Jul 30, 2014 at 08:17:15AM +0900, KAWASHIMA Takahiro wrote:
> > Nathan,
> > 
> > Thanks for your response.
> > 
> > Yes. My previous mail was the result of uncommented code.
> > Now I also pulled latest varList source code which uncommented
> > the section you mentioned, but the result was same.
> > 
> > If MPI_T_cvar_get_info should return MPI_T_ERR_INVALID_INDEX
> > for variables for unloaded components, not returning
> > MPI_T_ERR_INVALID_INDEX is the problem.
> > 
> > I run varList on GDB and found that MPI_T_cvar_get_info returns
> > MPI_T_ERR_INVALID_INDEX for shmem_sysv_priority (this is sane).
> > But it returns MPI_SUCCESS for shmem_sysv_major_version.
> > The difference is mbv_flags values. mbv_flags is 0x44 for
> > shmem_sysv_priority on MPI_T_cvar_get_info call so that
> > mca_base_var_get function in opal/mca/base/mca_base_var.c
> > returns OPAL_ERR_NOT_FOUND. But mbv_flags is 0x10003 for
> > shmem_sysv_major_version so that mca_base_var_get function
> > returns OPAL_SUCCESS.
> > 
> > Control variables for unloaded components are not deregistered
> > completely?
> > 
> > I can track it more when I have time.
> > 
> > My environment:
> >   OS: Debian GNU/Linux wheezy
> >   CPU: x86_64
> >   Run: mpiexec -n 1 varList
> >   Open MPI source: trunk r32338 (almost latest)
> >   Open MPI configure:
> >     enable_picky=yes
> >     enable_debug=yes
> >     enable_mem_debug=yes
> >     enable_mem_profile=yes
> >     enable_memchecker=no
> >     
> > enable_mca_no_build=btl-elan,btl-gm,btl-mx,btl-ofud,btl-portals,btl-sctp,btl-template,btl-udapl,common-mx,common-portals,ess-alps,ess-cnos,ess-lsf,ess-portals_utcp,ess-singleton,ess-slurm,grpcomm-cnos,mpool-fake,mtl,notifier,plm-alps,plm-ccp,plm-lsf,plm-process,plm-slurm,plm-submit,plm-tm,plm-xgrid,pml-cm,pml-csum,pml-example,pml-v,ras
> >     enable_contrib_no_build=vt
> >     enable_mpi_cxx=no
> >     enable_mpi_f77=no
> >     enable_mpi_f90=no
> >     enable_ipv6=no
> >     enable_mpi_io=no
> >     with_devel_headers=no
> >     with_wrapper_cflags=-g
> >     with_wrapper_cxxflags=-g
> >     with_wrapper_fflags=-g
> >     with_wrapper_fcflags=-g
> > 
> > Regards,
> > KAWASHIMA Takahiro
> > 
> > > The problem is the code in question does not check the return code of
> > > MPI_T_cvar_handle_alloc . We are returning an error and they still try
> > > to use the handle (which is stale). Uncomment this section of the code:
> > > 
> > > 
> > >                 //if (MPI_T_ERR_INVALID_INDEX == err)// { NOTE TZI: This 
> > > variable is not recognized by Mvapich. It is OpenMPI specific.
> > >                 //      continue;
> > > 
> > > 
> > > Note that MPI_T_ERR_INVALID_INDEX is in the MPI-3 standard but mvapich
> > > must not have implemented it (and thus should not claim to be MPI 3.0).
> > > 
> > > -Nathan
> > > 
> > > On Wed, Jul 30, 2014 at 12:04:55AM +0900, KAWASHIMA Takahiro wrote:
> > > > Hi,
> > > > 
> > > > I encountered the same SEGV reported on the users list when
> > > > running varList program.
> > > > 
> > > >   http://www.open-mpi.org/community/lists/users/2014/07/24792.php
> > > > 
> > > > mpiexec -n 1 ./varList:
> > > > ----------------------------------------------------------------
> > > > ... snip ...
> > > > event                                             U/D-2 CHAR   n/a      
> > > > ALL
> > > > event_base_verbose                                D/D-8 INT    n/a      
> > > > LOCAL    0
> > > > event_libevent2021_event_include                  U/A-3 CHAR   n/a      
> > > > LOCAL    poll
> > > > opal_event_include                                U/A-3 CHAR   n/a      
> > > > LOCAL    poll
> > > > event_libevent2021_major_version                  D/A-9 INT    n/a      
> > > > UNKNOWN  1
> > > > event_libevent2021_minor_version                  D/A-9 INT    n/a      
> > > > UNKNOWN  9
> > > > event_libevent2021_release_version                D/A-9 INT    n/a      
> > > > UNKNOWN  0
> > > > shmem                                             U/D-2 CHAR   n/a      
> > > > ALL
> > > > shmem_base_verbose                                D/D-8 INT    n/a      
> > > > LOCAL    0
> > > > shmem_base_RUNTIME_QUERY_hint                     D/A-9 CHAR   n/a      
> > > > ALL-EQ
> > > > shmem_mmap_priority                               U/A-3 INT    n/a      
> > > > ALL      50
> > > > shmem_mmap_enable_nfs_warning                     D/A-9 INT    n/a      
> > > > LOCAL    true
> > > > shmem_mmap_relocate_backing_file                  D/A-9 INT    n/a      
> > > > ALL      0
> > > > shmem_mmap_backing_file_base_dir                  D/A-9 CHAR   n/a      
> > > > ALL      /dev/shm
> > > > shmem_mmap_major_version                          D/A-9 INT    n/a      
> > > > UNKNOWN  1
> > > > shmem_mmap_minor_version                          D/A-9 INT    n/a      
> > > > UNKNOWN  9
> > > > shmem_mmap_release_version                        D/A-9 INT    n/a      
> > > > UNKNOWN  0
> > > > shmem_posix_major_version                         D/A-9 INT    n/a      
> > > > UNKNOWN  1201644720
> > > > shmem_posix_minor_version                         D/A-9 INT    n/a      
> > > > UNKNOWN  32756
> > > > shmem_posix_release_version                       D/A-9 INT    n/a      
> > > > UNKNOWN  6
> > > > [ppc:12688] *** Process received signal ***
> > > > [ppc:12688] Signal: Segmentation fault (11)
> > > > [ppc:12688] Signal code: Invalid permissions (2)
> > > > [ppc:12688] Failing at address: 0x7ff4479f83d8
> > > > [ppc:12688] [ 0] 
> > > > /lib/x86_64-linux-gnu/libc.so.6(+0x325c0)[0x7ff4493015c0]
> > > > [ppc:12688] [ 1] 
> > > > /home/rivis/opt/openmpi-trunk-debug/lib/libmpi.so.0(PMPI_T_cvar_read+0xbc)[0x7ff44970abb7]
> > > > [ppc:12688] [ 2] ./varlist(list_cvars+0x56a)[0x4029bc]
> > > > [ppc:12688] [ 3] ./varlist(main+0x42b)[0x403598]
> > > > [ppc:12688] [ 4] 
> > > > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7ff4492edeed]
> > > > [ppc:12688] [ 5] ./varlist[0x4016c9]
> > > > [ppc:12688] *** End of error message ***
> > > >         ----------------------------------------------------------------
> > > > 
> > > > I tracked this error and found that this seems related to DSO.
> > > > 
> > > > The error occurs when accessing value->intval for the
> > > > control variable shmem_sysv_major_version in MPI_T_cvar_read.
> > > > 
> > > >   
> > > > https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mpi/tool/cvar_read.c
> > > > 
> > > > The 'value' was gotten by mca_base_var_get_value and it points
> > > > mca_shmem_sysv_component.super.base_version.mca_component_major_version,
> > > > which was dlclose'd in MPI_INIT for DSO.
> > > > (component mmap is selected on my environment)
> > > > 
> > > > Abnormal shmem_posix_{major,minor,relase}_version values in
> > > > my output above are the same reason. SEGV occurs if the memory
> > > > was returned to kernel, and abnormal values are printed
> > > > if not yet.
> > > > 
> > > > So this SEGV doesn't occur if I configure Open MPI with
> > > > --disable-dlopen option. I think it's the reason why Nathan
> > > > doesn't see this error.

Reply via email to