#13447: Make libsindular multivariate polynomial rings collectable
---------------------------+------------------------------------------------
       Reporter:  nbruin   |         Owner:  rlm     
           Type:  defect   |        Status:  new     
       Priority:  major    |     Milestone:  sage-5.4
      Component:  memleak  |    Resolution:          
       Keywords:           |   Work issues:          
Report Upstream:  N/A      |     Reviewers:          
        Authors:           |     Merged in:          
   Dependencies:           |      Stopgaps:          
---------------------------+------------------------------------------------

Comment (by nbruin):

 On 5.4-beta0 + #715 + #11521, there is a doctest failure on
 `bsd.math.washington.edu`, an x86_64 machine running MacOSX 10.6:
 {{{
 bash-3.2$ ../../sage -t sage/misc/cachefunc.pyx
 sage -t  "devel/sage-main/sage/misc/cachefunc.pyx"
 The doctested process was killed by signal 11
          [12.7 s]

 ----------------------------------------------------------------------
 The following tests failed:


         sage -t  "devel/sage-main/sage/misc/cachefunc.pyx" #
 Killed/crashed
 Total time for all tests: 12.7 seconds
 }}}
 The segmentation fault happens reliably, but is hard to study because
 - running the same examples in an interactive session does not trigger the
 problem
 - running with `sage -t -gdb` (yes, that's possible!) or `sage -t
 --verbose` does not trigger the problem.
 - the order of tests in the file seems fairly important. You can change
 and
 delete some tests but not others. The likely explanation is that a garbage
 collection has to be triggered under the right conditions, so that the
 memory
 corruption (which likely happens upon deallocate somewhere) happens in the
 right
 spot.

 The segfault happens in the doctest for
 `CachedMethodCaller._instance_call` (line 1038 in the sage source;
 `example_27` in the file `~/.sage/tmp/cachefunc_*.py` left after
 doctesting), in the line
 {{{
             sage: P.<a,b,c,d> = QQ[]
 }}}
 Further instrumentation showed that the segfault happens in
 `sage/libs/singular/ring. pyx`, in `singular_ring_new`, in the part that
 copies the strings over.
 {{{#!diff
 +    sys.stderr.write("before _names allocation\n")
      _names = <char**>omAlloc0(sizeof(char*)*(len(names)))
 +    sys.stderr.write("after _names allocation\n")

      for i from 0 <= i < n:
          _name = names[i]
 +        sys.stderr.write("calling omStrDup for i=%s with
 name=%s\n"%(i,names[i])
         _names[i] = omStrDup(_name)
 +        sys.stderr.write("after omStrDup\n")
 }}}
 The call `_omStrDup` segfaults for `i=1`. Unwinding the `_omStrDup` call:
 {{{#!diff
      for i from 0 <= i < n:
          _name = names[i]
 +        sys.stderr.write("calling omStrDup for i=%s with
 name=%s\n"%(i,names[i]))
 -        _names[i] = omStrDup(_name)
 +        j = 0
 +        while <bint> _name[j]:
 +            j+=1
 +        j+=1     #increment to include the 0
 +        sys.stderr.write("string length (including 0) seems to be
 %s\n"%j)
 +        copiedname =  <char*>omAlloc(sizeof(char)*(j+perturb))
 +        sys.stderr.write("Done reserving memory buffer; got address
 %x\n"%(<long>copiedname))
 +        for 0 <= offset < j:
 +            sys.stderr.write("copying character nr %s\n"%offset)
 +            copiedname[offset] = _name[offset]
 +        _names[i] = copiedname
 +        sys.stderr.write("after omStrDup\n")
 }}}
 shows that it's actually the `omAlloc` call segfaulting. For `perturb=7`
 or higher, the segfault does not happen. For `perturb` a lower value it
 does. Given that the omAlloc addresses returned on earlier calls do not
 seem close to a page boundary, the only way `omAlloc` can fail is
 basically by a corrupted freelist an 8-byte bin. Likely culprits:
  - a double free (although I'd expect that would trigger problems on more
 architectures)
  - someone writing out-of-bounds in omAlloc-managed memory.
 Perhaps someone claiming an `<int>,<int>` structure and storing a `<void
 *>` in the second one?

 Note the `<char*>` to `<long>` cast in the print statement. With an
 `<int>`, the compiler complains
 about loss of precision, but not with `<long>`. I haven't checked whether
 `<long>` is really 64 bits on this machine, though.

 I have tried and the problem seems to persist with the old singular (5.4b0
 has a
 recently upgraded singular).

 It would help a lot if someone could build singular to use plain malloc
 throughout and then use valgrind or a similar tool, which should be able
 to
 immediately catch a double free or out-of-bounds error. If the root of the
 problem is not OSX-specific, this would even show up on other
 architectures.

 See also [http://trac.sagemath.org/sage_trac/ticket/715#comment:295
 #715,comment
 295] and below for some more details on how the diagnosis above was
 obtained.

-- 
Ticket URL: <http://trac.sagemath.org/sage_trac/ticket/13447#comment:1>
Sage <http://www.sagemath.org>
Sage: Creating a Viable Open Source Alternative to Magma, Maple, Mathematica, 
and MATLAB

-- 
You received this message because you are subscribed to the Google Groups 
"sage-trac" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sage-trac?hl=en.

Reply via email to