Re: [racket-users] Diagnosing a segfault in production

Scott Bell Thu, 09 Jul 2015 10:34:08 -0700

On Wednesday, July 8, 2015 at 8:05:47 PM UTC-7, neil wrote:
> Scott Bell wrote on 07/08/2015 09:23 PM:
> [...]
> >> ...
> >>      #6  0x000000080153e004 in strlen () from /lib/libc.so.7
> >>      #7  0x0000000800a72bf3 in scheme_make_byte_string_without_copying ()
> >>         from /usr/local/lib/libracket3m-6.1.1.so
> >>      #8  0x0000000800ade3f2 in c_to_scheme ()
> >>         from /usr/local/lib/libracket3m-6.1.1.so
> >>      #9  0x0000000800adee02 in ffi_do_call ()
> >>         from /usr/local/lib/libracket3m-6.1.1.so
> >> ...
> >>
> >> That certainly points me in the direction of a bad FFI call.
> > When I say bad FFI call, I'm leaving open the possibility that
> > the issue may be in the FFI machinery as well as in our source,
> > but I should point out that the minimal amount of FFI code
> > that we have is years old and has been running without issue.
> > We first observed this crash only a few months ago.
> 
> (Just throwing out some ideas, mainly for any readers new to this kind 
> of debugging, and to help put the fear of C into students.)
> 
> Maybe bad/freed pointer (or the C string got corrupted and there is no 
> NUL byte to be found for miles, or some earlier failure led to the code 
> doing `strlen` on bytes that weren't a C string in the first place).
> 
> Are you doing FFI to C code that you developed, or a well-established 
> off-the-shelf C library, or something else?
> 
> I might initially plan attack with the following rough ordering.
> 
> 0. Brief pause to see whether cause of a `strlen` barfing in this code 
> magically pops into head.
> 
> 1. Quickly check whether a version change of the deployed native code 
> happened around the time your problem appeared in deployment (probably 
> easy to check), and the known bugs and fixes for that native code.  
> Might not be likely cause, but it's often the easiest thing to check 
> before more expensive investigation.
> 
> 2. Take look at core dump for other useful things easily spotted, other 
> than just the stack entry points (such as arguments to the calls, and 
> any persistent data maintained by the C library), but don't get far into 
> learning JIT'd Racket VM deciphering before proceeding to easier options.
> 
> 3. The following in either order, or in parallel/interleaved:
> 
> * Audit the application's Racket source code doing the FFI on the 
> application side.
> 
> * Audit the source code of the native code that's being called at the 
> time you have the failure.
> 
> 4. Audit the native code that might have been called sometime before 
> failure, but left data inconsistent or memory corrupted for the failure 
> time.
> 
> 5. Attempt to make the native code fail in debugging setup, not in 
> production, without involving Racket.  Can include fancy memory 
> debugging tools, `gdb`, stress-testing, fuzz testing, etc.
> 
> 6. Attempt to make failure in debugging setup of the whole shebang 
> (including Racket), still not in production.
> 
> Possible alternative option to keep in mind... One thing I've done in 
> the past with a Racket server app when C code is suspect, is eliminate 
> calls to C libraries (other than those backing core Racket), by 
> rewriting the functionality in pure Racket.  That's not always possible, 
> but, in the cases I did it, it took a massive unpredictable suspect out 
> of debugging any subsequent strange production problem.  (Performance 
> improvement in some cases, due to not blocking all Racket threads in a 
> native call, or due to letting the JIT do its job, was a side benefit.)
> 
> Ideally, the hardest thing you then have to debug in production is swap 
> thrashing or OOM kill because someone made an algorithmic space oops 
> that is obvious as soon as you look at the application Racket code. :)
> 
> Neil V.


Thanks Neil, I appreciate the ideas! 

Fuzzing our FFI calls would be a good technique to come up
with a reproduction case, if there is one. Depending on which 
call is at fault we might be looking at one crash per >10M 
calls :)

It would be even simpler to simply replay all of the historical
calls based on log data, but I'm afraid that the nature of the
crash might mean that triggering inputs, if they exist, wouldn't
make it into the logs. However, if the problem were related to
state that wasn't input sensitive, I could probably test this by
replaying log data repeatedly, noting the conditions of the 
crash each time.

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Diagnosing a segfault in production

Reply via email to