Re: [racket-users] Diagnosing a segfault in production

Neil Van Dyke Wed, 08 Jul 2015 20:05:58 -0700


Scott Bell wrote on 07/08/2015 09:23 PM:
[...]

...
     #6  0x000000080153e004 in strlen () from /lib/libc.so.7
     #7  0x0000000800a72bf3 in scheme_make_byte_string_without_copying ()
        from /usr/local/lib/libracket3m-6.1.1.so
     #8  0x0000000800ade3f2 in c_to_scheme ()
        from /usr/local/lib/libracket3m-6.1.1.so
     #9  0x0000000800adee02 in ffi_do_call ()
        from /usr/local/lib/libracket3m-6.1.1.so
...


That certainly points me in the direction of a bad FFI call.

When I say bad FFI call, I'm leaving open the possibility that
the issue may be in the FFI machinery as well as in our source,
but I should point out that the minimal amount of FFI code
that we have is years old and has been running without issue.
We first observed this crash only a few months ago.

(Just throwing out some ideas, mainly for any readers new to this kindof debugging, and to help put the fear of C into students.)

Maybe bad/freed pointer (or the C string got corrupted and there is noNUL byte to be found for miles, or some earlier failure led to the codedoing `strlen` on bytes that weren't a C string in the first place).

Are you doing FFI to C code that you developed, or a well-establishedoff-the-shelf C library, or something else?


I might initially plan attack with the following rough ordering.

0. Brief pause to see whether cause of a `strlen` barfing in this codemagically pops into head.

1. Quickly check whether a version change of the deployed native codehappened around the time your problem appeared in deployment (probablyeasy to check), and the known bugs and fixes for that native code.Might not be likely cause, but it's often the easiest thing to checkbefore more expensive investigation.

2. Take look at core dump for other useful things easily spotted, otherthan just the stack entry points (such as arguments to the calls, andany persistent data maintained by the C library), but don't get far intolearning JIT'd Racket VM deciphering before proceeding to easier options.


3. The following in either order, or in parallel/interleaved:

* Audit the application's Racket source code doing the FFI on theapplication side.

* Audit the source code of the native code that's being called at thetime you have the failure.

4. Audit the native code that might have been called sometime beforefailure, but left data inconsistent or memory corrupted for the failuretime.

5. Attempt to make the native code fail in debugging setup, not inproduction, without involving Racket. Can include fancy memorydebugging tools, `gdb`, stress-testing, fuzz testing, etc.

6. Attempt to make failure in debugging setup of the whole shebang(including Racket), still not in production.

Possible alternative option to keep in mind... One thing I've done inthe past with a Racket server app when C code is suspect, is eliminatecalls to C libraries (other than those backing core Racket), byrewriting the functionality in pure Racket. That's not always possible,but, in the cases I did it, it took a massive unpredictable suspect outof debugging any subsequent strange production problem. (Performanceimprovement in some cases, due to not blocking all Racket threads in anative call, or due to letting the JIT do its job, was a side benefit.)

Ideally, the hardest thing you then have to debug in production is swapthrashing or OOM kill because someone made an algorithmic space oopsthat is obvious as soon as you look at the application Racket code. :)


Neil V.

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [racket-users] Diagnosing a segfault in production

Reply via email to