Scott Bell wrote on 07/08/2015 09:23 PM:
[...]
...
     #6  0x000000080153e004 in strlen () from /lib/libc.so.7
     #7  0x0000000800a72bf3 in scheme_make_byte_string_without_copying ()
        from /usr/local/lib/libracket3m-6.1.1.so
     #8  0x0000000800ade3f2 in c_to_scheme ()
        from /usr/local/lib/libracket3m-6.1.1.so
     #9  0x0000000800adee02 in ffi_do_call ()
        from /usr/local/lib/libracket3m-6.1.1.so
...

That certainly points me in the direction of a bad FFI call.
When I say bad FFI call, I'm leaving open the possibility that
the issue may be in the FFI machinery as well as in our source,
but I should point out that the minimal amount of FFI code
that we have is years old and has been running without issue.
We first observed this crash only a few months ago.

(Just throwing out some ideas, mainly for any readers new to this kind of debugging, and to help put the fear of C into students.)

Maybe bad/freed pointer (or the C string got corrupted and there is no NUL byte to be found for miles, or some earlier failure led to the code doing `strlen` on bytes that weren't a C string in the first place).

Are you doing FFI to C code that you developed, or a well-established off-the-shelf C library, or something else?

I might initially plan attack with the following rough ordering.

0. Brief pause to see whether cause of a `strlen` barfing in this code magically pops into head.

1. Quickly check whether a version change of the deployed native code happened around the time your problem appeared in deployment (probably easy to check), and the known bugs and fixes for that native code. Might not be likely cause, but it's often the easiest thing to check before more expensive investigation.

2. Take look at core dump for other useful things easily spotted, other than just the stack entry points (such as arguments to the calls, and any persistent data maintained by the C library), but don't get far into learning JIT'd Racket VM deciphering before proceeding to easier options.

3. The following in either order, or in parallel/interleaved:

* Audit the application's Racket source code doing the FFI on the application side.

* Audit the source code of the native code that's being called at the time you have the failure.

4. Audit the native code that might have been called sometime before failure, but left data inconsistent or memory corrupted for the failure time.

5. Attempt to make the native code fail in debugging setup, not in production, without involving Racket. Can include fancy memory debugging tools, `gdb`, stress-testing, fuzz testing, etc.

6. Attempt to make failure in debugging setup of the whole shebang (including Racket), still not in production.

Possible alternative option to keep in mind... One thing I've done in the past with a Racket server app when C code is suspect, is eliminate calls to C libraries (other than those backing core Racket), by rewriting the functionality in pure Racket. That's not always possible, but, in the cases I did it, it took a massive unpredictable suspect out of debugging any subsequent strange production problem. (Performance improvement in some cases, due to not blocking all Racket threads in a native call, or due to letting the JIT do its job, was a side benefit.)

Ideally, the hardest thing you then have to debug in production is swap thrashing or OOM kill because someone made an algorithmic space oops that is obvious as soon as you look at the application Racket code. :)

Neil V.

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to