Scott Bell wrote on 07/08/2015 09:23 PM:
[...]
...
#6 0x000000080153e004 in strlen () from /lib/libc.so.7
#7 0x0000000800a72bf3 in scheme_make_byte_string_without_copying ()
from /usr/local/lib/libracket3m-6.1.1.so
#8 0x0000000800ade3f2 in c_to_scheme ()
from /usr/local/lib/libracket3m-6.1.1.so
#9 0x0000000800adee02 in ffi_do_call ()
from /usr/local/lib/libracket3m-6.1.1.so
...
That certainly points me in the direction of a bad FFI call.
When I say bad FFI call, I'm leaving open the possibility that
the issue may be in the FFI machinery as well as in our source,
but I should point out that the minimal amount of FFI code
that we have is years old and has been running without issue.
We first observed this crash only a few months ago.
(Just throwing out some ideas, mainly for any readers new to this kind
of debugging, and to help put the fear of C into students.)
Maybe bad/freed pointer (or the C string got corrupted and there is no
NUL byte to be found for miles, or some earlier failure led to the code
doing `strlen` on bytes that weren't a C string in the first place).
Are you doing FFI to C code that you developed, or a well-established
off-the-shelf C library, or something else?
I might initially plan attack with the following rough ordering.
0. Brief pause to see whether cause of a `strlen` barfing in this code
magically pops into head.
1. Quickly check whether a version change of the deployed native code
happened around the time your problem appeared in deployment (probably
easy to check), and the known bugs and fixes for that native code.
Might not be likely cause, but it's often the easiest thing to check
before more expensive investigation.
2. Take look at core dump for other useful things easily spotted, other
than just the stack entry points (such as arguments to the calls, and
any persistent data maintained by the C library), but don't get far into
learning JIT'd Racket VM deciphering before proceeding to easier options.
3. The following in either order, or in parallel/interleaved:
* Audit the application's Racket source code doing the FFI on the
application side.
* Audit the source code of the native code that's being called at the
time you have the failure.
4. Audit the native code that might have been called sometime before
failure, but left data inconsistent or memory corrupted for the failure
time.
5. Attempt to make the native code fail in debugging setup, not in
production, without involving Racket. Can include fancy memory
debugging tools, `gdb`, stress-testing, fuzz testing, etc.
6. Attempt to make failure in debugging setup of the whole shebang
(including Racket), still not in production.
Possible alternative option to keep in mind... One thing I've done in
the past with a Racket server app when C code is suspect, is eliminate
calls to C libraries (other than those backing core Racket), by
rewriting the functionality in pure Racket. That's not always possible,
but, in the cases I did it, it took a massive unpredictable suspect out
of debugging any subsequent strange production problem. (Performance
improvement in some cases, due to not blocking all Racket threads in a
native call, or due to letting the JIT do its job, was a side benefit.)
Ideally, the hardest thing you then have to debug in production is swap
thrashing or OOM kill because someone made an algorithmic space oops
that is obvious as soon as you look at the application Racket code. :)
Neil V.
--
You received this message because you are subscribed to the Google Groups "Racket
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.