On Fri, May 26, 2017 at 5:15 PM, Alan Conway <[email protected]> wrote:
> On Fri, 2017-05-26 at 10:48 +0200, Jiri Danek wrote: > > On Fri, May 26, 2017 at 12:18 AM, Alan Conway (JIRA) <[email protected] > > > wrote: > > > [ https://issues.apache.org/jira/browse/DISPATCH-777?page=com.a > > > tlassian.jira.plugin.system.issuetabpanels:comment- > > > tabpanel&focusedCommentId=16025475#comment-16025475 ] > > > > > > Alan Conway commented on DISPATCH-777: > > > -------------------------------------- > > > > > > This appears to be a race condition (double free) in the epoll > > > proactor, fixing... > > > > > > > Could you maybe describe in more detail how you went about triaging > > it? So that I know what more steps I can take next time I am > > reporting a crash like this. Thank you. > > -- > > Jiří Daněk > > Messaging QA > > I ran the test in a loop with 'rr' http://rr-project.org/ until it > I stumbled upon this morning it when writing the question! I googled https://www.google.cz/search?q=gdb+time+traveling+record+execution and this was second result from top. > crashed. 'rr' is a truly amazing extension to gdb - it records a > complete execution trace of the program (without imposing much run-time > overhead) that you can replay forwards *and backwards* in gdb, > examining memory etc. as you normally would. > > So playing the program up to the segfault in rr showed me that it > crashed on a pointer with the value 0x4242424242. Now I have this in my > .bashrc: > > export MALLOC_PERTURB_=66 # 0x42 > > So freed memory is always filled with the hex pattern 424242. Now I > know the pointer is in memory that was previously freed so I do: > > watch -l ptr # Standard gdb watchpoint on the pointer > reverse-cont # rr magic - continue *backwards* to the watchpoint > > This runs the program back to the exact point where it was freed! > > The rest is knowledge of the code: the crash comes just after the > pointer was extracted from epoll_wait(), the free is during > finalization of a closed connection - so I'm fairly sure there's a race > where we sometimes free memory used by a connection while it is still > registered with epoll. > So next time when reporting crash like this, I will set export MALLOC_PERTURB_=66 # 0x42 and attach rr trace to Jira. Assuming it can be moved between computers and that it compresses reasonably well. i haven't actually tried using that yet. -- Jiří Daněk Messaging QA
