Hey all; For the past several days, I've been tracking down a crash that's happening in traffic_server 6.2.1 that is happening fairly regularly on our cache nodes. The basic pathology is: server_vc is NULL under some circumstances, which has caused segfaults in code paths which do not guard for this condition. This issue was raised last year in ticket TS-5046 and reported to be fixed with the release of 6.2.1:
https://issues.apache.org/jira/browse/TS-5046 The fix was in PR 1222, by backporting the fixed for a different, related issue reported in TS-4938, also reported as fixed in 6.2.1: https://issues.apache.org/jira/browse/TS-4938 In the stack trace given in TS-5046, the person reported the segfault happening in HttpSM::tunnel_handler_server(). While their path to get there was different than ours, the effect was the same. In my case, the segfaults always happen during processing in the gzip plugin: https://paste.ec/paste/ZJnXEDC6#Rd-AeEkmmbsqM6E4rA/WuFdcgnUrv9By+rKO6eaxWmY As a work-around, I put Masa's bandaid fix in HttpSM::tunnel_handler_server() he mentioned in TS-5046. This of course avoids the null ptr deref there, but now it occurs at the next point down the stack: https://paste.ec/paste/+JwWMtuB#ew-87NnW1hQeFzwOCZ5GC+93EPF45YeJut/MtZwsdro Obviously we can continue playing whackamole and put guards in for a NULL server_vc in HttpServerSession::release(), however I feel that this probably isn't the desired route. Questions such as "Is server_vc *ever* supposed to be NULL?" and "Why would it be null in the first place?" persist, so I'm turning to more seasoned eyes here for your thoughts. At any rate, TS-5046/4938 do not appear to have completely addressed the situation. /dale
signature.asc
Description: Message signed with OpenPGP