Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20739 )

Change subject: KUDU-3524 Fix crash when sending periodic keep-alive requests
......................................................................


Patch Set 6:

(2 comments)

Thank you very much for the fix!

I built client-test with this patch and run the test many times as prescribed 
by instructions in KUDU-3524.  The test is stable and doesn't crash now, while 
before this fix it was failing 100% of the runs on my macOS laptop.

http://gerrit.cloudera.org:8080/#/c/20739/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20739/3//COMMIT_MSG@10
PS3, Line 10: eriodical
> This case only happens in macOS. The community server is linux. And this ca
While I'm not exactly sure this happens only on macOS, but I could not 
reproduce the exact crash when running the test on Linux as prescribed in 
KUDU-3524.

Maybe, this manifests itself a bit differently?  I'm not sure about that, and I 
didn't spend any time trying to investigate this on Linux -- having failing 
100% on macOS with the same trace under debugger was an evidence of a bug to me.

However, I know there are many failures of the 
KeepAlivePeriodically/KeepAlivePeriodicallyTest.TestScannerKeepAlivePeriodicallyScannerTolerate/1
 test at the dist-test dashboard. The majority of those failures were because 
of that tests failing:
  http://dist-test.cloudera.org:8080/test_drilldown?test_name=client-test

One example is at 
http://dist-test.cloudera.org:8080/diagnose?key=0c43a1ce-9a29-11ee-b18e-0242ac110002

The test failed because of timeout, but you could some abnormal condition has 
been detected and traces of all threads have been printed out.  Probably, that 
masks the issue on Linux, but on macOS we don't have similar tracing, so the 
process simply crashes, making it easier to comprehed and troubleshoot.


http://gerrit.cloudera.org:8080/#/c/20739/3//COMMIT_MSG@10
PS3, Line 10: eriodical
> Could you please add a unit test to reproduce this? And make sure the fix w
We have had many failures in dist-test recently after introducing the new 
keep-alive functionality:
  http://dist-test.cloudera.org:8080/test_drilldown?test_name=client-test

The majority of those failures are because of this bug, I guess, but it 
manifests itself differently on Linux, maybe?
I didn't to any investigation on Linux, but I bet that was the same issue, and 
with this fix flakiness in client-test will go down.  I mean we already have 
tests that are failing on Linux because of the bug, but it manifests itself a 
bit different :)

I also verified that the test run from code built with this patch no longer 
crashes on macOS, so the issue is fixed.



--
To view, visit http://gerrit.cloudera.org:8080/20739
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I130db970a091cdf7689245a79dc4ea445d1f739f
Gerrit-Change-Number: 20739
Gerrit-PatchSet: 6
Gerrit-Owner: Wang Xixu <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wang Xixu <[email protected]>
Gerrit-Reviewer: Yifan Zhang <[email protected]>
Gerrit-Reviewer: Yingchun Lai <[email protected]>
Gerrit-Reviewer: Ádám Bakai <[email protected]>
Gerrit-Comment-Date: Fri, 15 Dec 2023 05:08:10 +0000
Gerrit-HasComments: Yes

Reply via email to