Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/20282 )
Change subject: KUDU-3498 Scanner keeps alive in periodically ...................................................................... Patch Set 4: (19 comments) Thank you for working on this feature! http://gerrit.cloudera.org:8080/#/c/20282/4//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20282/4//COMMIT_MSG@9 PS4, Line 9: Kudu caches the scanner id in the tablet server In fact, Kudu tablet servers run those scanner objects when service scan requests -- that's not only about scanner identifiers, but the actual scanner objects are hosted by the server side as runtime structures. http://gerrit.cloudera.org:8080/#/c/20282/4//COMMIT_MSG@12 PS4, Line 12: is every large, it takes a long time to handle it Just FYI: another way of handling that is setting smaller batch size by calling SetBatchSizeBytes() -- that way there is no need to send any KeepAlive requests. http://gerrit.cloudera.org:8080/#/c/20282/4//COMMIT_MSG@14 PS4, Line 14: will be expired even if it has sent a keep alive request There is no sense in sending KeepAlive request at the same time (e.g., right before or right after) when the client fetches the data using NextBatch() requests and alike. It would make sense to send in KeepAlive requests in an independent thread that runs concurrently with the thread that handles the actual scan data or from the data handler thread from time to time while it processes the data that has been fetched from the previous batch. AFAIK, the server resets the TTL countdown for a scanner if upon receiving request to continue the scan (i.e. when the client calls NextBatch() and alike). Also, a scanner might expire unexpectedly when KeepAlive requests are dropped out of the RPC queue because the tablet server is too busy. In that sense, it's necessary to check for the return status of the KuduScanner::KeepAlive() method invocation to make sure the keep-alive RPC actually resets the scanner's TTL. http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc File src/kudu/client/client-test.cc: http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2964 PS4, Line 2964: TEST_P(KeepAlivePeriodicallyTest, TestScannerKeepAlivePeriodically) { It would be great to add a test to verify how the new functionality works in case of the scan spans multiple tablets that are hosted by different tablet servers, so multiple scanner objects at different tablet servers are involved. That's to make sure the client sends the keep-alive requests to appropriate tablet servers while the data is being fetched. Also, it's necessary to make sure the functionality works as expected for fault-tolerant scanners: in those scenarios, a tablet server that is currently serving the scan should fail/crash/stopped, and the client automatically restarts the scanning at some other tablet replica hosted by different tablet server. In that case, the client library should stop sending keep-alive requests to the former server, but start sending the requests to the new server where the scanning has been switched to. Using the DoScanWithCallback() utility function could help in creating such a test scenario. http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2965 PS4, Line 2965: const auto keep_alive_periodically = GetParam(); Since this test scenario assumes running for at least 3 seconds, it's considered slow -- please add SKIP_IF_SLOW_NOT_ALLOWED() macro in the very beginning. http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2967 PS4, Line 2967: table with only one replica non-replicated table http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2972 PS4, Line 2972: // Start a scan but don't get the whole data back nit: add a period (.) in the end of the sentence http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2974 PS4, Line 2974: to the second tablet Why the second tablet is important here? How do you even know that multiple tablets are involved in the scope of this test scenario? It would be great to add comments to explain that. http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2986 PS4, Line 2986: make scanner exipred when without keepalive peroidically make sure the scanner expires when keep-alive requests are not sent http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2993 PS4, Line 2993: has been expired has expired http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2994 PS4, Line 2994: ASSERT_TRUE(s.IsNotFound()); nit: when using this sort of assert, consider adding s.ToString() -- that would be easier to troubleshoot this ever triggers ASSERT_TRUE(s.IsNotFound()) << s.ToString(); http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client.h File src/kudu/client/client.h: http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client.h@2957 PS4, Line 2957: it build a messeger failed building a messenger failed http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.h File src/kudu/client/scanner-internal.h: http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.h@277 PS4, Line 277: Status KeepAlivePeriodically(uint64_t keep_alive_interval_ms); Please add a small doc comment (as the other methods in this class have). http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.h@302 PS4, Line 302: DISALLOW_COPY_AND_ASSIGN(Data); nit: please keep the DISALLOW_COPY_AND_ASSIGN() macro as the last element in the list of members for easier readability http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.h@303 PS4, Line 303: keep_alive_timer_ nit: please add a comment at least to describe when this member becomes non-null http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.cc File src/kudu/client/scanner-internal.cc: http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.cc@104 PS4, Line 104: KeepAlivePeriodically What happens with the periodic timer when KuduScanner::Data::OpenTablet() is called (e.g., that happens upon calling KuduScanner::Data::OpenNextTablet())? http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.cc@109 PS4, Line 109: keep_alive_timer_ Do we need to do anything with this member field upon an explicit call of the KuduScanner::Close() method? http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.cc@110 PS4, Line 110: messenger nit: use std::move() here? http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.cc@115 PS4, Line 115: Start Does it make sense to stop sending keep-alive requests for the scanner if all the data has been already fetched for a particular tablet? -- To view, visit http://gerrit.cloudera.org:8080/20282 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1165d96814eb4bcd5db9b5cb60403fffc5b18c81 Gerrit-Change-Number: 20282 Gerrit-PatchSet: 4 Gerrit-Owner: Wang Xixu <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Mahesh Reddy <[email protected]> Gerrit-Reviewer: Wang Xixu <[email protected]> Gerrit-Reviewer: Yifan Zhang <[email protected]> Gerrit-Reviewer: Yingchun Lai <[email protected]> Gerrit-Reviewer: Yuqi Du <[email protected]> Gerrit-Comment-Date: Wed, 23 Aug 2023 00:53:32 +0000 Gerrit-HasComments: Yes
