Alexey Serbin has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20282 )

Change subject: KUDU-3498 Scanner keeps alive in periodically
......................................................................


Patch Set 4:

(19 comments)

Thank you for working on this feature!

http://gerrit.cloudera.org:8080/#/c/20282/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20282/4//COMMIT_MSG@9
PS4, Line 9: Kudu caches the scanner id in the tablet server
In fact, Kudu tablet servers run those scanner objects when service scan 
requests -- that's not only about scanner identifiers, but the actual scanner 
objects are hosted by the server side as runtime structures.


http://gerrit.cloudera.org:8080/#/c/20282/4//COMMIT_MSG@12
PS4, Line 12: is every large, it takes a long time to handle it
Just FYI: another way of handling that is setting smaller batch size by calling 
SetBatchSizeBytes() -- that way there is no need to send any KeepAlive requests.


http://gerrit.cloudera.org:8080/#/c/20282/4//COMMIT_MSG@14
PS4, Line 14: will be expired even if it has sent a keep alive request
There is no sense in sending KeepAlive request at the same time (e.g., right 
before or right after) when the client fetches the data using NextBatch() 
requests and alike.  It would make sense to send in KeepAlive requests in an 
independent thread that runs concurrently with the thread that handles the 
actual scan data or from the data handler thread from time to time while it 
processes the data that has been fetched from the previous batch.  AFAIK, the 
server resets the TTL countdown for a scanner if upon receiving request to 
continue the scan (i.e. when the client calls NextBatch() and alike).

Also, a scanner might expire unexpectedly when KeepAlive requests are dropped 
out of the RPC queue because the tablet server is too busy.  In that sense, 
it's necessary to check for the return status of the KuduScanner::KeepAlive() 
method invocation to make sure the keep-alive RPC actually resets the scanner's 
TTL.


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc
File src/kudu/client/client-test.cc:

http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2964
PS4, Line 2964: TEST_P(KeepAlivePeriodicallyTest, 
TestScannerKeepAlivePeriodically) {
It would be great to add a test to verify how the new functionality works in 
case of the scan spans multiple tablets that are hosted by different tablet 
servers, so multiple scanner objects at different tablet servers are involved.  
That's to make sure the client sends the keep-alive requests to appropriate 
tablet servers while the data is being fetched.

Also, it's necessary to make sure the functionality works as expected for 
fault-tolerant scanners: in those scenarios, a tablet server that is currently 
serving the scan should fail/crash/stopped, and the client automatically 
restarts the scanning at some other tablet replica hosted by different tablet 
server.  In that case, the client library should stop sending keep-alive 
requests to the former server, but start sending the requests to the new server 
where the scanning has been switched to.  Using the DoScanWithCallback() 
utility function could help in creating such a test scenario.


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2965
PS4, Line 2965:   const auto keep_alive_periodically = GetParam();
Since this test scenario assumes running for at least 3 seconds, it's 
considered slow -- please add SKIP_IF_SLOW_NOT_ALLOWED() macro in the very 
beginning.


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2967
PS4, Line 2967: table with only one replica
non-replicated table


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2972
PS4, Line 2972:   // Start a scan but don't get the whole data back
nit: add a period (.) in the end of the sentence


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2974
PS4, Line 2974: to the second tablet
Why the second tablet is important here?  How do you even know that multiple 
tablets are involved in the scope of this test scenario?

It would be great to add comments to explain that.


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2986
PS4, Line 2986: make scanner exipred when without keepalive peroidically
make sure the scanner expires when keep-alive requests are not sent


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2993
PS4, Line 2993: has been expired
has expired


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client-test.cc@2994
PS4, Line 2994: ASSERT_TRUE(s.IsNotFound());
nit: when using this sort of assert, consider adding s.ToString() -- that would 
be easier to troubleshoot this ever triggers

ASSERT_TRUE(s.IsNotFound()) << s.ToString();


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client.h
File src/kudu/client/client.h:

http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/client.h@2957
PS4, Line 2957: it build a messeger failed
building a messenger failed


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.h
File src/kudu/client/scanner-internal.h:

http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.h@277
PS4, Line 277:   Status KeepAlivePeriodically(uint64_t keep_alive_interval_ms);
Please add a small doc comment (as the other methods in this class have).


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.h@302
PS4, Line 302: DISALLOW_COPY_AND_ASSIGN(Data);
nit: please keep the DISALLOW_COPY_AND_ASSIGN() macro as the last element in 
the list of members for easier readability


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.h@303
PS4, Line 303: keep_alive_timer_
nit: please add a comment at least to describe when this member becomes non-null


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.cc
File src/kudu/client/scanner-internal.cc:

http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.cc@104
PS4, Line 104: KeepAlivePeriodically
What happens with the periodic timer when KuduScanner::Data::OpenTablet() is 
called (e.g., that happens upon calling KuduScanner::Data::OpenNextTablet())?


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.cc@109
PS4, Line 109: keep_alive_timer_
Do we need to do anything with this member field upon an explicit call of the  
KuduScanner::Close() method?


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.cc@110
PS4, Line 110: messenger
nit: use std::move() here?


http://gerrit.cloudera.org:8080/#/c/20282/4/src/kudu/client/scanner-internal.cc@115
PS4, Line 115: Start
Does it make sense to stop sending keep-alive requests for the scanner if all 
the data has been already fetched for a particular tablet?



--
To view, visit http://gerrit.cloudera.org:8080/20282
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1165d96814eb4bcd5db9b5cb60403fffc5b18c81
Gerrit-Change-Number: 20282
Gerrit-PatchSet: 4
Gerrit-Owner: Wang Xixu <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mahesh Reddy <[email protected]>
Gerrit-Reviewer: Wang Xixu <[email protected]>
Gerrit-Reviewer: Yifan Zhang <[email protected]>
Gerrit-Reviewer: Yingchun Lai <[email protected]>
Gerrit-Reviewer: Yuqi Du <[email protected]>
Gerrit-Comment-Date: Wed, 23 Aug 2023 00:53:32 +0000
Gerrit-HasComments: Yes

Reply via email to