Ádám Bakai has posted comments on this change. ( http://gerrit.cloudera.org:8080/22056 )
Change subject: [subprocess] KUDU-3624 Fix DoWait thread-safety ...................................................................... Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/22056/2//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/22056/2//COMMIT_MSG@15 PS2, Line 15: a > the following Done http://gerrit.cloudera.org:8080/#/c/22056/2/src/kudu/util/subprocess-test.cc File src/kudu/util/subprocess-test.cc: http://gerrit.cloudera.org:8080/#/c/22056/2/src/kudu/util/subprocess-test.cc@227 PS2, Line 227: odes = > Maybe, add an some extra delay before exiting ('sleep 1' or alike), otherwi Done http://gerrit.cloudera.org:8080/#/c/22056/2/src/kudu/util/subprocess-test.cc@233 PS2, Line 233: string exit_info; > Move this into a scoped cleanup, otherwise the thread wouldn't be ripped as Done http://gerrit.cloudera.org:8080/#/c/22056/1/src/kudu/util/subprocess.cc File src/kudu/util/subprocess.cc: http://gerrit.cloudera.org:8080/#/c/22056/1/src/kudu/util/subprocess.cc@815 PS1, Line 815: std::unique_lock lock(wait_mutex_, std::try_to_lock); > What I meant was this - moving whole lock acquisition logic to the start so You are correct. I thought that not having that condition can cause a deadlock, but tested and thought about more and it can't. http://gerrit.cloudera.org:8080/#/c/22056/2/src/kudu/util/subprocess.cc File src/kudu/util/subprocess.cc: http://gerrit.cloudera.org:8080/#/c/22056/2/src/kudu/util/subprocess.cc@809 PS2, Line 809: if (state_ != kRunning) { > Since now this creates a protected section in the Subprocess's only method The problem with extending the protected section is that we need to both support blocking and non blocking wait. If we start the lock at the start of the function, then it will block for non blocking functions. This is even true if there is a separate read and write lock for waitpid, because the waitpid part can run for very long and read lock can not be acquired at that time. what could solve the issue is having two separate mutex, one for the waitpid(that is already existing) and one for state, and basically this is in write lock: child_pid_ = -1; wait_status_ = status; state_ = kExited; And this is in a read lock, because theoretically it can happen that the state is still in running, but the pid is already -1: Status Subprocess::Kill(int signal) { if (state_ != kRunning) { const string err_str = "Sub-process is not running"; LOG(DFATAL) << err_str; return Status::IllegalState(err_str); } int ret = kill(child_pid_, signal); But it can happen even in this case that the kill is unsucessful, since the child may have already exited. I checked the code and its pretty common, that there is only a warning if the kill is unsucessful: WARN_NOT_OK(hms_process_->Kill(SIGQUIT), "failed to send SIGQUIT to HMS"); About using atomics: I think it makes sense, so that writing and reading the same variable can create any kind of weird problem, that TSAN was complaining about. And the solution is to make them atomic. So in a nuthsell I think: 1. Using shared lock wouldn't help, since write lock takes a long time and we only use it as a write mutex. using read mutexes with it doesn't work, because these functions have to be fast. 2. using two locks, one for the current waitpid() (as a unique_mutex) and an other(shared_mutex) for to conserve the consistency between state_/child_pid_/wait_status_ makes sense, if we are trying to make the whole subprocess thread-safe. By first look, Subprocess::Kill and Subprocess::GetExitStatus is not thread safe. But the main point is making DoWait thread-safe since this is the main issue that makes the tests fail. 3. Using the atomic state is useful, because this way it is enforced that state doesn't get written at the exact same time as it is read in DoWait, preventing errors. I think it's better this way -- To view, visit http://gerrit.cloudera.org:8080/22056 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I1cb540860b439c26e1c8529123c8b29940d9f84f Gerrit-Change-Number: 22056 Gerrit-PatchSet: 1 Gerrit-Owner: Ádám Bakai <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Ashwani Raina <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Ádám Bakai <[email protected]> Gerrit-Comment-Date: Thu, 21 Nov 2024 16:47:16 +0000 Gerrit-HasComments: Yes
