Ádám Bakai has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/22056 )

Change subject: [subprocess] KUDU-3624 Fix DoWait thread-safety
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/22056/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/22056/2//COMMIT_MSG@15
PS2, Line 15: a
> the following
Done


http://gerrit.cloudera.org:8080/#/c/22056/2/src/kudu/util/subprocess-test.cc
File src/kudu/util/subprocess-test.cc:

http://gerrit.cloudera.org:8080/#/c/22056/2/src/kudu/util/subprocess-test.cc@227
PS2, Line 227: odes =
> Maybe, add an some extra delay before exiting ('sleep 1' or alike), otherwi
Done


http://gerrit.cloudera.org:8080/#/c/22056/2/src/kudu/util/subprocess-test.cc@233
PS2, Line 233:     string exit_info;
> Move this into a scoped cleanup, otherwise the thread wouldn't be ripped as
Done


http://gerrit.cloudera.org:8080/#/c/22056/1/src/kudu/util/subprocess.cc
File src/kudu/util/subprocess.cc:

http://gerrit.cloudera.org:8080/#/c/22056/1/src/kudu/util/subprocess.cc@815
PS1, Line 815:   std::unique_lock lock(wait_mutex_, std::try_to_lock);
> What I meant was this - moving whole lock acquisition logic to the start so
You are correct. I thought that not having that condition can cause a deadlock, 
but tested and thought about more and it can't.


http://gerrit.cloudera.org:8080/#/c/22056/2/src/kudu/util/subprocess.cc
File src/kudu/util/subprocess.cc:

http://gerrit.cloudera.org:8080/#/c/22056/2/src/kudu/util/subprocess.cc@809
PS2, Line 809:   if (state_ != kRunning) {
> Since now this creates a protected section in the Subprocess's only method
The problem with extending the protected section is that we need to both 
support blocking and non blocking  wait. If we start the lock at the start of 
the function, then it will block for non blocking functions. This is even true 
if there is a separate read and write lock for waitpid, because the waitpid 
part can run for very long and read lock can not be acquired at that time.
what could solve the issue is having two separate mutex, one for the 
waitpid(that is already existing) and one for state, and basically this is in 
write lock:

  child_pid_ = -1;
  wait_status_ = status;
  state_ = kExited;


And this is in a read lock, because theoretically it can happen that the state 
is still in running, but the pid is already -1:

Status Subprocess::Kill(int signal) {
  if (state_ != kRunning) {
    const string err_str = "Sub-process is not running";
    LOG(DFATAL) << err_str;
    return Status::IllegalState(err_str);
  }
  int ret = kill(child_pid_, signal);

But it can happen even in this case that the kill is unsucessful, since the 
child may have already exited. I checked the code and its pretty common, that 
there is only a warning if the kill is unsucessful:
    WARN_NOT_OK(hms_process_->Kill(SIGQUIT), "failed to send SIGQUIT to HMS");

About using atomics: I think it makes sense, so that writing and reading the 
same variable can create any kind of weird problem, that TSAN was complaining 
about. And the solution is to make them atomic.

So in a nuthsell I think:
1. Using shared lock wouldn't help, since write lock takes a long time and we 
only use it as a write mutex. using read mutexes with it doesn't work, because 
these functions have to be fast.
2. using two locks, one for the current waitpid() (as a unique_mutex) and an 
other(shared_mutex) for to conserve the consistency between 
state_/child_pid_/wait_status_ makes sense, if we are trying to make the whole 
subprocess thread-safe. By first look, Subprocess::Kill and 
Subprocess::GetExitStatus is not thread safe. But the main point is making 
DoWait thread-safe since this is the main issue that makes the tests fail.
3. Using the atomic state is useful, because this way it is enforced that state 
doesn't get written at the exact same time as it is read in DoWait, preventing 
errors.

I think it's better this way



--
To view, visit http://gerrit.cloudera.org:8080/22056
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1cb540860b439c26e1c8529123c8b29940d9f84f
Gerrit-Change-Number: 22056
Gerrit-PatchSet: 1
Gerrit-Owner: Ádám Bakai <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Ashwani Raina <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Ádám Bakai <[email protected]>
Gerrit-Comment-Date: Thu, 21 Nov 2024 16:47:16 +0000
Gerrit-HasComments: Yes

Reply via email to