Henry Robinson has posted comments on this change.

Change subject: IMPALA-3882: Simplify some query exec state locking
......................................................................


Patch Set 4:

(9 comments)

Here's how I think about the lock safety (sorry, long, but hopefully helps to 
convince):

1. For callers that passed 'false' to GetQES(), nothing changes. That code path 
stays the same. 

2. For callers that passed 'true' to GetQES(), there's a new set of possible 
executions. Previously no two 'true' callers could obtain a reference to the 
QES simultaneously. That has now changed. 

However, apart from registration or UnregisterQuery()*, there's no reason I can 
think of to care about whether there are other references to the QES; rather 
the concern is whether the QES is accessed safely under that concurrency. 
Registration and UnregisterQuery() are safe anyhow.

Safety under concurrent references to the same QES is managed by QES::lock(), 
so the important thing in this patch is making sure that all 'true' callers 
correctly obtain the lock when they need to. Conveniently, that's the same 
point at which they all used to *adopt* the same lock, so it's pretty easy to 
check that we've got that right.

I added some more defensive locking even to the 'false' callers when they read 
from the QES in the latest patch version. 

3. Previously, no 'true' caller could obtain a reference to the QES 
simultaneously with planning. That's changed (effectively by passing 'false' in 
all those GetQES() calls), and planning doesn't hold the lock now anyhow.

* One might think that UnregisterQuery() erased the QES from the map while 
holding both locks, so that 'true' callers couldn't get at the QES after it was 
removed - guaranteeing that all 'true' callers were finished with the QES - but 
that was not the case even before this patch.

I ran this patch for ~10k queries under the stress test. I'll leave it running.

http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-beeswax-server.cc
File be/src/service/impala-beeswax-server.cc:

Line 279:   lock_guard<mutex> l(*exec_state->lock());
> There still is a race between GetQES() and this lock(). What if it the quer
I think that race exists currently, and is benign. If you look at 
https://github.com/apache/incubator-impala/blob/master/be/src/service/impala-server.cc#L904,
 UnregisterQuery() doesn't do anything to make sure that other references to 
the exec state have gone away, so it's totally possible for some other caller 
to have an existing shared_ptr and to read from the QES while it calls Done().

So the question becomes: is it safe to do anything to a QES if Done() might be 
called concurrently? Depends on the definition of 'safe':

* safe == free from concurrent access bugs -> both caller and Done() need to 
take lock_, which they [should] do.
* safe == free from *logic* bugs, where Done() destroys state that caller wants 
to read -> that's not what Done() does. Mostly Done() fills in some query 
profile information, and tells the coordinator it can stop executing (and the 
coordinator doesn't get destroyed otherwise).


http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-hs2-server.cc
File be/src/service/impala-hs2-server.cc:

PS4, Line 625: .get()
> I don't feel that strongly, but comparisons with NULL for scoped_ptr/unique
I don't feel that strongly either, but went with this because a) it's 
consistent with other usages in this file and b) it's explicit, rather than 
relying on operator overloading.


Line 641:   return;
> Unneeded return.
Done


PS4, Line 708: shared_ptr<QueryExecState> exec_state = 
GetQueryExecState(query_id);
> If this line is moved above L701, we wouldn't need to call GetQueryExecStat
Done


http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-server.cc
File be/src/service/impala-server.cc:

Line 789:   RETURN_IF_ERROR((*exec_state)->ExecQuery());
> What happens if Cancel() was called before we reach here?
Do you mean this comment to be on line 785? Cancel() could happen before this 
because the query is 'visible' after RegisterQuery() returns, and anyone that 
looks at e.g. the debug web page can find out the query id. 

But there is a new window for cancellation between RegisterQuery() and 
PlanQuery(). PlanQuery() has code to deal with that. Let me know if I'm not 
seeing the same race that you are.


http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/impala-server.h
File be/src/service/impala-server.h:

Line 82: /// 2. session_state_map_lock_
> Curious - do we have a similar problem with the session map lock?
Similar-ish, but not as acute. We don't have the problem where we have this 
weird lock-inversion like we do when calling GetQES(..., true). 
GetSessionState(). But there is one place where we get s_s_m_l_ and then the 
s_s_l_, and other places where we don't always get the s_s_m_l_ first. 

This can lead to deadlock if the latter case then calls GetSessionState() with 
the session ID for the session it already has the lock for.

I think it would be a good idea to move s_s_m_l_ to the bottom of this 
hierarchy as well by ensuring that it should never be taken in conjunction with 
any locks. Any iteration over the set of sessions can be done by copying out 
the shared_ptrs, releasing the lock and then iterating again.


http://gerrit.cloudera.org:8080/#/c/4935/4/be/src/service/query-exec-state.cc
File be/src/service/query-exec-state.cc:

PS4, Line 586: Like the other Exec(), fill out as much profile information as 
we're able to.
> Update comment.
Done


PS4, Line 594: SetPlanningFinished();
> Why call SetPlanningFinished() here?
Because this is effectively the 'planning' path for metadata ops, some of which 
have result sets and therefore metadata as well. 

I changed the method name to SetMetadataAvailable() to be clearer. I'm in two 
minds about whether it's a good change or not.


Line 620:       block_on_wait_cv_.wait(l);
> Should we switch to promises instead of cond vars here?
I think that's a good idea (see the jira for this patch), but it's a separate 
change to simplify the logic a bit.


-- 
To view, visit http://gerrit.cloudera.org:8080/4935
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I516357d2b5e9eb83e8209872cbe4c078c778a629
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-HasComments: Yes

Reply via email to