Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5749: coordinator race hits DCHECK 
'num_remaining_backends_ > 0'
......................................................................


Patch Set 2:

(1 comment)

> > Does this trigger only when there are two concurrent calls to
 > > UpdateBackendExecStatus() from the same backend? If so, do we
 > > understand why that happens so often?
 > 
 > My understanding is this:
 > A fragment instance sends reports every 'n' seconds. Due to a
 > congested network, two of these reports for the same fragment
 > instance from a backend can arrive at the coordinator and start
 > being processed at around the same time, hence leading to this
 > issue.
 > 
 > Ideally a second report cannot be send until the first one is ACKd
 > by the coordinator, since a lock is held until the report is ACKd,
 > in the ReportProfileThread(); but there is only one case where a
 > second report will be sent before the first one is responded to,
 > i.e.  from FragmentInstanceState::Finalize().
 > 
 > So ReportProfileThread() sends the one report of the last
 > finstance, then Finalize() sends the second report of the same
 > finstance before the first one is responded to.

This may be possible, but the query I've been using to repro it runs fast 
enough that there are never any updates sent by the report thread, just from 
Finalize, so I haven't observed it.

The reason I've been seeing it fail is: there are two different fragments from 
the same backend that send reports at the same time. When they both reach 
Coordinator::UpdateBackendExecStatus(), they both get past the IsDone() check.

Then ApplyExecStatusReport() gets called twice but both calls return true in 
the out parameter 'done', because the first call contained a fragment with an 
error status which causes the backend to be considered done, so both threads 
end up decreasing num_remaining_backends_ for the same backend.

http://gerrit.cloudera.org:8080/#/c/7577/1/be/src/runtime/coordinator-backend-state.h
File be/src/runtime/coordinator-backend-state.h:

Line 75:   /// this update was not applied because this backend has previously 
completed. This can
> Please also add:
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7577
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>
Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]>
Gerrit-HasComments: Yes

Reply via email to