Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7820

to look at the new patch set (#2).

Change subject: Fix flaky ts_recovery-itest TestChangeMaxCellSize
......................................................................

Fix flaky ts_recovery-itest TestChangeMaxCellSize

In TSTabletManager::CreateReportedTabletPB(), each replica's state is
reported, and if this state is FAILED, it additionally reports the error
state that caused it to fail. However, replica state is transient, and
can, on rare occasion, change between these two retrievals of replica
state.

This led to a DCHECK failure in ts_recovery-itest when parsing a
report, as reports with errors are expected to be FAILED. The failure
was consistent with the following sequence of events:
  1. The tablet is opened, hits an error, and starts shutting down
  2. The report notes the replica state (STOPPING)
  3. The tablet finishes shutting down and switches its state (FAILED)
  4. The report, seeing the replica is FAILED, notes the error that
     caused the shut down
  5. The catalog manager goes through the report, and notes the error
     tacked onto the report and the fact that the tablet is STOPPING
     instead of FAILED. This fails the DCHECK, as logged below:

BlockManagerType/TsRecoveryITest.TestChangeMaxCellSize/0: 
catalog_manager.cc:2489] Check failed: report.state() == tablet::FAILED (3 vs. 
2)

This sequence may be more frequent since a9d17c0, which enforces the
tablet stops (i.e. go into the STOPPING state and shut down the
replica's internals) before going into the FAILED state, instead of
going directly to the FAILED state.

Validated by injecting a pause in the CreateReportedTabletPB between the
two calls and hitting the DCHECK failure. This is fixed by reporting the
error regardless of state and removing this DCHECK.

Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
---
M src/kudu/master/catalog_manager.cc
M src/kudu/tserver/ts_tablet_manager.cc
2 files changed, 4 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/20/7820/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7820
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Todd Lipcon <[email protected]>

Reply via email to