[
https://issues.apache.org/jira/browse/KUDU-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Todd Lipcon resolved KUDU-915.
------------------------------
Resolution: Fixed
Fix Version/s: 0.5.0
I'm going to call this fixed, since the workaround committed back in August
seems to be good enough for now. If we decide we want to make the locking more
fine-grained we can attack it as a perf optimization at some point down the
road.
> Bootstrap can fail shortly after an alter-table
> -----------------------------------------------
>
> Key: KUDU-915
> URL: https://issues.apache.org/jira/browse/KUDU-915
> Project: Kudu
> Issue Type: Bug
> Components: tablet
> Affects Versions: Private Beta
> Reporter: Todd Lipcon
> Priority: Critical
> Fix For: 0.5.0
>
> Attachments: alter_table-randomized-test (5).txt.gz
>
>
> I saw a test failure which seems to be due to the following sequence:
> 1) Log: REPLICATE 1.8 ALTER_SCHEMA
> 2) Log: REPLICATE 1.9 WRITE
> 3) Log: COMMIT 1.9 WRITE
> 4) TabletMetadata::Flush()
> 5) crash (before COMMIT 1.8 ALTER_SCHEMA)
> During bootstrap, we then have an issue that, because we haven't seen a
> commit message for 1.8, we consider operation 1.9 to be still pending. We are
> relying on the tablet peer's FlushInFlightsToLogCallback to ensure that we
> don't flush metadata until the COMMIT message in the log, but that isn't
> strong enough -- we need to actually wait until COMMIT messages are in the
> log for _all_ prior operations, not just all prior _writes_. The
> implementation currently uses MvccManager::WaitForAllInFlightToCommit, but
> since AlterSchema doesn't use MvccManager, we aren't waiting for it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)