[
https://issues.apache.org/jira/browse/KUDU-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15693535#comment-15693535
]
Mike Percy commented on KUDU-1761:
----------------------------------
Below is evidence from running tablet_history_gc-itest that the description of
this JIRA is true.
There appears to be a problem with out-of-order writes from the client. Likely
our exactly-once mechanism does not support enforcement of well-ordered client
flushes / writes. Concurrent client flushes may apparently result in a
misordering of applied operations, as we can see in the below test results.
Assertion failure from a run of tablet_history_gc-itest:
{noformat}
I1124 14:58:47.745049 6573 tablet_history_gc-itest.cc:214] Round 26: Verifying
snapshot scan for timestamp P: 1000000000 usec, L: 1559 (4096000001559)
../../src/kudu/integration-tests/tablet_history_gc-itest.cc:243: Failure
Value of: int_val
Actual: 1853363533
Expected: snap_iter->second.int_val
Which is: 230599396
at row key 1401
{noformat}
Expected order of applied operations from the test log:
{noformat}
I1124 14:58:43.812958 6573 tablet_history_gc-itest.cc:477] Updating row to {
1401, 1853363533, 1853363532, NOT_DELETED }
I1124 14:58:43.812971 6573 tablet_history_gc-itest.cc:477] Updating row to {
1401, 230599396, 230599395, NOT_DELETED }
{noformat}
Actual order of applied operations was the opposite, based on the WAL generated
by the test:
{noformat}
1.1297@4096000001314 REPLICATE WRITE_OP
Tablet: 87bd5c7e164744399cf6e9b02ff4f588
RequestId: client_id: "dec6b320491a46928b5914deb42e16e7" seq_no: 1296
first_incomplete_seq_no: 1295 attempt_no: 0
Consistency: CLIENT_PROPAGATED
op 0: MUTATE (int32 key=1401) SET int_val=230599396,
string_val=230599395
op_type: WRITE_OP commited_op_id { term: 1 index: 1297 } result { ops {
mutated_stores { mrs_id: 0 } } }
1.1298@4096000001315 REPLICATE WRITE_OP
Tablet: 87bd5c7e164744399cf6e9b02ff4f588
RequestId: client_id: "dec6b320491a46928b5914deb42e16e7" seq_no: 1295
first_incomplete_seq_no: 1294 attempt_no: 0
Consistency: CLIENT_PROPAGATED
op 0: MUTATE (int32 key=1401) SET int_val=1853363533,
string_val=1853363532
{noformat}
> Flaky tablet_history_gc-itest due to interleaving of concurrent client flushes
> ------------------------------------------------------------------------------
>
> Key: KUDU-1761
> URL: https://issues.apache.org/jira/browse/KUDU-1761
> Project: Kudu
> Issue Type: Bug
> Components: client, test
> Affects Versions: 1.1.0
> Reporter: Mike Percy
>
> It appears that tablet_history_gc-itest is flaky due to interleaving of
> client operations when automatic flush is enabled. The test is particularly
> susceptible if an async flush is triggered after each operation.
> The issue becomes more apparent when there are two updates to the same row in
> quick succession, and an async flush is triggered after each one. Sometimes
> the 2nd update is applied first on the server, then overwritten by the 1st
> update, even though it was applied first to the client session. This
> concurrency race may manifest randomly in response to thread and network
> timing latencies.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)