[jira] [Commented] (KUDU-1761) Flaky tablet_history_gc-itest due to interleaving of concurrent client flushes

Mike Percy (JIRA) Thu, 24 Nov 2016 07:27:43 -0800

    [ 
https://issues.apache.org/jira/browse/KUDU-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15693535#comment-15693535
 ]


Mike Percy commented on KUDU-1761:
----------------------------------

Below is evidence from running tablet_history_gc-itest that the description of 
this JIRA is true.

There appears to be a problem with out-of-order writes from the client. Likely 
our exactly-once mechanism does not support enforcement of well-ordered client 
flushes / writes. Concurrent client flushes may apparently result in a 
misordering of applied operations, as we can see in the below test results.

Assertion failure from a run of tablet_history_gc-itest:

{noformat}
I1124 14:58:47.745049  6573 tablet_history_gc-itest.cc:214] Round 26: Verifying 
snapshot scan for timestamp P: 1000000000 usec, L: 1559 (4096000001559)
../../src/kudu/integration-tests/tablet_history_gc-itest.cc:243: Failure
Value of: int_val
  Actual: 1853363533
Expected: snap_iter->second.int_val
Which is: 230599396
at row key 1401
{noformat}

Expected order of applied operations from the test log:

{noformat}
I1124 14:58:43.812958  6573 tablet_history_gc-itest.cc:477] Updating row to { 
1401, 1853363533, 1853363532, NOT_DELETED }
I1124 14:58:43.812971  6573 tablet_history_gc-itest.cc:477] Updating row to { 
1401, 230599396, 230599395, NOT_DELETED }
{noformat}

Actual order of applied operations was the opposite, based on the WAL generated 
by the test:

{noformat}
1.1297@4096000001314    REPLICATE WRITE_OP
        Tablet: 87bd5c7e164744399cf6e9b02ff4f588
        RequestId: client_id: "dec6b320491a46928b5914deb42e16e7" seq_no: 1296 
first_incomplete_seq_no: 1295 attempt_no: 0
        Consistency: CLIENT_PROPAGATED
        op 0: MUTATE (int32 key=1401) SET int_val=230599396, 
string_val=230599395
        op_type: WRITE_OP commited_op_id { term: 1 index: 1297 } result { ops { 
mutated_stores { mrs_id: 0 } } }
1.1298@4096000001315    REPLICATE WRITE_OP
        Tablet: 87bd5c7e164744399cf6e9b02ff4f588
        RequestId: client_id: "dec6b320491a46928b5914deb42e16e7" seq_no: 1295 
first_incomplete_seq_no: 1294 attempt_no: 0
        Consistency: CLIENT_PROPAGATED
        op 0: MUTATE (int32 key=1401) SET int_val=1853363533, 
string_val=1853363532
{noformat}

> Flaky tablet_history_gc-itest due to interleaving of concurrent client flushes
> ------------------------------------------------------------------------------
>
>                 Key: KUDU-1761
>                 URL: https://issues.apache.org/jira/browse/KUDU-1761
>             Project: Kudu
>          Issue Type: Bug
>          Components: client, test
>    Affects Versions: 1.1.0
>            Reporter: Mike Percy
>
> It appears that tablet_history_gc-itest is flaky due to interleaving of 
> client operations when automatic flush is enabled. The test is particularly 
> susceptible if an async flush is triggered after each operation.
> The issue becomes more apparent when there are two updates to the same row in 
> quick succession, and an async flush is triggered after each one. Sometimes 
> the 2nd update is applied first on the server, then overwritten by the 1st 
> update, even though it was applied first to the client session. This 
> concurrency race may manifest randomly in response to thread and network 
> timing latencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KUDU-1761) Flaky tablet_history_gc-itest due to interleaving of concurrent client flushes

Reply via email to