[ 
https://issues.apache.org/jira/browse/KUDU-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15171168#comment-15171168
 ] 

David Alves edited comment on KUDU-1354 at 2/28/16 8:31 PM:
------------------------------------------------------------

After discussing this on slack, this is apparently a bug with the way we 
release locks and then mvcc commit transactions that have intersecting read 
sets.

Some preliminary thoughts on possible alternatives:

1 - Forego releasing locks before the mvcc commit:
Todd suggests we could simply release locks after we commit making sure that 
the two transactions never overlap. This is the simple option and likely the 
one we should implement first.
It has the disadvantage of making all transactions queued to acquire locks wait 
the additional time period between the instant that we would release locks 
(before mvcc commit) and the new instant (after mvcc commit). But this is 
likely not so bad as this interval as it is usually not proportional to the txn 
size and the prepare thread would block anyway.

2 - Keep track of the dependencies and make sure the transactions commit in 
order:
This would make much more sense if the lock manager was able to have per-lock 
wait queues, since we'd be tracking the dependencies already, in a way.

We could do something like:

Tx1 acquires locks with LOCK_EXCLUSIVE, adds marks itself as the lock owner
Tx1 goes though the prepare, apply etc, then changes the locks to LOCK_SHARED
Tx1 mvcc commits and removes itself from all lock queues.

Tx2 when acquiring the locks:
- if it observes a lock with no owner, acquires it as LOCK_EXCLUSIVE and marks 
itself as the owner.
- if it observes a lock with LOCK_EXCLUSIVE adds itself to the wait queue and 
adds the owner to its dependency set
- if it observes a lock with LOCK_SHARED changes it to LOCK_EXCLUSIVE, marks 
itself as the owner and adds the previous owner to its dependency set.

Tx2 can release locks as soon as it applies
Tx2 won't mvcc commit until all txns in it's dependency set have committed.




was (Author: dralves):
After discussing this on slack, this is apparently a bug with the way we 
release locks and then mvcc commit transactions that have intersecting read 
sets.

Some preliminary thoughts on possible alternatives:

1 - Forego releasing locks before the mvcc commit:
Todd suggests we could simply release locks after we commit making sure that 
the two transactions never overlap. This is the simple option and likely the 
one we should implement first.
It has the disadvantage of making all transactions queued to acquire locks wait 
the additional time period between the instant that we would release locks 
(before mvcc commit) and the new instant (after mvcc commit). But this is 
likely not so bad as this interval as it is usually not proportional to to the 
txn size and the prepare thread would block anyway.

2 - Keep track of the dependencies and make sure the transactions commit in 
order:
This would make much more sense lock manager was able to have per-lock wait 
queues, since we'd be tracking the dependencies already, in a way.

We could do something like:

Tx1 acquires locks with LOCK_EXCLUSIVE, adds marks itself as the lock owner
Tx1 goes though the prepare, apply etc, then changes the locks to LOCK_SHARED
Tx1 mvcc commits and removes itself from all lock queues.

Tx2 when acquiring the locks:
- if it observes a lock with no owner, acquires it as LOCK_EXCLUSIVE and marks 
itself as the owner.
- if it observes a lock with LOCK_EXCLUSIVE adds itself to the wait queue and 
adds the owner to its dependency set
- if it observes a lock with LOCK_SHARED changes it to LOCK_EXCLUSIVE, marks 
itself as the owner and adds the previous owner to its dependency set.

Tx2 can release locks as soon as it applies
Tx2 won't mvcc commit until all txns in it's dependency set have committed.



> MVCC Snapshots chosen during flush can contain out-of-order transactions
> ------------------------------------------------------------------------
>
>                 Key: KUDU-1354
>                 URL: https://issues.apache.org/jira/browse/KUDU-1354
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 0.7.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>
> I spent a while trying to debug a failure of alter_table-randomized-test and 
> found the following interesting logs:
> - We have two operations in the WAL which arrived in short succession (about 
> 4ms apart) just before an alter table. I've renumbered the txids for 
> readability here:
> {noformat}
> 1.13@2        REPLICATE WRITE_OP
>       op 0: MUTATE (int32 key=1643562) SET c6=1107303203
> 1.14@4        REPLICATE WRITE_OP
>       op 0: MUTATE (int32 key=1643562) DELETE
> {noformat}
> - and the Flush that was caused by the Altertable has the following snapshots:
> {noformat}
> ... Phase 1 snapshot:  MvccSnapshot[committed={T|T < 2 or (T in (4))]
> ...
> ... Phase 2 snapshot: MvccSnapshot[committed={T|T < 2 or (T in (4, 2))]
> {noformat}
> Note that the first snapshot considers the 'DELETE' committed but not the 
> 'UPDATE'. We then fill in the 'UPDATE' in the second snapshot.The end result 
> here is that we end up flushing REDO deltas as follows:
> REDO file 1 (flushed in phase 1): includes only the DELETE
> REDO file 2 (flushed after ReupdateMissedDeltas); includes only the UPDATE
> When we later proceed to compact this rowset, we get "Check failed: 
> !is_deleted Got UPDATE for deleted row."
> Scenarios like this seem to reproduce a few tenths of a percent of the time 
> in this stress test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to