[
https://issues.apache.org/jira/browse/KUDU-3446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
shenxingwuying updated KUDU-3446:
---------------------------------
Description:
h1. Background
In kudu, kudu's WAL' records has two types, one is 'replicate', the other is
'commit'. The 'replcate' log is the raft logs, the 'commit' logs is durability
for the applied opid on kudu storage engine.
Currently, appling the ops using 'apply_pool->Submit()'(i.e concurrent
thread-pool),
the apply task mainly run the following statements:
{code:java}
// op_driver.cc
apply_pool_->Submit([this]() { this->ApplyTask(); });
OpDriver::ApplyTask() {
CommitMsg* commit_msg;
Status s = op_->Apply(&commit_msg);
log_->AsyncAppendCommit(*commit_msg, ...
} {code}
apply_pool_ is an concurrent thread-pool, ApplyTask is concurrent, so some raft
logs statifys happen-before ralationship, it may not statisfies apply them into
kudu storage engine.
For example, 4 logs of 2 ops, we expected:
replicate 1.1
commit 1.1
replicate 1.2
commit 1.2
or
replicate 1.1
replicate 1.2
commit 1.1
commit 1.2
A incorrect order(IMO) is:
replicate 1.1
replicate 1.2
commit 1.2
commit 1.1
Currently, it's valid in kudu system, kudu system allow the order and some test
cases and bootstrap's processing can reflect this.
But that means 1.2 would become valid before 1.1 in kudu engine in a very high
probability, that may be not expected.
It's simple to reproduce the scenarios if there is enough WriteRequests. I will
write a test for this.
I obtain a case like this:
./bin/kudu wal dump $wal_file | egrep "REPLICATE|COMMIT" | less
1.75939@6812005919066001408 REPLICATE WRITE_OP
1.75940@6812005919066857472 REPLICATE WRITE_OP
1.75941@6812005919067430912 REPLICATE WRITE_OP
COMMIT 1.75939
COMMIT 1.75941
COMMIT 1.75940
1.75942@6812005919193690112 REPLICATE WRITE_OP
COMMIT 1.75942
1.75943@6812005919311241216 REPLICATE WRITE_OP
1.75944@6812005919312207872 REPLICATE WRITE_OP
1.75945@6812005919312932864 REPLICATE WRITE_OP
1.75946@6812005919313645568 REPLICATE WRITE_OP
COMMIT 1.75943
COMMIT 1.75945
COMMIT 1.75944
COMMIT 1.75946
1.75947@6812005919354585088 REPLICATE WRITE_OP
COMMIT 1.75947
1.75948@6812005919430410240 REPLICATE WRITE_OP
1.75949@6812005919431192576 REPLICATE WRITE_OP
1.75950@6812005919431778304 REPLICATE WRITE_OP
COMMIT 1.75948
COMMIT 1.75950
COMMIT 1.75949
we can see the COMMIT:
COMMIT 1.75939
COMMIT 1.75941
COMMIT 1.75940
and
COMMIT 1.75943
COMMIT 1.75945
COMMIT 1.75944
and
COMMIT 1.75948
COMMIT 1.75950
COMMIT 1.75949
h1. Motivation
I think the correct order should statisfy the invariable
r: replicate
c: commit
e[i]: a pair replicate and commit op for index i.
# r(e[i]) < r(e[i+1]) its raft's requirement
# r(e[i]) < c(e[i] its obvious
# c(e[i]) < c(e[i+1]) should same as 1.
The raft logs is a total order on server side, kudu storage engine is the state
machine and the applied order should same as raft logs. So we should fix the
problem.
h1. Solution
I think we should use a 'apply_pool_token_' with SERIAL_MODE
created by apply_pool_ instead of 'apply_pool_'. If we do this, some cases
should fix at the same time.
We should talk about the words what I described above firstly and whether is
it correct?
was:
h1. Background
In kudu, kudu's WAL' records has two types, one is 'replicate', the other is
'commit'. The 'replcate' log is the raft logs, the 'commit' logs is durability
for the applied opid on kudu storage engine.
Currently, appling the ops using 'apply_pool->Submit()'(i.e concurrent
thread-pool),
the apply task mainly run the following statements:
{code:java}
// op_driver.cc
apply_pool_->Submit([this]() { this->ApplyTask(); });
OpDriver::ApplyTask() {
CommitMsg* commit_msg;
Status s = op_->Apply(&commit_msg);
log_->AsyncAppendCommit(*commit_msg, ...
} {code}
apply_pool_ is an concurrent thread-pool, ApplyTask is concurrent, so some raft
logs statifys happen-before ralationship, it may not statisfies apply them into
kudu storage engine.
For example, 4 logs of 2 ops, we expected:
replicate 1.1
commit 1.1
replicate 1.2
commit 1.2
or
replicate 1.1
replicate 1.2
commit 1.1
commit 1.2
A incorrect order(IMO) is:
replicate 1.1
replicate 1.2
commit 1.2
commit 1.1
Currently, it's valid in kudu system, kudu system allow the order and some test
cases and bootstrap's processing can reflect this.
But that means 1.2 would become valid before 1.1 in kudu engine in a very high
probability, that may be not expected.
It's simple to reproduce the scenarios if there is enough WriteRequests. I will
write a test for this.
I obtain a case like this:
./bin/kudu wal dump $wal_file | egrep "REPLICATE|COMMIT" | less
1.75939@6812005919066001408 REPLICATE WRITE_OP
1.75940@6812005919066857472 REPLICATE WRITE_OP
1.75941@6812005919067430912 REPLICATE WRITE_OP
COMMIT 1.75939
COMMIT 1.75941
COMMIT 1.75940
1.75942@6812005919193690112 REPLICATE WRITE_OP
COMMIT 1.75942
1.75943@6812005919311241216 REPLICATE WRITE_OP
1.75944@6812005919312207872 REPLICATE WRITE_OP
1.75945@6812005919312932864 REPLICATE WRITE_OP
1.75946@6812005919313645568 REPLICATE WRITE_OP
COMMIT 1.75943
COMMIT 1.75945
COMMIT 1.75944
COMMIT 1.75946
1.75947@6812005919354585088 REPLICATE WRITE_OP
COMMIT 1.75947
1.75948@6812005919430410240 REPLICATE WRITE_OP
1.75949@6812005919431192576 REPLICATE WRITE_OP
1.75950@6812005919431778304 REPLICATE WRITE_OP
COMMIT 1.75948
COMMIT 1.75950
COMMIT 1.75949
we can see the COMMIT:
COMMIT 1.75939
COMMIT 1.75941
COMMIT 1.75940
and
COMMIT 1.75943
COMMIT 1.75945
COMMIT 1.75944
and
COMMIT 1.75948
COMMIT 1.75950
COMMIT 1.75949
h1. Motivation
I think the correct order should statisfy the invariable
r: replicate
c: commit
e[i]: a pair replicate and commit op for index i.
# r(e[i]) < r(e[i+1]) its raft's requirement
# r(e[i]) < c(e[i] its obvious
# c(e[i]) < c(e[i+1]) should same as 1.
The raft logs is an total order on server side, kudu storage engine is the
state machine and the applied order should same as raft logs.
h1. Solution
I think we should use a 'apply_pool_token_' with SERIAL_MODE
created by apply_pool_ instead of 'apply_pool_'. If we do this, some cases
should fix at the same time.
We should talk about the words what I described above firstly and whether is
it correct?
> I think we should talk about CommitMsg's order in WAL
> -----------------------------------------------------
>
> Key: KUDU-3446
> URL: https://issues.apache.org/jira/browse/KUDU-3446
> Project: Kudu
> Issue Type: Improvement
> Reporter: shenxingwuying
> Assignee: shenxingwuying
> Priority: Major
>
> h1. Background
> In kudu, kudu's WAL' records has two types, one is 'replicate', the other is
> 'commit'. The 'replcate' log is the raft logs, the 'commit' logs is
> durability for the applied opid on kudu storage engine.
> Currently, appling the ops using 'apply_pool->Submit()'(i.e concurrent
> thread-pool),
> the apply task mainly run the following statements:
>
> {code:java}
> // op_driver.cc
> apply_pool_->Submit([this]() { this->ApplyTask(); });
> OpDriver::ApplyTask() {
> CommitMsg* commit_msg;
> Status s = op_->Apply(&commit_msg);
> log_->AsyncAppendCommit(*commit_msg, ...
> } {code}
> apply_pool_ is an concurrent thread-pool, ApplyTask is concurrent, so some
> raft logs statifys happen-before ralationship, it may not statisfies apply
> them into kudu storage engine.
> For example, 4 logs of 2 ops, we expected:
> replicate 1.1
> commit 1.1
> replicate 1.2
> commit 1.2
> or
> replicate 1.1
> replicate 1.2
> commit 1.1
> commit 1.2
> A incorrect order(IMO) is:
> replicate 1.1
> replicate 1.2
> commit 1.2
> commit 1.1
> Currently, it's valid in kudu system, kudu system allow the order and some
> test cases and bootstrap's processing can reflect this.
> But that means 1.2 would become valid before 1.1 in kudu engine in a very
> high probability, that may be not expected.
>
>
> It's simple to reproduce the scenarios if there is enough WriteRequests. I
> will write a test for this.
> I obtain a case like this:
> ./bin/kudu wal dump $wal_file | egrep "REPLICATE|COMMIT" | less
> 1.75939@6812005919066001408 REPLICATE WRITE_OP
> 1.75940@6812005919066857472 REPLICATE WRITE_OP
> 1.75941@6812005919067430912 REPLICATE WRITE_OP
> COMMIT 1.75939
> COMMIT 1.75941
> COMMIT 1.75940
> 1.75942@6812005919193690112 REPLICATE WRITE_OP
> COMMIT 1.75942
> 1.75943@6812005919311241216 REPLICATE WRITE_OP
> 1.75944@6812005919312207872 REPLICATE WRITE_OP
> 1.75945@6812005919312932864 REPLICATE WRITE_OP
> 1.75946@6812005919313645568 REPLICATE WRITE_OP
> COMMIT 1.75943
> COMMIT 1.75945
> COMMIT 1.75944
> COMMIT 1.75946
> 1.75947@6812005919354585088 REPLICATE WRITE_OP
> COMMIT 1.75947
> 1.75948@6812005919430410240 REPLICATE WRITE_OP
> 1.75949@6812005919431192576 REPLICATE WRITE_OP
> 1.75950@6812005919431778304 REPLICATE WRITE_OP
> COMMIT 1.75948
> COMMIT 1.75950
> COMMIT 1.75949
> we can see the COMMIT:
> COMMIT 1.75939
> COMMIT 1.75941
> COMMIT 1.75940
> and
> COMMIT 1.75943
> COMMIT 1.75945
> COMMIT 1.75944
> and
> COMMIT 1.75948
> COMMIT 1.75950
> COMMIT 1.75949
> h1. Motivation
> I think the correct order should statisfy the invariable
> r: replicate
> c: commit
> e[i]: a pair replicate and commit op for index i.
> # r(e[i]) < r(e[i+1]) its raft's requirement
> # r(e[i]) < c(e[i] its obvious
> # c(e[i]) < c(e[i+1]) should same as 1.
> The raft logs is a total order on server side, kudu storage engine is the
> state machine and the applied order should same as raft logs. So we should
> fix the problem.
> h1. Solution
> I think we should use a 'apply_pool_token_' with SERIAL_MODE
> created by apply_pool_ instead of 'apply_pool_'. If we do this, some cases
> should fix at the same time.
>
> We should talk about the words what I described above firstly and whether is
> it correct?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)