[
https://issues.apache.org/jira/browse/MESOS-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985868#comment-13985868
]
Jie Yu commented on MESOS-1271:
-------------------------------
OK, I think we are hitting this TODO in the code:
{noformat}
void ReplicaProcess::write(const WriteRequest& request)
{
...
Result<Action> result = read(request.position());
...
if (result.isError()) {
...
} else if (result.isNone()) {
...
} else if (result.isSome()) {
...
if (request.proposal() < action.promised()) {
...
} else {
// TODO(benh): Check if this position has already been learned,
// and if so, check that we are re-writing the same value!
//
// TODO(jieyu): Interestingly, in the presence of truncations,
// we may encounter a situation where this position has already
// been learned, but we are re-writing a different value. For
// example, assume that there are 5 replicas (R1 ~ R5). First,
// an append operation has been agreed at position 5 by R1, R2,
// R3 and R4, but only R1 receives a learned message. Later, a
// truncate operation has been agreed at position 10 by R1, R2
// and R3, but only R1 receives a learned message. Now, a leader
// failover happens and R5 is filled with a NOP at position 5
// because its coordinator receives a learned NOP at position 5
// from R1 (because of its learned truncation at position 10).
// Now, another leader failover happens and R4's coordinator
// tries to fill position 5. However, it is only able to contact
// R2, R3 and R4 during the explicit promise phase. As a result,
// it will try to write an append operation at position 5 to R5
// while R5 currently have a learned NOP stored at position 5.
}
}
}
{noformat}
> CHECK failure in replica.
> -------------------------
>
> Key: MESOS-1271
> URL: https://issues.apache.org/jira/browse/MESOS-1271
> Project: Mesos
> Issue Type: Bug
> Components: replicated log
> Affects Versions: 0.19.0
> Reporter: Benjamin Mahler
> Assignee: Jie Yu
> Fix For: 0.19.0
>
>
> {noformat}
> I0430 02:26:34.484668 45920 registrar.cpp:427] Successfully updated 'registry'
> I0430 02:26:34.558326 45920 registrar.cpp:379] Attempting to update the
> 'registry'
> I0430 02:26:34.658385 45910 log.cpp:680] Attempting to append 719410 bytes to
> the log
> I0430 02:26:34.658483 45910 coordinator.cpp:339] Coordinator attempting to
> write APPEND action at position 6129
> I0430 02:26:34.666218 45925 replica.cpp:508] Replica received write request
> for position 6129
> I0430 02:26:34.672080 45925 leveldb.cpp:341] Persisting action (719433 bytes)
> to leveldb took 5.610781ms
> I0430 02:26:34.672142 45925 replica.cpp:664] Persisted action at 6129
> I0430 02:26:34.682070 45913 replica.cpp:643] Replica received learned notice
> for position 6129
> I0430 02:26:34.687636 45913 leveldb.cpp:341] Persisting action (719435 bytes)
> to leveldb took 5.50696ms
> I0430 02:26:34.687713 45913 replica.cpp:664] Persisted action at 6129
> I0430 02:26:34.687729 45913 replica.cpp:649] Replica learned APPEND action at
> position 6129
> I0430 02:26:34.688134 45912 log.cpp:699] Attempting to truncate the log to
> 6129
> I0430 02:26:34.688251 45911 coordinator.cpp:339] Coordinator attempting to
> write TRUNCATE action at position 6130
> I0430 02:26:34.689167 45911 replica.cpp:508] Replica received write request
> for position 6130
> I0430 02:26:34.689728 45911 leveldb.cpp:341] Persisting action (18 bytes) to
> leveldb took 529731ns
> I0430 02:26:34.689746 45911 replica.cpp:664] Persisted action at 6130
> I0430 02:26:34.701628 45919 replica.cpp:643] Replica received learned notice
> for position 6130
> I0430 02:26:34.702505 45919 leveldb.cpp:341] Persisting action (20 bytes) to
> leveldb took 762510ns
> I0430 02:26:34.702551 45919 leveldb.cpp:399] Deleting ~2 keys from leveldb
> took 20442ns
> I0430 02:26:34.702568 45919 replica.cpp:664] Persisted action at 6130
> I0430 02:26:34.702590 45919 replica.cpp:649] Replica learned TRUNCATE action
> at position 6130
> I0430 02:26:35.163915 45920 registrar.cpp:427] Successfully updated 'registry'
> I0430 02:26:35.246116 45920 registrar.cpp:379] Attempting to update the
> 'registry'
> I0430 02:26:35.348455 45910 log.cpp:680] Attempting to append 718498 bytes to
> the log
> I0430 02:26:35.350102 45906 coordinator.cpp:339] Coordinator attempting to
> write APPEND action at position 6131
> I0430 02:26:35.378063 45908 replica.cpp:643] Replica received learned notice
> for position 6131
> I0430 02:26:35.383350 45908 leveldb.cpp:341] Persisting action (718523 bytes)
> to leveldb took 5.173633ms
> I0430 02:26:35.383401 45908 replica.cpp:664] Persisted action at 6131
> I0430 02:26:35.383414 45908 replica.cpp:649] Replica learned APPEND action at
> position 6131
> I0430 02:26:35.383997 45923 replica.cpp:508] Replica received write request
> for position 6131
> I0430 02:26:35.384345 45923 leveldb.cpp:436] Reading position from leveldb
> took 308357ns
> I0430 02:26:35.389766 45923 leveldb.cpp:341] Persisting action (718521 bytes)
> to leveldb took 5.054618ms
> I0430 02:26:35.389829 45923 replica.cpp:664] Persisted action at 6131
> F0430 02:26:35.393795 45903 coordinator.cpp:399] Check failed: !missing Not
> expecting local replica to be missing position 6131 after the writing is done
> *** Check failure stack trace: ***
> @ 0x7f9081eed5fd google::LogMessage::Fail()
> @ 0x7f9081eef444 google::LogMessage::SendToLog()
> @ 0x7f9081eed1ec google::LogMessage::Flush()
> @ 0x7f9081eefd39 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f9081d06c52
> mesos::internal::log::CoordinatorProcess::updateIndexAfterWritten()
> @ 0x7f9081d18b63
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI6OptionImEN5mesos8internal3log18CoordinatorProcessEbbEENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSF_FSD_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> @ 0x7f9081e231c2 process::ProcessManager::resume()
> @ 0x7f9081e234bc process::schedule()
> @ 0x7f908139783d start_thread
> @ 0x7f90800ff26d clone
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.2#6252)