[jira] [Commented] (MESOS-1271) CHECK failure in replica.

Jie Yu (JIRA) Wed, 30 Apr 2014 11:22:44 -0700

    [ 
https://issues.apache.org/jira/browse/MESOS-1271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985868#comment-13985868
 ]


Jie Yu commented on MESOS-1271:
-------------------------------

OK, I think we are hitting this TODO in the code:

{noformat}
void ReplicaProcess::write(const WriteRequest& request)
{
  ...
  Result<Action> result = read(request.position());
  ...
  if (result.isError()) {
    ...
  } else if (result.isNone()) {
    ...
  } else if (result.isSome()) {
    ...
    if (request.proposal() < action.promised()) {
       ...
    } else {
       // TODO(benh): Check if this position has already been learned,
       // and if so, check that we are re-writing the same value!
       //
       // TODO(jieyu): Interestingly, in the presence of truncations,
       // we may encounter a situation where this position has already
       // been learned, but we are re-writing a different value. For
       // example, assume that there are 5 replicas (R1 ~ R5). First,
       // an append operation has been agreed at position 5 by R1, R2,
       // R3 and R4, but only R1 receives a learned message. Later, a
       // truncate operation has been agreed at position 10 by R1, R2
       // and R3, but only R1 receives a learned message. Now, a leader
       // failover happens and R5 is filled with a NOP at position 5
       // because its coordinator receives a learned NOP at position 5
       // from R1 (because of its learned truncation at position 10).
       // Now, another leader failover happens and R4's coordinator
       // tries to fill position 5. However, it is only able to contact
       // R2, R3 and R4 during the explicit promise phase. As a result,
       // it will try to write an append operation at position 5 to R5
       // while R5 currently have a learned NOP stored at position 5.
    }
  }
}
{noformat}

> CHECK failure in replica.
> -------------------------
>
>                 Key: MESOS-1271
>                 URL: https://issues.apache.org/jira/browse/MESOS-1271
>             Project: Mesos
>          Issue Type: Bug
>          Components: replicated log
>    Affects Versions: 0.19.0
>            Reporter: Benjamin Mahler
>            Assignee: Jie Yu
>             Fix For: 0.19.0
>
>
> {noformat}
> I0430 02:26:34.484668 45920 registrar.cpp:427] Successfully updated 'registry'
> I0430 02:26:34.558326 45920 registrar.cpp:379] Attempting to update the 
> 'registry'
> I0430 02:26:34.658385 45910 log.cpp:680] Attempting to append 719410 bytes to 
> the log
> I0430 02:26:34.658483 45910 coordinator.cpp:339] Coordinator attempting to 
> write APPEND action at position 6129
> I0430 02:26:34.666218 45925 replica.cpp:508] Replica received write request 
> for position 6129
> I0430 02:26:34.672080 45925 leveldb.cpp:341] Persisting action (719433 bytes) 
> to leveldb took 5.610781ms
> I0430 02:26:34.672142 45925 replica.cpp:664] Persisted action at 6129
> I0430 02:26:34.682070 45913 replica.cpp:643] Replica received learned notice 
> for position 6129
> I0430 02:26:34.687636 45913 leveldb.cpp:341] Persisting action (719435 bytes) 
> to leveldb took 5.50696ms
> I0430 02:26:34.687713 45913 replica.cpp:664] Persisted action at 6129
> I0430 02:26:34.687729 45913 replica.cpp:649] Replica learned APPEND action at 
> position 6129
> I0430 02:26:34.688134 45912 log.cpp:699] Attempting to truncate the log to 
> 6129
> I0430 02:26:34.688251 45911 coordinator.cpp:339] Coordinator attempting to 
> write TRUNCATE action at position 6130
> I0430 02:26:34.689167 45911 replica.cpp:508] Replica received write request 
> for position 6130
> I0430 02:26:34.689728 45911 leveldb.cpp:341] Persisting action (18 bytes) to 
> leveldb took 529731ns
> I0430 02:26:34.689746 45911 replica.cpp:664] Persisted action at 6130
> I0430 02:26:34.701628 45919 replica.cpp:643] Replica received learned notice 
> for position 6130
> I0430 02:26:34.702505 45919 leveldb.cpp:341] Persisting action (20 bytes) to 
> leveldb took 762510ns
> I0430 02:26:34.702551 45919 leveldb.cpp:399] Deleting ~2 keys from leveldb 
> took 20442ns
> I0430 02:26:34.702568 45919 replica.cpp:664] Persisted action at 6130
> I0430 02:26:34.702590 45919 replica.cpp:649] Replica learned TRUNCATE action 
> at position 6130
> I0430 02:26:35.163915 45920 registrar.cpp:427] Successfully updated 'registry'
> I0430 02:26:35.246116 45920 registrar.cpp:379] Attempting to update the 
> 'registry'
> I0430 02:26:35.348455 45910 log.cpp:680] Attempting to append 718498 bytes to 
> the log
> I0430 02:26:35.350102 45906 coordinator.cpp:339] Coordinator attempting to 
> write APPEND action at position 6131
> I0430 02:26:35.378063 45908 replica.cpp:643] Replica received learned notice 
> for position 6131
> I0430 02:26:35.383350 45908 leveldb.cpp:341] Persisting action (718523 bytes) 
> to leveldb took 5.173633ms
> I0430 02:26:35.383401 45908 replica.cpp:664] Persisted action at 6131
> I0430 02:26:35.383414 45908 replica.cpp:649] Replica learned APPEND action at 
> position 6131
> I0430 02:26:35.383997 45923 replica.cpp:508] Replica received write request 
> for position 6131
> I0430 02:26:35.384345 45923 leveldb.cpp:436] Reading position from leveldb 
> took 308357ns
> I0430 02:26:35.389766 45923 leveldb.cpp:341] Persisting action (718521 bytes) 
> to leveldb took 5.054618ms
> I0430 02:26:35.389829 45923 replica.cpp:664] Persisted action at 6131
> F0430 02:26:35.393795 45903 coordinator.cpp:399] Check failed: !missing Not 
> expecting local replica to be missing position 6131 after the writing is done
> *** Check failure stack trace: ***
>     @     0x7f9081eed5fd  google::LogMessage::Fail()
>     @     0x7f9081eef444  google::LogMessage::SendToLog()
>     @     0x7f9081eed1ec  google::LogMessage::Flush()
>     @     0x7f9081eefd39  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f9081d06c52  
> mesos::internal::log::CoordinatorProcess::updateIndexAfterWritten()
>     @     0x7f9081d18b63  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI6OptionImEN5mesos8internal3log18CoordinatorProcessEbbEENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSF_FSD_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>     @     0x7f9081e231c2  process::ProcessManager::resume()
>     @     0x7f9081e234bc  process::schedule()
>     @     0x7f908139783d  start_thread
>     @     0x7f90800ff26d  clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MESOS-1271) CHECK failure in replica.

Reply via email to