[ 
https://issues.apache.org/jira/browse/HDDS-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565058#comment-16565058
 ] 

Mukul Kumar Singh commented on HDDS-230:
----------------------------------------

Looked into this issue and was able to reproduce on a cluster.

on leader:
{code}
2018-07-05 18:09:35,474 [grpc-default-executor-10] INFO       - adding 
chunk:8f10dd8e0e8a4fa236ffb1ec1f40bdc2_stream_35d91bc0-6b33-485d-bce6-19a96557180c_chunk_1
 for container:14
{code}

on follower1:
{code}
2018-07-05 18:09:35,575 [grpc-default-executor-3] INFO       - adding 
chunk:8f10dd8e0e8a4fa236ffb1ec1f40bdc2_stream_35d91bc0-6b33-485d-bce6-19a96557180c_chunk_1
 for container:14
{code}

on follower 2, which went into a stop the world gc before this transaction.
{code}
2018-07-05 14:10:01,606 
[StateMachineUpdater-40356aa1-741f-499c-aad1-b500f2620a3d_9858] INFO       - 
removing 
chunk:8f10dd8e0e8a4fa236ffb1ec1f40bdc2_stream_35d91bc0-6b33-485d-bce6-19a96557180c_chunk_1
 for container:14
{code}

This is the case where a transaction was committed on the leader and one 
follower and the leader discarded the cache after that. The new follower which 
picks up after this will request for new append entries where the state machine 
data has been discarded.

This issue has been fixed in Ratis using RATIS-281, where state machine 
provides and api called readStateMachineData, where statemachine can plugin 
stateMachineData which is missing inside Ratis.

This jira proposes to fix the issue with changes in ContainerStateMachine to 
provide the statemachine data to the ratis leader.


> Ozone Datanode exits during data write through Ratis
> ----------------------------------------------------
>
>                 Key: HDDS-230
>                 URL: https://issues.apache.org/jira/browse/HDDS-230
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: Ozone Datanode
>    Affects Versions: 0.2.1
>            Reporter: Mukul Kumar Singh
>            Assignee: Mukul Kumar Singh
>            Priority: Critical
>             Fix For: 0.2.1
>
>
> Ozone datanode exits during data write with the following exception.
> {code}
> 2018-07-05 14:10:01,605 INFO org.apache.ratis.server.storage.RaftLogWorker: 
> Rolling segment:40356aa1-741f-499c-aad1-b500f2620a3d_9858-RaftLogWorker index 
> to:4565
> 2018-07-05 14:10:01,607 ERROR 
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit 
> status 2: StateMachineUpdater-40356aa1-741f-499c-aad1-b500f2620a3d_9858: the 
> StateMachineUpdater hits Throwable
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:272)
>         at 
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1058)
>         at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:154)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> This might be as a result of a ratis transaction which was not written 
> through the "writeStateMachineData" phase, however it was added to the raft 
> log. This implied that stateMachineUpdater now applies a transaction without 
> the corresponding entry being added to the stateMachine.
> I am raising this jira to track the issue and will also raise a Ratis jira if 
> required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to