[
https://issues.apache.org/jira/browse/HDDS-230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565058#comment-16565058
]
Mukul Kumar Singh commented on HDDS-230:
----------------------------------------
Looked into this issue and was able to reproduce on a cluster.
on leader:
{code}
2018-07-05 18:09:35,474 [grpc-default-executor-10] INFO - adding
chunk:8f10dd8e0e8a4fa236ffb1ec1f40bdc2_stream_35d91bc0-6b33-485d-bce6-19a96557180c_chunk_1
for container:14
{code}
on follower1:
{code}
2018-07-05 18:09:35,575 [grpc-default-executor-3] INFO - adding
chunk:8f10dd8e0e8a4fa236ffb1ec1f40bdc2_stream_35d91bc0-6b33-485d-bce6-19a96557180c_chunk_1
for container:14
{code}
on follower 2, which went into a stop the world gc before this transaction.
{code}
2018-07-05 14:10:01,606
[StateMachineUpdater-40356aa1-741f-499c-aad1-b500f2620a3d_9858] INFO -
removing
chunk:8f10dd8e0e8a4fa236ffb1ec1f40bdc2_stream_35d91bc0-6b33-485d-bce6-19a96557180c_chunk_1
for container:14
{code}
This is the case where a transaction was committed on the leader and one
follower and the leader discarded the cache after that. The new follower which
picks up after this will request for new append entries where the state machine
data has been discarded.
This issue has been fixed in Ratis using RATIS-281, where state machine
provides and api called readStateMachineData, where statemachine can plugin
stateMachineData which is missing inside Ratis.
This jira proposes to fix the issue with changes in ContainerStateMachine to
provide the statemachine data to the ratis leader.
> Ozone Datanode exits during data write through Ratis
> ----------------------------------------------------
>
> Key: HDDS-230
> URL: https://issues.apache.org/jira/browse/HDDS-230
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: Ozone Datanode
> Affects Versions: 0.2.1
> Reporter: Mukul Kumar Singh
> Assignee: Mukul Kumar Singh
> Priority: Critical
> Fix For: 0.2.1
>
>
> Ozone datanode exits during data write with the following exception.
> {code}
> 2018-07-05 14:10:01,605 INFO org.apache.ratis.server.storage.RaftLogWorker:
> Rolling segment:40356aa1-741f-499c-aad1-b500f2620a3d_9858-RaftLogWorker index
> to:4565
> 2018-07-05 14:10:01,607 ERROR
> org.apache.ratis.server.impl.StateMachineUpdater: Terminating with exit
> status 2: StateMachineUpdater-40356aa1-741f-499c-aad1-b500f2620a3d_9858: the
> StateMachineUpdater hits Throwable
> java.lang.NullPointerException
> at
> org.apache.hadoop.ozone.container.common.transport.server.ratis.ContainerStateMachine.applyTransaction(ContainerStateMachine.java:272)
> at
> org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1058)
> at
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:154)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> This might be as a result of a ratis transaction which was not written
> through the "writeStateMachineData" phase, however it was added to the raft
> log. This implied that stateMachineUpdater now applies a transaction without
> the corresponding entry being added to the stateMachine.
> I am raising this jira to track the issue and will also raise a Ratis jira if
> required.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]