[
https://issues.apache.org/jira/browse/HDDS-451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614645#comment-16614645
]
Shashikant Banerjee commented on HDDS-451:
------------------------------------------
>From ozone.log on 172.27.19.198
{code:java}
2018-09-13 12:11:52,396 [pool-3-thread-2] DEBUG (ChunkManagerImpl.java:85) -
writing
chunk:bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2
chunk stage:WRITE_DATA chunk
file:/tmp/hadoop-root/dfs/data/hdds/de0a9e01-4a12-40e3-b567-51b9bd83248e/current/containerDir0/16/chunks/bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2
tmp chunk file
2018-09-13 12:11:56,433 [pool-3-thread-3] DEBUG (ChunkManagerImpl.java:85) -
writing
chunk:bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2
chunk stage:WRITE_DATA chunk
file:/tmp/hadoop-root/dfs/data/hdds/de0a9e01-4a12-40e3-b567-51b9bd83248e/current/containerDir0/16/chunks/bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2
tmp chunk file
2018-09-13 12:11:59,226 [ForkJoinPool.commonPool-worker-4] DEBUG
(ChunkManagerImpl.java:85) - writing
chunk:bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2
chunk stage:COMMIT_DATA chunk
file:/tmp/hadoop-root/dfs/data/hdds/de0a9e01-4a12-40e3-b567-51b9bd83248e/current/containerDir0/16/chunks/bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2
tmp chunk file (Called in ApplyTransaction in Ratis)
2018-09-13 12:12:59,338 [pool-3-thread-4] ERROR (ChunkUtils.java:244) -
Rejecting write chunk request. Chunk overwrite without explicit request.
ChunkInfo{chunkName='bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2,
offset=0, len=16777216} -> (Same request retried the transaction){code}
Committing writeChunk happens during applyTransaction call in Ratis. Since ,
commit chunk call is happening twice It affirms that there two raft log entries
for the same client request.
During the same time, the leader relection is triggered. On the
leader(172.27.19.198), the lo is applied first and sent to followers which time
out. The client request will fail and Ratis client will retry the request to
the new leader which is 172.27.27.132. There might be a race condition , where
the new Ratis Leader will now apply its log and send it back to node
172.27.19.198 which will make an appendLog entry for the same request as a
result of which we have two log entries for the same request.
{code:java}
2018-09-13 12:11:52,405 INFO org.apache.ratis.server.impl.RaftServerImpl:
c3a12636-4f91-4865-8f0f-954db2f3e440: change Leader from
c3a12636-4f91-4865-8f0f-954db2f3e440 to null at term 53 for updateCurrentTerm
2018-09-13 12:11:52,405 INFO org.apache.ratis.server.impl.RaftServerImpl:
c3a12636-4f91-4865-8f0f-954db2f3e440 changes role from
org.apache.ratis.server.impl.RoleInfo@444b5719 to FOLLOWER at term 53 for
changeToFollower
2018-09-13 12:11:52,405 INFO org.apache.ratis.server.impl.PendingRequests:
c3a12636-4f91-4865-8f0f-954db2f3e440 sends responses before shutting down
PendingRequestsHandler
2018-09-13 12:11:52,406 WARN org.apache.ratis.server.impl.LogAppender:
GRpcLogAppender(c3a12636-4f91-4865-8f0f-954db2f3e440 ->
08cc013f-157b-4db2-8738-54b2d6e709e0): Wait interrupted by
java.lang.InterruptedException
2018-09-13 12:11:52,406 WARN org.apache.ratis.server.impl.LogAppender:
GRpcLogAppender(c3a12636-4f91-4865-8f0f-954db2f3e440 ->
4f5a4f28-4cb2-45bb-b29d-a0cf8a02582b): Wait interrupted by
java.lang.InterruptedException
2018-09-13 12:11:52,408 INFO org.apache.ratis.server.impl.RaftServerImpl:
c3a12636-4f91-4865-8f0f-954db2f3e440: change Leader from null to
4f5a4f28-4cb2-45bb-b29d-a0cf8a02582b at term 53 for appendEntries, leader
elected after 2ms
2018-09-13 12:11:52,408 INFO org.apache.ratis.server.impl.RaftServerImpl:
c3a12636-4f91-4865-8f0f-954db2f3e440-org.apache.ratis.server.impl.RoleInfo@444b5719:
Withhold vote from candidate 4f5a4f28-4cb2-45bb-b29d-a0cf8a02582b with term
53. State: leader=4f5a4f28-4cb2-45bb-b29d-a0cf8a02582b, term=53,
lastRpcElapsed=0ms
2018-09-13 12:11:53,100 INFO org.apache.ratis.server.impl.LogAppender:
c3a12636-4f91-4865-8f0f-954db2f3e440 stops appending log entries to follower
4f5a4f28-4cb2-45bb-b29d-a0cf8a02582b(next=1381, match=1379, attendVote=true,
lastRpcSendTime=710, lastRpcResponseTime=11)
2018-09-13 12:11:53,195 INFO org.apache.ratis.server.impl.LogAppender:
c3a12636-4f91-4865-8f0f-954db2f3e440 stops appending log entries to follower
08cc013f-157b-4db2-8738-54b2d6e709e0(next=1381, match=1379, attendVote=true,
lastRpcSendTime=804, lastRpcResponseTime=9)
2018-09-13 12:11:55,390 WARN org.apache.ratis.server.impl.LogAppender:
GRpcLogAppender(c3a12636-4f91-4865-8f0f-954db2f3e440 ->
4f5a4f28-4cb2-45bb-b29d-a0cf8a02582b): appendEntries Timeout,
request=c3a12636-4f91-4865-8f0f-954db2f3e440->4f5a4f28-4cb2-45bb-b29d-a0cf8a02582b#1285
2018-09-13 12:11:55,390 WARN org.apache.ratis.server.impl.LogAppender:
GRpcLogAppender(c3a12636-4f91-4865-8f0f-954db2f3e440 ->
08cc013f-157b-4db2-8738-54b2d6e709e0): appendEntries Timeout,
request=c3a12636-4f91-4865-8f0f-954db2f3e440->08cc013f-157b-4db2-8738-54b2d6e709e0#1298
2018-09-13 12:11:56,857 INFO org.apache.ratis.server.impl.RaftServerImpl:
c3a12636-4f91-4865-8f0f-954db2f3e440: change Leader from
4f5a4f28-4cb2-45bb-b29d-a0cf8a02582b to null at term 54 for updateCurrentTerm
2018-09-13 12:11:56,873 INFO
org.apache.ratis.grpc.server.RaftServerProtocolService:
c3a12636-4f91-4865-8f0f-954db2f3e440: appendEntries completed{code}
This needs some more investigation.
> PutKey failed due to error "Rejecting write chunk request. Chunk overwrite
> without explicit request"
> ----------------------------------------------------------------------------------------------------
>
> Key: HDDS-451
> URL: https://issues.apache.org/jira/browse/HDDS-451
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Components: Ozone Client
> Affects Versions: 0.2.1
> Reporter: Nilotpal Nandi
> Assignee: Shashikant Banerjee
> Priority: Blocker
> Attachments: all-node-ozone-logs-1536841590.tar.gz
>
>
> steps taken :
> ------------------
> # Ran Put Key command to write 50GB data. Put Key client operation failed
> after 17 mins.
> error seen ozone.log :
> ------------------------------------
>
> {noformat}
> 2018-09-13 12:11:53,734 [ForkJoinPool.commonPool-worker-20] DEBUG
> (ChunkManagerImpl.java:85) - writing
> chunk:bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_1
> chunk stage:COMMIT_DATA chunk
> file:/tmp/hadoop-root/dfs/data/hdds/de0a9e01-4a12-40e3-b567-51b9bd83248e/current/containerDir0/16/chunks/bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_1
> tmp chunk file
> 2018-09-13 12:11:56,576 [pool-3-thread-60] DEBUG (ChunkManagerImpl.java:85) -
> writing
> chunk:bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2
> chunk stage:WRITE_DATA chunk
> file:/tmp/hadoop-root/dfs/data/hdds/de0a9e01-4a12-40e3-b567-51b9bd83248e/current/containerDir0/16/chunks/bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2
> tmp chunk file
> 2018-09-13 12:11:56,739 [ForkJoinPool.commonPool-worker-20] DEBUG
> (ChunkManagerImpl.java:85) - writing
> chunk:bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2
> chunk stage:COMMIT_DATA chunk
> file:/tmp/hadoop-root/dfs/data/hdds/de0a9e01-4a12-40e3-b567-51b9bd83248e/current/containerDir0/16/chunks/bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2
> tmp chunk file
> 2018-09-13 12:12:21,410 [Datanode State Machine Thread - 0] DEBUG
> (DatanodeStateMachine.java:148) - Executing cycle Number : 206
> 2018-09-13 12:12:51,411 [Datanode State Machine Thread - 0] DEBUG
> (DatanodeStateMachine.java:148) - Executing cycle Number : 207
> 2018-09-13 12:12:53,525 [BlockDeletingService#1] DEBUG
> (TopNOrderedContainerDeletionChoosingPolicy.java:79) - Stop looking for next
> container, there is no pending deletion block contained in remaining
> containers.
> 2018-09-13 12:12:55,048 [Datanode ReportManager Thread - 1] DEBUG
> (ContainerSet.java:191) - Starting container report iteration.
> 2018-09-13 12:13:02,626 [pool-3-thread-1] ERROR (ChunkUtils.java:244) -
> Rejecting write chunk request. Chunk overwrite without explicit request.
> ChunkInfo{chunkName='bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2,
> offset=0, len=16777216}
> 2018-09-13 12:13:03,035 [pool-3-thread-1] INFO (ContainerUtils.java:149) -
> Operation: WriteChunk : Trace ID: 54834b29-603d-4ba9-9d68-0885215759d8 :
> Message: Rejecting write chunk request. OverWrite flag
> required.ChunkInfo{chunkName='bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2,
> offset=0, len=16777216} : Result: OVERWRITE_FLAG_REQUIRED
> 2018-09-13 12:13:03,037 [ForkJoinPool.commonPool-worker-11] ERROR
> (ChunkUtils.java:244) - Rejecting write chunk request. Chunk overwrite
> without explicit request.
> ChunkInfo{chunkName='bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2,
> offset=0, len=16777216}
> 2018-09-13 12:13:03,037 [ForkJoinPool.commonPool-worker-11] INFO
> (ContainerUtils.java:149) - Operation: WriteChunk : Trace ID:
> 54834b29-603d-4ba9-9d68-0885215759d8 : Message: Rejecting write chunk
> request. OverWrite flag
> required.ChunkInfo{chunkName='bd80b58a5eba888200a4832a0f2aafb3_stream_5f3b2505-6964-45c9-a7ad-827388a1e6a0_chunk_2,
> offset=0, len=16777216} : Result: OVERWRITE_FLAG_REQUIRED
>
> {noformat}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]