[
https://issues.apache.org/jira/browse/HDFS-16293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452859#comment-17452859
]
Yuanxin Zhu edited comment on HDFS-16293 at 12/3/21, 10:07 AM:
---------------------------------------------------------------
[~tasanuma] Thanks for your feedback. What I'm worried about is that the unit
test went wrong because of threading problems.
I think there are two situations:
* Without fixing DataStreamer, the congestedNodes thread may run one step
ahead of the dataQueue thread, resulting in the size of the congestedNodes
greater than 1, it can be solved by increasing the sleep time of the
congestedNodes thread.
* With fixing DataStreamer, in order to save time, the previous unit test
program exits after the dataQueue thread ends, which may cause the program to
exit in advance when the size of the congestedNodes is not greater than 1. It
can be solved by increasing the number of the congestedNodes thread runs and
putting the program exit code in the congestedNodes thread, but it will affect
the running time of the unit test Without fixing DataStreamer.
If the program can't finish occasionally, we can increase the number of times
the dataQueue thread runs, so as to prevent the DataStreamer from waiting
because the dataQueue is empty, or add a packet again before the congestedNodes
thread ends.
Could you check it?
was (Author: yuanxin zhu):
[~tasanuma] Thanks for your feedback. What I'm worried about is that the unit
test went wrong because of threading problems
I think there are two situations:
* Without fixing DataStreamer, the congestedNodes thread may run one step
ahead of the dataQueue thread, resulting in the size of the congestedNodes
greater than 1, it can be solved by increasing the sleep time of the
congestedNodes thread.
* With fixing DataStreamer, in order to save time, the previous unit test
program exits after the dataQueue thread ends, which may cause the program to
exit in advance when the size of the congestedNodes is not greater than 1. It
can be solved by increasing the number of the congestedNodes thread runs and
putting the program exit code in the congestedNodes thread, but it will affect
the running time of the unit test Without fixing DataStreamer.
If the program can't finish occasionally, we can increase the number of times
the dataQueue thread runs, so as to prevent the DataStreamer from waiting
because the dataQueue is empty, or add a packet again before the congestedNodes
thread ends.
Could you check it?
> Client sleeps and holds 'dataQueue' when DataNodes are congested
> ----------------------------------------------------------------
>
> Key: HDFS-16293
> URL: https://issues.apache.org/jira/browse/HDFS-16293
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs-client
> Affects Versions: 3.2.2, 3.3.1, 3.2.3
> Reporter: Yuanxin Zhu
> Assignee: Yuanxin Zhu
> Priority: Major
> Attachments: HDFS-16293.01-branch-3.2.2.patch, HDFS-16293.01.patch,
> HDFS-16293.02.patch, HDFS-16293.03.patch, HDFS-16293.04.patch,
> HDFS-16293.05.patch
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> When I open the ECN and use Terasort(500G data,8 DataNodes,76 vcores/DN) for
> testing, DataNodes are congested(HDFS-8008). The client enters the sleep
> state after receiving the ACK for many times, but does not release the
> 'dataQueue'. The ResponseProcessor thread needs the 'dataQueue' to execute
> 'ackQueue.getFirst()', so the ResponseProcessor will wait for the client to
> release the 'dataQueue', which is equivalent to that the ResponseProcessor
> thread also enters sleep, resulting in ACK delay.MapReduce tasks can be
> delayed by tens of minutes or even hours.
> The DataStreamer thread can first execute 'one = dataQueue. getFirst()',
> release 'dataQueue', and then judge whether to execute 'backOffIfNecessary()'
> according to 'one.isHeartbeatPacket()'
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]