[
https://issues.apache.org/jira/browse/HDFS-10743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16202482#comment-16202482
]
Daryn Sharp commented on HDFS-10743:
------------------------------------
Triggering a block report immediately after the heartbeat isn't addressing the
main issue of delayed reconnects after a cluster restart. Eliminating that
delay will save a lot of time.
The DNs are stuck waiting for the next heartbeat or stuck in
{{sleepAfterException}}. The mini cluster has some "triggerBlah" methods but
they are synchronous and wait for the operation to complete which we can't do
because sometimes DNs are expected to fail to connect. An async wakeup can be
done with {{DataNode#scheduleAllBlockReport(0)}} – if it also then triggered a
heartbeat. Maybe add a flag to that method for sending a heartbeat.
Something needs to be done to wake the thread from {{sleepAfterException}}
because tests will likely encounter that delay during restarts.
> MiniDFSCluster test runtimes can be drastically reduce
> ------------------------------------------------------
>
> Key: HDFS-10743
> URL: https://issues.apache.org/jira/browse/HDFS-10743
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs
> Affects Versions: 2.0.0-alpha
> Reporter: Daryn Sharp
> Assignee: Kuhu Shukla
> Attachments: HDFS-10743.001.patch, HDFS-10743.002.patch,
> HDFS-10743.003.patch
>
>
> {{MiniDFSCluster}} tests have excessive runtimes. The main problem appears
> to be the heartbeat interval. The NN may have to wait up to 3s (default
> value) for all DNs to heartbeat, triggering registration, so NN can go
> active. Tests that repeatedly restart the NN are severely affected.
> Example for varying heartbeat intervals for {{TestFSImageWithAcl}}:
> * 3s = ~70s -- (disgusting, why I investigated)
> * 1s = ~27s
> * 500ms = ~17s -- (had to hack DNConf for millisecond precision)
> That a 4x improvement in runtime.
> 17s is still excessively long for what the test does. Further areas to
> explore when running tests:
> * Reduce numerous sleeps intervals in DN's {{BPServiceActor}}.
> * Ensure heartbeats and initial BR are sent immediately upon (re)registration.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]