[
https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Erik Krogen updated HDFS-14973:
-------------------------------
Description:
In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls issued
by the Balancer/Mover more dispersed, to alleviate load on the NameNode, since
{{getBlocks}} can be very expensive and the Balancer should not impact normal
cluster operation.
Unfortunately, this functionality does not function as expected, especially
when the dispatcher thread count is low. The primary issue is that the delay is
applied only to the first N threads that are submitted to the dispatcher's
executor, where N is the size of the dispatcher's threadpool, but *not* to the
first R threads, where R is the number of allowed {{getBlocks}} QPS (currently
hardcoded to 20). For example, if the threadpool size is 100 (the default),
threads 0-19 have no delay, 20-99 have increased levels of delay, and 100+ have
no delay. As I understand it, the intent of the logic was that the delay
applied to the first 100 threads would force the dispatcher executor's threads
to all be consumed, thus blocking subsequent (non-delayed) threads until the
delay period has expired. However, threads 0-19 can finish very quickly (their
work can often be fulfilled in the time it takes to execute a single
{{getBlocks}} RPC, on the order of tens of milliseconds), thus opening up 20
new slots in the executor, which are then consumed by non-delayed threads
100-119, and so on. So, although 80 threads have had a delay applied, the
non-delay threads rush through in the 20 non-delay slots.
This problem gets even worse when the dispatcher threadpool size is less than
the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no
threads ever have a delay applied_, and the feature is not enabled at all.
This problem wasn't surfaced in the original JIRA because the test incorrectly
measured the period across which {{getBlocks}} RPCs were distributed. The
variables {{startGetBlocksTime}} and {{endGetBlocksTime}} were used to track
the time over which the {{getBlocks}} calls were made. However,
{{startGetBlocksTime}} was initialized at the time of creation of the
{{FSNameystem}} spy, which is before the mock DataNodes are started. Even
worse, the Balancer in this test takes 2 iterations to complete balancing the
cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} actually
represents:
{code}
2 * (time to submit getBlocks RPCs) + (DataNode startup time) + 2 * (time for
the Dispatcher to complete an iteration of moving blocks)
{code}
Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen
during the period of initial block fetching.
was:
In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls issued
by the Balancer/Mover more dispersed, to alleviate load on the NameNode, since
{{getBlocks}} can be very expensive and the Balancer should not impact normal
cluster operation.
Unfortunately, this functionality does not function as expected, especially
when the dispatcher thread count is low. The primary issue is that the delay is
applied only to the first N threads that are submitted to the dispatcher's
executor, where N is the size of the dispatcher's threadpool, but *not* to the
first R threads, where R is the number of allowed {{getBlocks}} QPS (currently
hardcoded to 20). For example, if the threadpool size is 100 (the default),
threads 0-19 have no delay, 20-99 have increased levels of delay, and 100+ have
no delay. As I understand it, the intent of the logic was that the delay
applied to the first 100 threads would force the dispatcher executor's threads
to all be consumed, thus blocking subsequent (non-delayed) threads until the
delay period has expired. However, threads 0-19 can finish very quickly (their
work can often be fulfilled in the time it takes to execute a single
{{getBlocks}} RPC, on the order of tens of milliseconds), thus opening up 20
new slots in the executor, which are then consumed by non-delayed threads
100-119, and so on. So, although 80 threads have had a delay applied, the
non-delay threads rush through in the 20 non-delay slots.
This problem gets even worse when the dispatcher threadpool size is less than
the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no
threads ever have a delay applied_, and the feature is not enabled at all.
> Balancer getBlocks RPC dispersal does not function properly
> -----------------------------------------------------------
>
> Key: HDFS-14973
> URL: https://issues.apache.org/jira/browse/HDFS-14973
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: balancer & mover
> Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0
> Reporter: Erik Krogen
> Assignee: Erik Krogen
> Priority: Major
>
> In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls
> issued by the Balancer/Mover more dispersed, to alleviate load on the
> NameNode, since {{getBlocks}} can be very expensive and the Balancer should
> not impact normal cluster operation.
> Unfortunately, this functionality does not function as expected, especially
> when the dispatcher thread count is low. The primary issue is that the delay
> is applied only to the first N threads that are submitted to the dispatcher's
> executor, where N is the size of the dispatcher's threadpool, but *not* to
> the first R threads, where R is the number of allowed {{getBlocks}} QPS
> (currently hardcoded to 20). For example, if the threadpool size is 100 (the
> default), threads 0-19 have no delay, 20-99 have increased levels of delay,
> and 100+ have no delay. As I understand it, the intent of the logic was that
> the delay applied to the first 100 threads would force the dispatcher
> executor's threads to all be consumed, thus blocking subsequent (non-delayed)
> threads until the delay period has expired. However, threads 0-19 can finish
> very quickly (their work can often be fulfilled in the time it takes to
> execute a single {{getBlocks}} RPC, on the order of tens of milliseconds),
> thus opening up 20 new slots in the executor, which are then consumed by
> non-delayed threads 100-119, and so on. So, although 80 threads have had a
> delay applied, the non-delay threads rush through in the 20 non-delay slots.
> This problem gets even worse when the dispatcher threadpool size is less than
> the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no
> threads ever have a delay applied_, and the feature is not enabled at all.
> This problem wasn't surfaced in the original JIRA because the test
> incorrectly measured the period across which {{getBlocks}} RPCs were
> distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}}
> were used to track the time over which the {{getBlocks}} calls were made.
> However, {{startGetBlocksTime}} was initialized at the time of creation of
> the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even
> worse, the Balancer in this test takes 2 iterations to complete balancing the
> cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}}
> actually represents:
> {code}
> 2 * (time to submit getBlocks RPCs) + (DataNode startup time) + 2 * (time for
> the Dispatcher to complete an iteration of moving blocks)
> {code}
> Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen
> during the period of initial block fetching.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]