[ 
https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated HDFS-14973:
-------------------------------
    Description: 
In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls issued 
by the Balancer/Mover more dispersed, to alleviate load on the NameNode, since 
{{getBlocks}} can be very expensive and the Balancer should not impact normal 
cluster operation.

Unfortunately, this functionality does not function as expected, especially 
when the dispatcher thread count is low. The primary issue is that the delay is 
applied only to the first N threads that are submitted to the dispatcher's 
executor, where N is the size of the dispatcher's threadpool, but *not* to the 
first R threads, where R is the number of allowed {{getBlocks}} QPS (currently 
hardcoded to 20). For example, if the threadpool size is 100 (the default), 
threads 0-19 have no delay, 20-99 have increased levels of delay, and 100+ have 
no delay. As I understand it, the intent of the logic was that the delay 
applied to the first 100 threads would force the dispatcher executor's threads 
to all be consumed, thus blocking subsequent (non-delayed) threads until the 
delay period has expired. However, threads 0-19 can finish very quickly (their 
work can often be fulfilled in the time it takes to execute a single 
{{getBlocks}} RPC, on the order of tens of milliseconds), thus opening up 20 
new slots in the executor, which are then consumed by non-delayed threads 
100-119, and so on. So, although 80 threads have had a delay applied, the 
non-delay threads rush through in the 20 non-delay slots.

This problem gets even worse when the dispatcher threadpool size is less than 
the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no 
threads ever have a delay applied_, and the feature is not enabled at all.

This problem wasn't surfaced in the original JIRA because the test incorrectly 
measured the period across which {{getBlocks}} RPCs were distributed. The 
variables {{startGetBlocksTime}} and {{endGetBlocksTime}} were used to track 
the time over which the {{getBlocks}} calls were made. However, 
{{startGetBlocksTime}} was initialized at the time of creation of the 
{{FSNameystem}} spy, which is before the mock DataNodes are started. Even 
worse, the Balancer in this test takes 2 iterations to complete balancing the 
cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} actually 
represents:
{code}
2 * (time to submit getBlocks RPCs) + (DataNode startup time) + 2 * (time for 
the Dispatcher to complete an iteration of moving blocks)
{code}
Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen 
during the period of initial block fetching.

  was:
In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls issued 
by the Balancer/Mover more dispersed, to alleviate load on the NameNode, since 
{{getBlocks}} can be very expensive and the Balancer should not impact normal 
cluster operation.

Unfortunately, this functionality does not function as expected, especially 
when the dispatcher thread count is low. The primary issue is that the delay is 
applied only to the first N threads that are submitted to the dispatcher's 
executor, where N is the size of the dispatcher's threadpool, but *not* to the 
first R threads, where R is the number of allowed {{getBlocks}} QPS (currently 
hardcoded to 20). For example, if the threadpool size is 100 (the default), 
threads 0-19 have no delay, 20-99 have increased levels of delay, and 100+ have 
no delay. As I understand it, the intent of the logic was that the delay 
applied to the first 100 threads would force the dispatcher executor's threads 
to all be consumed, thus blocking subsequent (non-delayed) threads until the 
delay period has expired. However, threads 0-19 can finish very quickly (their 
work can often be fulfilled in the time it takes to execute a single 
{{getBlocks}} RPC, on the order of tens of milliseconds), thus opening up 20 
new slots in the executor, which are then consumed by non-delayed threads 
100-119, and so on. So, although 80 threads have had a delay applied, the 
non-delay threads rush through in the 20 non-delay slots.

This problem gets even worse when the dispatcher threadpool size is less than 
the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no 
threads ever have a delay applied_, and the feature is not enabled at all.


> Balancer getBlocks RPC dispersal does not function properly
> -----------------------------------------------------------
>
>                 Key: HDFS-14973
>                 URL: https://issues.apache.org/jira/browse/HDFS-14973
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer & mover
>    Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>            Priority: Major
>
> In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls 
> issued by the Balancer/Mover more dispersed, to alleviate load on the 
> NameNode, since {{getBlocks}} can be very expensive and the Balancer should 
> not impact normal cluster operation.
> Unfortunately, this functionality does not function as expected, especially 
> when the dispatcher thread count is low. The primary issue is that the delay 
> is applied only to the first N threads that are submitted to the dispatcher's 
> executor, where N is the size of the dispatcher's threadpool, but *not* to 
> the first R threads, where R is the number of allowed {{getBlocks}} QPS 
> (currently hardcoded to 20). For example, if the threadpool size is 100 (the 
> default), threads 0-19 have no delay, 20-99 have increased levels of delay, 
> and 100+ have no delay. As I understand it, the intent of the logic was that 
> the delay applied to the first 100 threads would force the dispatcher 
> executor's threads to all be consumed, thus blocking subsequent (non-delayed) 
> threads until the delay period has expired. However, threads 0-19 can finish 
> very quickly (their work can often be fulfilled in the time it takes to 
> execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), 
> thus opening up 20 new slots in the executor, which are then consumed by 
> non-delayed threads 100-119, and so on. So, although 80 threads have had a 
> delay applied, the non-delay threads rush through in the 20 non-delay slots.
> This problem gets even worse when the dispatcher threadpool size is less than 
> the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no 
> threads ever have a delay applied_, and the feature is not enabled at all.
> This problem wasn't surfaced in the original JIRA because the test 
> incorrectly measured the period across which {{getBlocks}} RPCs were 
> distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} 
> were used to track the time over which the {{getBlocks}} calls were made. 
> However, {{startGetBlocksTime}} was initialized at the time of creation of 
> the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even 
> worse, the Balancer in this test takes 2 iterations to complete balancing the 
> cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} 
> actually represents:
> {code}
> 2 * (time to submit getBlocks RPCs) + (DataNode startup time) + 2 * (time for 
> the Dispatcher to complete an iteration of moving blocks)
> {code}
> Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen 
> during the period of initial block fetching.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to