[ 
https://issues.apache.org/jira/browse/HDFS-14973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971813#comment-16971813
 ] 

Erik Krogen commented on HDFS-14973:
------------------------------------

Thanks for taking a look [~shv]!
{quote}The next wave of dispatcher threads after 100 should not hit the 
NameNode right away. It is supposed first to executePendingMove(), then call 
getBlocks(). And executePendingMove() naturally throttles the dispatcher, so it 
was not necessary to delay the subsequent ways.
{quote}
This might work, except that {{executePendingMove}} is non-blocking:
{code:java}
  public void executePendingMove(final PendingMove p) {
    // move the reportedBlock
    final DDatanode targetDn = p.target.getDDatanode();
    ExecutorService moveExecutor = targetDn.getMoveExecutor();
    if (moveExecutor == null) {
      final int nThreads = moverThreadAllocator.allocate();
      if (nThreads > 0) {
        moveExecutor = targetDn.initMoveExecutor(nThreads);
      }
    }
    if (moveExecutor == null) {
      LOG.warn("No mover threads available: skip moving " + p);
      targetDn.removePendingBlock(p);
      p.proxySource.removePendingBlock(p);
      return;
    }
    moveExecutor.execute(new Runnable() {
      @Override
      public void run() {
        p.dispatch();
      }
    });
  }
{code}
It simply allocates a thread pool (if one does not exist), then submits a task 
to it to be executed. The actual movement will be executed later, by the 
{{moveExecutor}}. Even as far back as 2.6.1 (which doesn't have HDFS-11742), 
{{executePendingMove}} was similarly nonblocking:
{code:java}
  public void executePendingMove(final PendingMove p) {
    // move the block
    moveExecutor.execute(new Runnable() {
      @Override
      public void run() {
        p.dispatch();
      }
    });
  }
{code}
However I certainly agree that it's possible the changes to the balancer 
(HDFS-8818, HDFS-11742) exacerbated this issue.

With this non-blocking behavior, you end up with the scenario I described where 
the first 20 slots in the {{dispatchExecutor}} continue to push through 
dispatch tasks with throughput above the throttling limit.

Attached v1 patch addressing hdfs-site.xml issue.

> Balancer getBlocks RPC dispersal does not function properly
> -----------------------------------------------------------
>
>                 Key: HDFS-14973
>                 URL: https://issues.apache.org/jira/browse/HDFS-14973
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer & mover
>    Affects Versions: 2.9.0, 2.7.4, 2.8.2, 3.0.0
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>            Priority: Major
>         Attachments: HDFS-14973.000.patch, HDFS-14973.test.patch
>
>
> In HDFS-11384, a mechanism was added to make the {{getBlocks}} RPC calls 
> issued by the Balancer/Mover more dispersed, to alleviate load on the 
> NameNode, since {{getBlocks}} can be very expensive and the Balancer should 
> not impact normal cluster operation.
> Unfortunately, this functionality does not function as expected, especially 
> when the dispatcher thread count is low. The primary issue is that the delay 
> is applied only to the first N threads that are submitted to the dispatcher's 
> executor, where N is the size of the dispatcher's threadpool, but *not* to 
> the first R threads, where R is the number of allowed {{getBlocks}} QPS 
> (currently hardcoded to 20). For example, if the threadpool size is 100 (the 
> default), threads 0-19 have no delay, 20-99 have increased levels of delay, 
> and 100+ have no delay. As I understand it, the intent of the logic was that 
> the delay applied to the first 100 threads would force the dispatcher 
> executor's threads to all be consumed, thus blocking subsequent (non-delayed) 
> threads until the delay period has expired. However, threads 0-19 can finish 
> very quickly (their work can often be fulfilled in the time it takes to 
> execute a single {{getBlocks}} RPC, on the order of tens of milliseconds), 
> thus opening up 20 new slots in the executor, which are then consumed by 
> non-delayed threads 100-119, and so on. So, although 80 threads have had a 
> delay applied, the non-delay threads rush through in the 20 non-delay slots.
> This problem gets even worse when the dispatcher threadpool size is less than 
> the max {{getBlocks}} QPS. For example, if the threadpool size is 10, _no 
> threads ever have a delay applied_, and the feature is not enabled at all.
> This problem wasn't surfaced in the original JIRA because the test 
> incorrectly measured the period across which {{getBlocks}} RPCs were 
> distributed. The variables {{startGetBlocksTime}} and {{endGetBlocksTime}} 
> were used to track the time over which the {{getBlocks}} calls were made. 
> However, {{startGetBlocksTime}} was initialized at the time of creation of 
> the {{FSNameystem}} spy, which is before the mock DataNodes are started. Even 
> worse, the Balancer in this test takes 2 iterations to complete balancing the 
> cluster, so the time period {{endGetBlocksTime - startGetBlocksTime}} 
> actually represents:
> {code}
> (time to submit getBlocks RPCs) + (DataNode startup time) + (time for the 
> Dispatcher to complete an iteration of moving blocks)
> {code}
> Thus, the RPC QPS reported by the test is much lower than the RPC QPS seen 
> during the period of initial block fetching.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to