[ 
https://issues.apache.org/jira/browse/HDFS-15415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140477#comment-17140477
 ] 

Stephen O'Donnell commented on HDFS-15415:
------------------------------------------

I have annotated the main loop the DirectoryScanner runs under the datanode 
lock below, with commends starting "SOD:". Its important to keep in mind that 
this compare phase creates a list of differences. These differences are then 
checked again block by block under the datanode lock in the reconcile step. 
Some "incorrect" differences are likely to be recorded even under the lock, as 
scanning the disks will take time. This scan is performed outside of the lock, 
so the DN could be appending, deleting and adding blocks during this time. 
Therefore if some more changes happen when comparing the disk results to the 
in-memory blocks, it is not a big deal. They will get re-check and resolved 
during the reconcile step.

I have a small concern that if the disk balancer is running and moving blocks 
around it could cause more differences. However I don't see any protection 
against that when scanning the volumes either, so a block could potentially be 
counted on vol1, moved to vol2 and then counted again.

Overall, I feel it is safe to limit the lock to to be around the call to 
`dataset.getSortedFinalizedBlocks(bpid)` only.

Please let me know if anyone thinks that is wrong, or I am missing something 
obvious.


{code}

    // Hold FSDataset lock to prevent further changes to the block map
    try (AutoCloseableLock lock = dataset.acquireDatasetLock()) {
      for (final String bpid : blockPoolReport.getBlockPoolIds()) {
        List<ScanInfo> blockpoolReport = blockPoolReport.getScanInfo(bpid);

        Stats statsRecord = new Stats(bpid);
        stats.put(bpid, statsRecord);
        Collection<ScanInfo> diffRecord = new ArrayList<>();

        statsRecord.totalBlocks = blockpoolReport.size();
        // Need to hold a lock here to prevent the replica map changing
        final List<ReplicaInfo> bl = dataset.getSortedFinalizedBlocks(bpid);

        // SOD: After here, were have a "snapshot" of the replicas that were in 
the
        // replica map. It doesn't really matter if those replicas change or
        // not as we go through the checks, as we are working off the snapshot.
        // The in-memory version will have diverged from the on-disk details as
        // the disk is scanned anyway.

        int d = 0; // index for blockpoolReport
        int m = 0; // index for memReprot
        while (m < bl.size() && d < blockpoolReport.size()) {
          ReplicaInfo memBlock = bl.get(m);
          ScanInfo info = blockpoolReport.get(d);
          // SOD: This block is safe to run outside of the lock
          if (info.getBlockId() < memBlock.getBlockId()) {
            // SOD: isDeletingBlock() is a synchronized method, so we don't 
need a
            // lock to check it.
            if (!dataset.isDeletingBlock(bpid, info.getBlockId())) {
              // Block is missing in memory
              statsRecord.missingMemoryBlocks++;
              addDifference(diffRecord, statsRecord, info);
            }
            d++;
            continue;
          }
          // SOD: This is safe outside the lock
          if (info.getBlockId() > memBlock.getBlockId()) {
            // Block is missing on the disk
            addDifference(diffRecord, statsRecord, memBlock.getBlockId(),
                info.getVolume());
            m++;
            continue;
          }
          // Block file and/or metadata file exists on the disk
          // Block exists in memory
          // SOD: This branch looks safe
          if (info.getVolume().getStorageType() != StorageType.PROVIDED
              && info.getBlockFile() == null) {
            // Block metadata file exits and block file is missing
            addDifference(diffRecord, statsRecord, info);
          // SOD: This if we don't have a lock, an append or truncate could 
alter the
          // block length or gen stamp. However, these could already have 
changed
          // as the disk was scanned. Therefore I believe it is safe to do this
          // outside the lock. Worst case we gather some extra differences, but
          // they get handled in the reconcile step.
          } else if (info.getGenStamp() != memBlock.getGenerationStamp()
              || info.getBlockLength() != memBlock.getNumBytes()) {
            // Block metadata file is missing or has wrong generation stamp,
            // or block file length is different than expected
            statsRecord.mismatchBlocks++;
            addDifference(diffRecord, statsRecord, info);
          // SOD: The compareWith method checks the expected locations of the  
          // block (ie vol/subdir/subdir/blk_xxxx with what was found on the 
disk  
          // scan. This section is a concern, as the disk balancer could move
          // a block and then this change would log a difference. Again it will
          // be handled in the reconcile step  
          } else if (memBlock.compareWith(info) != 0) {
            // volumeMap record and on-disk files do not match.
            statsRecord.duplicateBlocks++;
            addDifference(diffRecord, statsRecord, info);
          }
          d++;

          // SOD: This seems safe outside of the lock
          if (d < blockpoolReport.size()) {
            // There may be multiple on-disk records for the same block, do not
            // increment the memory record pointer if so.
            ScanInfo nextInfo = blockpoolReport.get(d);
            if (nextInfo.getBlockId() != info.getBlockId()) {
              ++m;
            }
          } else {
            ++m;
          }
        }
        // SOD: This seems safe
        while (m < bl.size()) {
          ReplicaInfo current = bl.get(m++);
          addDifference(diffRecord, statsRecord, current.getBlockId(),
              current.getVolume());
        }
        // SOD: This seems safe
        while (d < blockpoolReport.size()) {
          if (!dataset.isDeletingBlock(bpid,
              blockpoolReport.get(d).getBlockId())) {
            statsRecord.missingMemoryBlocks++;
            addDifference(diffRecord, statsRecord, blockpoolReport.get(d));
          }
          d++;
        }
        synchronized (diffs) {
          diffs.addAll(bpid, diffRecord);
        }
        LOG.info("Scan Results: {}", statsRecord);
      }
    }
{code}   

> Reduce locking in Datanode DirectoryScanner
> -------------------------------------------
>
>                 Key: HDFS-15415
>                 URL: https://issues.apache.org/jira/browse/HDFS-15415
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>    Affects Versions: 3.4.0
>            Reporter: Stephen O'Donnell
>            Assignee: Stephen O'Donnell
>            Priority: Major
>         Attachments: HDFS-15415.001.patch
>
>
> In HDFS-15406, we have a small change to greatly reduce the runtime and 
> locking time of the datanode DirectoryScanner. They may be room for further 
> improvement here:
> 1. These lines of code in DirectoryScanner#scan(), obtain a snapshot of the 
> finalized blocks from memory, and then sort them, under the DN lock. However 
> the blocks are stored in a sorted structure (FoldedTreeSet) and hence the 
> sort should be unnecessary.
> {code}
>   final List<ReplicaInfo> bl = dataset.getFinalizedBlocks(bpid);
>   Collections.sort(bl); // Sort based on blockId
> {code}
> 2.  From the scan step, we have captured a snapshot of what is on disk. After 
> calling `dataset.getFinalizedBlocks(bpid);` as above we have taken a snapshot 
> of in memory. The two snapshots are never 100% in sync as things are always 
> changing as the disk is scanned.
> We are only comparing finalized blocks, so they should not really change:
> * If a block is deleted after our snapshot, our snapshot will not see it and 
> that is OK.
> * A finalized block could be appended. If that happens both the genstamp and 
> length will change, but that should be handled by reconcile when it calls 
> `FSDatasetImpl.checkAndUpdate()`, and there is nothing stopping blocks being 
> appended after they have been scanned from disk, but before they have been 
> compared with memory.
> My suspicion is that we can do all the comparison work outside of the lock 
> and checkAndUpdate() re-checks any differences later under the lock on a 
> block by block basis.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to