Daryn Sharp created HDFS-8674:
---------------------------------
Summary: Improve performance of postponed block scans
Key: HDFS-8674
URL: https://issues.apache.org/jira/browse/HDFS-8674
Project: Hadoop HDFS
Issue Type: Improvement
Components: HDFS
Affects Versions: 2.6.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
Priority: Critical
When a standby goes active, it marks all nodes as "stale" which will cause
block invalidations for over-replicated blocks to be queued until full block
reports are received from the nodes with the block. The replication monitor
scans the queue with O(N) runtime. It picks a random offset and iterates
through the set to randomize blocks scanned.
The result is devastating when a cluster loses multiple nodes during a rolling
upgrade. Re-replication occurs, the nodes come back, the excess block
invalidations are postponed. Rescanning just 2k blocks out of millions of
postponed blocks may take multiple seconds. During the scan, the write lock is
held which stalls all other processing.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)