[
https://issues.apache.org/jira/browse/HDFS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Eric Sirianni updated HDFS-5380:
--------------------------------
Attachment: ExcessReplicaPruningTest.java
JUnit test that demonstrates this issue using {{MiniDFSCluster}}
> NameNode returns stale block locations to clients during excess replica
> pruning
> -------------------------------------------------------------------------------
>
> Key: HDFS-5380
> URL: https://issues.apache.org/jira/browse/HDFS-5380
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.0.0-alpha, 1.2.1
> Reporter: Eric Sirianni
> Priority: Minor
> Attachments: ExcessReplicaPruningTest.java
>
>
> Consider the following contrived example:
> {code}
> // Step 1: Create file with replication factor = 2
> Path path = ...;
> short replication = 2;
> OutputStream os = fs.create(path, ..., replication, ...);
> // Step 2: Write to file
> os.write(...);
> // Step 3: Reduce replication factor to 1
> fs.setReplication(path, 1);
> // Wait for namenode to prune excess replicates
> // Step 4: Read from file
> InputStream is = fs.open(path);
> is.read(...);
> {code}
> During the read in _Step 4_, the {{DFSInputStream}} client receives "stale"
> block locations from the NameNode. Specifically, it receives block locations
> that the NameNode has already pruned/invalidated (and the DataNodes have
> already deleted).
> The net effect of this is unnecessary churn in the {{DFSClient}} (timeouts,
> retries, extra RPCs, etc.). In particular:
> {noformat}
> WARN hdfs.DFSClient - Failed to connect to datanode-1 for block, add to
> deadNodes and continue.
> {noformat}
> The blacklisting of DataNodes that are, in fact, functioning properly can
> lead to inefficient locality of reads. Since the blacklist is _cumulative_
> across all blocks in the file, this can have noticeable impact for large
> files.
> A pathological case can occur when *all* block locations are in the
> blacklist. In this case, the {{DFSInputStream}} will sleep and refetch
> locations from the NameNode, causing unnecessary RPCs and a client-side
> sleep:
> {noformat}
> INFO hdfs.DFSClient - Could not obtain blk_1073741826_1002 from any node:
> java.io.IOException: No live nodes contain current block. Will get new block
> locations from namenode and retry...
> {noformat}
> This pathological case can occur in the following example (for a read of file
> {{foo}}):
> # {{DFSInputStream}} attempts to read block 1 of {{foo}}.
> # Gets locations: {{( dn1(stale), dn2 )}}
> # Attempts read from {{dn1}}. Fails. Adds {{dn1}} to blacklist.
> # {{DFSInputStream}} attempts to read block 2 of {{foo}}.
> # Gets locations: {{( dn1, dn2(stale) )}}
> # Attempts read from {{dn2}} ({{dn1}} already blacklisted). Fails. Adds
> {{dn1}} to blacklist.
> # All locations for block 2 are now in blacklist.
> # Clears blacklists
> # Sleeps up to 3 seconds
> # Refetches locations from the NameNode
> A solution would be to change the NameNode to not return stale block
> locations to clients for replicas that it knows it has asked DataNodes to
> invalidate.
> A quick look at the {{BlockManager.chooseExcessReplicates()}} code path seems
> to indicate that the NameNode does not actually remove the pruned replica
> from the BlocksMap until the subsequent blockReport is received. This can
> leave a substantial window where the NameNode can return stale replica
> locations to clients.
> If the NameNode were to proactively update the {{BlocksMap}} upon excess
> replica pruning, this situation could be avoided. If the DataNode did not in
> fact invalidate the replica as asked, the NameNode would simply re-add the
> replica to the {{BlocksMap}} upon next blockReport and go through the pruning
> exercise again.
--
This message was sent by Atlassian JIRA
(v6.1#6144)