Eric Sirianni created HDFS-5380:
-----------------------------------
Summary: NameNode returns stale block locations to clients during
excess replica pruning
Key: HDFS-5380
URL: https://issues.apache.org/jira/browse/HDFS-5380
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 1.2.1, 2.0.0-alpha
Reporter: Eric Sirianni
Priority: Minor
Consider the following contrived example:
{code}
// Step 1: Create file with replication factor = 2
Path path = ...;
short replication = 2;
OutputStream os = fs.create(path, ..., replication, ...);
// Step 2: Write to file
os.write(...);
// Step 3: Reduce replication factor to 1
fs.setReplication(path, 1);
// Wait for namenode to prune excess replicates
// Step 4: Read from file
InputStream is = fs.open(path);
is.read(...);
{code}
During the read in _Step 4_, the {{DFSInputStream}} client receives "stale"
block locations from the NameNode. Specifically, it receives block locations
that the NameNode has already pruned/invalidated (and the DataNodes have
already deleted).
The net effect of this is unnecessary churn in the {{DFSClient}} (timeouts,
retries, extra RPCs, etc.). In particular:
{noformat}
WARN hdfs.DFSClient - Failed to connect to datanode-1 for block, add to
deadNodes and continue.
{noformat}
The blacklisting of DataNodes that are, in fact, functioning properly can lead
to inefficient locality of reads. Since the blacklist is _cumulative_ across
all blocks in the file, this can have noticeable impact for large files.
A pathological case can occur when *all* block locations are in the blacklist.
In this case, the {{DFSInputStream}} will sleep and refetch locations from the
NameNode, causing unnecessary RPCs and a client-side sleep:
{noformat}
INFO hdfs.DFSClient - Could not obtain blk_1073741826_1002 from any node:
java.io.IOException: No live nodes contain current block. Will get new block
locations from namenode and retry...
{noformat}
This pathological case can occur in the following example (for a read of file
{{foo}}):
# {{DFSInputStream}} attempts to read block 1 of {{foo}}.
# Gets locations: {{( dn1(stale), dn2 )}}
# Attempts read from {{dn1}}. Fails. Adds {{dn1}} to blacklist.
# {{DFSInputStream}} attempts to read block 2 of {{foo}}.
# Gets locations: {{( dn1, dn2(stale) )}}
# Attempts read from {{dn2}} ({{dn1}} already blacklisted). Fails. Adds
{{dn1}} to blacklist.
# All locations for block 2 are now in blacklist.
# Clears blacklists
# Sleeps up to 3 seconds
# Refetches locations from the NameNode
A solution would be to change the NameNode to not return stale block locations
to clients for replicas that it knows it has asked DataNodes to invalidate.
A quick look at the {{BlockManager.chooseExcessReplicates()}} code path seems
to indicate that the NameNode does not actually remove the pruned replica from
the BlocksMap until the subsequent blockReport is received. This can leave a
substantial window where the NameNode can return stale replica locations to
clients.
If the NameNode were to proactively update the {{BlocksMap}} upon excess
replica pruning, this situation could be avoided. If the DataNode did not in
fact invalidate the replica as asked, the NameNode would simply re-add the
replica to the {{BlocksMap}} upon next blockReport and go through the pruning
exercise again.
--
This message was sent by Atlassian JIRA
(v6.1#6144)