[
https://issues.apache.org/jira/browse/HDFS-4931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692897#comment-13692897
]
Steve Loughran commented on HDFS-4931:
--------------------------------------
I can see the benefits of this in some applications -though MR jobs aren't
necessarily it, as scattering the blocks gives you better bandwidth. by keeping
them all one node, the max bandwidth is the #of HDDs on that node, minus all
other work going on on those disks. If scattered, the bandwidth is the #of
blocks of the file, minus other work going on against the same blocks. To make
things worse -any other code that is trying to access another file on the same
machine is going to fight for exactly the same set of hard disks.
# the failure mode of the cluster will change. You should look at that
carefully.
# you aren't going to handle a full disk very well, as at that point your
constraints don't get satisfied.
# rebalance and recovery time will increase, as now all the rebalanced blocks
are being directed to a single server, limited by both the HDD and net
bandwidth of that device, rather than the aggregate bandwidth of the cluster.
Assuming all three copies of a file's blocks are stored only on 3 machines, you
get hurt at both ends. As that time to recover increases, exposure to multiple
HDD/node failures increases too.
I think it may be an interesting experiment, but you need to start looking at
the impact of failures, and the performance problems. Overall, though, I'm not
convinced it scales well, either to large files or large clusters -the latter
offering the IO and network bandwidth this policy would fail to exploit, and
the highest failure rates. Normally that failure rate is a background noise,
but with this placement policy, it may be more visible.
What may be more useful is revisiting Facebook's work on sub-cluster placement
policy, where all blocks of a file are stored in the same set of racks in a
larger cluster. You get more chance of rack locality for multiple blocks, and
when a rack fails, while some files suffer more, a lot of files suffer less
-and recovery bandwidth is restricted to a fraction of the net, which, on a
multi-layered network, may protect the backbone.
Because its experimental and has scale issues, I don't see a rush to commit
patches to support it unless its backed up by the theory and the data
justifying this tactic.
> Extend the block placement policy interface to utilize the location
> information of previously stored files
> ------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-4931
> URL: https://issues.apache.org/jira/browse/HDFS-4931
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Jihoon Son
> Attachments: HDFS-4931.patch
>
>
> Nowadays, I'm implementing a locality preserving block placement policy which
> stores files in a directory in the same datanode. That is to say, given a
> root directory, files under the root directory are grouped by paths of their
> parent directories. After that, files of a group are stored in the same
> datanode.
> When a new file is stored at HDFS, the block placement policy choose the
> target datanode considering locations of previously stored files.
> In the current block placement policy interface, there are some problems. The
> first problem is that there is no interface to keep the previously stored
> files when HDFS is restarted. To restore the location information of all
> files, this process should be done during the safe mode of the namenode.
> To solve the first problem, I modified the block placement policy interface
> and FSNamesystem. Before leaving the safe mode, every necessary location
> information is sent to the block placement policy.
> However, there are too much changes of access modifiers from private to
> public in my implementation. This may violate the design of the interface.
> The second problem is occurred when some blocks are moved by the balancer or
> node failures. In this case, the block placement policy should recognize the
> current status, and return a new datanode to move blocks. However, the
> current interface does not support it.
> The attached patch is to solve the first problem, but as mentioned above, it
> may violate the design of the interface.
> Do you have any good ideas?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira