[
https://issues.apache.org/jira/browse/HDFS-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650426#comment-16650426
]
Kitti Nanasi commented on HDFS-12946:
-------------------------------------
Thanks [~xiaochen] for the comments and the summary!
I agree that ClientProtocol might not be a good place for this RPC call,
because CP is responsible for way more important RPC calls than this one, it is
even more so for the DFSClient. For the time being I created a patch (patch
v003) which removes the DFSClient changes, but keeps the new RPC in the
ClientProtocol. And I also fixed the test failures which were related.
Having this command in the fsck is a good idea, my only concern is that this
new verify command would be more general than the usual fsck checks (it can't
be calculated for directories, but it is general in the whole cluster) and that
could cause some confusion.
There are some other solutions in my mind which could work as well:
* It could work like the reconfig command in DFSAdmin, which implements a
custom ReconfigurationProtocol, which is good because it doesn't use the
existing ClientProtocol, but I don't like it so much, because the command
requires the address of the namenode as a parameter.
* JMX call in ECAdmin when the new command is executed, the problem with this
is that we have to get the namenode's ip address (I'm not sure how to do that
in case of HA) and get the result of the verify via JMX.
* It could be a metric inside ECBlockGroupStats (that is already exposed on
the ClientProtocol), the problem with this is that in this case the new metric
shouldn't be recalculated at every invocation, but it should be stored on the
namenode, like the other metrics. Then the metric should be recalculated when a
policy is enabled or disabled, or when a datanode dies or is added. The last
one would be more difficult to react on.
Overall I think that the fsck way is the best and easiest solution, so I also
uploaded an initial patch (I will add tests later) for that.
Note that I think that the return value for the verify method should contain
the result message, I also plan to change that in a later patch.
> Add a tool to check rack configuration against EC policies
> ----------------------------------------------------------
>
> Key: HDFS-12946
> URL: https://issues.apache.org/jira/browse/HDFS-12946
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: erasure-coding
> Reporter: Xiao Chen
> Assignee: Kitti Nanasi
> Priority: Major
> Attachments: HDFS-12946.01.patch, HDFS-12946.02.patch,
> HDFS-12946.03.patch, HDFS-12946.04.fsck.patch
>
>
> From testing we have seen setups with problematic racks / datanodes that
> would not suffice basic EC usages. These are usually found out only after the
> tests failed.
> We should provide a way to check this beforehand.
> Some scenarios:
> - not enough datanodes compared to EC policy's highest data+parity number
> - not enough racks to satisfy BPPRackFaultTolerant
> - highly uneven racks to satisfy BPPRackFaultTolerant
> - highly uneven racks (so that BPP's considerLoad logic may exclude some busy
> nodes on the rack, resulting in #2)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]