[jira] [Commented] (HDFS-12946) Add a tool to check rack configuration against EC policies

Kitti Nanasi (JIRA) Mon, 15 Oct 2018 09:13:44 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650426#comment-16650426
 ]


Kitti Nanasi commented on HDFS-12946:
-------------------------------------

Thanks [~xiaochen] for the comments and the summary!

I agree that ClientProtocol might not be a good place for this RPC call, 
because CP is responsible for way more important RPC calls than this one, it is 
even more so for the DFSClient. For the time being I created a patch (patch 
v003) which removes the DFSClient changes, but keeps the new RPC in the 
ClientProtocol. And I also fixed the test failures which were related.

Having this command in the fsck is a good idea, my only concern is that this 
new verify command would be more general than the usual fsck checks (it can't 
be calculated for directories, but it is general in the whole cluster) and that 
could cause some confusion.

There are some other solutions in my mind which could work as well:
 * It could work like the reconfig command in DFSAdmin, which implements a 
custom ReconfigurationProtocol, which is good because it doesn't use the 
existing ClientProtocol, but I don't like it so much, because the command 
requires the address of the namenode as a parameter.
 * JMX call in ECAdmin when the new command is executed, the problem with this 
is that we have to get the namenode's ip address (I'm not sure how to do that 
in case of HA) and get the result of the verify via JMX.
 * It could be a metric inside ECBlockGroupStats (that is already exposed on 
the ClientProtocol), the problem with this is that in this case the new metric 
shouldn't be recalculated at every invocation, but it should be stored on the 
namenode, like the other metrics. Then the metric should be recalculated when a 
policy is enabled or disabled, or when a datanode dies or is added. The last 
one would be more difficult to react on.

Overall I think that the fsck way is the best and easiest solution, so I also 
uploaded an initial patch (I will add tests later) for that.

Note that I think that the return value for the verify method should contain 
the result message, I also plan to change that in a later patch.

> Add a tool to check rack configuration against EC policies
> ----------------------------------------------------------
>
>                 Key: HDFS-12946
>                 URL: https://issues.apache.org/jira/browse/HDFS-12946
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: erasure-coding
>            Reporter: Xiao Chen
>            Assignee: Kitti Nanasi
>            Priority: Major
>         Attachments: HDFS-12946.01.patch, HDFS-12946.02.patch, 
> HDFS-12946.03.patch, HDFS-12946.04.fsck.patch
>
>
> From testing we have seen setups with problematic racks / datanodes that 
> would not suffice basic EC usages. These are usually found out only after the 
> tests failed.
> We should provide a way to check this beforehand.
> Some scenarios:
> - not enough datanodes compared to EC policy's highest data+parity number
> - not enough racks to satisfy BPPRackFaultTolerant
> - highly uneven racks to satisfy BPPRackFaultTolerant
> - highly uneven racks (so that BPP's considerLoad logic may exclude some busy 
> nodes on the rack, resulting in #2)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-12946) Add a tool to check rack configuration against EC policies

Reply via email to