[
https://issues.apache.org/jira/browse/HDFS-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652115#comment-16652115
]
Andrew Wang commented on HDFS-12946:
------------------------------------
Hi folks, thanks for working on this! Catching up on the discussion, this is a
nice change and is something I've hit before too (though hopefully not
something we see too often in production).
What I'd ask (and about most "monitoring" type applications) is about the
usecase. Cluster admins want to automate their alerting and reporting. If
they've gotten to the point that they need to take some manual action (e.g. use
fsck, {{hdfs debug}}, call this new RPC), it's because something external has
told them there is an issue. I go to interactive debugging tools to provide the
next level of detail for alerts that can't be easily automated.
In this case, it seems like most users would want to automate an alert based on
the metric. It's similar to mis-replication. The RPC isn't as useful IMO since
it doesn't tell you anything extra, though I would suggest logging a WARN/ERROR
when enabling an EC policy and this condition is true.
Are there any existing ways of querying the cluster topology and enabled EC
policies, and then computing this client-side? If not, I think this would be a
more generally useful admin interface than the very-lightweight new RPC.
One code comment is that I would prefer having some booleans for the MXBean
rather than the integer for additional clarity, since a bare int return type is
a bit opaque. In code I'd recommend using an enum or named static constants,
but that doesn't work for the MXBean.
> Add a tool to check rack configuration against EC policies
> ----------------------------------------------------------
>
> Key: HDFS-12946
> URL: https://issues.apache.org/jira/browse/HDFS-12946
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: erasure-coding
> Reporter: Xiao Chen
> Assignee: Kitti Nanasi
> Priority: Major
> Attachments: HDFS-12946.01.patch, HDFS-12946.02.patch,
> HDFS-12946.03.patch, HDFS-12946.04.fsck.patch
>
>
> From testing we have seen setups with problematic racks / datanodes that
> would not suffice basic EC usages. These are usually found out only after the
> tests failed.
> We should provide a way to check this beforehand.
> Some scenarios:
> - not enough datanodes compared to EC policy's highest data+parity number
> - not enough racks to satisfy BPPRackFaultTolerant
> - highly uneven racks to satisfy BPPRackFaultTolerant
> - highly uneven racks (so that BPP's considerLoad logic may exclude some busy
> nodes on the rack, resulting in #2)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]