[
https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15811374#comment-15811374
]
Wei Zhou commented on HDFS-7343:
--------------------------------
Thanks [~anu] for reviewing the design document and the great comments.
{code}
1. I would like to understand the technical trade-offs that was considered in
making this choice.
{code}
Some methodologies existed in NN to collect metrics and events from DNs, so
it's better to store these data in NN to make SSM stateless as suggested by
Andrew. Also, it makes SSM more stable as these data stored in NN. When SSM
node failure happened, we can simply launch another instance on another node.
{code}
2. throttle the number of times a particular rule is executed in a time window?
{code}
Yes, good suggestion, I think we can make it a part of rule (for example,
provide a keyword). For now, it's better to provide a predictable SSM that a
rule getting executed when the condition fulfilled. If a throttle added in
rule-engine level then it's hard for users to predict the execution of the
rule. This brings in uncertainty to users. We can implement automatical
rule-engine level throttle in Phase 2.
{code}
3. Do we need to store the rules inside Namenode ?
{code}
Rule is the core part for SSM to function. For convenient and reliable
consideration, it's better to store it in NN to keep SSM simple and stateless
as suggested. Also the size of rule is very small (pure text) and suppose it
should never be a burden to NN.
{code}
4. HA support
{code}
Yes, good question. We can support HA by many ways, for example, periodically
checkpoint the data to HDFS or store the data in the same way as edit log.
{code}
5. but how do you intend to protect this end point?
{code}
Yes, if the cluster implements the Kerberos protocol, then web interface,
consoles and other parts of SSM are all works with Kerberos enabled.
{code}
6. How do we prevent a run-away rule?
{code}
This is a very good question.
First, we provide some verification mechanism when adding some rule. For
example, we can give the user some warning when the candidate files of an
action (such as move) exceeding some certain value.
Second, the execution state and other info related info can also be showed in
the dashboard or queried. It's convenient for users to track the status and
take actions accordingly. It's also very good to implement a timeout mechanism.
{code}
7. On the HDFS client querying SSM before writing, what happens if the SSM is
down?
{code}
Sorry for not making it clearly. Client queries SSM only once just before
creating the file, SSM does not need to participate in write procedure. So,
HDFS client will bypass SSM when the query fails, then the client goes back to
the original working flow. It has almost no effect on the existing I/O.
{code}
I would love to learn how this is working out in real world clusters.
{code}
We did some prototypes for POC. Three typical cases implemented with some
extent simplification:
# Move data to SSD based on the access count
# Cache data based on the access count
# Archive data based on file's age
The following chart shows the testing result of the first case. The rule is "if
a file been read for more than 2 times within 10 mins then move the file to
SSD". As we can see the time used for read decreases after the rule been
executed.
!move.jpg!
{code}
I think we have accidentally omitted reference to our classic balancer here.
{code}
Yes, thanks for your reminder.
> HDFS smart storage management
> -----------------------------
>
> Key: HDFS-7343
> URL: https://issues.apache.org/jira/browse/HDFS-7343
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Kai Zheng
> Assignee: Wei Zhou
> Attachments: HDFS-Smart-Storage-Management-update.pdf,
> HDFS-Smart-Storage-Management.pdf, move.jpg
>
>
> As discussed in HDFS-7285, it would be better to have a comprehensive and
> flexible storage policy engine considering file attributes, metadata, data
> temperature, storage type, EC codec, available hardware capabilities,
> user/application preference and etc.
> Modified the title for re-purpose.
> We'd extend this effort some bit and aim to work on a comprehensive solution
> to provide smart storage management service in order for convenient,
> intelligent and effective utilizing of erasure coding or replicas, HDFS cache
> facility, HSM offering, and all kinds of tools (balancer, mover, disk
> balancer and so on) in a large cluster.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]