[ 
https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15811374#comment-15811374
 ] 

Wei Zhou commented on HDFS-7343:
--------------------------------

Thanks [~anu] for reviewing the design document and the great comments. 
{code}
1. I would like to understand the technical trade-offs that was considered in 
making this choice.
{code}
Some methodologies existed in NN to collect metrics and events from DNs, so 
it's better to store these data in NN to make SSM stateless as suggested by 
Andrew. Also, it makes SSM more stable as these data stored in NN. When SSM 
node failure happened, we can simply launch another instance on another node.

{code}
2. throttle the number of times a particular rule is executed in a time window? 
{code}
Yes, good suggestion, I think we can make it a part of rule (for example, 
provide a keyword). For now, it's better to provide a predictable SSM that a 
rule getting executed when the condition fulfilled. If a throttle added in 
rule-engine level then it's hard for users to predict the execution of the 
rule. This brings in uncertainty to users. We can implement automatical 
rule-engine level throttle in Phase 2.

{code}
3. Do we need to store the rules inside Namenode ?
{code}
Rule is the core part for SSM to function. For convenient and reliable 
consideration, it's better to store it in NN to keep SSM simple and stateless 
as suggested. Also the size of rule is very small (pure text) and suppose it 
should never be a burden to NN.

{code}
4. HA support
{code}
Yes, good question. We can support HA by many ways, for example, periodically 
checkpoint the data to HDFS or store the data in the same way as edit log.

{code}
5. but how do you intend to protect this end point?
{code}
Yes, if the cluster implements the Kerberos protocol, then web interface, 
consoles and other parts of SSM are all works with Kerberos enabled.

{code}
6. How do we prevent a run-away rule?
{code}
This is a very good question.
First, we provide some verification mechanism when adding some rule. For 
example, we can give the user some warning when the candidate files of an 
action (such as move) exceeding some certain value. 
Second, the execution state and other info related info can also be showed in 
the dashboard or queried. It's convenient for users to track the status and 
take actions accordingly. It's also very good to implement a timeout mechanism.

{code}
7. On the HDFS client querying SSM before writing, what happens if the SSM is 
down?
{code}
Sorry for not making it clearly. Client queries SSM only once just before 
creating the file, SSM does not need to participate in write procedure. So, 
HDFS client will bypass SSM when the query fails, then the client goes back to 
the original working flow. It has almost no effect on the existing I/O.

{code}
I would love to learn how this is working out in real world clusters.
{code}
We did some prototypes for POC. Three typical cases implemented with some 
extent simplification:
# Move data to SSD based on the access count
# Cache data based on the access count
# Archive data based on file's age

The following chart shows the testing result of the first case. The rule is "if 
a file been read for more than 2 times within 10 mins then move the file to 
SSD". As we can see the time used for read decreases after the rule been 
executed.
!move.jpg!

{code}
I think we have accidentally omitted reference to our classic balancer here.
{code}
Yes, thanks for your reminder.

> HDFS smart storage management
> -----------------------------
>
>                 Key: HDFS-7343
>                 URL: https://issues.apache.org/jira/browse/HDFS-7343
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Kai Zheng
>            Assignee: Wei Zhou
>         Attachments: HDFS-Smart-Storage-Management-update.pdf, 
> HDFS-Smart-Storage-Management.pdf, move.jpg
>
>
> As discussed in HDFS-7285, it would be better to have a comprehensive and 
> flexible storage policy engine considering file attributes, metadata, data 
> temperature, storage type, EC codec, available hardware capabilities, 
> user/application preference and etc.
> Modified the title for re-purpose.
> We'd extend this effort some bit and aim to work on a comprehensive solution 
> to provide smart storage management service in order for convenient, 
> intelligent and effective utilizing of erasure coding or replicas, HDFS cache 
> facility, HSM offering, and all kinds of tools (balancer, mover, disk 
> balancer and so on) in a large cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to