[ 
https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15805521#comment-15805521
 ] 

Anu Engineer commented on HDFS-7343:
------------------------------------

[~zhouwei] Thank you for addressing all the issues I had. The updated design 
doc looks excellent. if there are any JIRAs that you need help with, please let 
me know and I will be happy to chip in. I am almost at a +1. However, I had 
some questions and comments. Please treat the following sections as Questions 
(Things that I don’t understand and would like to know), comments (some 
subjective comments, please feel free to ignore them), nitpick (completely 
ignore them, written down to avoid someone else asking the same question later).

h4. Questions

{noformat}
NameNode then stores the data into database. In this way, SSM has no need to 
maintain state checkpoints. 
{noformat}
1. I would like to understand the technical trade-offs that was considered in 
making this choice. Other applications that do this, like Amabri chooses to 
store this data in a database maintained within the application. However, for 
SSM you are choosing to store them in Namenode. In fact when I look at the 
architecture diagram (it is very well drawn, thank you) It looks to me that it 
is trivial to have the LevelDB on the SSM side instead of Namenode. So I am 
wondering what advantage we have by maintaining more metadata on namenode side.

2. 1.   In Rule Engine section, can we throttle the number of times a 
particular rule is executed in a time window? What I am trying to prevent is 
some kind of flapping where two opposite rules are triggered continuously. 

3. Also there is a reference {noformat} SSM also stores some data (for example, 
rules) into database through NameNode. {noformat} Do we need to store the rules 
inside Namenode ? Would it make more sense to store it in SSM itself. The 
reason why I am asking is that in future I see that this platform could be 
leveraged by Hive or HBase. If that is that case, having an independent rule 
store might be more interesting than a pure HDFS one.
4. {noformat} HA supporting will be considered later {noformat} I am all for 
postponing HA support, but I am not able to understand how we are going to 
store rules in namenode and ignore HA. Are we going to say SSM will not work if 
HA is enabled? Most clusters we see are HA enabled. However, if we avoid 
dependencies on Namenode, SSM might work with HA enabled cluster. I cannot see 
anything in SSM that inherently prevents it from working with an HA enabled 
cluster.
5. {noformat} SSM also exports interface (for example, through RESTful API) for 
administrator and client to manage SSM or query information.{noformat} May be 
something for later, but how do you intend to protect this end point?  Is it 
going to be Kerberos? Usually I would never ask a service to consider security 
till you have to core modules, but in this case the potential for abuse is very 
high. I understand that we might not have it in the initial releases of SSM, 
but it might be good to think about it.

6. How do we prevent a run-away rule? Let me give you an example, recently I 
was cleaning a cluster and decided to run “rm –rf “ on the data directories. 
Each datanode had more than 12 million files and just a normal file system 
operation was taking forever. So a rule like *file.path matches "/fooA/*.dat"*, 
might run forever (I am looking at you , Hive). Are you planning to provide a 
timeout on the execution of a rule or is it that rule will run until we reach 
the end of processing?  if we don’t have timeouts, it might be hard to honor 
other rules which want to run at a specific time. Even with multiple threads, 
you might not be able to make too much progress and since most of these rules 
are going to be run against Namenode and you would have a limited bandwidth to 
work with.

7.  On the HDFS client querying SSM before writing, what happens if the SSM is 
down? Will client wait and retry , potentially making I/O slower and eventually 
bypassing SSM ? Have you considered using  Storage Policy Satisfier, 
HDFS-10285. Even if SSM is down or a client does not talk to SSM, we could rely 
on SPS to move the data to the right location. Some of your storage manager 
functionality can leverage what is being done in Storage Policy Satisfier. So 
can you please clarify how you will handle clients that does not talk to SSM 
and what happens to I/O when SSM is down.

h4. Comments 

1. I still share some of the concerns voiced by  [~andrew.wang]. It is going to 
be challenging to create a set of static rules for changing conditions of the 
cluster, especially when works loads are different. But sometimes we learn 
surprising things by doing rather than talking. I would love to learn how this 
is working out in real world clusters.  If you have any data to share I would l 
appreciate it.
h4. Nitpic
{noformat}
 HSM, Cache, SPS, DataNode Disk Balancer(HDFS-1312) and EC to do the actual 
data manipulation work.
{noformat}
I think we have accidentally omitted reference to our classic balancer here.






> HDFS smart storage management
> -----------------------------
>
>                 Key: HDFS-7343
>                 URL: https://issues.apache.org/jira/browse/HDFS-7343
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Kai Zheng
>            Assignee: Wei Zhou
>         Attachments: HDFS-Smart-Storage-Management-update.pdf, 
> HDFS-Smart-Storage-Management.pdf
>
>
> As discussed in HDFS-7285, it would be better to have a comprehensive and 
> flexible storage policy engine considering file attributes, metadata, data 
> temperature, storage type, EC codec, available hardware capabilities, 
> user/application preference and etc.
> Modified the title for re-purpose.
> We'd extend this effort some bit and aim to work on a comprehensive solution 
> to provide smart storage management service in order for convenient, 
> intelligent and effective utilizing of erasure coding or replicas, HDFS cache 
> facility, HSM offering, and all kinds of tools (balancer, mover, disk 
> balancer and so on) in a large cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to