[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106283#comment-16106283
 ] 

Uma Maheswara Rao G commented on HDFS-10285:
--------------------------------------------

[~andrew.wang] Thanks a lot Andrew for spending time on review and for very 
valuable comments.

Please find my replies to the comments/questions.
{quote}The "-satisfyStoragePolicy" command is asynchronous. One difficulty for 
async APIs is status reporting. "-isSpsRunning" doesn't give much insight. 
How does a client track the progress of their request? How are errors 
propagated? A client like HBase can't read the NN log to find a stacktrace. 
Section 5.3 lists some possible errors for block movement on the DN. It might 
be helpful to think about NN-side errors too: out of quota, out of capacity, 
other BPP failure, slow/stuck SPS tasks, etc.
{quote}
Interesting question and we thought about this, but its pretty hard to 
communicate back to user about statuses. IMO, this async api is basically a 
facility to user to trigger HDFS to start satisfying the blocks as per the 
storage policy set. Example if we enable automatic movements in future, errors 
status will not be reported to users. Its HDFS responsibility to satisfy as 
possible as when policy changed.
One possible way for admins to notice the failures would be via metrics 
reporting. I am also thinking to provide option in fsck command to check the 
current pending/in-progress status. I understand, this kind of status tracking 
may be useful in the case of SSM kind of systems to act upon, say raising alarm 
alerts etc. But HBase kind of system may not take any action from its business 
logic even of movement statuses are failures. Right now, HDFS itself will keep 
retry until it satisfies. 
{quote}It might be helpful to think about NN-side errors too: out of quota, out 
of capacity, other BPP failure, slow/stuck SPS tasks, etc.{quote}
Sure, let me think on this if there are possible conditions. Ideally SPS does 
not deal with namespace change (except in adding Xattr for internal use 
purpose), but it does data movement to different volumes at DN. We will think 
to collect possible metrics from NN side as well specifically in ERROR 
conditions.

{quote}Rather than using the acronym (which a user might not know), maybe 
rename "-isSpsRunning" to "-isSatisfierRunning" ?{quote}
Make sense, We would change that.

{quote}How is leader election done for the C-DN? Is there some kind of lease 
system so an old C-DN aborts if it can't reach the NN? This prevents split 
brain.{quote}
Here we choose C-DN loosely. We just pick first source in the list. C-DN send 
back IN_PROGRESS pings for every 5mins( via heartbeat). If no IN_PROGRESS pings 
and timeout
dfs.storage.policy.satisfier.self.retry.timeout.millis elapses, then NN will 
just choose another in C-DN and reschedule. Here Even if older C-DN comes back, 
on re-registration, 
we send dropSPSWork request to DNs, that will prevent 2 C-DNs running.

{quote}Any plans to trigger the satisfier automatically on events like rename 
or setStoragePolicy? When I explain HSM to users, they're often surprised that 
they need to trigger movement manually. Here, it's easier since it's 
Mover-as-a-service, but still manually triggered.
{quote}
Actually this is our long term plan. To simplify the solution, we target to 
implement first phase with manual triggering. Once the current code base 
performing well and stable enough, In the follow up work, we will work on this 
task to enable automatic triggering. To avoid missing requirements, I will add 
this task in followup JIRA. 


{quote}Docs say that right now the user has to trigger SPS tasks recursively 
for a directory. Why? I believe the Mover works recursively. xiaojian is doing 
some work on HDFS-10899 that involves an efficient recursive directory 
iteration, maybe can take some ideas from there.
{quote}
IIRC, Actually we intentionally restricted recursive operation. We wanted to 
more careful on NN overheads. If some user accidentally calls on root 
directory, it may trigger lot of unnecessary overlap scans.
In Mover case, it is running outside, so all scan overheads will be outside of 
NN. So, here if user really requires on recursive policy satisfaction, then he 
can do recursively (this can’t happen accidentally).
I agree, allowing recursively will make user much more easier when they need 
recursive execution. Only constraint we thought was to make operation light 
weight as much as possible.  
If you feel recursive is fine, we are ok to enable it.

{quote}
HDFS-10899 also the cursor of the iterator in the EZ root xattr to track 
progress and handle restarts. I wonder if we can do something similar here to 
avoid having an xattr-per-file being moved.
{quote}
Thank you for pointing this optimization and possible solutions. We discussed 
about it, before in 
[HDFS-11150|https://issues.apache.org/jira/browse/HDFS-11150?focusedCommentId=15763884&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15763884].
 I file a JIRA to track it HDFS-12225 
{quote}
What's the NN memory overhead when I try to satisfy a directory full of files? 
A user might try to SPS a significant chunk of their data during an initial 
rollout.
{quote}
Actually NN will not track movement at block level. We are tracking at file 
level. NN will track only for Inode id to be satisfied fully. Also With above 
optimization, that is to avoid keeping Xattrs for each file. Overhead should be 
pretty less as overlap block scanning will happen sequentially. 

{quote}
Uma also mentioned some future work items in the DISCUSS email. Are these 
tracked in JIRAs?
{quote}
We have filed separate Followup JIRA HDFS-12226 for tracking all followup tasks 
which mentioned in design doc.


> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to