[
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106283#comment-16106283
]
Uma Maheswara Rao G commented on HDFS-10285:
--------------------------------------------
[~andrew.wang] Thanks a lot Andrew for spending time on review and for very
valuable comments.
Please find my replies to the comments/questions.
{quote}The "-satisfyStoragePolicy" command is asynchronous. One difficulty for
async APIs is status reporting. "-isSpsRunning" doesn't give much insight.
How does a client track the progress of their request? How are errors
propagated? A client like HBase can't read the NN log to find a stacktrace.
Section 5.3 lists some possible errors for block movement on the DN. It might
be helpful to think about NN-side errors too: out of quota, out of capacity,
other BPP failure, slow/stuck SPS tasks, etc.
{quote}
Interesting question and we thought about this, but its pretty hard to
communicate back to user about statuses. IMO, this async api is basically a
facility to user to trigger HDFS to start satisfying the blocks as per the
storage policy set. Example if we enable automatic movements in future, errors
status will not be reported to users. Its HDFS responsibility to satisfy as
possible as when policy changed.
One possible way for admins to notice the failures would be via metrics
reporting. I am also thinking to provide option in fsck command to check the
current pending/in-progress status. I understand, this kind of status tracking
may be useful in the case of SSM kind of systems to act upon, say raising alarm
alerts etc. But HBase kind of system may not take any action from its business
logic even of movement statuses are failures. Right now, HDFS itself will keep
retry until it satisfies.
{quote}It might be helpful to think about NN-side errors too: out of quota, out
of capacity, other BPP failure, slow/stuck SPS tasks, etc.{quote}
Sure, let me think on this if there are possible conditions. Ideally SPS does
not deal with namespace change (except in adding Xattr for internal use
purpose), but it does data movement to different volumes at DN. We will think
to collect possible metrics from NN side as well specifically in ERROR
conditions.
{quote}Rather than using the acronym (which a user might not know), maybe
rename "-isSpsRunning" to "-isSatisfierRunning" ?{quote}
Make sense, We would change that.
{quote}How is leader election done for the C-DN? Is there some kind of lease
system so an old C-DN aborts if it can't reach the NN? This prevents split
brain.{quote}
Here we choose C-DN loosely. We just pick first source in the list. C-DN send
back IN_PROGRESS pings for every 5mins( via heartbeat). If no IN_PROGRESS pings
and timeout
dfs.storage.policy.satisfier.self.retry.timeout.millis elapses, then NN will
just choose another in C-DN and reschedule. Here Even if older C-DN comes back,
on re-registration,
we send dropSPSWork request to DNs, that will prevent 2 C-DNs running.
{quote}Any plans to trigger the satisfier automatically on events like rename
or setStoragePolicy? When I explain HSM to users, they're often surprised that
they need to trigger movement manually. Here, it's easier since it's
Mover-as-a-service, but still manually triggered.
{quote}
Actually this is our long term plan. To simplify the solution, we target to
implement first phase with manual triggering. Once the current code base
performing well and stable enough, In the follow up work, we will work on this
task to enable automatic triggering. To avoid missing requirements, I will add
this task in followup JIRA.
{quote}Docs say that right now the user has to trigger SPS tasks recursively
for a directory. Why? I believe the Mover works recursively. xiaojian is doing
some work on HDFS-10899 that involves an efficient recursive directory
iteration, maybe can take some ideas from there.
{quote}
IIRC, Actually we intentionally restricted recursive operation. We wanted to
more careful on NN overheads. If some user accidentally calls on root
directory, it may trigger lot of unnecessary overlap scans.
In Mover case, it is running outside, so all scan overheads will be outside of
NN. So, here if user really requires on recursive policy satisfaction, then he
can do recursively (this can’t happen accidentally).
I agree, allowing recursively will make user much more easier when they need
recursive execution. Only constraint we thought was to make operation light
weight as much as possible.
If you feel recursive is fine, we are ok to enable it.
{quote}
HDFS-10899 also the cursor of the iterator in the EZ root xattr to track
progress and handle restarts. I wonder if we can do something similar here to
avoid having an xattr-per-file being moved.
{quote}
Thank you for pointing this optimization and possible solutions. We discussed
about it, before in
[HDFS-11150|https://issues.apache.org/jira/browse/HDFS-11150?focusedCommentId=15763884&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15763884].
I file a JIRA to track it HDFS-12225
{quote}
What's the NN memory overhead when I try to satisfy a directory full of files?
A user might try to SPS a significant chunk of their data during an initial
rollout.
{quote}
Actually NN will not track movement at block level. We are tracking at file
level. NN will track only for Inode id to be satisfied fully. Also With above
optimization, that is to avoid keeping Xattrs for each file. Overhead should be
pretty less as overlap block scanning will happen sequentially.
{quote}
Uma also mentioned some future work items in the DISCUSS email. Are these
tracked in JIRAs?
{quote}
We have filed separate Followup JIRA HDFS-12226 for tracking all followup tasks
which mentioned in design doc.
> Storage Policy Satisfier in Namenode
> ------------------------------------
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Affects Versions: HDFS-10285
> Reporter: Uma Maheswara Rao G
> Assignee: Uma Maheswara Rao G
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch,
> HDFS-10285-consolidated-merge-patch-01.patch,
> HDFS-SPS-TestReport-20170708.pdf,
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf,
> Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These
> policies can be set on directory/file to specify the user preference, where
> to store the physical block. When user set the storage policy before writing
> data, then the blocks could take advantage of storage policy preferences and
> stores physical block accordingly.
> If user set the storage policy after writing and completing the file, then
> the blocks would have been written with default storage policy (nothing but
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such
> file names as a list. In some distributed system scenarios (ex: HBase) it
> would be difficult to collect all the files and run the tool as different
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage
> policy file (inherited policy from parent directory) to another storage
> policy effected directory, it will not copy inherited storage policy from
> source. So it will take effect from destination file/dir parent storage
> policy. This rename operation is just a metadata change in Namenode. The
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for
> admins from distributed nodes(ex: region servers) and running the Mover tool.
> Here the proposal is to provide an API from Namenode itself for trigger the
> storage policy satisfaction. A Daemon thread inside Namenode should track
> such calls and process to DN as movement commands.
> Will post the detailed design thoughts document soon.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]