[
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107830#comment-16107830
]
Andrew Wang commented on HDFS-10285:
------------------------------------
Hi Uma, thanks for the replies,
bq. One possible way for admins to notice the failures would be via metrics
reporting. I am also thinking to provide option in fsck command to check the
current pending/in-progress status. I understand, this kind of status tracking
may be useful in the case of SSM kind of systems to act upon, say raising alarm
alerts etc. But HBase kind of system may not take any action from its business
logic even of movement statuses are failures. Right now, HDFS itself will keep
retry until it satisfies.
For the automatic usecase, I agree that metrics are probably the best we can
do. However, the API exposed here is for interactive usecases (e.g. a user
calling the shell command and polling until it's done). I think we need to do
more here to expose the status.
Even for the HBase usecase, it'd still want to know about satisfier status so
it can bubble it up to an HBase admin.
bq. I agree, allowing recursively will make user much more easier when they
need recursive execution. Only constraint we thought was to make operation
light weight as much as possible.
Can this be addressed by throttling? I think the SPS operations aren't too
different from decommissioning, since they're both doing block placement and
tracking data movement, and the decom throttles work okay.
We've also encountered directories with millions of files before, so there's a
need for throttles anyway. Maybe we can do something generic here that can be
shared with HDFS-10899.
bq. The pain point with SPS recursive could be is that... it may take a while
to finish all data movements under that directory. Mean while if user attempts
to change some policies again under some subdirectory(say /a/b ) and wants to
satisfy, then we can't block him because of previous large directory execution
was in-progress. Each file will have its own priority. In the re-encryptionzone
case, blocking may make sense as overall operation may finish in reasonable
time. But in SPS, its a data movement definitely it will take a while depending
on bandwidth, DN perf etc. Some times due to network glitches ops could fail
and we are retrying for that operations.
Re-encryption will be faster than SPS, but it's not fast since it needs to talk
to the KMS. Xiao's benchmarks indicate that a re-encrypt operation will likely
run for hours. On the upside, the benchmarks also show that scanning through an
already-re-encrypted zone is quite fast (seconds). I expect it'll be similarly
fast for SPS if a user submits subdir or duplicate requests. Would be good to
benchmark this.
I also don't understand the aversion to FIFO execution. It reduces code
complexity and is easy for admins to reason about. If we want to do something
more fancy, there should be a broader question around the API for resource
management. Is it fair share, priorities, limits, some combination? What are
these applied to (users, files, directories, queues with ACLs)?
bq. Here Even if older C-DN comes back, on re-registration, we send dropSPSWork
request to DNs, that will prevent 2 C-DNs running.
What's the total SPS work timeout in minutes? The node is declared dead after
10.5 minutes, but if the network partition is shorter than that, it won't need
to re-register. 5 mins also seems kind of long for an IN_PROGRESS update, since
it should take a few seconds for each block movement.
Also, we can't depend on re-registration with NN for fencing the old C-DN,
since there could be a network partition that is just between the NN and old
C-DN, and the old C-DN can still talk to other DNs. I don't know how this
affects correctness, but having multiple C-DNs makes debugging harder.
bq. Actually NN will not track movement at block level. We are tracking at file
level. NN will track only for Inode id to be satisfied fully. Also With above
optimization, that is to avoid keeping Xattrs for each file. Overhead should be
pretty less as overlap block scanning will happen sequentially.
Even assuming we do the xattr optimization, I believe the NN still has a queue
of pending work items so they can be retried if the C-DNs fail. How many items
might be in this queue, for a large SPS request? Is it throttled?
At a higher-level, if we implement all the throttles to reduce NN overhead, is
there still a benefit to offloading work to DNs? The SPS workload isn't too
different from decommissioning, which we manage on the NN okay.
> Storage Policy Satisfier in Namenode
> ------------------------------------
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Affects Versions: HDFS-10285
> Reporter: Uma Maheswara Rao G
> Assignee: Uma Maheswara Rao G
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch,
> HDFS-10285-consolidated-merge-patch-01.patch,
> HDFS-SPS-TestReport-20170708.pdf,
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf,
> Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These
> policies can be set on directory/file to specify the user preference, where
> to store the physical block. When user set the storage policy before writing
> data, then the blocks could take advantage of storage policy preferences and
> stores physical block accordingly.
> If user set the storage policy after writing and completing the file, then
> the blocks would have been written with default storage policy (nothing but
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such
> file names as a list. In some distributed system scenarios (ex: HBase) it
> would be difficult to collect all the files and run the tool as different
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage
> policy file (inherited policy from parent directory) to another storage
> policy effected directory, it will not copy inherited storage policy from
> source. So it will take effect from destination file/dir parent storage
> policy. This rename operation is just a metadata change in Namenode. The
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for
> admins from distributed nodes(ex: region servers) and running the Mover tool.
> Here the proposal is to provide an API from Namenode itself for trigger the
> storage policy satisfaction. A Daemon thread inside Namenode should track
> such calls and process to DN as movement commands.
> Will post the detailed design thoughts document soon.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]