[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107830#comment-16107830
 ] 

Andrew Wang commented on HDFS-10285:
------------------------------------

Hi Uma, thanks for the replies,

bq. One possible way for admins to notice the failures would be via metrics 
reporting. I am also thinking to provide option in fsck command to check the 
current pending/in-progress status. I understand, this kind of status tracking 
may be useful in the case of SSM kind of systems to act upon, say raising alarm 
alerts etc. But HBase kind of system may not take any action from its business 
logic even of movement statuses are failures. Right now, HDFS itself will keep 
retry until it satisfies.

For the automatic usecase, I agree that metrics are probably the best we can 
do. However, the API exposed here is for interactive usecases (e.g. a user 
calling the shell command and polling until it's done). I think we need to do 
more here to expose the status.

Even for the HBase usecase, it'd still want to know about satisfier status so 
it can bubble it up to an HBase admin.

bq. I agree, allowing recursively will make user much more easier when they 
need recursive execution. Only constraint we thought was to make operation 
light weight as much as possible.

Can this be addressed by throttling? I think the SPS operations aren't too 
different from decommissioning, since they're both doing block placement and 
tracking data movement, and the decom throttles work okay.

We've also encountered directories with millions of files before, so there's a 
need for throttles anyway. Maybe we can do something generic here that can be 
shared with HDFS-10899.

bq. The pain point with SPS recursive could be is that... it may take a while 
to finish all data movements under that directory. Mean while if user attempts 
to change some policies again under some subdirectory(say /a/b ) and wants to 
satisfy, then we can't block him because of previous large directory execution 
was in-progress. Each file will have its own priority. In the re-encryptionzone 
case, blocking may make sense as overall operation may finish in reasonable 
time. But in SPS, its a data movement definitely it will take a while depending 
on bandwidth, DN perf etc. Some times due to network glitches ops could fail 
and we are retrying for that operations.

Re-encryption will be faster than SPS, but it's not fast since it needs to talk 
to the KMS. Xiao's benchmarks indicate that a re-encrypt operation will likely 
run for hours. On the upside, the benchmarks also show that scanning through an 
already-re-encrypted zone is quite fast (seconds). I expect it'll be similarly 
fast for SPS if a user submits subdir or duplicate requests. Would be good to 
benchmark this.

I also don't understand the aversion to FIFO execution. It reduces code 
complexity and is easy for admins to reason about. If we want to do something 
more fancy, there should be a broader question around the API for resource 
management. Is it fair share, priorities, limits, some combination? What are 
these applied to (users, files, directories, queues with ACLs)?

bq. Here Even if older C-DN comes back, on re-registration, we send dropSPSWork 
request to DNs, that will prevent 2 C-DNs running.

What's the total SPS work timeout in minutes? The node is declared dead after 
10.5 minutes, but if the network partition is shorter than that, it won't need 
to re-register. 5 mins also seems kind of long for an IN_PROGRESS update, since 
it should take a few seconds for each block movement.

Also, we can't depend on re-registration with NN for fencing the old C-DN, 
since there could be a network partition that is just between the NN and old 
C-DN, and the old C-DN can still talk to other DNs. I don't know how this 
affects correctness, but having multiple C-DNs makes debugging harder.

bq. Actually NN will not track movement at block level. We are tracking at file 
level. NN will track only for Inode id to be satisfied fully. Also With above 
optimization, that is to avoid keeping Xattrs for each file. Overhead should be 
pretty less as overlap block scanning will happen sequentially.

Even assuming we do the xattr optimization, I believe the NN still has a queue 
of pending work items so they can be retried if the C-DNs fail. How many items 
might be in this queue, for a large SPS request? Is it throttled?

At a higher-level, if we implement all the throttles to reduce NN overhead, is 
there still a benefit to offloading work to DNs? The SPS workload isn't too 
different from decommissioning, which we manage on the NN okay.

> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to