[jira] [Commented] (HDFS-10285) Storage Policy Satisfier in Namenode

Uma Maheswara Rao G (JIRA) Mon, 31 Jul 2017 17:04:48 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108182#comment-16108182
 ]


Uma Maheswara Rao G commented on HDFS-10285:
--------------------------------------------

Hi [~andrew.wang], thank you so much for the thorough review.
Please find my replies below.

{quote}
For the automatic usecase, I agree that metrics are probably the best we can 
do. However, the API exposed here is for interactive usecases (e.g. a user 
calling the shell command and polling until it's done). I think we need to do 
more here to expose the status.
Even for the HBase usecase, it'd still want to know about satisfier status so 
it can bubble it up to an HBase admin.
{quote}
We have filed JIRA for this already HDFS-12228. 
Sure, we will think more about status reporting part. Anyway I will file a 
ticket for this as well to track. Now quick question on your example above “a 
user calling the shell command and polling until it's done” , you mean command 
should blocked by polling internally? or user will call status check 
periodically?  How much time server should hold the status?

{quote}
Can this be addressed by throttling? I think the SPS operations aren't too 
different from decommissioning, since they're both doing block placement and 
tracking data movement, and the decom throttles work okay.
We've also encountered directories with millions of files before, so there's a 
need for throttles anyway. Maybe we can do something generic here that can be 
shared with HDFS-10899.

Re-encryption will be faster than SPS, but it's not fast since it needs to talk 
to the KMS. Xiao's benchmarks indicate that a re-encrypt operation will likely 
run for hours. On the upside, the benchmarks also show that scanning through an 
already-re-encrypted zone is quite fast (seconds). I expect it'll be similarly 
fast for SPS if a user submits subdir or duplicate requests. Would be good to 
benchmark this.
I also don't understand the aversion to FIFO execution. It reduces code 
complexity and is easy for admins to reason about. If we want to do something 
more fancy, there should be a broader question around the API for resource 
management. Is it fair share, priorities, limits, some combination? What are 
these applied to (users, files, directories, queues with ACLs)?
{quote}

Throttling is one of the task we have filed HDFS-12227 already. But thats 
focussing on DN level throttling level and will add to track it to consider NN 
throttling as well.
I think as of now, FIFO model is one way to go ahead, each dir root can be main 
element to pick first and sub dir will get eventually next priority if user 
calls on sub directory while higher directory already in progress. 

{quote}
What's the total SPS work timeout in minutes? The node is declared dead after 
10.5 minutes, but if the network partition is shorter than that, it won't need 
to re-register. 5 mins also seems kind of long for an IN_PROGRESS update, since 
it should take a few seconds for each block movement.
Also, we can't depend on re-registration with NN for fencing the old C-DN, 
since there could be a network partition that is just between the NN and old 
C-DN, and the old C-DN can still talk to other DNs. I don't know how this 
affects correctness, but having multiple C-DNs makes debugging harder.
{quote}
 Even though old C-DN working with other DN to transfer blocks(scenario could 
be rare), DNs will allow only one block. Whoever transfer block first that DN 
will win, other wll get Block already exist exception. Since NN is tracking 
that file associated block, it has to remove its tracking element. Example: In 
worst case, old c-DN completed all movement successfully. New C-DN attempts 
will fail. Then NN will get result as failure from new C-DN. Now NN will retry, 
this time blocks would have been satisfied, since old C-DN already did. So, NN 
will simply ignore that and remove xattr as finished. IN-PROGRESS we send to 
indicate NN that DN is working on it. This should be very rare condition, as DN 
will transfer blocks faster than that. This is to make sure DN is running.  
Right now a file element will be retried after self retry timeout. This case 
only for failure case where C-DN is not reported anything at all (Dead, or out 
of network). Also it will happen to that files which are assigned to that C-DN. 
Right now, self retry timeout was configured 20mins, can be tuned to >10mins. [ 
We have just made this configurations like PendingReplicationMonitor, where it 
will reassign to LowReconstructionBlocks list approximately with in 10mins]. 

{quote}
Even assuming we do the xattr optimization, I believe the NN still has a queue 
of pending work items so they can be retried if the C-DNs fail. How many items 
might be in this queue, for a large SPS request? Is it throttled?
{quote}
Pending queue will be depending on number of files which were failed to move by 
C-DN. Queue contains InodeIds of file, but not block ids. So, if we are moving 
data blocks of million files, then queue contains a million elements. No blocks 
will be tracked at NN.

{quote}
At a higher-level, if we implement all the throttles to reduce NN overhead, is 
there still a benefit to offloading work to DNs? The SPS workload isn't too 
different from decommissioning, which we manage on the NN okay.
{quote}
Mainly our motivation to offload work was mainly to avoid tracking at block 
level results. Now NN just tracks file level results. C-DN track all block 
level movements and send the result back. We also thinking  to use this kind of 
model for converting regular files to EC and also HDFS-12090 is one of the use 
case they wanted to use it.  I think keeping all such monitoring logic into NN 
will be definitely overhead on NN. My feeling is, we should think to offload 
work as much as possible from NN. 

[~rakeshr] do you have any points to add?

> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-10285) Storage Policy Satisfier in Namenode

Reply via email to