[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284692#comment-16284692
 ] 

Uma Maheswara Rao G edited comment on HDFS-10285 at 12/9/17 10:33 AM:
----------------------------------------------------------------------

{quote}
Conceptually, blocks violating the HSM policy are a form of mis-replication 
that doesn't satisfy a placement policy – which would truly prevent performance 
issues if the feature isn't needed. The NN's repl monitor ignorantly handles 
the moves as a low priority transfer (if/since it's sufficiently replicated). 
The changes to the NN are minimalistic.
DNs need to support/honor storages in transfer requests. Transfers to itself 
become moves. Now HSM "just works", eventually, similar to increasing the repl 
factor.
{quote}
Thank you for the proposal. This seems reasonable to me about the scoping IIUC.
To be clear, the current SPS intends only for satisfying the basic HSM feature. 

{quote}
An external SPS can provide fancier policies for accelerating the processing 
for those users like hbase.
{quote}
The fancier policy implementation proposals are at HDFS-7343(smart storage 
management), but not as part SPS proposal.
 
So, if I understand correctly, what you are saying is, Namenode can handle 
satisfying storages by using RM kind of mechanism how it does for replication. 
Yes, the current SPS, does in the similar fashion except the way finding 
movements when policy changes, instead SPS schedule the movements on 
{{#satisfyStoragePolicy(path)}} API in this phase.
 
In general, it can keep the mismatches logic in RM. Here, RM itself will do the 
storage policy mismatch and schedule blocks movement command, like how 
replication command sent in the replication case. Since RM is a critical 
service, we thought of not touching this and we tried to optimize a bit in SPS 
considering its semantics. Let me try to explain little more about that. 
Actually for storage mismatches block finding should happen collectively for 
replica set as policy is for a block replicas set. One another point, how SPS 
collects  {{to_be_storage_movement_needed_blocks}} : to simplify things, we 
have planned to expose new API, where user can specify a path then internally 
we trigger satisfy only to that path. To handle retries/cluster restarts we 
save in Xattr until SPS finishes its work. Xattr overhead must be minimal by 
doing some deduplication (1), I will explain below.  So, instead of loading all 
blocks into memory for mismatch check, what we did in SPS is, load of blocks 
blocks when we really checking them. When SPS invoked to satisfy, we track only 
file InodeID. In replication case, it is good to track at block level, because 
any single replica can be missed, no need to check for other blocks in that 
file. In SPS case, if policy changes, it applies for all blocks in that file, 
so, it makes sense to just track file id in queues. The general usage would be, 
the directories where user set the storage policy changes are qualified for SPS 
processing. The recommendation for storage policies is to set as optimally as 
possible, it may be efficient to set on directories instead of setting on 
individual files until its really necessary. This is to avoid more number of 
Xattrs in HSM.
 
Since SPS picks for the same directories to satisfy, at directory Q, we keep 
only the InodeIDs (long) list, on which SPS intends to work for satisfying the 
mismatches blocks. It will not recursively load files/blocks under that dir 
into memory immediately. SPS thread will pick elements from intermediate Q , 
file Inode to process. This intermediate Q capacity  bounded to 1000 elements. 
Front Q processor fill-up the intermediate Q, only when it has empty slots. 
Otherwise it will not load files Inode under the directory. So, unnecessarily 
we will not load every file-Inode id into memory.
"Once the mismatches identified, for the set of blocks in a file, it will add 
into Datanode descriptors as NN-to-DN commands, this is exactly same as 
ReplicationMonitor. Then DN receive this commands and move the blocks, similar 
to transfer. Same as you explained. So, conceptually the approach is exactly 
same as RM with little optimizations like throttling."
When assigning the tasks to DN, it will respect the durability tasks. If 
replication tasks are already pending then it will give preference to them, it 
will not assign SPS block movements as high priority tasks.
 
*How can we keep Xattr load minimal:*
(1) The Xatter what we are adding just to indicate the directory for SPS 
processing if any restarts happened, otherwise it will be hard to scan entire 
name node to find mismatches.
The Xatttr object has NameSpace enum, String name, byte[] value.
In this case enum and name will be same for any directory we set and value will 
be null. It is just like constant object. Now we can create only Xattr object 
per NN and use the same object ref for all directories. Its more like a 
deduplication. So, its matter of keeping into the same xattr list and the cost 
is only the obj ref for one sps xattr.

*In summary to your proposal:* SPS exactly done similar to RM approach, except 
the way it finds mismatches. As you said, RM is critical for NN, we kept the 
SPS code carefully not interfere in RM path, but we made sure give high 
priority to RM tasks while scheduling. The second part (the point about fancier 
policies) you mentioned is not in the scope of the SPS feature. That feature is 
about HDFS-7343(smart storage management). It will be great if we get 
discussions in that JIRA sooner than later about the thoughts. That is totally 
about fancier policy implementation.

{quote}
Have any benchmarks been run, particularly with the SPS disabled?
{quote}
Thank you, [~chris.douglas]. Sure, we will get this numbers to prove the 
negligible/no impact in NN.


was (Author: umamaheswararao):
{quote}
Conceptually, blocks violating the HSM policy are a form of mis-replication 
that doesn't satisfy a placement policy – which would truly prevent performance 
issues if the feature isn't needed. The NN's repl monitor ignorantly handles 
the moves as a low priority transfer (if/since it's sufficiently replicated). 
The changes to the NN are minimalistic.
DNs need to support/honor storages in transfer requests. Transfers to itself 
become moves. Now HSM "just works", eventually, similar to increasing the repl 
factor.
{quote}
Thank you for the proposal. This seems reasonable to me about the scoping IIUC.
To be clear, the current SPS intends only for satisfying the basic HSM feature. 

{quote}
An external SPS can provide fancier policies for accelerating the processing 
for those users like hbase.
{quote}
The fancier policy implementation proposals are at HDFS-7343(smart storage 
management), but not as part SPS proposal.
 
So, if I understand correctly, what you are saying is, Namenode can handle 
satisfying storages by using RM kind of mechanism how it does for replication. 
Yes, the current SPS, does in the similar fashion except the way finding 
movements when policy changes, instead SPS schedule the movements on 
{{#satisfyStoragePolicy(path)}} API in this phase.
 
In general, it can keep the mismatches logic in RM. Here, RM itself will do the 
storage policy mismatch and schedule blocks movement command, like how 
replication command sent in the replication case. Since RM is a critical 
service, we thought of not touching this and we tried to optimize a bit in SPS 
considering its semantics. Let me try to explain little more about that. 
Actually for storage mismatches block finding should happen collectively for 
replica set as policy is for a block replicas set. One another point, how SPS 
collects  {{to_be_storage_movement_needed_blocks}} : to simplify things, we 
have planned to expose new API, where user can specify a path then internally 
we trigger satisfy only to that path. To handle retries/cluster restarts we 
save in Xattr until SPS finishes its work. Xattr overhead must be minimal by 
doing some deduplication(1), I will explain below.  So, instead of loading all 
blocks into memory for mismatch check, what we did in SPS is, load of blocks 
blocks when we really checking them. When SPS invoked to satisfy, we track only 
file InodeID. In replication case, it is good to track at block level, because 
any single replica can be missed, no need to check for other blocks in that 
file. In SPS case, if policy changes, it applies for all blocks in that file, 
so, it makes sense to just track file id in queues. The general usage would be, 
the directories where user set the storage policy changes are qualified for SPS 
processing. The recommendation for storage policies is to set as optimally as 
possible, it may be efficient to set on directories instead of setting on 
individual files until its really necessary. This is to avoid more number of 
Xattrs in HSM.
 
Since SPS picks for the same directories to satisfy, at directory Q, we keep 
only the InodeIDs (long) list, on which SPS intends to work for satisfying the 
mismatches blocks. It will not recursively load files/blocks under that dir 
into memory immediately. SPS thread will pick elements from intermediate Q , 
file Inode to process. This intermediate Q capacity  bounded to 1000 elements. 
Front Q processor fill-up the intermediate Q, only when it has empty slots. 
Otherwise it will not load files Inode under the directory. So, unnecessarily 
we will not load every file-Inode id into memory.
"Once the mismatches identified, for the set of blocks in a file, it will add 
into Datanode descriptors as NN-to-DN commands, this is exactly same as 
ReplicationMonitor. Then DN receive this commands and move the blocks, similar 
to transfer. Same as you explained. So, conceptually the approach is exactly 
same as RM with little optimizations like throttling."
When assigning the tasks to DN, it will respect the durability tasks. If 
replication tasks are already pending then it will give preference to them, it 
will not assign SPS block movements as high priority tasks.
 
*How can we keep Xattr load minimal:*
The Xatter what we are adding just to indicate the directory for SPS processing 
if any restarts happened, otherwise it will be hard to scan entire name node to 
find mismatches.
(1) The Xatttr object has NameSpace enum, String name, byte[] value.
In this case enum and name will be same for any directory we set and value will 
be null. It is just like constant object. Now we can create only Xattr object 
per NN and use the same object ref for all directories. Its more like a 
deduplication. So, its matter of keeping into the same xattr list and the cost 
is only the obj ref for one sps xattribute.
*In summary to your proposal:* SPS exactly done similar to RM approach, except 
the way it finds mismatches. As you said, RM is critical for NN, we kept the 
SPS code carefully not interfere in RM path, but we made sure give high 
priority to RM tasks while scheduling. The second part (the point about fancier 
policies) you mentioned is not in the scope of the SPS feature. That feature is 
about HDFS-7343(smart storage management). It will be great if we get 
discussions in that JIRA sooner than later about the thoughts. That is totally 
about fancier policy implementation.

{quote}
Have any benchmarks been run, particularly with the SPS disabled?
{quote}
Thank you, [~chris.douglas]. Sure, we will get this numbers to prove the 
negligible/no impact in NN.

> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-10285-consolidated-merge-patch-02.patch, 
> HDFS-10285-consolidated-merge-patch-03.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf, 
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to