[
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284692#comment-16284692
]
Uma Maheswara Rao G edited comment on HDFS-10285 at 12/9/17 10:33 AM:
----------------------------------------------------------------------
{quote}
Conceptually, blocks violating the HSM policy are a form of mis-replication
that doesn't satisfy a placement policy – which would truly prevent performance
issues if the feature isn't needed. The NN's repl monitor ignorantly handles
the moves as a low priority transfer (if/since it's sufficiently replicated).
The changes to the NN are minimalistic.
DNs need to support/honor storages in transfer requests. Transfers to itself
become moves. Now HSM "just works", eventually, similar to increasing the repl
factor.
{quote}
Thank you for the proposal. This seems reasonable to me about the scoping IIUC.
To be clear, the current SPS intends only for satisfying the basic HSM feature.
{quote}
An external SPS can provide fancier policies for accelerating the processing
for those users like hbase.
{quote}
The fancier policy implementation proposals are at HDFS-7343(smart storage
management), but not as part SPS proposal.
So, if I understand correctly, what you are saying is, Namenode can handle
satisfying storages by using RM kind of mechanism how it does for replication.
Yes, the current SPS, does in the similar fashion except the way finding
movements when policy changes, instead SPS schedule the movements on
{{#satisfyStoragePolicy(path)}} API in this phase.
In general, it can keep the mismatches logic in RM. Here, RM itself will do the
storage policy mismatch and schedule blocks movement command, like how
replication command sent in the replication case. Since RM is a critical
service, we thought of not touching this and we tried to optimize a bit in SPS
considering its semantics. Let me try to explain little more about that.
Actually for storage mismatches block finding should happen collectively for
replica set as policy is for a block replicas set. One another point, how SPS
collects {{to_be_storage_movement_needed_blocks}} : to simplify things, we
have planned to expose new API, where user can specify a path then internally
we trigger satisfy only to that path. To handle retries/cluster restarts we
save in Xattr until SPS finishes its work. Xattr overhead must be minimal by
doing some deduplication (1), I will explain below. So, instead of loading all
blocks into memory for mismatch check, what we did in SPS is, load of blocks
blocks when we really checking them. When SPS invoked to satisfy, we track only
file InodeID. In replication case, it is good to track at block level, because
any single replica can be missed, no need to check for other blocks in that
file. In SPS case, if policy changes, it applies for all blocks in that file,
so, it makes sense to just track file id in queues. The general usage would be,
the directories where user set the storage policy changes are qualified for SPS
processing. The recommendation for storage policies is to set as optimally as
possible, it may be efficient to set on directories instead of setting on
individual files until its really necessary. This is to avoid more number of
Xattrs in HSM.
Since SPS picks for the same directories to satisfy, at directory Q, we keep
only the InodeIDs (long) list, on which SPS intends to work for satisfying the
mismatches blocks. It will not recursively load files/blocks under that dir
into memory immediately. SPS thread will pick elements from intermediate Q ,
file Inode to process. This intermediate Q capacity bounded to 1000 elements.
Front Q processor fill-up the intermediate Q, only when it has empty slots.
Otherwise it will not load files Inode under the directory. So, unnecessarily
we will not load every file-Inode id into memory.
"Once the mismatches identified, for the set of blocks in a file, it will add
into Datanode descriptors as NN-to-DN commands, this is exactly same as
ReplicationMonitor. Then DN receive this commands and move the blocks, similar
to transfer. Same as you explained. So, conceptually the approach is exactly
same as RM with little optimizations like throttling."
When assigning the tasks to DN, it will respect the durability tasks. If
replication tasks are already pending then it will give preference to them, it
will not assign SPS block movements as high priority tasks.
*How can we keep Xattr load minimal:*
(1) The Xatter what we are adding just to indicate the directory for SPS
processing if any restarts happened, otherwise it will be hard to scan entire
name node to find mismatches.
The Xatttr object has NameSpace enum, String name, byte[] value.
In this case enum and name will be same for any directory we set and value will
be null. It is just like constant object. Now we can create only Xattr object
per NN and use the same object ref for all directories. Its more like a
deduplication. So, its matter of keeping into the same xattr list and the cost
is only the obj ref for one sps xattr.
*In summary to your proposal:* SPS exactly done similar to RM approach, except
the way it finds mismatches. As you said, RM is critical for NN, we kept the
SPS code carefully not interfere in RM path, but we made sure give high
priority to RM tasks while scheduling. The second part (the point about fancier
policies) you mentioned is not in the scope of the SPS feature. That feature is
about HDFS-7343(smart storage management). It will be great if we get
discussions in that JIRA sooner than later about the thoughts. That is totally
about fancier policy implementation.
{quote}
Have any benchmarks been run, particularly with the SPS disabled?
{quote}
Thank you, [~chris.douglas]. Sure, we will get this numbers to prove the
negligible/no impact in NN.
was (Author: umamaheswararao):
{quote}
Conceptually, blocks violating the HSM policy are a form of mis-replication
that doesn't satisfy a placement policy – which would truly prevent performance
issues if the feature isn't needed. The NN's repl monitor ignorantly handles
the moves as a low priority transfer (if/since it's sufficiently replicated).
The changes to the NN are minimalistic.
DNs need to support/honor storages in transfer requests. Transfers to itself
become moves. Now HSM "just works", eventually, similar to increasing the repl
factor.
{quote}
Thank you for the proposal. This seems reasonable to me about the scoping IIUC.
To be clear, the current SPS intends only for satisfying the basic HSM feature.
{quote}
An external SPS can provide fancier policies for accelerating the processing
for those users like hbase.
{quote}
The fancier policy implementation proposals are at HDFS-7343(smart storage
management), but not as part SPS proposal.
So, if I understand correctly, what you are saying is, Namenode can handle
satisfying storages by using RM kind of mechanism how it does for replication.
Yes, the current SPS, does in the similar fashion except the way finding
movements when policy changes, instead SPS schedule the movements on
{{#satisfyStoragePolicy(path)}} API in this phase.
In general, it can keep the mismatches logic in RM. Here, RM itself will do the
storage policy mismatch and schedule blocks movement command, like how
replication command sent in the replication case. Since RM is a critical
service, we thought of not touching this and we tried to optimize a bit in SPS
considering its semantics. Let me try to explain little more about that.
Actually for storage mismatches block finding should happen collectively for
replica set as policy is for a block replicas set. One another point, how SPS
collects {{to_be_storage_movement_needed_blocks}} : to simplify things, we
have planned to expose new API, where user can specify a path then internally
we trigger satisfy only to that path. To handle retries/cluster restarts we
save in Xattr until SPS finishes its work. Xattr overhead must be minimal by
doing some deduplication(1), I will explain below. So, instead of loading all
blocks into memory for mismatch check, what we did in SPS is, load of blocks
blocks when we really checking them. When SPS invoked to satisfy, we track only
file InodeID. In replication case, it is good to track at block level, because
any single replica can be missed, no need to check for other blocks in that
file. In SPS case, if policy changes, it applies for all blocks in that file,
so, it makes sense to just track file id in queues. The general usage would be,
the directories where user set the storage policy changes are qualified for SPS
processing. The recommendation for storage policies is to set as optimally as
possible, it may be efficient to set on directories instead of setting on
individual files until its really necessary. This is to avoid more number of
Xattrs in HSM.
Since SPS picks for the same directories to satisfy, at directory Q, we keep
only the InodeIDs (long) list, on which SPS intends to work for satisfying the
mismatches blocks. It will not recursively load files/blocks under that dir
into memory immediately. SPS thread will pick elements from intermediate Q ,
file Inode to process. This intermediate Q capacity bounded to 1000 elements.
Front Q processor fill-up the intermediate Q, only when it has empty slots.
Otherwise it will not load files Inode under the directory. So, unnecessarily
we will not load every file-Inode id into memory.
"Once the mismatches identified, for the set of blocks in a file, it will add
into Datanode descriptors as NN-to-DN commands, this is exactly same as
ReplicationMonitor. Then DN receive this commands and move the blocks, similar
to transfer. Same as you explained. So, conceptually the approach is exactly
same as RM with little optimizations like throttling."
When assigning the tasks to DN, it will respect the durability tasks. If
replication tasks are already pending then it will give preference to them, it
will not assign SPS block movements as high priority tasks.
*How can we keep Xattr load minimal:*
The Xatter what we are adding just to indicate the directory for SPS processing
if any restarts happened, otherwise it will be hard to scan entire name node to
find mismatches.
(1) The Xatttr object has NameSpace enum, String name, byte[] value.
In this case enum and name will be same for any directory we set and value will
be null. It is just like constant object. Now we can create only Xattr object
per NN and use the same object ref for all directories. Its more like a
deduplication. So, its matter of keeping into the same xattr list and the cost
is only the obj ref for one sps xattribute.
*In summary to your proposal:* SPS exactly done similar to RM approach, except
the way it finds mismatches. As you said, RM is critical for NN, we kept the
SPS code carefully not interfere in RM path, but we made sure give high
priority to RM tasks while scheduling. The second part (the point about fancier
policies) you mentioned is not in the scope of the SPS feature. That feature is
about HDFS-7343(smart storage management). It will be great if we get
discussions in that JIRA sooner than later about the thoughts. That is totally
about fancier policy implementation.
{quote}
Have any benchmarks been run, particularly with the SPS disabled?
{quote}
Thank you, [~chris.douglas]. Sure, we will get this numbers to prove the
negligible/no impact in NN.
> Storage Policy Satisfier in Namenode
> ------------------------------------
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Affects Versions: HDFS-10285
> Reporter: Uma Maheswara Rao G
> Assignee: Uma Maheswara Rao G
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch,
> HDFS-10285-consolidated-merge-patch-01.patch,
> HDFS-10285-consolidated-merge-patch-02.patch,
> HDFS-10285-consolidated-merge-patch-03.patch,
> HDFS-SPS-TestReport-20170708.pdf,
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf,
> Storage-Policy-Satisfier-in-HDFS-May10.pdf,
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These
> policies can be set on directory/file to specify the user preference, where
> to store the physical block. When user set the storage policy before writing
> data, then the blocks could take advantage of storage policy preferences and
> stores physical block accordingly.
> If user set the storage policy after writing and completing the file, then
> the blocks would have been written with default storage policy (nothing but
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such
> file names as a list. In some distributed system scenarios (ex: HBase) it
> would be difficult to collect all the files and run the tool as different
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage
> policy file (inherited policy from parent directory) to another storage
> policy effected directory, it will not copy inherited storage policy from
> source. So it will take effect from destination file/dir parent storage
> policy. This rename operation is just a metadata change in Namenode. The
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for
> admins from distributed nodes(ex: region servers) and running the Mover tool.
> Here the proposal is to provide an API from Namenode itself for trigger the
> storage policy satisfaction. A Daemon thread inside Namenode should track
> such calls and process to DN as movement commands.
> Will post the detailed design thoughts document soon.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]