[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16400665#comment-16400665
 ] 

Daryn Sharp commented on HDFS-10285:
------------------------------------

To summarize main _implementation_ issues from an offline meeting:
* NN context abstraction is violated by having internal/external 
implementations.  There should be completely common implementations.  Only the 
context impl differs.
* No DN changes should be required.  DN should be “dumb” and just move blocks 
around.  It already has that support.
* Separate jira can add the transfer block optimization to just move the block 
w/o a transceiver when the target is the node itself.  Not strictly required by 
SPS.

I also have _design_ issues.  We also explored a better design to leverage 
existing NN replication behavior.  The SPS should not require so much code that 
it will be a maintenance burden for future development.

Let’s understand what motivates this feature.  Replication monitoring is not 
working.  Why?  There are two distinct criteria for a block to be correctly 
replicated:
# Are there enough replicas?
# Are the replicas correctly placed?  Ie. Rack placement.  Technically, the 
storage policy (SP) is no different.

The NN already handles storage policies during placement decisions.  Ie.  
Creating files and correcting mis-replication (over/under).  If #1 is true, #2 
is “short-circuited” (beyond racks > 1) based on assumption #2 was satisfied by 
choices to correct #1.  The “short-circuit” avoids a heavy performance penalty 
to FBRs and is why the NN fails to perform what should be a basic duty (always 
honoring SP).

So how can we leverage the replication monitor while maintaining the 
“short-circuit”?  I think it might be as simple as:
# Replication queue initialization should not short-circuit.  The performance 
penalty to check SP is absorbed by the background initialization thread.
# Replication monitor must not short-circuit when computing work.  Must assume 
“something” is wrong with the block if it’s in the queue, which allows the 
queue init to work and SPS to work.

Benefits:
# No xattrs.  Replication queue init handles resumption after failover/restart.
# SPS simply scans the tree and adds blocks (with flow-control) to the 
replication queue.  That’s all.
# No split-brain between replication monitor and SPS.
# SP moves are scheduled with respect to normal replication instead of spliced 
into the node’s work queue.

I also think forcing users to use an explicit “satisfy” operation is broken.  
We don’t have setReplication/satisfyReplication.  Deferring the satisfy to an 
indeterminate future time is a specious use case which burdens all callers.  We 
can’t expect users to implement special retry logic to ensure the satisfy 
occurs, persist pending satisfy operations to issue after a crash/restart, etc. 
 Inevitably the path of least resistance is scheduling a task 
(cron/oozie/whatever) to call satisfy on large trees, if not the whole 
namespace, and then complain that hdfs performance sucks.

> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>            Priority: Major
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-10285-consolidated-merge-patch-02.patch, 
> HDFS-10285-consolidated-merge-patch-03.patch, 
> HDFS-10285-consolidated-merge-patch-04.patch, 
> HDFS-10285-consolidated-merge-patch-05.patch, 
> HDFS-SPS-TestReport-20170708.pdf, SPS Modularization.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf, 
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to