[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277679#comment-16277679
 ] 

Andrew Wang commented on HDFS-10285:
------------------------------------

Responding to Anu's comment above inline:

bq. From a deployment point of view, all you will need to do is enable the SPS 
(which we have to do even now), and then SPS service can be automatically 
started.

Automatically started by whom?

Adding a new service requires adding support in management frameworks like 
Cloudera Manager or Ambari. This means support for deployment, configuration, 
monitoring, rolling upgrade, and log collection. It also adds user 
administrative complexity: they need to understand what this thing is, decide 
what host runs it, and how much memory and cpu and network and storage it 
needs. It also adds system complexity: what happens when there's a partial 
failure or a failover, and how is state synchronized and operations made safe 
if it's out-of-sync?

The fact that this would be HA adds a new level of complexity. Even with ZK 
already on the cluster, that's still 2 new ZKFCs and 2 new SPS processes that 
each need to be deployed, monitored, upgraded, etc.

These are all things that can be worked through and completed, but it's a huge 
amount of work to undertake with most of it being downstream integration and 
testing.

bq. I don't agree, the fact that we have a large number of applications working 
against HDFS by reading the information from Namenode should be enough evidence 
that SPS can simply be another application that works against Namenode. There 
is no need for moving that application into Namenode.

To be clear, the current situation is that the SPS is part of the NN, so we're 
not discussing moving the application into the NN, we're discussing moving it 
out.

I'm also not arguing that the SPS can't be implemented outside of the NameNode. 
I'm arguing it shouldn't be, with one of the reasons being that it adds a lot 
of overhead and complexity to do it over an RPC interface.

bq. The current move logic in HDFS – This is not something done by SPS – is 
such that when a block is moved, it issues a block report with a hint to 
Namenode which tells namenode which block to delete. So there is no extra 
overhead from SPS.

Yes I know how IBRs work in HDFS, but how does the SPS know where to resume 
from after a failover? We could be in the middle of recursively processing a 
directory with millions of files when the failover happens. How does the newly 
active SPS know where to resume without rescanning the directory?

bq. I disagree. It is just your opinion that existing code is bad. Do you have 
any metric to prove that existing code is bad? if so, would you be kind enough 
to share them? Looks like the consistency of opinion is not a virtue that you 
share , https://issues.apache.org/jira/browse/HDFS-6382 Please look at the 
JIRA to see comments from lots of people, including you on why a simple process 
external to Namenode seems like a good idea.

This feature is not HDFS-6382, so I don't understand your point. As this is 
supposed to be a discussion of *technical* merits, I'd like to keep the 
discussion focused on *technical* issues rather than on my virtues.

Regarding the balancer, I have heard from our supporters for years about how 
difficult it is to configure and use. The fact the balancer is a command that 
terminates is weird. Users end up scripting around it with cron or bash to get 
it to effectively run continuously. The balancer also doesn't expose any API or 
metrics that could be used for monitoring. The throttles are similar but 
different than replication, leading to confusion about why the balancer 
is/isn't running fast. Since the balancer is separate from the NN, it's also 
more complicated to configure. As an example, the cluster can get really messed 
up if the balancer isn't using the same BlockPlacementPolicy.

bq. Coming back to technology decision from personal opinions; The list of work 
items that can be maintained NN can become large. Yes, we have introduced 
throttling – but that only cripples this feature. 

How is the size of the work queue improved by the SPS being a separate service? 
The SPS relies on the NN for persistence, so the NN still needs to know the 
work queue.

How does throttling cripple the feature? We want to throttle so this doesn't 
affect foreground work and other maintenance tasks running on the cluster.

I'd also appreciate a response to the concern raised earlier about the overhead 
of communicating over RPC vs. scanning in-process. If throttling cripples the 
feature, then scanning at a fraction of the speed and with much higher 
memory/cpu overhead should be extremely concerning.

bq. The policies of how and when these blocks are moved and also the 
possibility of making this a first-class HSM is too attractive to forgo. With 
that in mind, staying inside Namenode is going to hamper the freedom and things 
this feature can do. As I mentioned already several other features will benefit 
immensely from this work.

All of us are interested in a stronger HSM story in HDFS.

Uma provided a few references to other projects that are interested in these 
improved capabilities (HDFS-12090, HDFS-7343). All of these references were 
made based on the current HSM design, where it's part of the NameNode. So I 
don't see any requirements from these other projects that the SPS be a separate 
service.

I think my points have been stated at this point, so I'm happy to join a call 
if it'd help progress the discussion.

> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-10285-consolidated-merge-patch-02.patch, 
> HDFS-10285-consolidated-merge-patch-03.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf, 
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to