[
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277679#comment-16277679
]
Andrew Wang commented on HDFS-10285:
------------------------------------
Responding to Anu's comment above inline:
bq. From a deployment point of view, all you will need to do is enable the SPS
(which we have to do even now), and then SPS service can be automatically
started.
Automatically started by whom?
Adding a new service requires adding support in management frameworks like
Cloudera Manager or Ambari. This means support for deployment, configuration,
monitoring, rolling upgrade, and log collection. It also adds user
administrative complexity: they need to understand what this thing is, decide
what host runs it, and how much memory and cpu and network and storage it
needs. It also adds system complexity: what happens when there's a partial
failure or a failover, and how is state synchronized and operations made safe
if it's out-of-sync?
The fact that this would be HA adds a new level of complexity. Even with ZK
already on the cluster, that's still 2 new ZKFCs and 2 new SPS processes that
each need to be deployed, monitored, upgraded, etc.
These are all things that can be worked through and completed, but it's a huge
amount of work to undertake with most of it being downstream integration and
testing.
bq. I don't agree, the fact that we have a large number of applications working
against HDFS by reading the information from Namenode should be enough evidence
that SPS can simply be another application that works against Namenode. There
is no need for moving that application into Namenode.
To be clear, the current situation is that the SPS is part of the NN, so we're
not discussing moving the application into the NN, we're discussing moving it
out.
I'm also not arguing that the SPS can't be implemented outside of the NameNode.
I'm arguing it shouldn't be, with one of the reasons being that it adds a lot
of overhead and complexity to do it over an RPC interface.
bq. The current move logic in HDFS – This is not something done by SPS – is
such that when a block is moved, it issues a block report with a hint to
Namenode which tells namenode which block to delete. So there is no extra
overhead from SPS.
Yes I know how IBRs work in HDFS, but how does the SPS know where to resume
from after a failover? We could be in the middle of recursively processing a
directory with millions of files when the failover happens. How does the newly
active SPS know where to resume without rescanning the directory?
bq. I disagree. It is just your opinion that existing code is bad. Do you have
any metric to prove that existing code is bad? if so, would you be kind enough
to share them? Looks like the consistency of opinion is not a virtue that you
share , https://issues.apache.org/jira/browse/HDFS-6382 Please look at the
JIRA to see comments from lots of people, including you on why a simple process
external to Namenode seems like a good idea.
This feature is not HDFS-6382, so I don't understand your point. As this is
supposed to be a discussion of *technical* merits, I'd like to keep the
discussion focused on *technical* issues rather than on my virtues.
Regarding the balancer, I have heard from our supporters for years about how
difficult it is to configure and use. The fact the balancer is a command that
terminates is weird. Users end up scripting around it with cron or bash to get
it to effectively run continuously. The balancer also doesn't expose any API or
metrics that could be used for monitoring. The throttles are similar but
different than replication, leading to confusion about why the balancer
is/isn't running fast. Since the balancer is separate from the NN, it's also
more complicated to configure. As an example, the cluster can get really messed
up if the balancer isn't using the same BlockPlacementPolicy.
bq. Coming back to technology decision from personal opinions; The list of work
items that can be maintained NN can become large. Yes, we have introduced
throttling – but that only cripples this feature.
How is the size of the work queue improved by the SPS being a separate service?
The SPS relies on the NN for persistence, so the NN still needs to know the
work queue.
How does throttling cripple the feature? We want to throttle so this doesn't
affect foreground work and other maintenance tasks running on the cluster.
I'd also appreciate a response to the concern raised earlier about the overhead
of communicating over RPC vs. scanning in-process. If throttling cripples the
feature, then scanning at a fraction of the speed and with much higher
memory/cpu overhead should be extremely concerning.
bq. The policies of how and when these blocks are moved and also the
possibility of making this a first-class HSM is too attractive to forgo. With
that in mind, staying inside Namenode is going to hamper the freedom and things
this feature can do. As I mentioned already several other features will benefit
immensely from this work.
All of us are interested in a stronger HSM story in HDFS.
Uma provided a few references to other projects that are interested in these
improved capabilities (HDFS-12090, HDFS-7343). All of these references were
made based on the current HSM design, where it's part of the NameNode. So I
don't see any requirements from these other projects that the SPS be a separate
service.
I think my points have been stated at this point, so I'm happy to join a call
if it'd help progress the discussion.
> Storage Policy Satisfier in Namenode
> ------------------------------------
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Affects Versions: HDFS-10285
> Reporter: Uma Maheswara Rao G
> Assignee: Uma Maheswara Rao G
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch,
> HDFS-10285-consolidated-merge-patch-01.patch,
> HDFS-10285-consolidated-merge-patch-02.patch,
> HDFS-10285-consolidated-merge-patch-03.patch,
> HDFS-SPS-TestReport-20170708.pdf,
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf,
> Storage-Policy-Satisfier-in-HDFS-May10.pdf,
> Storage-Policy-Satisfier-in-HDFS-Oct-26-2017.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These
> policies can be set on directory/file to specify the user preference, where
> to store the physical block. When user set the storage policy before writing
> data, then the blocks could take advantage of storage policy preferences and
> stores physical block accordingly.
> If user set the storage policy after writing and completing the file, then
> the blocks would have been written with default storage policy (nothing but
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such
> file names as a list. In some distributed system scenarios (ex: HBase) it
> would be difficult to collect all the files and run the tool as different
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage
> policy file (inherited policy from parent directory) to another storage
> policy effected directory, it will not copy inherited storage policy from
> source. So it will take effect from destination file/dir parent storage
> policy. This rename operation is just a metadata change in Namenode. The
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for
> admins from distributed nodes(ex: region servers) and running the Mover tool.
> Here the proposal is to provide an API from Namenode itself for trigger the
> storage policy satisfaction. A Daemon thread inside Namenode should track
> such calls and process to DN as movement commands.
> Will post the detailed design thoughts document soon.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]