[
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108182#comment-16108182
]
Uma Maheswara Rao G commented on HDFS-10285:
--------------------------------------------
Hi [~andrew.wang], thank you so much for the thorough review.
Please find my replies below.
{quote}
For the automatic usecase, I agree that metrics are probably the best we can
do. However, the API exposed here is for interactive usecases (e.g. a user
calling the shell command and polling until it's done). I think we need to do
more here to expose the status.
Even for the HBase usecase, it'd still want to know about satisfier status so
it can bubble it up to an HBase admin.
{quote}
We have filed JIRA for this already HDFS-12228.
Sure, we will think more about status reporting part. Anyway I will file a
ticket for this as well to track. Now quick question on your example above “a
user calling the shell command and polling until it's done” , you mean command
should blocked by polling internally? or user will call status check
periodically? How much time server should hold the status?
{quote}
Can this be addressed by throttling? I think the SPS operations aren't too
different from decommissioning, since they're both doing block placement and
tracking data movement, and the decom throttles work okay.
We've also encountered directories with millions of files before, so there's a
need for throttles anyway. Maybe we can do something generic here that can be
shared with HDFS-10899.
Re-encryption will be faster than SPS, but it's not fast since it needs to talk
to the KMS. Xiao's benchmarks indicate that a re-encrypt operation will likely
run for hours. On the upside, the benchmarks also show that scanning through an
already-re-encrypted zone is quite fast (seconds). I expect it'll be similarly
fast for SPS if a user submits subdir or duplicate requests. Would be good to
benchmark this.
I also don't understand the aversion to FIFO execution. It reduces code
complexity and is easy for admins to reason about. If we want to do something
more fancy, there should be a broader question around the API for resource
management. Is it fair share, priorities, limits, some combination? What are
these applied to (users, files, directories, queues with ACLs)?
{quote}
Throttling is one of the task we have filed HDFS-12227 already. But thats
focussing on DN level throttling level and will add to track it to consider NN
throttling as well.
I think as of now, FIFO model is one way to go ahead, each dir root can be main
element to pick first and sub dir will get eventually next priority if user
calls on sub directory while higher directory already in progress.
{quote}
What's the total SPS work timeout in minutes? The node is declared dead after
10.5 minutes, but if the network partition is shorter than that, it won't need
to re-register. 5 mins also seems kind of long for an IN_PROGRESS update, since
it should take a few seconds for each block movement.
Also, we can't depend on re-registration with NN for fencing the old C-DN,
since there could be a network partition that is just between the NN and old
C-DN, and the old C-DN can still talk to other DNs. I don't know how this
affects correctness, but having multiple C-DNs makes debugging harder.
{quote}
Even though old C-DN working with other DN to transfer blocks(scenario could
be rare), DNs will allow only one block. Whoever transfer block first that DN
will win, other wll get Block already exist exception. Since NN is tracking
that file associated block, it has to remove its tracking element. Example: In
worst case, old c-DN completed all movement successfully. New C-DN attempts
will fail. Then NN will get result as failure from new C-DN. Now NN will retry,
this time blocks would have been satisfied, since old C-DN already did. So, NN
will simply ignore that and remove xattr as finished. IN-PROGRESS we send to
indicate NN that DN is working on it. This should be very rare condition, as DN
will transfer blocks faster than that. This is to make sure DN is running.
Right now a file element will be retried after self retry timeout. This case
only for failure case where C-DN is not reported anything at all (Dead, or out
of network). Also it will happen to that files which are assigned to that C-DN.
Right now, self retry timeout was configured 20mins, can be tuned to >10mins. [
We have just made this configurations like PendingReplicationMonitor, where it
will reassign to LowReconstructionBlocks list approximately with in 10mins].
{quote}
Even assuming we do the xattr optimization, I believe the NN still has a queue
of pending work items so they can be retried if the C-DNs fail. How many items
might be in this queue, for a large SPS request? Is it throttled?
{quote}
Pending queue will be depending on number of files which were failed to move by
C-DN. Queue contains InodeIds of file, but not block ids. So, if we are moving
data blocks of million files, then queue contains a million elements. No blocks
will be tracked at NN.
{quote}
At a higher-level, if we implement all the throttles to reduce NN overhead, is
there still a benefit to offloading work to DNs? The SPS workload isn't too
different from decommissioning, which we manage on the NN okay.
{quote}
Mainly our motivation to offload work was mainly to avoid tracking at block
level results. Now NN just tracks file level results. C-DN track all block
level movements and send the result back. We also thinking to use this kind of
model for converting regular files to EC and also HDFS-12090 is one of the use
case they wanted to use it. I think keeping all such monitoring logic into NN
will be definitely overhead on NN. My feeling is, we should think to offload
work as much as possible from NN.
[~rakeshr] do you have any points to add?
> Storage Policy Satisfier in Namenode
> ------------------------------------
>
> Key: HDFS-10285
> URL: https://issues.apache.org/jira/browse/HDFS-10285
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: datanode, namenode
> Affects Versions: HDFS-10285
> Reporter: Uma Maheswara Rao G
> Assignee: Uma Maheswara Rao G
> Attachments: HDFS-10285-consolidated-merge-patch-00.patch,
> HDFS-10285-consolidated-merge-patch-01.patch,
> HDFS-SPS-TestReport-20170708.pdf,
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf,
> Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These
> policies can be set on directory/file to specify the user preference, where
> to store the physical block. When user set the storage policy before writing
> data, then the blocks could take advantage of storage policy preferences and
> stores physical block accordingly.
> If user set the storage policy after writing and completing the file, then
> the blocks would have been written with default storage policy (nothing but
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such
> file names as a list. In some distributed system scenarios (ex: HBase) it
> would be difficult to collect all the files and run the tool as different
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage
> policy file (inherited policy from parent directory) to another storage
> policy effected directory, it will not copy inherited storage policy from
> source. So it will take effect from destination file/dir parent storage
> policy. This rename operation is just a metadata change in Namenode. The
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for
> admins from distributed nodes(ex: region servers) and running the Mover tool.
> Here the proposal is to provide an API from Namenode itself for trigger the
> storage policy satisfaction. A Daemon thread inside Namenode should track
> such calls and process to DN as movement commands.
> Will post the detailed design thoughts document soon.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]