[ 
https://issues.apache.org/jira/browse/HDFS-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141060#comment-16141060
 ] 

Andrew Wang commented on HDFS-10285:
------------------------------------

Hi Uma,

bq. In fact isSpsRunning API was added just for users to make sure inbuilt SPS 
is not running if they want to run Mover tool explicitly.

Maybe I misunderstood this API then, since it wasn't mentioned in the 
"Administrator notes" where it talks about the interaction with the Mover. 
Should this API instead be "isSpsEnabled"? The docs indicate right now that 
when the SPS is "activated" (enabled via configuration?), the Mover cannot be 
run, and also vice versa.

The docs also say {{If a Mover instance is already triggered and running, SPS 
will be deactivated while starting.}}, does "starting" here mean enabling 
dynamically via configuration, or triggering an SPS operation?

bq. I filed a JIRA HDFS-12310 to discuss more. I really don't know its a good 
idea to encourage users to periodically poll on the system for this status. 
IMO, if movements are really failing(probably some storages are unavailable or 
some storages failed etc), there is definitely an administrator actions 
required instead of user component knowing the status and taking actions itself.

bq. Also another argument is that, We already have async fashioned APIs, 
example delete or setReplication. Even for NN call perspective they may be sync 
calls, but for user perspective, still lot of work happens asynchronously. If 
we delete file, it does NN cleanup and add blocks for deletions. All the blocks 
deletions happens asynchronously. User believe HDFS that data will be cleaned, 
we don't have status reporting API. 

Delete removes the files from the namespace immediately, so the user-visible 
portion of the operation is completed as soon as the RPC returns. Block 
deletion is also simpler than block placement. I haven't seen block deletion 
fail much, while I've seen plenty of block placement failures since it's 
complicated and depends on the environment.

[setrep 
-w|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#setrep]
 waits for the setrep to complete, it's pretty common to call it like this.

Metrics are not a great way of addressing this hole in the API. Since the 
operation is triggered via "hdfs storagepolicies", there should be another 
storagepolicies subcommand or flag there to get the status.

bq. Do you have some use cases on how the application system(ex: Hbase, Anoop 
Sam John has provided some use cases above to use SPS) reacts on status results?

Knowing the return code is a pretty basic desire on the part of users, like 
with {{setrep -w}}. It's also important when migrating away from the Mover. If 
the Mover fails, it'll return a non-zero error code and error message, and the 
log will have useful debugging info. There isn't a similar analogue for SPS 
actions.

bq. I don't see any extra latencies involved really.

For existing block commands:

* The NN selects a DN, sends it the block command on the heartbeat
* The DN does it, and then informs the NN on the next heartbeat (IBR).

For SPS work:

* The NN selects a C-DN and sends it a batch of work on the heartbeat
* The C-DN calls replaceBlock on the blocks
* The src and target DNs do the replaceBlock and inform the NN on their next 
heartbeat (IBR).
* The C-DN informs the NN that the batch is complete on its next heartbeat.

It's this last step that can add latency. Completion requires the IBRs of the 
src/target DNs, but also the status from the C-DN. This can add up to a 
heartbeat interval. It wouldn't be necessary if the NN tracked completion 
instead.

Do we verify that we've gotten all the IBRs and block state is correct before 
marking SUCCESS? I didn't see it in BlockStorageMovementAttemptedItems.

bq.  Along with that, we send batch to one DN first, that DN does its work as 
well as ask other DNs to transfer the blocks.

I read the code to better understand this flow. The C-DN calls replaceBlock on 
the src and target DNs of the work batches.

I'm still unconvinced that we save much by moving block-level completion 
tracking to the DN. PendingReconstructionBlocks + LowRedundancyBlocks works 
pretty well with block-level tracking, and that's even when a ton of work gets 
queued up due to a failure. For SPS, we can do better since we can throttle the 
directory scan speed and thus limit the number of outstanding work items. This 
would make any file-level vs. block-level overheads marginal.

Could you also comment on how SPS work is prioritized against block work from 
LowRedundancyBlocks? SPS actions are lower priority than maintaining durability.

One more question, block replication looks at the number of xmits used on the 
DN to throttle appropriately. This doesn't work well with the C-DN scheme since 
the C-DN is rarely the source or target DN, and the work is sent in batches. 
Could you comment on this?

> Storage Policy Satisfier in Namenode
> ------------------------------------
>
>                 Key: HDFS-10285
>                 URL: https://issues.apache.org/jira/browse/HDFS-10285
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Uma Maheswara Rao G
>         Attachments: HDFS-10285-consolidated-merge-patch-00.patch, 
> HDFS-10285-consolidated-merge-patch-01.patch, 
> HDFS-SPS-TestReport-20170708.pdf, 
> Storage-Policy-Satisfier-in-HDFS-June-20-2017.pdf, 
> Storage-Policy-Satisfier-in-HDFS-May10.pdf
>
>
> Heterogeneous storage in HDFS introduced the concept of storage policy. These 
> policies can be set on directory/file to specify the user preference, where 
> to store the physical block. When user set the storage policy before writing 
> data, then the blocks could take advantage of storage policy preferences and 
> stores physical block accordingly. 
> If user set the storage policy after writing and completing the file, then 
> the blocks would have been written with default storage policy (nothing but 
> DISK). User has to run the ‘Mover tool’ explicitly by specifying all such 
> file names as a list. In some distributed system scenarios (ex: HBase) it 
> would be difficult to collect all the files and run the tool as different 
> nodes can write files separately and file can have different paths.
> Another scenarios is, when user rename the files from one effected storage 
> policy file (inherited policy from parent directory) to another storage 
> policy effected directory, it will not copy inherited storage policy from 
> source. So it will take effect from destination file/dir parent storage 
> policy. This rename operation is just a metadata change in Namenode. The 
> physical blocks still remain with source storage policy.
> So, Tracking all such business logic based file names could be difficult for 
> admins from distributed nodes(ex: region servers) and running the Mover tool. 
> Here the proposal is to provide an API from Namenode itself for trigger the 
> storage policy satisfaction. A Daemon thread inside Namenode should track 
> such calls and process to DN as movement commands. 
> Will post the detailed design thoughts document soon. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to