[jira] [Commented] (HDFS-11334) [SPS]: NN switch and rescheduling movements can lead to have more than one coordinator for same file blocks

Uma Maheswara Rao G (JIRA) Thu, 23 Mar 2017 16:49:43 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939447#comment-15939447
 ]


Uma Maheswara Rao G commented on HDFS-11334:
--------------------------------------------

Hi Rakesh, Thank you for working on this task.
Below is my feedback:
# .
{code}
public static final int 
DFS_STORAGE_POLICY_SATISFIER_SELF_RETRY_TIMEOUT_MILLIS_DEFAULT =
-      30 * 60 * 1000;
+      20 * 60 * 1000;
+  public static final String 
DFS_DATANODE_STORAGE_POLICY_SATISFIER_WORKER_INPROGRESS_RECHECK_TIME_MILLIS_KEY 
=
+      
"dfs.datanode.storage.policy.satisfier.worker.inprogress.recheck.time.millis”;{code}
How about removing this configuration item? make this config to use heartbeat 
interval time. Say 2*heartbeatInterval time ? Thoughts?
Also name looks too long, it will be good if you make it shorter.
# .
{code}
+          if (receivedCompletionStatus) {
+            storageMovementAttemptedItems
+                .remove(storageMovementAttemptedResult.getTrackId());
           }{code}
May be we can have bool param as isInprogress and make it true only when result 
status says in progress. Then have above check as below?
 if (!isInprogress) {
+            storageMovementAttemptedItems
+                .remove(storageMovementAttemptedResult.getTrackId());
           }
# Typo: Didn’t dropped SPS work—> Didn’t drop ….
# I would like to see doc for test testDeadDatanode
# .
{code}
+          // add sps drop command
+          nodeS.setDropSPSWork(true);
{code}
I think we need to write why we are adding to drop here.
# Header comments of BlockStorageMovementAttemptedItems should reflect the new 
change
# Should we change lastAttemptedTimeStamp to lastAttemptedOrReportedTime ?


> [SPS]: NN switch and rescheduling movements can lead to have more than one 
> coordinator for same file blocks
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-11334
>                 URL: https://issues.apache.org/jira/browse/HDFS-11334
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, namenode
>    Affects Versions: HDFS-10285
>            Reporter: Uma Maheswara Rao G
>            Assignee: Rakesh R
>             Fix For: HDFS-10285
>
>         Attachments: HDFS-11334-HDFS-10285-00.patch
>
>
> I am summarizing the scenarios here what Rakesh and me discussed offline:
> Here we need to handle couple of cases:
> # NN switch - it will freshly start scheduling for all files.
>        At this time, old co-ordinators may continue movement work and send 
> results back. This could confuse NN SPS that which result is right one.
>   *NEED TO HANDLE*
> # DN disconnected for heartbeat expiry - If DN disconnected for long 
> time(more than heartbeat expiry), NN will remove this nodes. After SPS 
> Monitor time out, it may retry for files which were scheduled to that DN, by 
> finding new co-ordinator. But if it reconnects back after NN reschedules, it 
> may lead to get different results from deferent co-ordinators.
> *NEED TO HANDLE*
> # NN Restart- Should be same as point 1
> # DN disconnect - here When DN disconnected simply and reconnected 
> immediately (before heartbeat expiry), there should not any issues
> *NEED NOT HANDLE*, but can think of more scenarios if any thing missing
> # DN Restart- If DN restarted, DN can not send any results as it will loose 
> everything. After NN SPS Monitor timeout, it will retry.
> *NEED NOT HANDLE*, but can think of more scenarios if any thing missing



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-11334) [SPS]: NN switch and rescheduling movements can lead to have more than one coordinator for same file blocks

Reply via email to