[
https://issues.apache.org/jira/browse/HDFS-11334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939447#comment-15939447
]
Uma Maheswara Rao G commented on HDFS-11334:
--------------------------------------------
Hi Rakesh, Thank you for working on this task.
Below is my feedback:
# .
{code}
public static final int
DFS_STORAGE_POLICY_SATISFIER_SELF_RETRY_TIMEOUT_MILLIS_DEFAULT =
- 30 * 60 * 1000;
+ 20 * 60 * 1000;
+ public static final String
DFS_DATANODE_STORAGE_POLICY_SATISFIER_WORKER_INPROGRESS_RECHECK_TIME_MILLIS_KEY
=
+
"dfs.datanode.storage.policy.satisfier.worker.inprogress.recheck.time.millis”;{code}
How about removing this configuration item? make this config to use heartbeat
interval time. Say 2*heartbeatInterval time ? Thoughts?
Also name looks too long, it will be good if you make it shorter.
# .
{code}
+ if (receivedCompletionStatus) {
+ storageMovementAttemptedItems
+ .remove(storageMovementAttemptedResult.getTrackId());
}{code}
May be we can have bool param as isInprogress and make it true only when result
status says in progress. Then have above check as below?
if (!isInprogress) {
+ storageMovementAttemptedItems
+ .remove(storageMovementAttemptedResult.getTrackId());
}
# Typo: Didn’t dropped SPS work—> Didn’t drop ….
# I would like to see doc for test testDeadDatanode
# .
{code}
+ // add sps drop command
+ nodeS.setDropSPSWork(true);
{code}
I think we need to write why we are adding to drop here.
# Header comments of BlockStorageMovementAttemptedItems should reflect the new
change
# Should we change lastAttemptedTimeStamp to lastAttemptedOrReportedTime ?
> [SPS]: NN switch and rescheduling movements can lead to have more than one
> coordinator for same file blocks
> -----------------------------------------------------------------------------------------------------------
>
> Key: HDFS-11334
> URL: https://issues.apache.org/jira/browse/HDFS-11334
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: datanode, namenode
> Affects Versions: HDFS-10285
> Reporter: Uma Maheswara Rao G
> Assignee: Rakesh R
> Fix For: HDFS-10285
>
> Attachments: HDFS-11334-HDFS-10285-00.patch
>
>
> I am summarizing the scenarios here what Rakesh and me discussed offline:
> Here we need to handle couple of cases:
> # NN switch - it will freshly start scheduling for all files.
> At this time, old co-ordinators may continue movement work and send
> results back. This could confuse NN SPS that which result is right one.
> *NEED TO HANDLE*
> # DN disconnected for heartbeat expiry - If DN disconnected for long
> time(more than heartbeat expiry), NN will remove this nodes. After SPS
> Monitor time out, it may retry for files which were scheduled to that DN, by
> finding new co-ordinator. But if it reconnects back after NN reschedules, it
> may lead to get different results from deferent co-ordinators.
> *NEED TO HANDLE*
> # NN Restart- Should be same as point 1
> # DN disconnect - here When DN disconnected simply and reconnected
> immediately (before heartbeat expiry), there should not any issues
> *NEED NOT HANDLE*, but can think of more scenarios if any thing missing
> # DN Restart- If DN restarted, DN can not send any results as it will loose
> everything. After NN SPS Monitor timeout, it will retry.
> *NEED NOT HANDLE*, but can think of more scenarios if any thing missing
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]