[
https://issues.apache.org/jira/browse/HDFS-11493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Anu Engineer updated HDFS-11493:
--------------------------------
Attachment: HDFS-11493-HDFS-7240.003.patch
[~cheersyang] Thanks for the review comments. This patch addresses all issues.
Specifics inline ..
bq. getCommand() seems to always return DEFAULT_LIST.
Thanks for catching that, fixed.
bq. Looks like Commands#getCommands() evicts the commands for a datanode by
setting this.commands to a new list. Why not to let Commands#getCommands()
return Commands and modify the code to be something like,
Did exactly that, thanks for the suggestion
bq. I don't understand the point of maintaining commandsInQueue, if you want to
count the number of commands in the queue, why not count the number of commands
in commandMap?
That is exactly what we are doing, we have a Map with a list of commands. So
each time we add a command we just keep a running counter.
bq. The timeout argument in HadoopExecutors#shutdown() doesn't seem to work as
expected. Imagine passing 5s timeout to shutdown method, and line 117 waits 5s
but could not get tasks gracefully shutdown, then in line 121 it will wait for
another 5s.
Correct, This is an issue with java threading model. On one hand, user says
wait for 5 seconds. He/She is expressing an intent to wait for that much time
to finish any running tasks that have entered the queue just before we executed
the shutdown. That is the gracefully shutdown part, however the threads have no
obligation to finish executing in this time frame(depends on how the task was
written), so if we get a timeout, we now call ShutdownNow which means that we
are actively trying to stop executing tasks, but even that is nothing but a
best-effort. If we fail, we log an error.
The other option is to divide the time specified by the user by 2 and wait only
that much. But then user expresses an intent that is 5 seconds and we really
only wait 2.5 seconds, where as the task might have really finished in 3
seconds.
Bottom Line, Java threading model is not very user friendly and Thread.stop()
was not even correct.
bq. Can we move these configuration properties to ScmConfigKeys?
Done
bq. line 144 seems unnecessary to call toString() as there is a constructor
Well, you are right. It was a lame attempt by me to remind the reader of the
code {{getHostAddress}} returns a string and not an InetAddress. Evidently, I
have failed. I have removed the toString and left getHostAddress as is. Just
for the record, we still pass String and port to the InetAddress.
bq. Is ContainerReplicationManager better to extend AbstractService? Currently
it starts the thread in the constructor, it's better to handle this in its own
serviceStart and serviceStop, what do you think?
Not sure if that buys us anything. This patch does not show that yet, but once
all the parts of the container replication is in, this will invoked from the
SCM.
SCM is already a service, this is just a layer that will called on by SCM to do
a
job.
bq. startReconciliation() iterates each datanode in a pool on getNodestate().
getNodestate(), however, would wait and retry for at most 100s, this will let
rest of datanodes wait. Can we make this parallel?
We could, but I would think that we should do that after we profile these code
paths. It is quite possible that bottlenecks in this code is not in the start
processing part, but where we handle the container reports. So unless we really
see that getNodeState is taking too long, I would suggest that we assume that
getNodeState is instantaneous in most cases.
bq. There seems to be a lot of variables such as nodeProcessed,
containerProcessedCount, nodeCount are not thread safe.
Thanks, Fixed.
> Ozone: SCM: Add the ability to handle container reports
> ---------------------------------------------------------
>
> Key: HDFS-11493
> URL: https://issues.apache.org/jira/browse/HDFS-11493
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: ozone
> Affects Versions: HDFS-7240
> Reporter: Anu Engineer
> Assignee: Anu Engineer
> Attachments: container-replication-storage.pdf,
> HDFS-11493-HDFS-7240.001.patch, HDFS-11493-HDFS-7240.002.patch,
> HDFS-11493-HDFS-7240.003.patch
>
>
> Once a datanode sends the container report it is SCM's responsibility to
> determine if the replication levels are acceptable. If it is not, SCM should
> initiate a replication request to another datanode. This JIRA tracks how SCM
> handles a container report.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]