[ 
https://issues.apache.org/jira/browse/TEZ-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14005616#comment-14005616
 ] 

Hitesh Shah commented on TEZ-1116:
----------------------------------

Comments:

Overall comments:
  - is there a reason to have 2 reporters? After the refactor, we still have 
out-of-band heartbeats. Any chance they can be combined into a single polling 
model?

ContainerReporter:
  - on shutdown, a signal is to the condition. This wakes up the thread in the 
for loop. Even though, the shutdown has been invoked, the code will still end 
up trying to do an umbilical.getTask(). Seems like this should be avoided. 
  - I think the code could be made simpler in ContainerTask::call() by just 
doing umbilical.getTask directly and not even having the isNewGetTask var.

TezRuntimeChildJVM:
  - there seems to be quite some unused code. Should file a separate jira to 
clean this class up. 
  - "vargs.add(Environment.JAVA_HOME.$() + "/bin/java");" uses / instead of 
OS-independent path separators.

TaskReporter:
  - some documentation would be helpful for other devs down the line
       - especially around use of nonOobHeartbeatCounter and 
prevCounterSendHeartbeatNum 
  - does HeartbeatCallable need a shutdown flag in addition to the task 
done/fatal error check in its loop? What happens if the container is being 
shutdown without the task completing? Currently, shutting down the executor 
sends an interrupt to the condition which will just retry and invoke 
heartbeat()  
  - "LOG.info("Exiting TaskReporter therad with pending queue size=" + 
eventsToSend.size());" - is this going to create an issue? Should this be a 
warning/error log message?
  - If the plan is to use this to run multiple tasks, how is the 
request/response id in the heartbeat rpc supposed to work?
  - "maybeLogCounters()" - why not just log everytime we send counters to the 
AM or is there a need to log on a different cadence? 

{code}
+      nextHeartbeatNumToLog = (Math.max(1,
+          (int) (LOG_COUNTER_START_INTERVAL / (amPollInterval == 0 ? 0.000001f
+              : (float) amPollInterval))));
{code}
   - from a performance point, not sure if the floating point op is really 
useful - why not "if amPollInterval == 0, use 1 else do max(1, 
startinterval/amPollInterval )"  ? 

TezChild:
   - "containerTask = getTaskFuture.get();"  - can containerTask ever be null?  
 
  - typo - "requeted" - multiple places.
  - container reporter and task runner on the same executor with fixed thread 
pool size? This should work for a single task model but break with multiple 
tasks. 
 
{code}
+        // Effectively dead code. Must return something.
+        return true;
{code}
  - return false instead? ( same code change in 2 places )

{code}
+    } catch (ExecutionException e) {
+      // Exception thrown by the run() method itself.
+      Throwable cause = e.getCause();
+      if (cause instanceof FSError) {
+        failureCause = cause;
+      } else if (cause instanceof Error) {
{code}
  - may be good to add some comments on FSError is not an immediate fatal 
error? 

{code}
+                } catch (Exception ignored) {
+                  // Ignored since another cause is already known
+                }
{code}
  - please log the exception

{code}
+              if (firstException == null) {
+                task.setFrameworkCounters();
{code}
   - any reason the counters are not set immediately after task.close()?

Still need to review the tests.








> Refactor YarnTezDAGChild
> ------------------------
>
>                 Key: TEZ-1116
>                 URL: https://issues.apache.org/jira/browse/TEZ-1116
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: TEZ-1116.1.wip.txt, TEZ-1116.2.txt
>
>
> To be more testable, and usable elsewhere - example Local mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to