[ 
https://issues.apache.org/jira/browse/TEZ-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006188#comment-14006188
 ] 

Siddharth Seth commented on TEZ-1116:
-------------------------------------

Thanks for the review.

bq. is there a reason to have 2 reporters? After the refactor, we still have 
out-of-band heartbeats. Any chance they can be combined into a single polling 
model?
Not changing any of the protocols in this patch, there shouldn't be any changes 
to the AM. This is just refactoring it to be testable and re-usable. In terms 
of moving to a single reporter at a future point, we should probably do this. 
Out of band heartbeats will be required in any case to send out events 
generated by tasks faster.

bq. ContainerReporter ...
Fixed both

bq. some documentation
done

bq. 

bq. What happens if the container is being shutdown without the task completing?
Container shutdown from the AM is currently sent via the heartbeat, or via 
getTask. During the heartbeat, the callable will just exit and return a false. 
During a getTask call - the heartbeat thread should not be running.

bq. "LOG.info("Exiting TaskReporter therad with pending queue size=" + 
eventsToSend.size());" - is this going to create an issue? Should this be a 
warning/error log message?
Change to log only if there are pending events.

bq. If the plan is to use this to run multiple tasks, how is the 
request/response id in the heartbeat rpc supposed to work?
This patch isn't targeting fixing multiple tasks - but should make it easier. 
Tasks belonging to the same AM will eventually share a single heartbeat thread, 
with task information specified on the call.

bq. "maybeLogCounters()" - why not just log everytime we send counters to the 
AM or is there a need to log on a different cadence?
That would be every second by default, which would produce fairly verbose logs.

bq. from a performance point, not sure if the floating point op is really 
useful - why not "if amPollInterval == 0, use 1 else do max(1, 
startinterval/amPollInterval )" ?
Setting it to 1, for a low poll interval will end up logging very often if the 
poll interval is 0. Using int arithmetic will do the same for a small heartbeat 
interval. This operation is done once per task - subsequent calculations do not 
using floating point ops.

bq. "containerTask = getTaskFuture.get();" - can containerTask ever be null?
Not as far as I know. It could be interrupted - added shutdown handling for 
that.

bq. typo - "requeted" - multiple places.
Fixed

bq. container reporter and task runner on the same executor with fixed thread 
pool size? This should work for a single task model but break with multiple 
tasks.
Not targeting multiple tasks at the moment.

bq. return false instead? ( same code change in 2 places )
Sure. Added an assertion as well.

bq. may be good to add some comments on FSError is not an immediate fatal error?
Done

bq. please log the exception
Done

bq. any reason the counters are not set immediately after task.close()?
Nope, likely inherited from the previous code. IAC, changed.

Will try adding a test for the getTask issue and upload a new patch.




> Refactor YarnTezDAGChild
> ------------------------
>
>                 Key: TEZ-1116
>                 URL: https://issues.apache.org/jira/browse/TEZ-1116
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>         Attachments: TEZ-1116.1.wip.txt, TEZ-1116.2.txt
>
>
> To be more testable, and usable elsewhere - example Local mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to