[ https://issues.apache.org/jira/browse/TEZ-1116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14006188#comment-14006188 ]
Siddharth Seth commented on TEZ-1116: ------------------------------------- Thanks for the review. bq. is there a reason to have 2 reporters? After the refactor, we still have out-of-band heartbeats. Any chance they can be combined into a single polling model? Not changing any of the protocols in this patch, there shouldn't be any changes to the AM. This is just refactoring it to be testable and re-usable. In terms of moving to a single reporter at a future point, we should probably do this. Out of band heartbeats will be required in any case to send out events generated by tasks faster. bq. ContainerReporter ... Fixed both bq. some documentation done bq. bq. What happens if the container is being shutdown without the task completing? Container shutdown from the AM is currently sent via the heartbeat, or via getTask. During the heartbeat, the callable will just exit and return a false. During a getTask call - the heartbeat thread should not be running. bq. "LOG.info("Exiting TaskReporter therad with pending queue size=" + eventsToSend.size());" - is this going to create an issue? Should this be a warning/error log message? Change to log only if there are pending events. bq. If the plan is to use this to run multiple tasks, how is the request/response id in the heartbeat rpc supposed to work? This patch isn't targeting fixing multiple tasks - but should make it easier. Tasks belonging to the same AM will eventually share a single heartbeat thread, with task information specified on the call. bq. "maybeLogCounters()" - why not just log everytime we send counters to the AM or is there a need to log on a different cadence? That would be every second by default, which would produce fairly verbose logs. bq. from a performance point, not sure if the floating point op is really useful - why not "if amPollInterval == 0, use 1 else do max(1, startinterval/amPollInterval )" ? Setting it to 1, for a low poll interval will end up logging very often if the poll interval is 0. Using int arithmetic will do the same for a small heartbeat interval. This operation is done once per task - subsequent calculations do not using floating point ops. bq. "containerTask = getTaskFuture.get();" - can containerTask ever be null? Not as far as I know. It could be interrupted - added shutdown handling for that. bq. typo - "requeted" - multiple places. Fixed bq. container reporter and task runner on the same executor with fixed thread pool size? This should work for a single task model but break with multiple tasks. Not targeting multiple tasks at the moment. bq. return false instead? ( same code change in 2 places ) Sure. Added an assertion as well. bq. may be good to add some comments on FSError is not an immediate fatal error? Done bq. please log the exception Done bq. any reason the counters are not set immediately after task.close()? Nope, likely inherited from the previous code. IAC, changed. Will try adding a test for the getTask issue and upload a new patch. > Refactor YarnTezDAGChild > ------------------------ > > Key: TEZ-1116 > URL: https://issues.apache.org/jira/browse/TEZ-1116 > Project: Apache Tez > Issue Type: Improvement > Reporter: Siddharth Seth > Assignee: Siddharth Seth > Attachments: TEZ-1116.1.wip.txt, TEZ-1116.2.txt > > > To be more testable, and usable elsewhere - example Local mode. -- This message was sent by Atlassian JIRA (v6.2#6252)