[
https://issues.apache.org/jira/browse/TEZ-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521141#comment-14521141
]
Siddharth Seth commented on TEZ-1752:
-------------------------------------
Spoke to Rajesh offline about committing the patch into TEZ-2003 instead of
master for now, and hardening it there.
Looking at the patch, especially TaskRunner and LogicalIORuntimeTask changes, I
think it can be split into two - changes to LogicalIOProcessorRuntimeTask,
RuntimeTask, TaskReporter, TezTaskRunner go into the branch, and the rest into
the master. Rajesh, thoughts on splitting it ?
Comments (TaskReporter, LogicalIORuntimeTask etc changes) - mostly minor
- processorClosed=true, initializedInputs.remove, initializedOutputs.remove() -
should be fore the actual invocation of close. Otherwise if there's an error we
may try to invoke close twice.
- {code} LOG.info("Cleanup is complete. EventRouterThread is yet to be
interrupted"); {code} Nit: This may be misleading since the eventRouter may
have been interrupted during the close() finally block.
- {code}+ } catch (Throwable t) {
+ LOG.warn("Error in final cleanup of task. ", t);{code} This isn't
necessary. Exceptions are already being handled for individual close calls.
- pubilc enum State - Can be an isRunning method like the existing
hasInitialized method, instead of making the enum public
- In TaskRunnerCallable, is there a need to check for InterruptedExceptions in
the catch Throwable clause ?
- Nit: task will never be null in TezTaskRunner
- An exception from abortTask should probably be reported as a Failure. Can be
fixed later in the branch.
- Before taskFuture = executor.submit(callable); - checking the interrupt
status may be useful. Otherwise the task would start, and the get is
immediately interrupted.
- maybeInterruptWaitingThread - Don't think we should be interrupting the main
thread at this point. That can have several consequences. If task.run()
returned without an Exception, but due to an abort/interrupt invocation, and
the close succeeds after this - we'll try reporting the task as successful. If
there's an error from close, there's a possibility that the main thread would
have deregistered the task from the Reporter, and the TaskRunnerCallable thread
would fail while indicating an error. Throwing an InterruptedException here and
gracefully falling off the TaskRunnerCallable thread should take care of this.
After this, the main run() thread regains control and can shutdown.
- For a future jira. Inputs / Outputs could check for interrupts during
shutdown to prevent costly spill attempts. The rest of the patch probably
addresses this though.
For the rest, which can go into master
- Remove Exception from the abort() method signature ?
- MergeManager.close - should this be throwing an InterruptedException in case
of an Interrupt during Merge. Shuffle is changed to handle InterruptedException
cleanly.
- Some logs to be removed.
- In PipelinedSorter - isThreadInterrupted - should this cause the loop to exit
during the partition iteration.
- PipelinedSorter - there's a lot of changes. I think most of them are related
to moving the code into a try catch though ?
- For the cleanup - can we hide this behind a configuration, which defaults to
false for now. We don't support preemption of tasks without containers on
master yet, so the container will get killed and cleaned up.
> Inputs / Outputs in the Runtime library should be interruptable
> ---------------------------------------------------------------
>
> Key: TEZ-1752
> URL: https://issues.apache.org/jira/browse/TEZ-1752
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Siddharth Seth
> Assignee: Rajesh Balamohan
> Attachments: TEZ-1752.1.patch, TEZ-1752.2.patch, TEZ-1752.3.patch,
> TEZ-1752.4.patch, TEZ-1752.5.patch
>
>
> Not possible to preempt tasks without killing containers without this.
> There's still the problem of Processors not supporting interrupts. We may
> need API enhancements to either query IPOs on whether they are interrupbtible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)