[
https://issues.apache.org/jira/browse/TEZ-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14366825#comment-14366825
]
Jeff Zhang edited comment on TEZ-1909 at 3/20/15 9:14 AM:
----------------------------------------------------------
Attach the new patch to address the review comment. [~hitesh] Please help
review
Apart from the issues in the review comments, I also found there's one issue
about RecoveryService. For the scenario of draining the events before
RecoverySerivce is stopped, previously I take the event queue's size equal to
zero as an indication of events are all consumed, but it is not true. Because
even if the event queue is empty, the event may still been processing. I fix
this bug in the new patch just like AsyncDispatcher did.
bq. the "if (skipAllOtherEvents) {" check is probably also needed at the top of
the loop to prevent new files from being opened and read ( in addition to
short-circuiting the read of all events in the given file ). Maybe just log a
message that other files were present and skipped
Fix it. also add unit test in TestRecoveryParser
bq. any reason why this is needed in the DAGAppMaster "Set<String> getDagIDs()"
?
Only for unit test. But in the new patch, I remove it and initialize the Set in
the setup method.
bq. also, we should add a test for adding corrupt data to the summary stream
and ensuring that its processing fails
Done.
bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used
anywhere apart from being set to true in one of the tests.
Fix it.
bq. please replace "import com.sun.tools.javac.util.List;" with java.lang.List
Fix it
bq. testCorruptedLastRecord should also verify that the dag submitted event was
seen.
Done. verify DAGAppMaster.createDAG is invoked.
was (Author: zjffdu):
Attach the new patch to address the review comment.
Apart from the issues in the review comments, I also found there's one issue
about RecoveryService. For the scenario of draining the events before
RecoverySerivce is stopped, previously I take the event queue's size equal to
zero as an indication of events are all consumed, but it is not true. Because
even if the event queue is empty, the event may still been processing. I fix
this bug in the new patch just like AsyncDispatcher did.
bq. the "if (skipAllOtherEvents) {" check is probably also needed at the top of
the loop to prevent new files from being opened and read ( in addition to
short-circuiting the read of all events in the given file ). Maybe just log a
message that other files were present and skipped
Fix it. also add unit test in TestRecoveryParser
bq. any reason why this is needed in the DAGAppMaster "Set<String> getDagIDs()"
?
Only for unit test. But in the new patch, I remove it and initialize the Set in
the setup method.
bq. also, we should add a test for adding corrupt data to the summary stream
and ensuring that its processing fails
Done.
bq. I do not see TEZ_AM_RECOVERY_HANDLE_REMAINING_EVENT_WHEN_STOPPED being used
anywhere apart from being set to true in one of the tests.
Fix it.
bq. please replace "import com.sun.tools.javac.util.List;" with java.lang.List
Fix it
bq. testCorruptedLastRecord should also verify that the dag submitted event was
seen.
Done. verify DAGAppMaster.createDAG is invoked.
> Remove need to copy over all events from attempt 1 to attempt 2 dir
> -------------------------------------------------------------------
>
> Key: TEZ-1909
> URL: https://issues.apache.org/jira/browse/TEZ-1909
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Hitesh Shah
> Assignee: Jeff Zhang
> Attachments: TEZ-1909-1.patch, TEZ-1909-2.patch, TEZ-1909-3.patch
>
>
> Use of file versions should prevent the need for copying over data into a
> second attempt dir. Care needs to be taken to handle "last corrupt record"
> handling.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)