[
https://issues.apache.org/jira/browse/PIG-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shravan Matthur Narayanamurthy updated PIG-457:
-----------------------------------------------
Status: Patch Available (was: Open)
There are two issues that this patch tries to address:
1) Exceptions and traces even after a successful completion:
Currently, we have the same code path for both the success case & failure case
for getting & printing error messages. So this fix breaks the code path to use
debug for failures in a successful completion which are solved by retries & to
use error for failures in an unsuccessful run.
2) Shows 100% even if there are failures
This is a direct result of what hadoop does. It marks the map and reduce tasks
as 100 % complete irrespective of their success or failure. In some sense these
are unrelated dimensions. Since its better to relate these two, we need to make
sure that we don't report 100% complete in case of a failed execution. This is
a hack where I check if the progress has become 100% and postpone its display
till I am sure that the job has completed successfully.
There are some other fixes to the completion percentage display logic which
displays the percentage completion. In the code as we are chasing a moving
target and when we assume that the job is in a particular state & try to do
some processing based on that assumption, we might get spurious results. One
example is we get the list of running jobs and try to get the progress for each
job. While doing this, the state of this job might change from running to
something else and its not easy to construct all the possible scenarios into
the code. Thus when we try to fetch the progress of a previously running job
which has changed state, we will get spurious results. To mitigate this, we
make a simple assumption that the job can't regress and if we see such a
condition, we ignore it as we know its temporary.
Another thing that has been introduced into the logic is an exponential delay
scheme which will be useful when we are in a job which is not progressing may
be due to bag spilling or some udf running. In this case each progress reported
is the same for some time. During this time, we can either implement something
where we hard limit saying if we don't see progress we don't report it or we
can just report the same progress. There are cons with both approaches: for 1)
it might seem like the job is stuck or there is processing happening if we
don't display anything. for 2)its surely going to fill the screen with
something that is not adding any more information. So we try to introduce
delays between each batch of same progress display which increase exponentially
with each batch completing. Currently the batch size is half the number of
retries which is 6 since sleep time is 5 sec now; like trying to have a
progress reported every 30 sec but delaying future displays of the same
progress using an exponential delay scheme.
> pig produces errors after a job is said to be 100% done
> -------------------------------------------------------
>
> Key: PIG-457
> URL: https://issues.apache.org/jira/browse/PIG-457
> Project: Pig
> Issue Type: Bug
> Affects Versions: types_branch
> Reporter: Olga Natkovich
> Assignee: Shravan Matthur Narayanamurthy
> Fix For: types_branch
>
> Attachments: 457-2.patch
>
>
> It is possible that we get errors for all tasks even the ones we retried.
> Need to look at the code that handles detecting end of processing and
> producing errors.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.