[ 
https://issues.apache.org/jira/browse/PIG-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-457:
-----------------------------------------------

    Status: Patch Available  (was: Open)

There are two issues that this patch tries to address:
1) Exceptions and traces even after a successful completion:
Currently, we have the same code path for both the success case & failure case 
for getting & printing error messages. So this fix breaks the code path to use 
debug for failures in a successful completion which are solved by retries & to 
use error for failures in an unsuccessful run.

2) Shows 100% even if there are failures
This is a direct result of what hadoop does. It marks the map and reduce tasks 
as 100 % complete irrespective of their success or failure. In some sense these 
are unrelated dimensions. Since its better to relate these two, we need to make 
sure that we don't report 100% complete in case of a failed execution. This is 
a hack where I check if the progress has become 100% and postpone its display 
till I am sure that the job has completed successfully.

There are some other fixes to the completion percentage display logic which 
displays the percentage completion. In the code as we are chasing a moving 
target and when we assume that the job is in a particular state & try to do 
some processing based on that assumption, we might get spurious results. One 
example is we get the list of running jobs and try to get the progress for each 
job. While doing this, the state of this job might change from running to 
something else and its not easy to construct all the possible scenarios into 
the code. Thus when we try to fetch the progress of a previously running job 
which has changed state, we will get spurious results. To mitigate this, we 
make a simple assumption that the job can't regress and if we see such a 
condition, we ignore it as we know its temporary.

Another thing that has been introduced into the logic is an exponential delay 
scheme which will be useful when we are in a job which is not progressing may 
be due to bag spilling or some udf running. In this case each progress reported 
is the same for some time. During this time, we can either implement something 
where we hard limit saying if we don't see progress we don't report it or we 
can just report the same progress. There are cons with both approaches: for 1) 
it might seem like the job is stuck or there is processing happening if we 
don't display anything. for 2)its surely going to fill the screen with 
something that is not adding any more information. So we try to introduce 
delays between each batch of same progress display which increase exponentially 
with each batch completing. Currently the batch size is half the number of 
retries which is 6 since sleep time is 5 sec now; like trying to have a 
progress reported every 30 sec but delaying future displays of the same 
progress using an exponential delay scheme.

> pig produces errors after a job is said to be 100% done
> -------------------------------------------------------
>
>                 Key: PIG-457
>                 URL: https://issues.apache.org/jira/browse/PIG-457
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Olga Natkovich
>            Assignee: Shravan Matthur Narayanamurthy
>             Fix For: types_branch
>
>         Attachments: 457-2.patch
>
>
> It is possible that we get errors for all tasks even the ones we retried. 
> Need to look at the code that handles detecting end of processing and 
> producing errors.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to