Hello Jon,

If recovery is disabled, there is no clear way to know whether the previous 
attempt was in process of doing a commit and was aborted at that point. Given 
that there is no clear way to safely re-start/re-process the work, I believe 
the tez client sets to max attempts to 1 if recovery is disabled. Furthermore, 
with sessions, DAGs are submitted over RPC and not via the 
ApplicationSubmissionContext so therefore there will be no record of the DAG 
being submitted if recovery is disabled. The second attempt in this case will 
launch but will not do anything unless the client re-submits the DAG.

I think we should look to back porting all relevant recovery fixes to branch 
0.7 if you would like to stabilize on that branch. Are there any known fixes on 
master that we should backport?

Jeff has been driving a lot of changes for recovery with a lot of fixes being 
tracked off https://issues.apache.org/jira/browse/TEZ-2581. It would be good if 
you could help review and help test these patches in this regard. I believe 
Jeff was planning to do a full rebase after TEZ-2003 got merged in but may not 
have done that yet. 

thanks
— Hitesh 

On Sep 11, 2015, at 9:38 AM, Jonathan Eagles <[email protected]> wrote:

> Running pig on tez (0.7.1 pre-release) with recovery disabled and noticed
> that when the am fails there is no other attempts. What is it about
> sessions versus non-sessions (what bad thing are we preventing) that keeps
> us from retrying when recovery is disabled?
> 
> (background) Pig only runs sessions even when only executing a single DAG
> and recovery is fragile in 0.7.1 where hangs are likely, fixed only in 0.8.
> I want pig on tez to be a stable as pig on mr, where AM failures and going
> to dissuade users from migrating to pig on tez.
> 
> Jon

Reply via email to