[ 
https://issues.apache.org/jira/browse/GOBBLIN-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apekshit Kumar updated GOBBLIN-1963:
------------------------------------
    Description: 
*Context:*

Following a restart, Gobblin service is currently unable to process previous 
jobs in the RUNNING/LAUNCHED/SUBMITTED state, resulting in a stuck state for 
these jobs.

*Acceptance Criteria:*
 # The system should automatically resume jobs that were in the 
RUNNING/LAUNCHED/SUBMITTED state after the restart.

 # The solution should address lingering locks acquired in the previous run.

 # Care should be taken to avoid picking up jobs or cleaning locks that are 
currently being handled by other deployments as part of work stealing.
 
 
 

  was:
*Context :*

Data copy back job failed because it's unable to acquire job lock.

Gobblin job creation encountered a failure due to a *NullPointerException*

 

*Code ref-*

[https://github.com/apache/gobblin/blob/master/gobblin-runtime/src/main/java/org/apache/gobblin/runtime/AbstractJobLauncher.java#L945]

 

Error Stacktraces
{quote}at 
org.apache.gobblin.cluster.HelixRetriggeringJobCallable.runJobLauncherLoop(HelixRetriggeringJobCallable.java:214)
 at 
org.apache.gobblin.cluster.HelixRetriggeringJobCallable.runJobLauncherLoop(HelixRetriggeringJobCallable.java:214)
 at 
org.apache.gobblin.cluster.HelixRetriggeringJobCallable.call(HelixRetriggeringJobCallable.java:159)
 at 
org.apache.gobblin.cluster.GobblinHelixJobScheduler.runJob(GobblinHelixJobScheduler.java:251)
 at 
org.apache.gobblin.cluster.GobblinHelixJobScheduler$NonScheduledJobRunner.run(GobblinHelixJobScheduler.java:450)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)Caused by: 
java.lang.NullPointerException at 
org.apache.gobblin.runtime.AbstractJobLauncher.tryLockJob(AbstractJobLauncher.java:887)
 at 
org.apache.gobblin.runtime.AbstractJobLauncher.<init>(AbstractJobLauncher.java:202)
 at 
org.apache.gobblin.runtime.AbstractJobLauncher.<init>(AbstractJobLauncher.java:176)
 at 
org.apache.gobblin.cluster.GobblinHelixJobLauncher.<init>(GobblinHelixJobLauncher.java:138)
 at 
org.apache.gobblin.cluster.GobblinHelixJobScheduler.buildJobLauncher(GobblinHelixJobScheduler.java:266)
 at 
org.apache.gobblin.cluster.HelixRetriggeringJobCallable.runJobLauncherLoop(HelixRetriggeringJobCallable.java:201)
 ... 6 more
{quote}
 

*Job run metastore failure details*

mysql> select * from gobblin_job_queue  where queue_id like 
'DM-JOB-FINANCIALPACKV2PRD.FPV2-SCORES_1632434443250' order by created_date 
desc;mysql> select * from gobblin_job_queue  where queue_id like 
'DM-JOB-FINANCIALPACKV2PRD.FPV2-SCORES_1632434443250' order by created_date 
desc;+-----------------------------------------------------+---------------------------------------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+--------+---------------------+---------------------+|
 queue_id                                            | job_name                 
             | deployment_id | failure_exception                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                | configs                                                       
                                                                                
                                    | status | job_id | created_date        | 
updated_date        
|+-----------------------------------------------------+---------------------------------------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+--------+---------------------+---------------------+|
 DM-JOB-FINANCIALPACKV2PRD.FPV2-SCORES_1632434443250 | 
DM-JOB-FINANCIALPACKV2PRD.FPV2-SCORES |           230 | 
org.apache.gobblin.runtime.JobException: Failed to run job 
DM-JOB-FINANCIALPACKV2PRD.FPV2-SCORES at 
org.apache.gobblin.cluster.HelixRetriggeringJobCallable.runJobLauncherLoop(HelixRetriggeringJobCallable.java:214)
 at 
org.apache.gobblin.cluster.HelixRetriggeringJobCallable.call(HelixRetriggeringJobCallable.java:159)
 at 
org.apache.gobblin.cluster.GobblinHelixJobScheduler.runJob(GobblinHelixJobScheduler.java:251)
 at 
org.apache.gobblin.cluster.GobblinHelixJobScheduler$NonScheduledJobRunner.run(GobblinHelixJobScheduler.java:450)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)Caused by: 
java.lang.NullPointerException at 
org.apache.gobblin.runtime.AbstractJobLauncher.tryLockJob(AbstractJobLauncher.java:887)
 at 
org.apache.gobblin.runtime.AbstractJobLauncher.<init>(AbstractJobLauncher.java:202)
 at 
org.apache.gobblin.runtime.AbstractJobLauncher.<init>(AbstractJobLauncher.java:176)
 at 
org.apache.gobblin.cluster.GobblinHelixJobLauncher.<init>(GobblinHelixJobLauncher.java:138)
 at 
org.apache.gobblin.cluster.GobblinHelixJobScheduler.buildJobLauncher(GobblinHelixJobScheduler.java:266)
 at 
org.apache.gobblin.cluster.HelixRetriggeringJobCallable.runJobLauncherLoop(HelixRetriggeringJobCallable.java:201)
 ... 6 more | 
\\{"dataset":{"batch_id":"20210923150042","name":"financialpackv2prd.fpv2_scores","snapshot_id":"20210923150042"},"gobblin":\\{"deployment":{"name":"DMP230"}},"namespace":"Chunnel"}
 | FAILED | NULL   | 2021-09-23 22:00:43 | 2021-09-23 22:40:07 
|+-----------------------------------------------------+---------------------------------------+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------+--------+---------------------+---------------------+

 

1 row in set (0.00 sec)


>  Following the restart, jobs that were previously in the "RUNNING," 
> "LAUNCHED," or "SUBMITTED" state failed to resume.
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: GOBBLIN-1963
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-1963
>             Project: Apache Gobblin
>          Issue Type: Bug
>          Components: misc
>    Affects Versions: 0.15.0
>            Reporter: Apekshit Kumar
>            Priority: Minor
>
> *Context:*
> Following a restart, Gobblin service is currently unable to process previous 
> jobs in the RUNNING/LAUNCHED/SUBMITTED state, resulting in a stuck state for 
> these jobs.
> *Acceptance Criteria:*
>  # The system should automatically resume jobs that were in the 
> RUNNING/LAUNCHED/SUBMITTED state after the restart.
>  # The solution should address lingering locks acquired in the previous run.
>  # Care should be taken to avoid picking up jobs or cleaning locks that are 
> currently being handled by other deployments as part of work stealing.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to