[ 
https://issues.apache.org/jira/browse/GOBBLIN-159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114544#comment-16114544
 ] 

Joel Baranick commented on GOBBLIN-159:
---------------------------------------

Zhixiong Chen, are you actively working on this?

> Gobblin Cluster graceful shutdown of master and workers
> -------------------------------------------------------
>
>                 Key: GOBBLIN-159
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-159
>             Project: Apache Gobblin
>          Issue Type: Bug
>            Reporter: Abhishek Tiwari
>            Assignee: Zhixiong Chen
>
> Relevant chat from Gitter channel: 
> *Joel Baranick @kadaan Jun 30 10:47*
> Up scaling seems to work great. But down scaling caused problems with the 
> cluster.
> Basically, once the cpu dropped enough to start down scaling, something broke 
> where it stopped processing jobs.
> I’m concerned that the down scaling is not graceful and that the cluster 
> doesn’t respond nicely to workers leaving the cluster in the middle of 
> processing.
> There are a couple problems I see. One is that the workers down gracefully 
> stop running tasks and allow them to be picked up by other nodes.
> The other is that if task publishing is used, partial data might be published 
> when the node goes away. How does the task get completed without possibly 
> duplicating data?
> *Joel Baranick @kadaan Jun 30 12:07*
> @abti What I'm wondering is how we can shutdown a worker node and have it 
> gracefully stop working.
> *Joel Baranick @kadaan Jun 30 12:52*
> Also, seems like .../taskstates/... as well as the job...job.state file in 
> NFS don't get purged.
> Our NFS is experiencing unbounded growth. Are we missing a setting or service?
> *Abhishek Tiwari @abti Jun 30 15:36*
> I didn’t fully understand the issue. Did you see the workers abruptly cancel 
> the task or did they wait for it to finish before shutting down? If the 
> worker waits around enough for Task to finish, the task level publish should 
> be fine?
> *Joel Baranick @kadaan Jun 30 15:37*
> The workers never shut down.
> *Abhishek Tiwari @abti Jun 30 15:38*
> could be because they wait for graceful shutdown but do not leave cluster and 
> are assigned new tasks by helix?
> *Joel Baranick @kadaan Jun 30 15:39*
> I think one issue is that there is an 
> org.quartz.UnableToInterruptJobException in JobScheduler.shutDown which 
> causes it to never run 
> ExecutorsUtils.shutdownExecutorService(this.jobExecutor, Optional.of(LOG));
> *Abhishek Tiwari @abti Jun 30 15:40*
> also taskstates should get cleaned up, check with @htran1 too .. only wu 
> probably should be left around
> we need to add some cleaning mechanism for that
> we dont recall seeing the lurking state files
> *Joel Baranick @kadaan Jun 30 15:47*
> In my EFS/NFS, I have tons (> 6000) of files remaining under 
> .../_taskstates/... for jobs/tasks that have been completed for ages.
> *Abhishek Tiwari @abti Jun 30 16:29*
> wow thats unexpected, did master switch while several jobs were going on?
> *Joel Baranick @kadaan Jun 30 17:23*
> There isn't a way for master to switch without jobs running as they don't 
> cancel correctly.
> *Joel Baranick @kadaan Jul 05 14:22*
> @abti I was looking at fixing the cancellation problem.
> From what I can tell, GobblinHelixJob needs to implement InterruptableJob.
> And it needs to call jobLauncher.cancelJob(jobListener); when it is invoked.
> Does this seem right? Anything I'm missing?
> *Abhishek Tiwari @abti Jul 06 00:34*
> looks about right



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to