[ https://issues.apache.org/jira/browse/GOBBLIN-159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114544#comment-16114544 ]
Joel Baranick commented on GOBBLIN-159: --------------------------------------- Zhixiong Chen, are you actively working on this? > Gobblin Cluster graceful shutdown of master and workers > ------------------------------------------------------- > > Key: GOBBLIN-159 > URL: https://issues.apache.org/jira/browse/GOBBLIN-159 > Project: Apache Gobblin > Issue Type: Bug > Reporter: Abhishek Tiwari > Assignee: Zhixiong Chen > > Relevant chat from Gitter channel: > *Joel Baranick @kadaan Jun 30 10:47* > Up scaling seems to work great. But down scaling caused problems with the > cluster. > Basically, once the cpu dropped enough to start down scaling, something broke > where it stopped processing jobs. > I’m concerned that the down scaling is not graceful and that the cluster > doesn’t respond nicely to workers leaving the cluster in the middle of > processing. > There are a couple problems I see. One is that the workers down gracefully > stop running tasks and allow them to be picked up by other nodes. > The other is that if task publishing is used, partial data might be published > when the node goes away. How does the task get completed without possibly > duplicating data? > *Joel Baranick @kadaan Jun 30 12:07* > @abti What I'm wondering is how we can shutdown a worker node and have it > gracefully stop working. > *Joel Baranick @kadaan Jun 30 12:52* > Also, seems like .../taskstates/... as well as the job...job.state file in > NFS don't get purged. > Our NFS is experiencing unbounded growth. Are we missing a setting or service? > *Abhishek Tiwari @abti Jun 30 15:36* > I didn’t fully understand the issue. Did you see the workers abruptly cancel > the task or did they wait for it to finish before shutting down? If the > worker waits around enough for Task to finish, the task level publish should > be fine? > *Joel Baranick @kadaan Jun 30 15:37* > The workers never shut down. > *Abhishek Tiwari @abti Jun 30 15:38* > could be because they wait for graceful shutdown but do not leave cluster and > are assigned new tasks by helix? > *Joel Baranick @kadaan Jun 30 15:39* > I think one issue is that there is an > org.quartz.UnableToInterruptJobException in JobScheduler.shutDown which > causes it to never run > ExecutorsUtils.shutdownExecutorService(this.jobExecutor, Optional.of(LOG)); > *Abhishek Tiwari @abti Jun 30 15:40* > also taskstates should get cleaned up, check with @htran1 too .. only wu > probably should be left around > we need to add some cleaning mechanism for that > we dont recall seeing the lurking state files > *Joel Baranick @kadaan Jun 30 15:47* > In my EFS/NFS, I have tons (> 6000) of files remaining under > .../_taskstates/... for jobs/tasks that have been completed for ages. > *Abhishek Tiwari @abti Jun 30 16:29* > wow thats unexpected, did master switch while several jobs were going on? > *Joel Baranick @kadaan Jun 30 17:23* > There isn't a way for master to switch without jobs running as they don't > cancel correctly. > *Joel Baranick @kadaan Jul 05 14:22* > @abti I was looking at fixing the cancellation problem. > From what I can tell, GobblinHelixJob needs to implement InterruptableJob. > And it needs to call jobLauncher.cancelJob(jobListener); when it is invoked. > Does this seem right? Anything I'm missing? > *Abhishek Tiwari @abti Jul 06 00:34* > looks about right -- This message was sent by Atlassian JIRA (v6.4.14#64029)