Dear Suresh and Andrei Thanks for your help. I have upgrade CloudStack from 4.9.3 to 4.11.2 but the problem still persists. Then I inspect database tables and I found that these three tables could be the root cause: - op_ha_work - op_lock - vm_work_job So I delete all records in those tables and problem solved. The content of those tables are submitted as a comment in the bug report in jira: https://issues.apache.org/jira/browse/CLOUDSTACK-10401 Suresh, could you tell me more about the role of those tables in CS? I think CS had been more sensitive about concurrent jobs. Previous versions works better. Regards
On Wed, Jan 23, 2019 at 9:43 PM Suresh Kumar Anaparti < sureshkumar.anapa...@gmail.com> wrote: > Hi Alireza, > > *sync_queue *table is the actual VM sync queue which holds a queue id for > each VM (*sync_objtype*: VmWorkJobQueue, *sync_objid*: <VM-Id>) and the VM > jobs would reside in *sync_queue_item* table against that queue id. Only > one running job is allowed per VM queue (*queue_size_limit*: 1 in > *sync_queue* table). The active/running job would have the *queue_proc_id*, > *queue_proc_number* and *queue_proc_time* set in the *sync_queue_item* > table > and the rest jobs with that queue id would be waiting for active job to > complete. So, to delete pending jobs, records in the *sync_queue_item > *table > has to be cleared for the respective VMs, not the *sync_queue *table. > > I think, in your case, snapshots is taking long time and other jobs in that > VM are pending for long time as they are in queue waiting for snapshot job > to complete. What are the config values set for > "job.cancel.threshold.minutes", "job.expire.minutes" and > "volume.snapshot.job.cancel.threshold"? Are the jobs cancelled after the > threshold time? > > Thanks, > Suresh > > On Wed, Jan 23, 2019 at 7:14 PM Andrei Mikhailovsky > <and...@arhont.com.invalid> wrote: > > > Hi > > > > I've had this issue a few times in 2018 and managed to get it fixed > pretty > > easily, although had spent a number of hours initially trying to figure > out > > WTF is going on. This issue looks like one of those artefacts that > creeped > > up in one of the versions released in 2018 and hasn't been addressed by > the > > dev team. > > > > The way I fixed it was similar to what has been recommended earlier. > > However, the difference was that I am sure I've looked at more tables > than > > just the two suggested. Basically, I've stopped the management server, > > created the sql backup, connected to the sql db and listed all tables. > > Grepped for the words like job/schedule/queue/sync. After that I've went > > through all the tables and pretty much removed all the past / active / > > awaiting execution jobs. I have started by looking at the vm related jobs > > (the vm that I've tried to start but wasn't able to). This has worked > once, > > but the second time I had to remove a lot more jobs which relate to other > > vms. After that I've started the management server and all went well from > > there. > > > > What I have also noticed is that my snapshot jobs (I use KVM and Ceph) > > seem to be blocking jobs on the hypervisor hosts which are running these > > snapshots. So, if I am trying to perform various vm related jobs on a > host > > server which is currently running a snapshot process, that job will not > be > > executed until the snapshot process is done. I've tested this countless > > number of times and it's still the case. Again, this issued appeared in > one > > of the 2018 releases as I've never seen between 2012 - 2017. > > > > Both issues are annoying as hell! > > > > Cheers > > > > ----- Original Message ----- > > > From: "Alireza Eskandari" <astro.alir...@gmail.com> > > > To: "dev" <dev@cloudstack.apache.org> > > > Sent: Wednesday, 23 January, 2019 12:40:48 > > > Subject: Re: Help! Jobs stuck in pending state > > > > > I'm following this issue in github: > > > https://github.com/apache/cloudstack/issues/3104 > > > Please leave your comments > > > Thanks > > > > > > On Wed, Jan 23, 2019 at 12:39 PM Wei ZHOU <ustcweiz...@gmail.com> > wrote: > > > > > >> Hi Alireza, > > >> > > >> could you try again after restarting mgt server ? > > >> > > >> -Wei > > >> > > >> Alireza Eskandari <astro.alir...@gmail.com> 于2019年1月23日周三 上午6:22写道: > > >> > > >> > First I deleted two jobs which was existed in vm_work_job table and > > its > > >> > related entry in sync_queue table but it doesn't help. > > >> > Then I delete all the entries in sync_queue tables and again no > > success. > > >> > Any idea? > > >> > > > >> > On Wed, Jan 23, 2019 at 1:50 AM Wei ZHOU <ustcweiz...@gmail.com> > > wrote: > > >> > > > >> > > If you know the instance id and mysql password, it should work > after > > >> > > removing some records in mysql. > > >> > > > > >> > > ``` > > >> > > set @id=XXXXX; > > >> > > > > >> > > delete from vm_work_job where vm_instance_id=@id; > > >> > > delete from sync_queue where sync_objid=@id; > > >> > > ``` > > >> > > > > >> > > Alireza Eskandari <astro.alir...@gmail.com> 于2019年1月22日周二 > > 下午10:59写道: > > >> > > > > >> > > > Hi guys > > >> > > > I have opened a bug in jira about my problem in CS: > > >> > > > https://issues.apache.org/jira/browse/CLOUDSTACK-10401 > > >> > > > CloudStack doesn't process jobs! My cloud in totally unusable. > > >> > > > Thanks in advance for you help. > > >> > > > > > >> > > > > >> > > > >