Re: [rdo-dev] Long queue in RDO SF
On February 13, 2018 12:52 pm, Jakub Ruzicka wrote: On Mon, Feb 12, 2018 at 5:08 PM, Tristan Cacqueraywrote: On February 12, 2018 8:59 am, Javier Pena wrote: [snip] My only doubt is why this does not show up as "NOT_REGISTERED" in Zuul as it did before. This is because we changed check_job_registration to False in zuul.conf to make Zuul always queue new job. We did that because during previous nodepool outage, zuul would fail with NOT_REGISTERED when no slaves where online (zuul(v2) only register job for available labels). Perhaps we could add a check for missing jjb job in zuul.yaml, or revert that check_job_registration back to true. I was previsously confused by NOT_REGISTERED on wrong configuration too, but it's still better than having the job stuck. That said, I didn't know howto debug this error, someone with experience told me howto fix based on guesswork. So do I understand it correctly that Zuul has no good way of communicating job configuration errors? This is the design of the zuul(v2) gearman architecture, jobs only get registered when the associated label are available. So when nodepool or jenkins get restarted, it can take a few minutes before slave are online, and any change getting queued in that period will get the NOT_REGISTERED error. Isn't this possibly an issue to be solved in upstream Zuul? Something like returning CONFIG_ERROR that is clickable and leads to a log of config errors. The only way would be to prevent adding unknown job to the pipeline in the first place. Though this would a temporary measure until the migration to zuul(v3) which does exactly that by default. Regards, -Tristan pgpGymIpWOVTH.pgp Description: PGP signature ___ dev mailing list dev@lists.rdoproject.org http://lists.rdoproject.org/mailman/listinfo/dev To unsubscribe: dev-unsubscr...@lists.rdoproject.org
Re: [rdo-dev] Long queue in RDO SF
On Mon, Feb 12, 2018 at 5:08 PM, Tristan Cacqueraywrote: > On February 12, 2018 8:59 am, Javier Pena wrote: > [snip] > >> My only doubt is why this does not show up as "NOT_REGISTERED" in Zuul as >> it did before. >> > > This is because we changed check_job_registration to False in zuul.conf > to make Zuul always queue new job. We did that because during previous > nodepool outage, zuul would fail with NOT_REGISTERED when no slaves > where online (zuul(v2) only register job for available labels). > > Perhaps we could add a check for missing jjb job in zuul.yaml, or revert > that check_job_registration back to true. I was previsously confused by NOT_REGISTERED on wrong configuration too, but it's still better than having the job stuck. That said, I didn't know howto debug this error, someone with experience told me howto fix based on guesswork. So do I understand it correctly that Zuul has no good way of communicating job configuration errors? Isn't this possibly an issue to be solved in upstream Zuul? Something like returning CONFIG_ERROR that is clickable and leads to a log of config errors. Cheers, Jakub ___ dev mailing list dev@lists.rdoproject.org http://lists.rdoproject.org/mailman/listinfo/dev To unsubscribe: dev-unsubscr...@lists.rdoproject.org
Re: [rdo-dev] Long queue in RDO SF
On 02/11/2018 05:57 PM, David Manchado Cuesta wrote: FWIW no alerts during the weekend and I have been able to spawn 10+ instances without issue. Cheers David Manchado Senior Software Engineer - SysOps Team Red Hat dmanc...@redhat.com Thanks David! Just wanted to check :) Regards, H. On 11 February 2018 at 16:39, Paul Belangerwrote: On Sun, Feb 11, 2018 at 12:44:52PM +0100, Haïkel Guémar wrote: On 02/11/2018 12:17 AM, Sagi Shnaidman wrote: Hi, I see openstack-check has 53 hours queue when 1 job only is queued: https://review.rdoproject.org/zuul/ Seems like problem with nodepool? Thanks -- Best regards Sagi Shnaidman ___ dev mailing list dev@lists.rdoproject.org http://lists.rdoproject.org/mailman/listinfo/dev To unsubscribe: dev-unsubscr...@lists.rdoproject.org Ok, it looks bad enough that a simple nodepool list fails with that error: os_client_config.exceptions.OpenStackConfigException: Cloud rdo-cloud was not found. Despite RDO Cloud looks up, there might be an outage or incident hence copying David Manchado. Regards, H. Okay, I have to run, but this looks like a configuration issue. It is hard to tell without debug logs for nodepool or zuul, but please double check your node is setup properly. I have to run now. ___ dev mailing list dev@lists.rdoproject.org http://lists.rdoproject.org/mailman/listinfo/dev To unsubscribe: dev-unsubscr...@lists.rdoproject.org
Re: [rdo-dev] Long queue in RDO SF
Hi, I see no issues in nodepool. Looking at the current Zuul queue, we have a single job stuck for ~90 hours, queued on "gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-queens". When that happens, it's usually a configuration issue, and this is the case here: we have no definition for the queens gate job for featureset035 in https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/jobs/tripleo-upstream.yml#L979-L990. An easy way to troubleshoot this is: - If we find one or more jobs queued, first check at https://review.rdoproject.org/jenkins/ and see if there are nodes available to jenkins. - If there are, just check the list of jobs available to Jenkins. If it's not there, we need to double-check the jjb configuration and find what is missing. My only doubt is why this does not show up as "NOT_REGISTERED" in Zuul as it did before. I have proposed https://review.rdoproject.org/r/12038 as a fix for this. Regards, Javier - Original Message - > FWIW no alerts during the weekend and I have been able to spawn 10+ > instances without issue. > > Cheers > David Manchado > Senior Software Engineer - SysOps Team > Red Hat > dmanc...@redhat.com > > > On 11 February 2018 at 16:39, Paul Belangerwrote: > > On Sun, Feb 11, 2018 at 12:44:52PM +0100, Haïkel Guémar wrote: > >> On 02/11/2018 12:17 AM, Sagi Shnaidman wrote: > >> > Hi, > >> > > >> > I see openstack-check has 53 hours queue when 1 job only is queued: > >> > https://review.rdoproject.org/zuul/ > >> > > >> > Seems like problem with nodepool? > >> > > >> > Thanks > >> > > >> > -- > >> > Best regards > >> > Sagi Shnaidman > >> > > >> > > >> > ___ > >> > dev mailing list > >> > dev@lists.rdoproject.org > >> > http://lists.rdoproject.org/mailman/listinfo/dev > >> > > >> > To unsubscribe: dev-unsubscr...@lists.rdoproject.org > >> > > >> > >> Ok, it looks bad enough that a simple nodepool list fails with that error: > >> os_client_config.exceptions.OpenStackConfigException: Cloud rdo-cloud was > >> not found. > >> > >> Despite RDO Cloud looks up, there might be an outage or incident hence > >> copying David Manchado. > >> > >> Regards, > >> H. > >> > > Okay, I have to run, but this looks like a configuration issue. It is hard > > to > > tell without debug logs for nodepool or zuul, but please double check your > > node > > is setup properly. > > > > I have to run now. > > > ___ dev mailing list dev@lists.rdoproject.org http://lists.rdoproject.org/mailman/listinfo/dev To unsubscribe: dev-unsubscr...@lists.rdoproject.org
Re: [rdo-dev] Long queue in RDO SF
FWIW no alerts during the weekend and I have been able to spawn 10+ instances without issue. Cheers David Manchado Senior Software Engineer - SysOps Team Red Hat dmanc...@redhat.com On 11 February 2018 at 16:39, Paul Belangerwrote: > On Sun, Feb 11, 2018 at 12:44:52PM +0100, Haïkel Guémar wrote: >> On 02/11/2018 12:17 AM, Sagi Shnaidman wrote: >> > Hi, >> > >> > I see openstack-check has 53 hours queue when 1 job only is queued: >> > https://review.rdoproject.org/zuul/ >> > >> > Seems like problem with nodepool? >> > >> > Thanks >> > >> > -- >> > Best regards >> > Sagi Shnaidman >> > >> > >> > ___ >> > dev mailing list >> > dev@lists.rdoproject.org >> > http://lists.rdoproject.org/mailman/listinfo/dev >> > >> > To unsubscribe: dev-unsubscr...@lists.rdoproject.org >> > >> >> Ok, it looks bad enough that a simple nodepool list fails with that error: >> os_client_config.exceptions.OpenStackConfigException: Cloud rdo-cloud was >> not found. >> >> Despite RDO Cloud looks up, there might be an outage or incident hence >> copying David Manchado. >> >> Regards, >> H. >> > Okay, I have to run, but this looks like a configuration issue. It is hard to > tell without debug logs for nodepool or zuul, but please double check your > node > is setup properly. > > I have to run now. > ___ dev mailing list dev@lists.rdoproject.org http://lists.rdoproject.org/mailman/listinfo/dev To unsubscribe: dev-unsubscr...@lists.rdoproject.org
Re: [rdo-dev] Long queue in RDO SF
On 02/11/2018 12:17 AM, Sagi Shnaidman wrote: Hi, I see openstack-check has 53 hours queue when 1 job only is queued: https://review.rdoproject.org/zuul/ Seems like problem with nodepool? Thanks -- Best regards Sagi Shnaidman ___ dev mailing list dev@lists.rdoproject.org http://lists.rdoproject.org/mailman/listinfo/dev To unsubscribe: dev-unsubscr...@lists.rdoproject.org Ok, it looks bad enough that a simple nodepool list fails with that error: os_client_config.exceptions.OpenStackConfigException: Cloud rdo-cloud was not found. Despite RDO Cloud looks up, there might be an outage or incident hence copying David Manchado. Regards, H. ___ dev mailing list dev@lists.rdoproject.org http://lists.rdoproject.org/mailman/listinfo/dev To unsubscribe: dev-unsubscr...@lists.rdoproject.org
[rdo-dev] Long queue in RDO SF
Hi, I see openstack-check has 53 hours queue when 1 job only is queued: https://review.rdoproject.org/zuul/ Seems like problem with nodepool? Thanks -- Best regards Sagi Shnaidman ___ dev mailing list dev@lists.rdoproject.org http://lists.rdoproject.org/mailman/listinfo/dev To unsubscribe: dev-unsubscr...@lists.rdoproject.org