Re: [rdo-dev] Long queue in RDO SF

2018-02-13 Thread Tristan Cacqueray

On February 13, 2018 12:52 pm, Jakub Ruzicka wrote:

On Mon, Feb 12, 2018 at 5:08 PM, Tristan Cacqueray 
wrote:


On February 12, 2018 8:59 am, Javier Pena wrote:
[snip]


My only doubt is why this does not show up as "NOT_REGISTERED" in Zuul as
it did before.



This is because we changed check_job_registration to False in zuul.conf
to make Zuul always queue new job. We did that because during previous
nodepool outage, zuul would fail with NOT_REGISTERED when no slaves
where online (zuul(v2) only register job for available labels).

Perhaps we could add a check for missing jjb job in zuul.yaml, or revert
that check_job_registration back to true.



I was previsously confused by NOT_REGISTERED on wrong configuration too,
but it's still better than having the job stuck. That said, I didn't know
howto debug this error, someone with experience told me howto fix based on
guesswork.

So do I understand it correctly that Zuul has no good way of communicating
job configuration errors?

This is the design of the zuul(v2) gearman architecture, jobs only get
registered when the associated label are available. So when nodepool or
jenkins get restarted, it can take a few minutes before slave are
online, and any change getting queued in that period will get the
NOT_REGISTERED error.


Isn't this possibly an issue to be solved in
upstream Zuul? Something like returning CONFIG_ERROR that is clickable and
leads to a log of config errors.


The only way would be to prevent adding unknown job to the pipeline in
the first place. Though this would a temporary measure until the
migration to zuul(v3) which does exactly that by default.

Regards,
-Tristan


pgpGymIpWOVTH.pgp
Description: PGP signature
___
dev mailing list
dev@lists.rdoproject.org
http://lists.rdoproject.org/mailman/listinfo/dev

To unsubscribe: dev-unsubscr...@lists.rdoproject.org


Re: [rdo-dev] Long queue in RDO SF

2018-02-13 Thread Jakub Ruzicka
On Mon, Feb 12, 2018 at 5:08 PM, Tristan Cacqueray 
wrote:

> On February 12, 2018 8:59 am, Javier Pena wrote:
> [snip]
>
>> My only doubt is why this does not show up as "NOT_REGISTERED" in Zuul as
>> it did before.
>>
>
> This is because we changed check_job_registration to False in zuul.conf
> to make Zuul always queue new job. We did that because during previous
> nodepool outage, zuul would fail with NOT_REGISTERED when no slaves
> where online (zuul(v2) only register job for available labels).
>
> Perhaps we could add a check for missing jjb job in zuul.yaml, or revert
> that check_job_registration back to true.


I was previsously confused by NOT_REGISTERED on wrong configuration too,
but it's still better than having the job stuck. That said, I didn't know
howto debug this error, someone with experience told me howto fix based on
guesswork.

So do I understand it correctly that Zuul has no good way of communicating
job configuration errors? Isn't this possibly an issue to be solved in
upstream Zuul? Something like returning CONFIG_ERROR that is clickable and
leads to a log of config errors.


Cheers,
Jakub
___
dev mailing list
dev@lists.rdoproject.org
http://lists.rdoproject.org/mailman/listinfo/dev

To unsubscribe: dev-unsubscr...@lists.rdoproject.org


Re: [rdo-dev] Long queue in RDO SF

2018-02-12 Thread Haïkel Guémar

On 02/11/2018 05:57 PM, David Manchado Cuesta wrote:

FWIW no alerts during the weekend and I have been able to spawn 10+
instances without issue.

Cheers
David Manchado
Senior Software Engineer - SysOps Team
Red Hat
dmanc...@redhat.com



Thanks David!
Just wanted to check :)

Regards,
H.



On 11 February 2018 at 16:39, Paul Belanger  wrote:

On Sun, Feb 11, 2018 at 12:44:52PM +0100, Haïkel Guémar wrote:

On 02/11/2018 12:17 AM, Sagi Shnaidman wrote:

Hi,

I see openstack-check has 53 hours queue when 1 job only is queued:
https://review.rdoproject.org/zuul/

Seems like problem with nodepool?

Thanks

--
Best regards
Sagi Shnaidman


___
dev mailing list
dev@lists.rdoproject.org
http://lists.rdoproject.org/mailman/listinfo/dev

To unsubscribe: dev-unsubscr...@lists.rdoproject.org



Ok, it looks bad enough that a simple nodepool list fails with that error:
os_client_config.exceptions.OpenStackConfigException: Cloud rdo-cloud was
not found.

Despite RDO Cloud looks up, there might be an outage or incident hence
copying David Manchado.

Regards,
H.


Okay, I have to run, but this looks like a configuration issue. It is hard to
tell without debug logs for nodepool or zuul, but please double check your node
is setup properly.

I have to run now.



___
dev mailing list
dev@lists.rdoproject.org
http://lists.rdoproject.org/mailman/listinfo/dev

To unsubscribe: dev-unsubscr...@lists.rdoproject.org


Re: [rdo-dev] Long queue in RDO SF

2018-02-12 Thread Javier Pena
Hi,

I see no issues in nodepool. Looking at the current Zuul queue, we have a 
single job stuck for ~90 hours, queued on 
"gate-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-queens".

When that happens, it's usually a configuration issue, and this is the case 
here: we have no definition for the queens gate job for featureset035 in 
https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/jobs/tripleo-upstream.yml#L979-L990.

An easy way to troubleshoot this is:

- If we find one or more jobs queued, first check at 
https://review.rdoproject.org/jenkins/ and see if there are nodes available to 
jenkins.

- If there are, just check the list of jobs available to Jenkins. If it's not 
there, we need to double-check the jjb configuration and find what is missing.

My only doubt is why this does not show up as "NOT_REGISTERED" in Zuul as it 
did before.

I have proposed https://review.rdoproject.org/r/12038 as a fix for this.

Regards,
Javier

- Original Message -
> FWIW no alerts during the weekend and I have been able to spawn 10+
> instances without issue.
> 
> Cheers
> David Manchado
> Senior Software Engineer - SysOps Team
> Red Hat
> dmanc...@redhat.com
> 
> 
> On 11 February 2018 at 16:39, Paul Belanger  wrote:
> > On Sun, Feb 11, 2018 at 12:44:52PM +0100, Haïkel Guémar wrote:
> >> On 02/11/2018 12:17 AM, Sagi Shnaidman wrote:
> >> > Hi,
> >> >
> >> > I see openstack-check has 53 hours queue when 1 job only is queued:
> >> > https://review.rdoproject.org/zuul/
> >> >
> >> > Seems like problem with nodepool?
> >> >
> >> > Thanks
> >> >
> >> > --
> >> > Best regards
> >> > Sagi Shnaidman
> >> >
> >> >
> >> > ___
> >> > dev mailing list
> >> > dev@lists.rdoproject.org
> >> > http://lists.rdoproject.org/mailman/listinfo/dev
> >> >
> >> > To unsubscribe: dev-unsubscr...@lists.rdoproject.org
> >> >
> >>
> >> Ok, it looks bad enough that a simple nodepool list fails with that error:
> >> os_client_config.exceptions.OpenStackConfigException: Cloud rdo-cloud was
> >> not found.
> >>
> >> Despite RDO Cloud looks up, there might be an outage or incident hence
> >> copying David Manchado.
> >>
> >> Regards,
> >> H.
> >>
> > Okay, I have to run, but this looks like a configuration issue. It is hard
> > to
> > tell without debug logs for nodepool or zuul, but please double check your
> > node
> > is setup properly.
> >
> > I have to run now.
> >
> 
___
dev mailing list
dev@lists.rdoproject.org
http://lists.rdoproject.org/mailman/listinfo/dev

To unsubscribe: dev-unsubscr...@lists.rdoproject.org


Re: [rdo-dev] Long queue in RDO SF

2018-02-11 Thread David Manchado Cuesta
FWIW no alerts during the weekend and I have been able to spawn 10+
instances without issue.

Cheers
David Manchado
Senior Software Engineer - SysOps Team
Red Hat
dmanc...@redhat.com


On 11 February 2018 at 16:39, Paul Belanger  wrote:
> On Sun, Feb 11, 2018 at 12:44:52PM +0100, Haïkel Guémar wrote:
>> On 02/11/2018 12:17 AM, Sagi Shnaidman wrote:
>> > Hi,
>> >
>> > I see openstack-check has 53 hours queue when 1 job only is queued:
>> > https://review.rdoproject.org/zuul/
>> >
>> > Seems like problem with nodepool?
>> >
>> > Thanks
>> >
>> > --
>> > Best regards
>> > Sagi Shnaidman
>> >
>> >
>> > ___
>> > dev mailing list
>> > dev@lists.rdoproject.org
>> > http://lists.rdoproject.org/mailman/listinfo/dev
>> >
>> > To unsubscribe: dev-unsubscr...@lists.rdoproject.org
>> >
>>
>> Ok, it looks bad enough that a simple nodepool list fails with that error:
>> os_client_config.exceptions.OpenStackConfigException: Cloud rdo-cloud was
>> not found.
>>
>> Despite RDO Cloud looks up, there might be an outage or incident hence
>> copying David Manchado.
>>
>> Regards,
>> H.
>>
> Okay, I have to run, but this looks like a configuration issue. It is hard to
> tell without debug logs for nodepool or zuul, but please double check your 
> node
> is setup properly.
>
> I have to run now.
>
___
dev mailing list
dev@lists.rdoproject.org
http://lists.rdoproject.org/mailman/listinfo/dev

To unsubscribe: dev-unsubscr...@lists.rdoproject.org


Re: [rdo-dev] Long queue in RDO SF

2018-02-11 Thread Haïkel Guémar

On 02/11/2018 12:17 AM, Sagi Shnaidman wrote:

Hi,

I see openstack-check has 53 hours queue when 1 job only is queued:
https://review.rdoproject.org/zuul/

Seems like problem with nodepool?

Thanks

--
Best regards
Sagi Shnaidman


___
dev mailing list
dev@lists.rdoproject.org
http://lists.rdoproject.org/mailman/listinfo/dev

To unsubscribe: dev-unsubscr...@lists.rdoproject.org



Ok, it looks bad enough that a simple nodepool list fails with that error:
os_client_config.exceptions.OpenStackConfigException: Cloud rdo-cloud 
was not found.


Despite RDO Cloud looks up, there might be an outage or incident hence 
copying David Manchado.


Regards,
H.

___
dev mailing list
dev@lists.rdoproject.org
http://lists.rdoproject.org/mailman/listinfo/dev

To unsubscribe: dev-unsubscr...@lists.rdoproject.org


[rdo-dev] Long queue in RDO SF

2018-02-10 Thread Sagi Shnaidman
Hi,

I see openstack-check has 53 hours queue when 1 job only is queued:
https://review.rdoproject.org/zuul/

Seems like problem with nodepool?

Thanks

-- 
Best regards
Sagi Shnaidman
___
dev mailing list
dev@lists.rdoproject.org
http://lists.rdoproject.org/mailman/listinfo/dev

To unsubscribe: dev-unsubscr...@lists.rdoproject.org