Hi Sean,

On 21 July 2014 22:53, Collins, Sean <sean_colli...@cable.comcast.com>

>   The fact that I tried to reach out to the person who was listed as the
> contact back in November to try and resolve the –1 that this CI system
> gave, and never received a response until the public mailing list thread
> about revoking voting rights for Tail-F, makes me believe that the Tail-F
> CI system is still not ready to have that kind of privilege. Especially if
> the account was idle from around February, until June – that is a huge gap,
> if I understand correctly?

I understand your frustration. It seems like the experience of bringing up
our CI has been miserable for all concerned. I am sad about that. It does
not seem that it should have worked out this way, since everybody concerned
is a competent person and acting in good faith.

I hope we can finally clear this up and then continue with contributing to
OpenStack on good terms with everybody.

Back in November we were feeling eager to be good citizens and we wanted to
be amongst the first to setup a 3rd party CI for Neutron. We were trying to
be proactive: our driver was already in Havana and the deadlines for us to
setup the CI were far in the future. My colleague Tobbe was also planning
to take the lead on development of our OpenStack code from me and we
thought the perfect first step would be to setup our CI system, since that
would get him familiar with the code and since neither of us had prior
experience operating an OpenStack CI.

We read through the 3rd Party CI setup instructions and created a CI. Our
initial setup ran Jenkins and would use a custom script to create a
one-shot VM and inside that it would run the Neutron unit tests together
with a patch that made our driver talk to our real external system. This
got quite good test coverage because the unit tests really exercise the ML2
interface quite well. (Likely we should have used Tempest instead, as
everybody does nowadays include us, but we didn't know that back then.)

This seemed to work well and so we let it run. Honestly, we did not really
know what would happen with our results after they were posted, and we did
not have a definite goal for what service level we should uphold. That was
surely naive, but I think understandable. We were relatively new and minor
contributors to OpenStack and we were amongst the first wave of Neutron
people to setup a CI. We hadn't yet had the opportunity to learn from the
mistakes of others or see how reviews are used by the upstream people and
systems. We were also perhaps a little too relaxed because our total
contribution was around 150 lines of code that only run when explicitly
enabled, and we had our own test procedure in place separately from
OpenStack CI that we had been using since Havana, so it did not feel like
we had much potential to impact other OpenStack users and developers with
our code.

Anyway. The test runs started to fail unexpectedly, for a boring kind of
reason like that OpenStack needed a newer version of a library and our CI
script lacked a "pip upgrade" command that would pick it up, so all tests
would fail until manual intervention.

So what happens when the CI falls down and needs help to come back up?
First of all, it creates a big problem for upstream developers and slows
down work on OpenStack (ouch). Second, you poor guys who are having
problems try to contact the person responsible, but all you have is one
work email address and IRC nick. In that case, you guys did not get a
response. I think that was for the very pedestrian reason that my colleague
who was responsible was on vacation and didn't appreciate that an
operational issue with our CI would create an urgent problem for other
people and must be attended to at all times.

This must have been bad for you guys since you were stuck waiting on us and
couldn't fix the problem on your side. I was also contacted by email, as
the previous contact person for that driver, but the message simply asked
me to confirm my colleague's email address and did not tell me that there
was a problem that we had to resolve. So eventually the problem boiled over
and when we started getting publicly flamed on the mailing list then I
finally saw that there was an issue and called up my colleague directly who
*then* jumped into account to sort it out (logging into gerrit and
reversing old negative votes, and so on).

So what do we take away from this first experience? To me it just looks
like processes to fix: people operating 3rd party CIs need to better
understand the required service level, there should be multiple contact
points to deal with mundane stuff like vacations and illness, and that
people should operate their CI successfully for a while before voting is
enabled. It sucks that work was interrupted and people got mad, but at the
end of the day this happened with everybody acting in good faith, and it
shows us what kind of problems to prevent in the future.

This is where it became a bit sad on our side. The reaction we got from the
community was that the problem is not with the process but with the people.
That is, that we are lazy, incompetent, don't respect the community, don't
understand open source, and so on. My colleague got a really gut-wrenching
epic reprimand to this effect on IRC, and understandably decided to stop
contributing to OpenStack as a result. So then responsibility for the CI is
transferred back to me.

I decided to change priority: instead of getting CI running *early* I want
to get it setup *reliably* and still within the required timeframes. So I
wait a while to see how other people setup their CIs with the hope of
learning from their experiences and not making new mistakes.

End of Part One.

If you want to hear all of the details of my adventures with devstack-gate,
of how I have been operating and supervising our CI for Juno these past 6
weeks, and of the development work that I am doing in the hope of making CI
more robust both for myself and other members of the community, then I will
be happy to explain about that after I have a chance to catch my breath.

TLDR; Driver developers are people too!
OpenStack-dev mailing list

Reply via email to