Thank you both for your perspectives. It is understood that we can do a better job of communicating, both during and after outages. Such events are convenient for absolutely no one, and we try very hard to avoid them in the first place. We feel your pain.
I do feel the need to remind folks of the Service Level Agreements for the numerous services provided by Eclipse Foundation IT: https://wiki.eclipse.org/IT_SLA CBI is a Tier II - Best Effort service, and considering the Foundation's "shut down" status during that week, I think we fared pretty well. Strategic members of the EF can alert IT staff directly, outside of business hours, using SMS text, to expedite resolution of Tier II & Tier III service outages. If CBI, or other parts of the Eclipse Foundation's Infra is instrumental to your business, please consider Strategic Membership, as it has many benefits. https://www.eclipse.org/membership/ Denis On 2021-08-13 9:38 a.m., Christoph Läubrich wrote: > Thanks Ed for the detailed time-line. I also can confirm that (from > the point of a simple comitter POV) the outage was not over at Aug 2 > (maybe for me 'core services' are just others than from the infra-POV) > but has last far to 4 Aug and I could continue the work on my issues. > > So for me the summary "The outage was extensive, and for core > services, lasted for approximately 18 hours. Non-core services were > degraded for an additional 12 hours." does not feels quite right but > as said before I can't 'proof' that, its jsut that actually I was only > able to resume my work at Aug 4 (120hrs later!) at laest until the > tycho-ci server was restarted ... > > so for me it seems a check "are all build servers running and have > executors" is missing from the status page. > > Am 13.08.21 um 15:22 schrieb Ed Willink: >> Hi >> >> Thank you all for hitting problems quite quickly once you were >> engaged. Perhaps this 'bystander's' perspective may help to >> understand the need to communicate better. >> >> I first became aware of the problem after receiving notification a >> little after 2:42 EDT 1-Aug that a weekly OCL rebuild had failed. >> Investigation of the log pointed a finger at the GIT repo and >> eclipsestatus.io indicated that a major outage was in progress with >> an 'investigating' tweet. Clearly someone was on the case and so the >> bystander effect took over and I didn't raise any reports or emails >> to distract. >> >> 'investigating' status advanced to 'fix-in-progress' after an hour. >> >> But then nothing for a further 5 hours, at which point we got 'it >> will take 13 hours'. On twitter someone asked when the 13 hours >> started; one might have hoped that it would be from the >> 'fix-in-progress' time. This tweet and an 'ETA?' tweet were never >> answered. >> >> 17 hours later we got 'most websites' back, which might be true but >> with important services down, it was misleading. It took a further >> perhaps 4 hours >> forhttps://download.eclipse.org/tools/orbit/downloads/latest-I >> <https://download.eclipse.org/tools/orbit/downloads/latest-I> to >> return, and 50 hours before projects-storage.eclipse.org >> <mailto:genie.modi...@projects-storage.eclipse.org> was back and >> another couple of hours to get >> /shared/common/apache-ant-latest/bin/ant back. >> >> IMHO the outage lasted until at least the restoration of >> projects-storage.eclipse.org >> <mailto:genie.modi...@projects-storage.eclipse.org> at Aug 4 8:50 and >> so one of the issues to be addressed by the postmortem must be why >> the status page still reports no incidents or outage on the whole of >> the 3rd Aug when, for committers at least, there was no useable >> service all day. >> >> I must thank the team again for their hard work with a very difficult >> problem, but must also stress that the communication was very poor. >> So much so that at 3:07 EDT on 4th Aug I sent a private email to Ed >> Merks speculating that: >> >> /The total silence from the team is now way beyond >> incompetence/discourtesy/embarrassment; there must be another reason. // >> //// >> //Paranoia sets in. // >> //// >> //Is some government / hostile agency intervening to prevent >> communication? // >> //// >> //Are the team voluntarily maintaining silence to contain a security >> issue? / >> >> Please ensure that whenever possible the status updates are much more >> informative. >> >> Regards >> >> Ed Willink >> >> >> On 09/08/2021 21:45, Denis Roy wrote: >>> >>> I very much appreciate the sympathy and the support. In the end, the >>> Infra team can do better than this. We'll lick our wounds and go >>> back to the drawing board to make sure we don't repeat the same >>> mistakes twice. >>> >>> Postmortem is written, pending review with my team. >>> >>> >>> >>> Denis >>> >>> >> >> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >> Virus-free. www.avast.com >> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> >> >> >> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >> >> _______________________________________________ >> cross-project-issues-dev mailing list >> cross-project-issues-dev@eclipse.org >> To unsubscribe from this list, visit >> https://www.eclipse.org/mailman/listinfo/cross-project-issues-dev >> > _______________________________________________ > cross-project-issues-dev mailing list > cross-project-issues-dev@eclipse.org > To unsubscribe from this list, visit > https://www.eclipse.org/mailman/listinfo/cross-project-issues-dev -- *Denis Roy* *Director, IT Services | **Eclipse Foundation* /Eclipse Foundation/ <http://www.eclipse.org/>/: The Community for Open Innovation and Collaboration/ Twitter: @droy_eclipse
_______________________________________________ cross-project-issues-dev mailing list cross-project-issues-dev@eclipse.org To unsubscribe from this list, visit https://www.eclipse.org/mailman/listinfo/cross-project-issues-dev