Friday sept 19th, we had an outage issue with loop-server.


The elasticache database we were using went full and so it
wasn't possible to write new content to it. As a result, loop
clients (Firefox and FirefoxOS) weren't working as expected.



We've initally scheduled the deployment of a bigger cluster on
Monday 22 sept, because we knew we needed it, but the database
got filled faster than we thought last week.



On Friday night, just after the outage, the Ops team did some
clean up on the database and managed to free some space, which
was enough to get production working again for the rest of the
week-end.



Our database usage shows we're creating about 50MB of data per
day (~3GB for two months), and the previous database was sized
at 1GB. You can see a graph with the current data comsumption
attached.



Today, we've deployed a new stack with a larger database
(13.5GB of storage) which should be enough to cope with the
current load for the next few months.



In parallel, we are currently working on being more resilient
to database failures, by:



- using sharding on our Redis instances (so we can scale
better);

- having better reporting about the databases (so we can expect
failures sooner);

- make sure devs are part of the escalation path that's being
worked out for the loop server (with TokBox as well).



Last, the Firefox client is generating too many call-urls at
the moment, and the client team is currently working on sending
less call-urls to the server [0].



We're sorry about the outage, and many thanks to the Ops team
(Dean/Bob) and QA team (James/Richard) for their help this
weekend.



[0] [1]https://bugzilla.mozilla.org/show_bug.cgi?id=1030961



— Alexis, Lead of the Loop Server

References

1. https://bugzilla.mozilla.org/show_bug.cgi?id=1030961
_______________________________________________
dev-media mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-media

Reply via email to