Hi all, The proposed dates and times for the reboots are:
labsdb1001: Monday Oct 30 2017, 14:30 UTC labsdb1003: Tuesday Nov 07 2017, 14:30 UTC These servers are very old, and may run into hardware issues during the reboots, so we are not defining an outage window. I'll keep the list updated during and after the maintenance. As always, feel free to reach out with any questions or concerns here or on #wikimedia-cloud, or on task here - https://phabricator.wikimedia.org/T168584. Best, On Thu, Oct 19, 2017 at 12:44 AM, Jaime Crespo <[email protected]> wrote: > > Why not provide a fourth host and have those three servers act as slaves > of it? > > While that is a reasonable way to do it- that is exactly how we do it in > production; that is almost impossible to do it for all users. > > First, blind read-write split doesn't work; it needs to be curated for > each application, and we do not have control over when reads and writes > happen; so in the case there is lag, most applications will break. Now I > will explain why always there would be lag- > > Second, it creates a SPOF on replication, for each individual application. > While 90% of the data services users do reasonable queries and data > structure; a minority do extremely fast imports, create tables on MyISAM > format (which corrupts on every crash) or do other things that would make > the replication lag or break. **That means, because a single user does > heavy writes, all users, including of simple production replica reads would > be affected**. Also, being able to create tables and perform writes on the > wikireplicas caused lots of complains from other users as they often locked > the wiki data, causing lag. As Bryan says, we are open to other methods > (e.g. we could pregenerate/import summary tables from production), but > those should be puppetized and automatized and checked for sanity (InnoDB, > no blocking/breakage of replication, etc.). Creating summary tables through > replication means people can share that data instead of it being replicated > multiple times. Those are not theoretical, we had to blacklist some > databases from replication on toolsdb because they were causing similar > replication issues: https://phabricator.wikimedia.org/T127164 > > Another thing we could do is to experiment with FEDERATED/CONECTX tables > (make virtual tables from wikireplicas available from toolsdb), something > that was done in the past, but it has the same problem of userspace joins- > its performance is not perfect. > > It is important to note that to take this decision, we checked existing > users of user tables on wikireplicas and saw that a) the number of users of > local databases was much lower than the total users b) in most cases, they > were just used for summary tables INSERT...SELECT (which by the way, was > the main cause of lag on the replica servers), and is trivial to migrate to > SELECT -> user space -> INSERT c) in those cases where actual JOINs are > used, which we believe to be very few, we are open for solutions like the > ones commented above. > > Moving to tools db will mean more available resources in most cases, and a > better availability for both the replicas and the user dbs, as if the > master hardware crashes, there is a passive replica (unlike on local > tables). > > On Thu, Oct 19, 2017 at 4:01 AM, Platonides <[email protected]> wrote: > >> > It is not a happy thing for us to force anyone to change their >> > software, but as explained in the wiki page [0] we can not find a >> > reliable method to ensure that the same user created tables are >> > available on all three of the new backend servers >> >> Why not provide a fourth host and have those three servers act as slaves >> of it? >> >> Writes go to the first one, but reads and joins can go to the replicas. >> >> I'm afraid that "make the JOINs in user space" may end up on some >> cases with the app fetching all the rows of the tool table or the wiki >> replica in order to perform a JOIN that used to be straightforward. >> And the tools aren't really the place to implement all kinds of >> partial-fetch implementations and query estimates (without EXPLAIN, >> even!). Unless you know of such an efficient user space JOIN >> implementation, perhaps? >> >> Regards >> >> _______________________________________________ >> Wikimedia Cloud Services mailing list >> [email protected] (formerly [email protected]) >> https://lists.wikimedia.org/mailman/listinfo/cloud >> > > > > -- > Jaime Crespo > <http://wikimedia.org> > > _______________________________________________ > Wikimedia Cloud Services mailing list > [email protected] (formerly [email protected]) > https://lists.wikimedia.org/mailman/listinfo/cloud > -- Madhumitha Viswanathan Operations Engineer, Cloud Services
_______________________________________________ Wikimedia Cloud Services mailing list [email protected] (formerly [email protected]) https://lists.wikimedia.org/mailman/listinfo/cloud
