Hi, Joaquin! I was looking at the page the other day. I agree with your summary above, except I would add one more prominent case which somehow did not appear in the random query analysis but I think is critical:
- Querying CentralAuth and another DB An example is this query <https://quarry.wmflabs.org/query/53685>. In fact, I would argue that we should treat Commons+Wikidata+CentralAuth database as "shared" databases that need to be joined with local DBs for many use cases, and treat them as one group. The "En WP + other DB" use cases are a second group in my opinion, and may not call for all En WP data to be available elsewhere; only certain tables might be sufficient. Thanks, Huji On Tue, Apr 27, 2021 at 2:34 PM Joaquin Oltra Hernandez < [email protected]> wrote: > Hi! > > Like I mentioned in the past, I've been cataloging input from developers > and doing some analysis from the random query logger that Brooke coded in > the old cluster to get a better idea of the kinds of cross DB queries being > performed. > > Wiki Replicas Cross-DB Query Data > <https://wikitech.wikimedia.org/wiki/News/Wikireplicas_2020_Redesign/Wiki_Replicas_Cross-DB_Query_Data> > > > The input and random query logger analysis showcase the tools and which > wiki DBs are being used together, but are not perfect, so please if you > know of high impact tools that are going to suffer the breaking changes let > me know so that I can include them. > > The most prominent cases that this data shows are > > - Querying Commons and another DB (as highlighted in many phab tasks > and conversations, no surprise) > - Querying Wikidata and other DBs (commons, and wikipedias, for > example) > - Querying En Wikipedia and other DBs (wikipedias) > > Particularly noticeable is the appearance of arwiki in the analysis. It is > hard to know if there is some bias in the random sampling or tools for this > project do use the replicas a lot for their features. Something to look > into. > > Detailed analysis of the tables and fields joined / accessed on will need > to be performed on an individual basis, since the SQL queries can be very > complex and do things in different ways, so automated detection can be > faulty. For example on manual inspection I've seen queries using subqueries > on a different DB with `in`, instead of joins. > > This is the reason why the report looks at the cross DB queries and not > cross JOIN queries, to accurately look at all cross DB queries, even if > there can be some false positives which can be worked around in the new > architecture (I think the tools.guc queries for example). > > I have created T280152 Mitigate breaking changes from the new Wiki > Replicas architecture <https://phabricator.wikimedia.org/T280152> which > lists the links and can be used to hang tasks for mitigation work if > needed, and sits between the OLAP and the new Wiki Replicas task. > > My hope is this information can be used both by developers and WMF teams > to figure out ways to mitigate the breaking changes for the tools that > editors and other users rely on. If you think the published CSV is not > approachable enough and you think certain information on the wiki page > would be useful to have, please let me know and I'll process the data and > post the results in the page for ease of reading. > > For example, I was wondering if I should publish the "unique > normalized/stripped" queries per tool to be able to look at the SQL on that > wiki page itself. Thoughts? > -- > Joaquin Oltra Hernandez > Developer Advocate - Wikimedia Foundation > _______________________________________________ > Wikimedia Cloud Services mailing list > [email protected] (formerly [email protected]) > https://lists.wikimedia.org/mailman/listinfo/cloud >
_______________________________________________ Wikimedia Cloud Services mailing list [email protected] (formerly [email protected]) https://lists.wikimedia.org/mailman/listinfo/cloud
