Re: [Cloud] Wiki replicas: Compilation of input and logged cross-DB queries

Huji Lee Tue, 27 Apr 2021 11:49:34 -0700

Hi, Joaquin!

I was looking at the page the other day. I agree with your summary above,
except I would add one more prominent case which somehow did not appear in
the random query analysis but I think is critical:


   - Querying CentralAuth and another DB

An example is this query <https://quarry.wmflabs.org/query/53685>.

In fact, I would argue that we should treat Commons+Wikidata+CentralAuth
database as "shared" databases that need to be joined with local DBs for
many use cases, and treat them as one group. The "En WP + other DB" use
cases are a second group in my opinion, and may not call for all En WP data
to be available elsewhere; only certain tables might be sufficient.

Thanks,
Huji


On Tue, Apr 27, 2021 at 2:34 PM Joaquin Oltra Hernandez <
[email protected]> wrote:

> Hi!
>
> Like I mentioned in the past, I've been cataloging input from developers
> and doing some analysis from the random query logger that Brooke coded in
> the old cluster to get a better idea of the kinds of cross DB queries being
> performed.
>
> Wiki Replicas Cross-DB Query Data
> <https://wikitech.wikimedia.org/wiki/News/Wikireplicas_2020_Redesign/Wiki_Replicas_Cross-DB_Query_Data>
>
>
> The input and random query logger analysis showcase the tools and which
> wiki DBs are being used together, but are not perfect, so please if you
> know of high impact tools that are going to suffer the breaking changes let
> me know so that I can include them.
>
> The most prominent cases that this data shows are
>
>    - Querying Commons and another DB (as highlighted in many phab tasks
>    and conversations, no surprise)
>    - Querying Wikidata and other DBs (commons, and wikipedias, for
>    example)
>    - Querying En Wikipedia and other DBs (wikipedias)
>
> Particularly noticeable is the appearance of arwiki in the analysis. It is
> hard to know if there is some bias in the random sampling or tools for this
> project do use the replicas a lot for their features. Something to look
> into.
>
> Detailed analysis of the tables and fields joined / accessed on will need
> to be performed on an individual basis, since the SQL queries can be very
> complex and do things in different ways, so automated detection can be
> faulty. For example on manual inspection I've seen queries using subqueries
> on a different DB with `in`, instead of joins.
>
> This is the reason why the report looks at the cross DB queries and not
> cross JOIN queries, to accurately look at all cross DB queries, even if
> there can be some false positives which can be worked around in the new
> architecture (I think the tools.guc queries for example).
>
> I have created T280152 Mitigate breaking changes from the new Wiki
> Replicas architecture <https://phabricator.wikimedia.org/T280152> which
> lists the links and can be used to hang tasks for mitigation work if
> needed, and sits between the OLAP and the new Wiki Replicas task.
>
> My hope is this information can be used both by developers and WMF teams
> to figure out ways to mitigate the breaking changes for the tools that
> editors and other users rely on. If you think the published CSV is not
> approachable enough and you think certain information on the wiki page
> would be useful to have, please let me know and I'll process the data and
> post the results in the page for ease of reading.
>
> For example, I was wondering if I should publish the "unique
> normalized/stripped" queries per tool to be able to look at the SQL on that
> wiki page itself. Thoughts?
> --
> Joaquin Oltra Hernandez
> Developer Advocate - Wikimedia Foundation
> _______________________________________________
> Wikimedia Cloud Services mailing list
> [email protected] (formerly [email protected])
> https://lists.wikimedia.org/mailman/listinfo/cloud
>

_______________________________________________
Wikimedia Cloud Services mailing list
[email protected] (formerly [email protected])
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] Wiki replicas: Compilation of input and logged cross-DB queries

Reply via email to