Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

Nicholas Skaggs Mon, 16 Nov 2020 15:12:52 -0800

Kimmo, while I can't directly answer your question on bottlenecks, I will
try and provide a little background information on existing issues for
those who are new (like myself!).


Here's a recent example of replication issues with the current setup:
https://lists.wikimedia.org/pipermail/cloud-admin/2020-September/000409.html
https://lists.wikimedia.org/pipermail/cloud-admin/2020-October/000413.html

Replication lagged hours behind, and it's not the first instance of this
occurring.

As per https://phabricator.wikimedia.org/T249188#6204681 capacity is full
and it's not currently possible to upgrade as-is, despite the fact that
Wiki Replicas are being affected by bugs in the current version. In
addition, the current setup means any error recovery can take many days.
See https://lists.wikimedia.org/pipermail/cloud-admin/2020-March/000387.html
for further background information on historical issues.

If you'd rather see it in graphical form, you can look at the metrics
directly:

https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=labsdb1011&var-port=9104&from=now-30d&to=now

I hope this helps!


On Fri, Nov 13, 2020 at 5:39 AM Kimmo Virtanen <[email protected]>
wrote:

> As a follow up comment.
>
> If I understand correctly the main problems are a) databases are growing
> too big to be stored in single instances and b) query complexity is
> growing.
>
> a) the growth of the data is not going away as the major drivers for the
> growth are automated edits from Wikidata and Structured data on Commons.
> They are generating new data with increasing speed faster than humans ever
> could. So the longer term answer is to store the data to separate instances
> and use something like federated queries. This is how the access to the
> commonwiki replica was originally done when toolserver moved to toollabs in
> 2014.[1] Another long term solution to make databases smaller is to
> replicate only the current state of the wikidata/commonswiki and leave for
> example the revision history out.
>
> b) a major factor for query complexity which affects the query execution
> times is afaik the actor migration and the data sanitization which executes
> the queries through the multiple views.[2,3]  I have no idea how bad the
> problem currently is, but one could think that replication could be
> implemented with lighter sanitation by leaving some of the problematic data
> out altogether from replication.
>
> Anyway, my question is, are there more detailed plans for the *Wiki
> Replicas 2020 Redesign *than what is on the wikipage[4] or tickets linked
> from it? I guess there is if the plan is to buy new hardware in October and
> now we are in the implementation phase? Also is there information on the
> actual bottlenecks at table level? I.e., which tables (in which databases)
> are the too big ones, hard to keep up in replication and slow in terms of
> query time?
>
> [1]
> https://www.mediawiki.org/wiki/Wikimedia_Labs/Tool_Labs/Migration_of_Toolserver_tools#Will_the_commons_database_be_replicated_to_all_clusters,_like_it_is_on_the_Toolserver
> ?
> [2]
> https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas
> [3] https://phabricator.wikimedia.org/T215445
> [4] https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign
>
> Br,
> -- Kimmo Virtanen, Zache
>
> On Fri, Nov 13, 2020 at 8:51 AM Kimmo Virtanen <[email protected]>
> wrote:
>
>> >  Maarten: Having 6 servers with each one having a slice + s4 (Commons)
>> + s8 (Wikidata) might be a good compromise.
>> > Martin: Another idea is to have the database structured as-planned, but
>> add a server with *all* databases that would be slower/less stable, but
>> will provide a solution for those who really need cross database joins
>>
>> From the point of view of a person who is using cross database joins on
>> both tools and analysis queries I would say that both ideas would be
>> suitable. I think that 90%  of my crosswiki queries are written against
>> *wiki + wikidata/commons. However, I would not say that it is only for
>> those who really need it but more like that cross database joins are an
>> awesome feature for everybody and it is a loss if it will be gone.
>>
>> In older times we had also ability to do joins between user databases and
>> replica databases, which was removed in 2017 if I googled correctly.[1] My
>> guess is that one reason for the increasing query complexity is that there
>> is no possibility for creating tmp tables or joining to preselected data so
>> everything is done in single queries.  In any case, if the solution is what
>> Martin suggests to move cross joinable databases to a single server and the
>> original problem was that it was hard to keep in sync multiple servers then
>> we could reintroduce the user database joins as well.
>>
>> [1]
>> https://phabricator.wikimedia.org/phame/post/view/70/new_wiki_replica_servers_ready_for_use/
>>
>> Br,
>> -- Kimmo Virtanen, Zache
>>
>> On Fri, Nov 13, 2020 at 2:17 AM Martin Urbanec <
>> [email protected]> wrote:
>>
>>> +1 to Marteen
>>>
>>> Another idea is to have the database structured as-planned, but add a
>>> server with *all* databases that would be slower/less stable, but will
>>> provide a solution for those who really need cross database joins
>>>
>>> Martin
>>>
>>> pá 13. 11. 2020 v 0:31 odesílatel Maarten Dammers <[email protected]>
>>> napsal:
>>>
>>>> I recall some point in time (Toolserver maybe?) when all the slices
>>>> (overview at https://tools-info.toolforge.org/?listmetap ) were at
>>>> different servers, but the Commons slice (s4) was on every server.
>>>> At some point new fancy database servers were introduced with all the
>>>> slices on all servers. Having 6 servers with each one having a slice + s4
>>>> (Commons) + s8 (Wikidata) might be a good compromise.
>>>> On 12-11-2020 00:58, John wrote:
>>>>
>>>> I’ll throw my hat in this too. Moving it to the application layer will
>>>> make a number of queries just not feasible any longer. It might make sense
>>>> from the administration side, but from the user perspective it beaks one of
>>>> the biggest features that toolforge has.
>>>>
>>>> On Wed, Nov 11, 2020 at 6:40 PM Martin Urbanec <
>>>> [email protected]> wrote:
>>>>
>>>>> MusikAnimal is right, however, Wikidata and Commons either have a sui
>>>>> generis slice, or they share it with a few very large wikis. Tools that do
>>>>> any kind of crosswiki analysis would instantly break, as most of them
>>>>> utilise joining by Wikidata items at the very least.
>>>>>
>>>>> I second Maarten here. This would mean a lot of things that currently
>>>>> require a (relatively simple) SQL query would need a full script, which
>>>>> would do the join at the application level.
>>>>>
>>>>> I fully understand the reasoning, but there needs to be some
>>>>> replacement. Intentionally introduce breaking changes while providing no
>>>>> "new standard" is a bad pattern in a community environment.
>>>>>
>>>>> Martin
>>>>>
>>>>> On Wed, Nov 11, 2020, 10:31 PM MusikAnimal <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Technically, cross-wiki joins aren't completely disallowed, you just
>>>>>> have to make sure each of the db names are on the same slice/section,
>>>>>> right?
>>>>>>
>>>>>> ~ MA
>>>>>>
>>>>>> On Wed, Nov 11, 2020 at 4:11 PM Maarten Dammers <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Joaquin,
>>>>>>> On 10-11-2020 21:26, Joaquin Oltra Hernandez wrote:
>>>>>>>
>>>>>>> TLDR: Wiki Replicas' architecture is being redesigned for stability
>>>>>>> and performance. Cross database JOINs will not be available and a host
>>>>>>> connection will only allow querying its associated DB. See [1]
>>>>>>> <https://wikitech.wikimedia.org/wiki/News/Wiki_Replicas_2020_Redesign>
>>>>>>> for more details.
>>>>>>>
>>>>>>> If you only think of Wikipedia, not a lot will break probably, but
>>>>>>> if you take into account Commons and Wikidata a lot will break. A quick
>>>>>>> grep in my folder with Commons queries returns 123 lines with cross
>>>>>>> database joins. So yes, stuff will break and tools will be abandoned. 
>>>>>>> This
>>>>>>> follows the practice that seems to have become standard for the WMF 
>>>>>>> these
>>>>>>> days: Decisions are made with a small group within the WMF without any
>>>>>>> community involved. Only after the decision has been made, it's 
>>>>>>> announced.
>>>>>>>
>>>>>>> Unhappy and disappointed,
>>>>>>>
>>>>>>> Maarten
>>>>>>> _______________________________________________
>>>>>>> Wikimedia Cloud Services mailing list
>>>>>>> [email protected] (formerly [email protected])
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Wikimedia Cloud Services mailing list
>>>>>> [email protected] (formerly [email protected])
>>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>>>
>>>>> _______________________________________________
>>>>> Wikimedia Cloud Services mailing list
>>>>> [email protected] (formerly [email protected])
>>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>>
>>>>
>>>> _______________________________________________
>>>> Wikimedia Cloud Services mailing [email protected] (formerly 
>>>> [email protected])https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>
>>>> _______________________________________________
>>>> Wikimedia Cloud Services mailing list
>>>> [email protected] (formerly [email protected])
>>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>>
>>> _______________________________________________
>>> Wikimedia Cloud Services mailing list
>>> [email protected] (formerly [email protected])
>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>>
>> _______________________________________________
> Wikimedia Cloud Services mailing list
> [email protected] (formerly [email protected])
> https://lists.wikimedia.org/mailman/listinfo/cloud
>


-- 
*Nicholas Skaggs*
Engineering Manager, Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>

_______________________________________________
Wikimedia Cloud Services mailing list
[email protected] (formerly [email protected])
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] [Cloud-announce] Wiki Replicas 2020 Redesign

Reply via email to