Re: [Cloud] Loading wikipedia dump onto Clouds

Huji Lee Tue, 14 Apr 2020 05:49:50 -0700

Yes. If you go to the source of all those pages, there is a hidden HTML
element (<!-- --> kind) that has the SQL code for that report.


Here is one example: [1]

  [1]
https://fa.wikipedia.org/w/index.php?title=%D9%88%DB%8C%DA%A9%DB%8C%E2%80%8C%D9%BE%D8%AF%DB%8C%D8%A7:%DA%AF%D8%B2%D8%A7%D8%B1%D8%B4_%D8%AF%DB%8C%D8%AA%D8%A7%D8%A8%DB%8C%D8%B3/%D8%AD%D8%B3%D8%A7%D8%A8%E2%80%8C%D9%87%D8%A7%DB%8C_%D8%A7%D8%B2_%D8%A2%D8%A8%E2%80%8C%D9%86%D9%85%DA%A9_%D8%AF%D8%B1%D8%A2%D9%85%D8%AF%D9%87&action=edit

On Mon, Apr 13, 2020 at 7:19 PM MusikAnimal <[email protected]> wrote:

> Is the source code public? Maybe the queries could be improved. I ran into
> many such issues too after the actor migration, but after taking advantage
> of specialized views[0] and join decomposition (get just the actor IDs,
> i.e. rev_actor, then the actor_names in a separate query), my tools are
> seemingly as fast as they were before.
>
> ~ MA
>
> [0]
> https://wikitech.wikimedia.org/wiki/News/Actor_storage_changes_on_the_Wiki_Replicas#Advanced_use_cases_of_specialized_views
>
> On Mon, Apr 13, 2020 at 5:03 PM Huji Lee <[email protected]> wrote:
>
>> I understand. However, I think that the use case we are looking at is
>> relatively unique. I also think that indexes we need may not be desirable
>> for all the Wiki Replicas (they would often be multi-column indexes geared
>> towards a specific set of queries) and I honestly don't want to go through
>> the several weeks (months?) of discussion to justify them.
>>
>> Note that if we open the can of "more indexes on Wiki Replicas" worms,
>> this would all of a sudden become an all-wiki discussion. I'm not sure if
>> there are more than a handful wikis that do this level of page-level and
>> user-level analytics as fawiki does, which means for most wikis (and for
>> most Wiki Replica databases) those additional indexes may not even be
>> justified.
>>
>> Even if we were to generalize parts of this approach and bring it to Wiki
>> Replicas, I would still argue that doing it at a smaller extent (one wiki
>> DB for now) would be a reasonable starting point, no?
>>
>> On Mon, Apr 13, 2020 at 4:42 PM Bryan Davis <[email protected]> wrote:
>>
>>> On Sun, Apr 12, 2020 at 7:48 AM Huji Lee <[email protected]> wrote:
>>> >
>>> > One possible solution is to create a script which is scheduled to run
>>> once a month; the script would download the latest dump of the wiki
>>> database,[3] load it into MySQL/MariaDB, create some additional indexes
>>> that would make our desired queries run faster, and generate the reports
>>> using this database. A separate script can then purge the data a few days
>>> later.
>>>
>>> If I am understanding your proposal here, I think the main difference
>>> from the current Wiki Replicas would be "create some additional
>>> indexes that would make our desired queries run faster". We do have
>>> some indexes and views in the Wiki Replicas which are specifically
>>> designed to make common things faster today. If possible, adding to
>>> these rather than building a one-off process of moving lots of data
>>> round for your tool would be nice.
>>>
>>> I say this not because what you are proposing is a ridiculous
>>> solution, but because it is a unique solution for your current problem
>>> that will not help others who are having similar problems. Having 1
>>> tool use ToolsDB or a custom Cloud VPS project like this is possible,
>>> but having 100 tools try to follow that pattern themselves is not.
>>>
>>> > Out of abundance of caution, I thought I should ask for permission
>>> now, rather than forgiveness later. Do we have a process for getting
>>> approval for projects that require gigabytes of storage and hours of
>>> computation, or is what I proposed not even remotely considered a "large"
>>> project, meaning I am being overly cautious?
>>>
>>> <https://phabricator.wikimedia.org/project/view/2875/>
>>>
>>> Bryan
>>> --
>>> Bryan Davis              Technical Engagement      Wikimedia Foundation
>>> Principal Software Engineer                               Boise, ID USA
>>> [[m:User:BDavis_(WMF)]]                                      irc: bd808
>>>
>>> _______________________________________________
>>> Wikimedia Cloud Services mailing list
>>> [email protected] (formerly [email protected])
>>> https://lists.wikimedia.org/mailman/listinfo/cloud
>>
>> _______________________________________________
>> Wikimedia Cloud Services mailing list
>> [email protected] (formerly [email protected])
>> https://lists.wikimedia.org/mailman/listinfo/cloud
>
> _______________________________________________
> Wikimedia Cloud Services mailing list
> [email protected] (formerly [email protected])
> https://lists.wikimedia.org/mailman/listinfo/cloud

_______________________________________________
Wikimedia Cloud Services mailing list
[email protected] (formerly [email protected])
https://lists.wikimedia.org/mailman/listinfo/cloud

Re: [Cloud] Loading wikipedia dump onto Clouds

Reply via email to