The only reason I didn't break this down by namespace was because they
queries would have taken an order of magnitude longer to join the revision
and page tables.  The query I used didn't even need to read the revision or
archive tables.  It only read an index on those tables.  That made it go
pretty fast.  :)  I'd be interested in taking another pass if you guys
don't mind dealing with a heavier server load.

On Thu, Jun 12, 2014 at 7:05 PM, Dario Taraborelli <
[email protected]> wrote:

> Aaron – this is fantastic.
> Two quick questions:
>
> - was the decision not to break down the data by namespace (matching Erik
> Zachte’s master editor data dump) intentional?
> - are we expecting to refresh the archived revision count field every
> month?
>
> Dario
>
>
> On Jun 12, 2014, at 2:33 PM, Aaron Halfaker <[email protected]>
> wrote:
>
> +1
>
> For example, the last time I sent a similar email to the list, it was for
> the wiki_info table.  One of the tasks I have is to break the code for
> generating that table out of the analysis project it lives in and make it a
> separate repo so that Oliver can send pull requests to fix issues and/or
> maintain his own managed table.
>
> It would be great to work towards an architecture that allows us to keep
> these tables up-to-date without user-based cron jobs.
>
> -Aaron
>
> On Thu, Jun 12, 2014 at 4:24 PM, Dan Andreescu <[email protected]>
> wrote:
>
>> This is great.  I'd like to go on record saying that this is leaning
>> towards a data warehouse kind of approach - basically pre-aggregating
>> useful datasets.  So we might want to do this in a more organized way down
>> the line.
>>
>>
>> On Thu, Jun 12, 2014 at 2:57 PM, Oliver Keyes <[email protected]>
>> wrote:
>>
>>> This is fricking awesome!
>>>
>>>
>>>  On 12 June 2014 10:58, Aaron Halfaker <[email protected]> wrote:
>>>
>>>> I created a new table on analytics-store.eqiad.wmnet.  It contains the
>>>> monthly edit counts for all wikis.  See a brief overview below.
>>>>
>>>> Note that the "revisions" column contains a count of all revisions --
>>>> archived or not.  The "archived" column contains a count of archived
>>>> revisions.   So revisions - archived == non-archived revisions.
>>>>
>>>> analytics-store.eqiad.wmnet [staging]> explain editor_month;
>>>> +-------------------+----------------+------+-----+---------+-------+
>>>> | Field             | Type           | Null | Key | Default | Extra |
>>>> +-------------------+----------------+------+-----+---------+-------+
>>>> | wiki              | varbinary(50)  | NO   | PRI |         |       |
>>>> | month             | varbinary(7)   | NO   | PRI |         |       |
>>>> | user_id           | int(11)        | NO   | PRI | 0       |       |
>>>> | user_name         | varbinary(191) | YES  |     | NULL    |       |
>>>> | user_registration | varbinary(14)  | YES  |     | NULL    |       |
>>>> | archived          | int(11)        | YES  |     | NULL    |       |
>>>> | revisions         | int(11)        | YES  |     | NULL    |       |
>>>> +-------------------+----------------+------+-----+---------+-------+
>>>> 7 rows in set (0.01 sec)
>>>>
>>>> analytics-store.eqiad.wmnet [staging]> select * from editor_month limit
>>>> 3;
>>>>
>>>> +--------+---------+---------+------------+-------------------+----------+-----------+
>>>> | wiki   | month   | user_id | user_name  | user_registration |
>>>> archived | revisions |
>>>>
>>>> +--------+---------+---------+------------+-------------------+----------+-----------+
>>>> | enwiki | 2001-01 |      34 | WojPob     | 20010129110725    |
>>>>  0 |        13 |
>>>> | enwiki | 2001-01 |      99 | RoseParks  | 20010121021221    |
>>>>  0 |         7 |
>>>> | enwiki | 2001-01 |     479 | JimboWales | 20010123223416    |
>>>>  0 |        13 |
>>>>
>>>> +--------+---------+---------+------------+-------------------+----------+-----------+
>>>> 3 rows in set (0.03 sec)
>>>>
>>>> Feedback is welcome.   One of the next things, I'd like to do is remove
>>>> the "-" from the month column as it ruins comparison with MW timestamps.
>>>>
>>>> -Aaron
>>>>
>>>> _______________________________________________
>>>> wmfresearch mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/wmfresearch
>>>>
>>>>
>>>
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to