On 23 March 2018 at 07:02, Ahmed Fasih <[email protected]> wrote: > Neil, thank you so much for your insightful comments! >
No problem. It's always a good feeling when you know the answer to someone else's question :) > I was able to use Quarry to get the number of edits on English > Wikipedia yesterday, so I can indeed get recent data from it—hooray!!! > > I also used it to cross-check against the REST API for February 28th: > > https://quarry.wmflabs.org/query/25783 > > and I see that Quarry reports 168668 while the REST API reports 169754 > edits for the same period (less than 1% error). I'll do some digging > to see if the difference is from the denormalization you mentioned, or > other reasons why they disagree. > The first thing to consider is that when a Wikipedia page is deleted, all the corresponding rows from the revision table are moved to a separate archive table <https://www.mediawiki.org/wiki/Manual:Archive_table> (probably for reasons that made much more sense years ago). However, in the Data Lake and therefore the REST API, there's no such separation. This query is one way to get a combined count: https://quarry.wmflabs.org/query/25794 However, combining the two tables yields 171 346 edits, which makes the Data Lake count about 1% *lower *than the application database count. At the moment, I can't think of a good reason for that, but I'm sure others on this list know.
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
