Neil, thank you so much for your insightful comments! I was able to use Quarry to get the number of edits on English Wikipedia yesterday, so I can indeed get recent data from it—hooray!!!
I also used it to cross-check against the REST API for February 28th: https://quarry.wmflabs.org/query/25783 and I see that Quarry reports 168668 while the REST API reports 169754 edits for the same period (less than 1% error). I'll do some digging to see if the difference is from the denormalization you mentioned, or other reasons why they disagree. Maybe one more question: > the data requires some complex reconstruction and denormalization that takes > several days to a week. This mostly affects the historical data, but the > reconstruction currently has to be done for all history at once because > historical data sometimes changes long after the fact in the MediaWiki > databases. So the entire dataset is regenerated every month, which would be > impossible to do daily. A Wikipedian (hi Yurik!) guessed that this full scan over the data is needed because Wikipedia admins have the authority to make changes to the history of an article (e.g., if a rogue editor posted copyrighted information that shouldn't ever be visible in the changelog). If articles' changelogs were append-only, then the operation could pick up where it left off, rather than starting from scratch, but this isn't the case, so a full scan is needed. Is this a good understanding? Again, many thanks! Ahmed PS. In case of an overabundance of curiosity, my little project is at https://github.com/fasiha/wikiatrisk :) _______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
