Neil, thank you so much for your insightful comments!

I was able to use Quarry to get the number of edits on English
Wikipedia yesterday, so I can indeed get recent data from it—hooray!!!

I also used it to cross-check against the REST API for February 28th:

https://quarry.wmflabs.org/query/25783

and I see that Quarry reports 168668 while the REST API reports 169754
edits for the same period (less than 1% error). I'll do some digging
to see if the difference is from the denormalization you mentioned, or
other reasons why they disagree.

Maybe one more question:

> the data requires some complex reconstruction and denormalization that takes
> several days to a week. This mostly affects the historical data, but the
> reconstruction currently has to be done for all history at once because
> historical data sometimes changes long after the fact in the MediaWiki
> databases. So the entire dataset is regenerated every month, which would be
> impossible to do daily.

A Wikipedian (hi Yurik!) guessed that this full scan over the data is
needed because Wikipedia admins have the authority to make changes to
the history of an article (e.g., if a rogue editor posted copyrighted
information that shouldn't ever be visible in the changelog). If
articles' changelogs were append-only, then the operation could pick
up where it left off, rather than starting from scratch, but this
isn't the case, so a full scan is needed. Is this a good
understanding?

Again, many thanks!

Ahmed

PS. In case of an overabundance of curiosity, my little project is at
https://github.com/fasiha/wikiatrisk :)

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to