>Wikistats 1 generates data on content pages with a delay of 10-15 days after the end of the month This is true for full snapshots (for the reasons we have discussed before and that Dan has described on this thread). You can expect data to be available on the API soon after the 10th, but it is unlikely that it will be there before the 10th as we do not start the process until the 5th.
Now, data - as you now- is streamed real time, every second. So it is only the full reconstruction of events, the full snapshot, that takes several days to build. Have you looked into using the real time events when the next month snapshot is yet not available? On Wed, Oct 10, 2018 at 7:48 PM Dan Andreescu <[email protected]> wrote: > It should be updated soon, the jobs are all done successfully. But > currently we do expect this kind of lag, I'll explain why. > > When we started we were sqooping at the beginning of the month and the > processing takes something like 4 days total, most of it sqooping. But > this put too much load on the database serves too close to the beginning of > the month when a bunch of other stuff is running. So we had to move it > back to the 5th of the month [1]. Add 4 days onto that and we end up > finishing around the 9th of the month. We don't like this at all and we're > trying to figure out a better way to import the data incrementally so we > can just start processing when we have all of it. It's unfortunate but we > couldn't foresee the infrastructure limitation, too much was up in the air > about even where we would sqoop from when we started this work. Joseph and > I have a weekly meeting to discuss moving towards a more incremental > approach, and this task is the parent task to watch for now: > https://phabricator.wikimedia.org/T193650 (priority is low because we > have too many other commitments, but it's something I'd love to see before > we call wikistats 2 "production" quality) > > [1] > https://github.com/wikimedia/puppet/blob/28b78985d3612a6e19720be1fe8eef5f0dfc2ed7/modules/profile/manifests/analytics/refinery/job/sqoop_mediawiki.pp#L43 > > On Wed, Oct 10, 2018 at 10:00 PM Neil Patel Quinn <[email protected]> > wrote: > >> Hey there! >> >> I just wrote a script that fetches data from the AQS new pages endpoint >> <https://wikimedia.org/api/rest_v1/#!/Edited_pages_data/get_metrics_edited_pages_new_project_editor_type_page_type_granularity_start_end> >> in order to prepare the our monthly health metrics (T199459 >> <https://phabricator.wikimedia.org/T199459>). >> >> However, it seems like that endpoint doesn't yet have monthly data for >> September. For example, a query for Commons with a start of July 1 and >> and an end of October 1 >> <https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/commons.wikimedia.org/all-editor-types/content/monthly/20180701/20181001> >> returns only data for July and August. What's the schedule for updating >> this data? >> >> To be honest, I feel pretty frustrated by this. Wikistats 1 generates >> data on content pages with a delay of 10-15 days after the end of the >> month, which has made it difficult for us to provide timely metrics to >> executives and the board. I had assumed (to a degree that I didn't even >> check) that by switching to this API, we would instead only have to deal >> with the delay in generating the mediawiki_history snapshot (5-7 days after >> the end of the month). But that doesn't seem to be the case. >> -- >> Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF> >> (he/him/his) >> product analyst, Wikimedia Foundation >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
