Hello Ahmed, nice to meet you! As a data analyst who constantly works with the edit data, I would love to have it updated daily too. But there are serious infrastructural limitations that make that very difficult.
Both the edit data and pageview data that you're talking about come from the Hadoop-based Analytics Data Lake <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>. However, because of limitations in the underlying MediaWiki application databases <https://www.mediawiki.org/wiki/Manual:Database_layout> that Hive pulls edit data from, the data requires some complex reconstruction and denormalization <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Data_Lake/Edits/Pipeline> that takes several days to a week. This mostly affects the historical data, but the reconstruction currently has to be done for all history at once because historical data sometimes changes long after the fact in the MediaWiki databases. So the entire dataset is regenerated every month, which would be impossible to do daily. I'm sure there are strategies that could ultimately fix these problems, but I'm also sure that they would take great effort to implement, so unfortunately that's unlikely to happen anytime soon. In the meantime, you may be able to work around these issues by using the public replicas of the application database <https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connecting_to_the_database_replicas> s. Unlike with the API, you'd have to do the computation yourself, but it is updated in (near) real-time. Quarry <https://meta.wikimedia.org/wiki/Research:Quarry> is an excellent, easy-to-use tool for running SQL queries on those replicas. I'm not an expert on the Data Lake, but I'm pretty sure this is broadly accurate. Corrections from the Analytics team welcome :) On 22 March 2018 at 08:21, Ahmed Fasih <[email protected]> wrote: > Hello! I have some questions about the latency of some Wikipedia REST > endpoints from > > https://wikimedia.org/api/rest_v1 > > I see that I can get very recent pageviews data, e.g. > > https://wikimedia.org/api/rest_v1/metrics/pageviews/ > aggregate/en.wikipedia/all-access/all-agents/hourly/2018032100/2018032300 > > accessed now, on 2018/03/22, at 0249 UTC, gives me an hourly pageviews > on the English Wikipedia at timestamp "2018032200", so with about ~4 > hours latency, very nice! > > In contrast, asking for the daily number of edits via > > https://wikimedia.org/api/rest_v1/metrics/edits/ > aggregate/en.wikipedia/all-editor-types/all-page-types/ > daily/20180225/20180321 > > only gives me data up to the end of February, with no March data. This > makes me think the daily datasets are generated only once a month? How > might I gain access to more recent daily data like the > "rest_v1/metrics/edits" endpoints? > > Thanks! > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
