Hello Ahmed, nice to meet you!

As a data analyst who constantly works with the edit data, I would love to
have it updated daily too. But there are serious infrastructural
limitations that make that very difficult.

Both the edit data and pageview data that you're talking about come from
the Hadoop-based Analytics Data Lake
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>. However, because
of limitations in the underlying MediaWiki application databases
<https://www.mediawiki.org/wiki/Manual:Database_layout> that Hive pulls
edit data from, the data requires some complex reconstruction and
denormalization
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Data_Lake/Edits/Pipeline>
that takes several days to a week. This mostly affects the historical data,
but the reconstruction currently has to be done for all history at once
because historical data sometimes changes long after the fact in the
MediaWiki databases. So the entire dataset is regenerated every month,
which would be impossible to do daily.

I'm sure there are strategies that could ultimately fix these problems, but
I'm also sure that they would take great effort to implement, so
unfortunately that's unlikely to happen anytime soon.

In the meantime, you may be able to work around these issues by using
the public
replicas of the application database
<https://wikitech.wikimedia.org/wiki/Help:Toolforge/Database#Connecting_to_the_database_replicas>
s. Unlike with the API, you'd have to do the computation yourself, but it
is updated in (near) real-time. Quarry
<https://meta.wikimedia.org/wiki/Research:Quarry> is an excellent,
easy-to-use tool for running SQL queries on those replicas.

I'm not an expert on the Data Lake, but I'm pretty sure this is broadly
accurate. Corrections from the Analytics team welcome :)


On 22 March 2018 at 08:21, Ahmed Fasih <[email protected]> wrote:

> Hello! I have some questions about the latency of some Wikipedia REST
> endpoints from
>
> https://wikimedia.org/api/rest_v1
>
> I see that I can get very recent pageviews data, e.g.
>
> https://wikimedia.org/api/rest_v1/metrics/pageviews/
> aggregate/en.wikipedia/all-access/all-agents/hourly/2018032100/2018032300
>
> accessed now, on 2018/03/22, at 0249 UTC, gives me an hourly pageviews
> on the English Wikipedia at timestamp "2018032200", so with about ~4
> hours latency, very nice!
>
> In contrast, asking for the daily number of edits via
>
> https://wikimedia.org/api/rest_v1/metrics/edits/
> aggregate/en.wikipedia/all-editor-types/all-page-types/
> daily/20180225/20180321
>
> only gives me data up to the end of February, with no March data. This
> makes me think the daily datasets are generated only once a month? How
> might I gain access to more recent daily data like the
> "rest_v1/metrics/edits" endpoints?
>
> Thanks!
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to