>
> This is why we were interested in pageviews to add "popularity" in the
> score. Thanks for sharing this tool it is very helpful to have a quick look
> at how it would look like.
>

Indeed, very interesting!  Here's another tool that calculates trending
articles in a variety of ways and was interesting for me to peruse as I was
thinking about this same topic: https://www.vitribyte.com/ (free sign-up,
worth going through)

I still don't know if pageviews can be the only score component or if we
> should compose with other factors like "quality", "authority".
> My concerns with pageviews are :
> - we certainly have outliers (caused by 3rd party tools/bots ...)
>

We are doing a better and better job of filtering that.  The data behind
the just-released pageview API [1] and the latest dumps dataset [2] is
using that filtering and we'll just improve the criteria over time
hopefully finding and labeling most automata properly.


> - what's the coverage of pageviews: i.e. in one month how many pages get 0
> pageviews?
>

select count(distinct page_title) from wmf.pageview_hourly where agent_type
= 'user' and year=2015 and month=10 and day=15;

Result: 23,110,732

We have about 35 million total articles [3] so the percentage is something
like 68% of all our articles get viewed daily.  This is probably quite
inaccurate because the 23 million number above includes views to redirects
and doesn't have the same exact definition of articles as the dataset
behind that graph.  And it's daily instead of monthly, but still hopefully
informs this a bit.


> Quality: we have a set of templates that are already used to flag
> good/featured articles. Cirrus uses it on enwiki only, I'd really like to
> extend this to other wikis. I'm also very interested in the tool behind
> http://ores.wmflabs.org/scores/enwiki/wp10/?revids=686575075 .
>

We're very interested in starting to link this type of data with pageview
data and making different combinations of this accessible via other
endpoints on the API (the pageview API is just a set of endpoints served by
what we hope to be a more generic Analytics Query Service).


> I'm wondering if this approach can work, I tend to think that by using
> only one factor (pageviews) we can have both very long tails with 1 or 0
> pageview and big outliers caused by new bots/tools we failed to detect.
> Using other factors not related to pageviews might help to mitigate these
> problems.
> So the question about normalization is also interesting to compute a
> composite score between 3 different components.
>

+1 for using multiple factors.  We've been looking at Druid and we think it
can be very useful for this type of big data question where the answer has
to consider lots of dimensions.
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to