>For percentile charts, my understanding is (thanks for the IRC advice, Nuria 
>and Leila!) that they remain accurate, as >long as the amount sampled is large 
>enough; the best practice is to sample at least 1000 events per bucket (so 
>>10,000 altogether if we are looking for the 90th percentile, 100,000 if we 
>are looking for the 99th percentile etc).
Correct, there is no adjustment needed in this case, we are just
reducing the sample to the size we need to be able to calculate a
percentile with an aceptable level of confidence. This is a
simplification that should work well in this case.


>I'm still looking for an answer on what effect sampling has on geometric means.
If the sampling we have is good enough to calculate a 90th or 99th
percentile (which it is) I do not see why you would need to adjust
your geometric mean in any way.
Please anyone correct me if I am wrong but I believe that if you want
a measure of confidence of how spread out are your values you can
calculate the geometric standard deviation and find out.


>So to get a proper amount of data, we would probably need to vary sampling per 
>wiki or country, and also per action:
Correct. Every action you are inter-comparing should have a sample
size that lets you calculate, say, a percentile 99 with acceptable
confidence. Per our rule above 100.000 samples or more (this is,
again, a simplification that should work well in this case)

Now, are you really interested in detailing user behavior of your
feature per wiki? Is the expectation that users from es.wikipedia have
a fundamentally different experience than users from fr.wikipedia? Or
are we studying "global" usage? If we need different samples size per
wiki the most logical way to do it is to have a sampling configuration
deployed per wiki rather than changing the schemas. (Need to check
whether mediawiki config allows for this)


>-whenever we display percentiles, we ignore sampling rates, they should not 
>influence the result even if we consider >data from multiple sources with 
>mixed sampling rates (I'm not quite sure about this one)
This is only correct if you have a sufficient sample size in all
datasets to calculate percentiles with aceptable confidence. Example
(simplifying things a bunch to rules of thumb): you are interested in
percentile 90 and you have dataset 1 with 100.000 points, dataset 2
with 500.000 an dataset 3 with 1000.  You can inter-compare percentile
90 in dataset 1 and 2 but in dataset 3 there is not enough data to
calculate the 90th percentile.








On Sun, May 18, 2014 at 7:00 AM, Gergo Tisza <[email protected]> wrote:
> On Fri, May 16, 2014 at 9:34 AM, Ori Livneh <[email protected]> wrote:
>>
>> On Fri, May 16, 2014 at 9:17 AM, Federico Leva (Nemo) <[email protected]>
>> wrote:
>>>
>>> * From 40 to 260 events logged per second in a month: what's going on?
>>
>>
>> Eep, thanks for raising the alarm. MediaViewer is 170 events / sec,
>> MultimediaViewerDuration is 38 / sec.
>>
>> +CC Multimedia.
>
>
> After an IRC discussion we added 1:1000 sampling to both of those schemas.
> I'll need a little help fixing things on the data processing side; I'll give
> a short description of how we use the data first.
>
> A MediaViewer event represents a user action (e.g. clicking on a thumbnail,
> or using the back button in the browser while the lightbox is open). The
> most used actions are (were, before the sampling) logged a few million times
> a day; the least used ones less than a thousand times.
> We use the data to display graphs like this:
> http://multimedia-metrics.wmflabs.org/dashboards/mmv#actions-graphs-tab
> There are also per-wiki graphs; there is about three magnitudes of
> difference between the largest and the smallest wikis (will be more once we
> roll out on English).
>
> A MultimediaViewerDuration event contains data about how much the user had
> to wait (such as milliseconds between clicking the thumbnail and displaying
> the image). This is fairly new and we don't have graphs yet, but they will
> look something like these (which show the latency of our network requests):
> http://multimedia-metrics.wmflabs.org/dashboards/mmv#overall_network_performance-graphs-tab
> http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_performance-graphs-tab
> that is, they are used to calculate a geometric mean and various
> percentiles, with per-wiki and per-country breakdown.
>
> What I would like to understand is: 1) how we need to modify these charts to
> account for the sampling, 2) how we can make sure the sampling does not
> result in loss of low-volume data (e.g. from wikis which have less traffic).
>
> == How to take the sampling into account ==
>
> For the activity charts which show total event counts, this is easy: we just
> need to multiply the count by the sampling ratio.
>
> For percentile charts, my understanding is (thanks for the IRC advice, Nuria
> and Leila!) that they remain accurate, as long as the amount sampled is
> large enough; the best practice is to sample at least 1000 events per bucket
> (so 10,000 altogether if we are looking for the 90th percentile, 100,000 if
> we are looking for the 99th percentile etc).
>
> I'm still looking for an answer on what effect sampling has on geometric
> means.
>
> == How to handle data sources with very different volumes ==
>
> As I said above, there are about three magnitudes of difference between data
> volume for frequent and rare user actions, and also between large and small
> wikis (probably even more for countries - if you look at the map linked
> above, you can see that some African countries are missing: we use 1:1000
> sampling and haven't collected a single data point there yet).
>
> So to get a proper amount of data, we would probably need to vary sampling
> per wiki or country, and also per action: 1:1000 sampling is fine for frwiki
> thumbnail clicks, but not for cawiki fullscreen button presses. The question
> is, how to mix different data sources? For example, we might decide to
> sample thumbnail clicks 1:1000 on enwiki but only 1:100 on dewiki, and then
> we want to show a graph of global clicks which includes both enwiki and
> dewiki counts.
>
> Here is what I came up with:
> - we add a "sampling rate" field to all our schemas
> - the rule to determine the sampling rate of a given event (i.e. the
> reciprocal of the probability of the event getting logged) can be as
> difficult as we like, as long as the logging code saves that number as well
> - whenever we display total counts, we use sum(sampling_rate) instead of
> count(*)
> - whenever we display percentiles, we ignore sampling rates, they should not
> influence the result even if we consider data from multiple sources with
> mixed sampling rates (I'm not quite sure about this one)
> - whenever we display geometric means, we weight by sampling rate
> (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of
> exp(avg(ln(value))))
>
> Do you think that would yield correct results?
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Reply via email to