>>>[gerco] - whenever we display geometric means, we weight by sampling rate >>>(exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of >>>exp(avg(ln(value))))
>>[gilles] I don't follow the logic here. Like percentiles, averages should be >>unaffected by sampling, geometric or not. >[gerco]Assume we have 10 duration logs with 1 sec time and 10 with 2 sec; the >(arithmetic) mean is 1.5 sec. If the >second group is sampled 1:10, and we >take the average of that, that would give 1.1 sec; our one sample from the >>second group really represents 10 events, but only has the weight of one. The >same logic should hold for geometric >means. What variable are we measuring with this data that we are averaging? On Mon, May 19, 2014 at 11:40 AM, Gergo Tisza <[email protected]> wrote: > On Sun, May 18, 2014 at 11:55 PM, Gilles Dubuc <[email protected]> wrote: >>> >>> 1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki >>> fullscreen button presses >> >> >> Since the issue is the global load, I think it'd be resolved by changing >> the sampling rate for the large wikis only. The small ones going back to 1:1 >> would be fine, as they contribute little to the global load. > > > That solves part of the problem, but not all of it. For example, how do we > display click-to-thumbnail time in Kenya on our map? Presumably most people > there use the English or French Wikipedia, which are large ones, but the > traffic from Kenya is small, sampling will pretty much destroy it. Same for > rare actions like clicking on the author name. > > Basically we should the segments which are large in all dimensions (e.g. > thumbnail clicks on enwiki from US), and only sample those. > >> Is there a way to set different PHP settings for small wikipedias than for >> large ones, though? > > > InitializeSettings.php can take wiki names directly, or any of the dblists > from the operations/mediawiki-config repo root (s* and small/medium/large > would be the helpful ones here). > >>> - whenever we display geometric means, we weight by sampling rate >>> (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of >>> exp(avg(ln(value)))) >> >> >> I don't follow the logic here. Like percentiles, averages should be >> unaffected by sampling, geometric or not. > > > Assume we have 10 duration logs with 1 sec time and 10 with 2 sec; the > (arithmetic) mean is 1.5 sec. If the second group is sampled 1:10, and we > take the average of that, that would give 1.1 sec; our one sample from the > second group really represents 10 events, but only has the weight of one. > The same logic should hold for geometric means. > > I think averages would be unaffected by uniform sampling; but we are not > doing uniform sampling here; even if we are only doing per-wiki sampling, we > might need to aggregate data from differently sampled groups for a > cross-wiki comparison chart, for example. > > (I suspect percentiles would be affected by non-uniform sampling as well, > but I don't really have an idea how.) > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > _______________________________________________ Multimedia mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/multimedia
