I agree that there's no reason to re-weight the observations under a consistent sample. The only reason I might re-weight based on the sample would be if I were combining data with different sampling rates.
-Aaron On Tue, May 20, 2014 at 8:18 AM, Nuria Ruiz <[email protected]> wrote: > >>>[gerco] - whenever we display geometric means, we weight by sampling > rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of > exp(avg(ln(value)))) > > >>[gilles] I don't follow the logic here. Like percentiles, averages > should be unaffected by sampling, geometric or not. > > >[gerco]Assume we have 10 duration logs with 1 sec time and 10 with 2 > sec; the (arithmetic) mean is 1.5 sec. If the >second group is sampled > 1:10, and we take the average of that, that would give 1.1 sec; our one > sample from the >second group really represents 10 events, but only has the > weight of one. The same logic should hold for geometric >means. > What variable are we measuring with this data that we are averaging? > > > > On Mon, May 19, 2014 at 11:40 AM, Gergo Tisza <[email protected]> > wrote: > > On Sun, May 18, 2014 at 11:55 PM, Gilles Dubuc <[email protected]> > wrote: > >>> > >>> 1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki > >>> fullscreen button presses > >> > >> > >> Since the issue is the global load, I think it'd be resolved by changing > >> the sampling rate for the large wikis only. The small ones going back > to 1:1 > >> would be fine, as they contribute little to the global load. > > > > > > That solves part of the problem, but not all of it. For example, how do > we > > display click-to-thumbnail time in Kenya on our map? Presumably most > people > > there use the English or French Wikipedia, which are large ones, but the > > traffic from Kenya is small, sampling will pretty much destroy it. Same > for > > rare actions like clicking on the author name. > > > > Basically we should the segments which are large in all dimensions (e.g. > > thumbnail clicks on enwiki from US), and only sample those. > > > >> Is there a way to set different PHP settings for small wikipedias than > for > >> large ones, though? > > > > > > InitializeSettings.php can take wiki names directly, or any of the > dblists > > from the operations/mediawiki-config repo root (s* and small/medium/large > > would be the helpful ones here). > > > >>> - whenever we display geometric means, we weight by sampling rate > >>> (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of > >>> exp(avg(ln(value)))) > >> > >> > >> I don't follow the logic here. Like percentiles, averages should be > >> unaffected by sampling, geometric or not. > > > > > > Assume we have 10 duration logs with 1 sec time and 10 with 2 sec; the > > (arithmetic) mean is 1.5 sec. If the second group is sampled 1:10, and we > > take the average of that, that would give 1.1 sec; our one sample from > the > > second group really represents 10 events, but only has the weight of one. > > The same logic should hold for geometric means. > > > > I think averages would be unaffected by uniform sampling; but we are not > > doing uniform sampling here; even if we are only doing per-wiki > sampling, we > > might need to aggregate data from differently sampled groups for a > > cross-wiki comparison chart, for example. > > > > (I suspect percentiles would be affected by non-uniform sampling as well, > > but I don't really have an idea how.) > > > > _______________________________________________ > > Analytics mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics >
_______________________________________________ Multimedia mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/multimedia
