I agree that there's no reason to re-weight the observations under a
consistent sample.  The only reason I might re-weight based on the sample
would be if I were combining data with different sampling rates.

-Aaron


On Tue, May 20, 2014 at 8:18 AM, Nuria Ruiz <[email protected]> wrote:

> >>>[gerco] - whenever we display geometric means, we weight by sampling
> rate (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of
> exp(avg(ln(value))))
>
> >>[gilles] I don't follow the logic here. Like percentiles, averages
> should be unaffected by sampling, geometric or not.
>
> >[gerco]Assume we have 10 duration logs with  1 sec time and 10 with 2
> sec; the (arithmetic) mean is 1.5 sec. If the >second group is sampled
> 1:10, and we take the average of that, that would give 1.1 sec; our one
> sample from the >second group really represents 10 events, but only has the
> weight of one. The same logic should hold for geometric >means.
> What variable are we measuring with this data that we are averaging?
>
>
>
> On Mon, May 19, 2014 at 11:40 AM, Gergo Tisza <[email protected]>
> wrote:
> > On Sun, May 18, 2014 at 11:55 PM, Gilles Dubuc <[email protected]>
> wrote:
> >>>
> >>> 1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki
> >>> fullscreen button presses
> >>
> >>
> >> Since the issue is the global load, I think it'd be resolved by changing
> >> the sampling rate for the large wikis only. The small ones going back
> to 1:1
> >> would be fine, as they contribute little to the global load.
> >
> >
> > That solves part of the problem, but not all of it. For example, how do
> we
> > display click-to-thumbnail time in Kenya on our map? Presumably most
> people
> > there use the English or French Wikipedia, which are large ones, but the
> > traffic from Kenya is small, sampling will pretty much destroy it. Same
> for
> > rare actions like clicking on the author name.
> >
> > Basically we should the segments which are large in all dimensions (e.g.
> > thumbnail clicks on enwiki from US), and only sample those.
> >
> >> Is there a way to set different PHP settings for small wikipedias than
> for
> >> large ones, though?
> >
> >
> > InitializeSettings.php can take wiki names directly, or any of the
> dblists
> > from the operations/mediawiki-config repo root (s* and small/medium/large
> > would be the helpful ones here).
> >
> >>> - whenever we display geometric means, we weight by sampling rate
> >>> (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of
> >>> exp(avg(ln(value))))
> >>
> >>
> >> I don't follow the logic here. Like percentiles, averages should be
> >> unaffected by sampling, geometric or not.
> >
> >
> > Assume we have 10 duration logs with  1 sec time and 10 with 2 sec; the
> > (arithmetic) mean is 1.5 sec. If the second group is sampled 1:10, and we
> > take the average of that, that would give 1.1 sec; our one sample from
> the
> > second group really represents 10 events, but only has the weight of one.
> > The same logic should hold for geometric means.
> >
> > I think averages would be unaffected by uniform sampling; but we are not
> > doing uniform sampling here; even if we are only doing per-wiki
> sampling, we
> > might need to aggregate data from differently sampled groups for a
> > cross-wiki comparison chart, for example.
> >
> > (I suspect percentiles would be affected by non-uniform sampling as well,
> > but I don't really have an idea how.)
> >
> > _______________________________________________
> > Analytics mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Reply via email to