Re: [Multimedia] [Analytics] EventLogging ballooning

Nuria Ruiz Tue, 20 May 2014 06:18:54 -0700

>>>[gerco] - whenever we display geometric means, we weight by sampling rate 
>>>(exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of 
>>>exp(avg(ln(value))))


>>[gilles] I don't follow the logic here. Like percentiles, averages should be 
>>unaffected by sampling, geometric or not.

>[gerco]Assume we have 10 duration logs with  1 sec time and 10 with 2 sec; the 
>(arithmetic) mean is 1.5 sec. If the >second group is sampled 1:10, and we 
>take the average of that, that would give 1.1 sec; our one sample from the 
>>second group really represents 10 events, but only has the weight of one. The 
>same logic should hold for geometric >means.
What variable are we measuring with this data that we are averaging?



On Mon, May 19, 2014 at 11:40 AM, Gergo Tisza <[email protected]> wrote:
> On Sun, May 18, 2014 at 11:55 PM, Gilles Dubuc <[email protected]> wrote:
>>>
>>> 1:1000 sampling is fine for frwiki thumbnail clicks, but not for cawiki
>>> fullscreen button presses
>>
>>
>> Since the issue is the global load, I think it'd be resolved by changing
>> the sampling rate for the large wikis only. The small ones going back to 1:1
>> would be fine, as they contribute little to the global load.
>
>
> That solves part of the problem, but not all of it. For example, how do we
> display click-to-thumbnail time in Kenya on our map? Presumably most people
> there use the English or French Wikipedia, which are large ones, but the
> traffic from Kenya is small, sampling will pretty much destroy it. Same for
> rare actions like clicking on the author name.
>
> Basically we should the segments which are large in all dimensions (e.g.
> thumbnail clicks on enwiki from US), and only sample those.
>
>> Is there a way to set different PHP settings for small wikipedias than for
>> large ones, though?
>
>
> InitializeSettings.php can take wiki names directly, or any of the dblists
> from the operations/mediawiki-config repo root (s* and small/medium/large
> would be the helpful ones here).
>
>>> - whenever we display geometric means, we weight by sampling rate
>>> (exp(sum(sampling_rate * ln(value)) / sum(sampling_rate)) instead of
>>> exp(avg(ln(value))))
>>
>>
>> I don't follow the logic here. Like percentiles, averages should be
>> unaffected by sampling, geometric or not.
>
>
> Assume we have 10 duration logs with  1 sec time and 10 with 2 sec; the
> (arithmetic) mean is 1.5 sec. If the second group is sampled 1:10, and we
> take the average of that, that would give 1.1 sec; our one sample from the
> second group really represents 10 events, but only has the weight of one.
> The same logic should hold for geometric means.
>
> I think averages would be unaffected by uniform sampling; but we are not
> doing uniform sampling here; even if we are only doing per-wiki sampling, we
> might need to aggregate data from differently sampled groups for a
> cross-wiki comparison chart, for example.
>
> (I suspect percentiles would be affected by non-uniform sampling as well,
> but I don't really have an idea how.)
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Re: [Multimedia] [Analytics] EventLogging ballooning

Reply via email to