That makes sense Gilles - the "local detection" seems too sophisticated in
this case, outliers and stuff like that can be figured out in SQL.

On Thu, Jan 29, 2015 at 2:17 PM, Gilles Dubuc <[email protected]> wrote:

> jsfiddle confirming my hunch about CORS and the misguided conditional:
> http://jsfiddle.net/2ycra7y0/4/
>
> On Thu, Jan 29, 2015 at 10:00 AM, Gilles Dubuc <[email protected]>
> wrote:
>
>> I think I've found where this is coming from. Nothing to do with EL:
>>
>> - When Media Viewer was launched to production, we forgot about our list
>> of beta sites in the config repo and as a result, wikis that had not been
>> through the beta (Chinese Wikipedia being one of them) ended up being
>> sampled 100 times more than all the other production wikis for this
>> particular metrics (network performance). Which made China, where network
>> performance is particularly bad, overrepresented in the global figures.
>> - On Jan 7th the Timing-Allow-Origin: * header started being served for
>> all images. In itself it should have only meant that our performance
>> logging had more complete records, since the presence of the timing data
>> wasn't a prerequisite in our code to recording a hit. So in itself that
>> doesn't explain the spike of sampled events for that metrics.
>> - In our code there was a line specifically filtering out hits based on
>> timing data, when it believed the image load to be a local cache hit. In
>> think that's the main source of the issue. I suspect that when CORS
>> policies aren't in place, this code (which checked that a certain timing
>> value was set to 0) wrongly believed those to be local cache hits and
>> didn't send them to EL when it should have. Essentially because of the CORS
>> situation in production we'd been accidentally suppressing all image
>> performance events for people with recent browsers, and only recording
>> browsers that don't support the timing APIs. Since we haven't really used
>> the detailed timing data yet, this went unnoticed.
>>
>> The header being served on 7th suddenly "unlocked" that timing data and
>> suddenly a large amount of events that were previously stopped at the gate
>> on the client side, based on a wrong condition, started flowing into EL.
>> And the imbalance we had because of the sampling rate config issue made
>> this very visible because China, with its very slow speed, was taking a
>> much larger share of the global stats than it should have.
>>
>> So basically 3 separate bugs combined caused this phenomenon. I think the
>> lesson learned, which we actually applied to the rest of that performance
>> code, is that we're better off not suppressing the client data as long as a
>> user/hit is sampled and we should rather filter the data based on the
>> recorded values later in SQL.
>>
>> Today I'm going to double-check that when CORS policies are missing
>> timing data does have a 0 value that looks like a local cache hit, if I
>> don't post an update to this thread, my assumptions were correct.
>>
>> On Wed, Jan 28, 2015 at 8:33 PM, Gilles Dubuc <[email protected]>
>> wrote:
>>
>>> Looking specifically at China,
>>>
>>> CN on Jan 3rd: 9103 in log file, 9415 in DB
>>> CN on Jan 20th: 13816 in log file, 13586 in DB
>>>
>>> This is further confirmation that EL seems to be fine. Thanks for
>>> pointing me in the direction of the raw logs, I didn't know where they
>>> were. I'll post an update when we've found the source of this issue.
>>>
>>> On Wed, Jan 28, 2015 at 6:20 PM, Gilles Dubuc <[email protected]>
>>> wrote:
>>>
>>>> Does this throughput rate match what you see on the database?
>>>>>
>>>>
>>>> It's hard to tell, because I can't infer how many events per day are
>>>> generated based on what you're showing me. It depends on the granularity of
>>>> the curve's rendering. Even if I knew what the numbers going up to 6.2 on
>>>> the left represent, it still wouldn't tell me how many per hour/per day are
>>>> recorded there.
>>>>
>>>> Are there a bigger number of events that might not be validating for
>>>>> other wikis versus China?
>>>>>
>>>>
>>>> Are you talking about syntactically incorrect events that don't pass
>>>> the schema check?
>>>>
>>>> Wouldn't this happen if your product is "more used" in china than
>>>>> elsewhere?
>>>>>
>>>>
>>>> Sure, but that doesn't really matter, because global trends would
>>>> evolve slowly. The point of tracking the network performance statistics
>>>> worldwide, despite their very uneven distribution, is to immediately spot
>>>> when something very wrong happens. We also track the same stats per wiki,
>>>> and per country (albeit not on a timeline at this point for the per-country
>>>> ones). The bottom line is that the per-country share of recorded EL events
>>>> shouldn't vary wildly like that.
>>>>
>>>> Logs are at: stat1002:/a/eventlogging/
>>>>>
>>>>
>>>> gilles@stat1003:~/20150120$ cat all-events.log-20150120 | grep
>>>> "\"clientValidated\": true" | grep -c "\"schema\":
>>>> \"MultimediaViewerNetworkPerformance\""
>>>> 330061
>>>>
>>>> And in the DB across the tables that have "active" versions, I get a
>>>> total of 31 + 327867 = 327898 events recorded
>>>>
>>>> A difference of 2000ish events, might not be a real difference since
>>>> the date cutoff probably differs between the log file and the timestamp
>>>> contained in its entries.
>>>>
>>>> Now, looking at an earlier date, when things seemed to be stable in
>>>> terms of country-based sampling:
>>>>
>>>> gilles@stat1003:~/20150103$ cat all-events.log-20150103  | grep
>>>> "\"clientValidated\": true" | grep -c "\"schema\":
>>>> \"MultimediaViewerNetworkPerformance\""
>>>> 287712
>>>>
>>>> And in the DB, a total of  5 + 289461 = 289466
>>>>
>>>> So far it does look like everything is making it to the DB, I'll keep
>>>> investigating tomorrow.
>>>>
>>>> On Wed, Jan 28, 2015 at 5:43 PM, Nuria Ruiz <[email protected]>
>>>> wrote:
>>>>
>>>>> Gilles:
>>>>>
>>>>> This event has a "pretty" constant rate of input:
>>>>> http://graphite.wikimedia.org/render/?width=588&height=311&_salt=1422494956.516&from=00%3A00_20141201&until=23%3A59_20150128&target=eventlogging.schema.MultimediaViewerNetworkPerformance.rate
>>>>>
>>>>> And as far as I can see it has not changed (greatly) before and after
>>>>> the 7th. Does this throughput rate match what you see on the database? If
>>>>> so, I would say db matches incoming stream. Now, if it doesn't (somehow 
>>>>> the
>>>>> db has less data) we might have a problem. However, dropped events when
>>>>> inserting on the db will affect all countries equally so I doubt these 
>>>>> will
>>>>> be the source of the discrepancy.
>>>>>
>>>>> Are there a bigger number of events that might not be validating for
>>>>> other wikis versus China?
>>>>> Logs are at: stat1002:/a/eventlogging/archive and the all-events log
>>>>> will have every single one of your events received. Please note that 
>>>>> events
>>>>> are not inserted right away when received, there is a buffer of couple
>>>>> minutes.
>>>>>
>>>>> >This balance shifting over time is really problematic for tracking
>>>>> Media Viewer client-side network performance, because Chinese >traffic
>>>>> suddenly accounting for a bigger or smaller share of the overall recorded
>>>>> events creates big swings in the global >averages/percentiles (since
>>>>> network performance in China is bad).
>>>>>
>>>>> Wouldn't this happen if your product is "more used" in china than
>>>>> elsewhere? If you are counting "absolute" values you will be always skewed
>>>>> to the biggest dataset unless you are doing some kind of scaling.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Nuria
>>>>>
>>>>> On Wed, Jan 28, 2015 at 4:09 PM, Gilles Dubuc <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I've tracked down an unexplained EL phenomenon that surfaced in our
>>>>>> stats as a false trend in our global stats.
>>>>>>
>>>>>> The data I'm looking at specifically is coming from Media Viewer's
>>>>>> MultimediaViewerNetworkPerformance_* tables.
>>>>>>
>>>>>> Have a look at this graph:
>>>>>> https://docs.google.com/spreadsheets/d/1PJsyzAyj74dctGCl4-09L7LS4AMZRh57G56Xc2qZv6w/edit?usp=sharing
>>>>>> the big change is on Jan 7th/8th
>>>>>>
>>>>>> It shows how many EL events we've recorded, per client-reported
>>>>>> country, over the last 90 days. The sampling factor we use has been
>>>>>> constant for each wiki over that period. Thus, the distribution shouldn't
>>>>>> evolve drastically, aside from seasonal/local trends. Besides the Ukraine
>>>>>> spike on a particular date (probably related to world events), the graph
>>>>>> before Jan 7th looks like what you would expect. Then, following the 
>>>>>> outage
>>>>>> that happened on Jan 7th, not only the balance is completely changed, but
>>>>>> it evolves over time (the US and China are keeping "higher than normal"
>>>>>> levels, while the rest seems to slide down lower than pre-7th 
>>>>>> quantities),
>>>>>> showing me that something strange is happening and is probably 
>>>>>> unresolved.
>>>>>>
>>>>>> This balance shifting over time is really problematic for tracking
>>>>>> Media Viewer client-side network performance, because Chinese traffic
>>>>>> suddenly accounting for a bigger or smaller share of the overall recorded
>>>>>> events creates big swings in the global averages/percentiles (since 
>>>>>> network
>>>>>> performance in China is bad).
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to