Super good job. It was probably not easy to figure out that perfect storm
of events.

>So basically 3 separate bugs combined caused this phenomenon. I think the
lesson learned, which we actually applied to the rest >of that performance
code, is that we're better off not suppressing the client data as long as a
user/hit is sampled and we should >rather filter the data based on the
recorded values later in SQL.
Much Agreed. When loading perf data the standard is to expect to distinct
distributions for clients with hot and cold cache.





On Thu, Jan 29, 2015 at 12:22 PM, Dan Andreescu <[email protected]>
wrote:

> That makes sense Gilles - the "local detection" seems too sophisticated in
> this case, outliers and stuff like that can be figured out in SQL.
>
> On Thu, Jan 29, 2015 at 2:17 PM, Gilles Dubuc <[email protected]>
> wrote:
>
>> jsfiddle confirming my hunch about CORS and the misguided conditional:
>> http://jsfiddle.net/2ycra7y0/4/
>>
>> On Thu, Jan 29, 2015 at 10:00 AM, Gilles Dubuc <[email protected]>
>> wrote:
>>
>>> I think I've found where this is coming from. Nothing to do with EL:
>>>
>>> - When Media Viewer was launched to production, we forgot about our list
>>> of beta sites in the config repo and as a result, wikis that had not been
>>> through the beta (Chinese Wikipedia being one of them) ended up being
>>> sampled 100 times more than all the other production wikis for this
>>> particular metrics (network performance). Which made China, where network
>>> performance is particularly bad, overrepresented in the global figures.
>>> - On Jan 7th the Timing-Allow-Origin: * header started being served for
>>> all images. In itself it should have only meant that our performance
>>> logging had more complete records, since the presence of the timing data
>>> wasn't a prerequisite in our code to recording a hit. So in itself that
>>> doesn't explain the spike of sampled events for that metrics.
>>> - In our code there was a line specifically filtering out hits based on
>>> timing data, when it believed the image load to be a local cache hit. In
>>> think that's the main source of the issue. I suspect that when CORS
>>> policies aren't in place, this code (which checked that a certain timing
>>> value was set to 0) wrongly believed those to be local cache hits and
>>> didn't send them to EL when it should have. Essentially because of the CORS
>>> situation in production we'd been accidentally suppressing all image
>>> performance events for people with recent browsers, and only recording
>>> browsers that don't support the timing APIs. Since we haven't really used
>>> the detailed timing data yet, this went unnoticed.
>>>
>>> The header being served on 7th suddenly "unlocked" that timing data and
>>> suddenly a large amount of events that were previously stopped at the gate
>>> on the client side, based on a wrong condition, started flowing into EL.
>>> And the imbalance we had because of the sampling rate config issue made
>>> this very visible because China, with its very slow speed, was taking a
>>> much larger share of the global stats than it should have.
>>>
>>> So basically 3 separate bugs combined caused this phenomenon. I think
>>> the lesson learned, which we actually applied to the rest of that
>>> performance code, is that we're better off not suppressing the client data
>>> as long as a user/hit is sampled and we should rather filter the data based
>>> on the recorded values later in SQL.
>>>
>>> Today I'm going to double-check that when CORS policies are missing
>>> timing data does have a 0 value that looks like a local cache hit, if I
>>> don't post an update to this thread, my assumptions were correct.
>>>
>>> On Wed, Jan 28, 2015 at 8:33 PM, Gilles Dubuc <[email protected]>
>>> wrote:
>>>
>>>> Looking specifically at China,
>>>>
>>>> CN on Jan 3rd: 9103 in log file, 9415 in DB
>>>> CN on Jan 20th: 13816 in log file, 13586 in DB
>>>>
>>>> This is further confirmation that EL seems to be fine. Thanks for
>>>> pointing me in the direction of the raw logs, I didn't know where they
>>>> were. I'll post an update when we've found the source of this issue.
>>>>
>>>> On Wed, Jan 28, 2015 at 6:20 PM, Gilles Dubuc <[email protected]>
>>>> wrote:
>>>>
>>>>> Does this throughput rate match what you see on the database?
>>>>>>
>>>>>
>>>>> It's hard to tell, because I can't infer how many events per day are
>>>>> generated based on what you're showing me. It depends on the granularity 
>>>>> of
>>>>> the curve's rendering. Even if I knew what the numbers going up to 6.2 on
>>>>> the left represent, it still wouldn't tell me how many per hour/per day 
>>>>> are
>>>>> recorded there.
>>>>>
>>>>> Are there a bigger number of events that might not be validating for
>>>>>> other wikis versus China?
>>>>>>
>>>>>
>>>>> Are you talking about syntactically incorrect events that don't pass
>>>>> the schema check?
>>>>>
>>>>> Wouldn't this happen if your product is "more used" in china than
>>>>>> elsewhere?
>>>>>>
>>>>>
>>>>> Sure, but that doesn't really matter, because global trends would
>>>>> evolve slowly. The point of tracking the network performance statistics
>>>>> worldwide, despite their very uneven distribution, is to immediately spot
>>>>> when something very wrong happens. We also track the same stats per wiki,
>>>>> and per country (albeit not on a timeline at this point for the 
>>>>> per-country
>>>>> ones). The bottom line is that the per-country share of recorded EL events
>>>>> shouldn't vary wildly like that.
>>>>>
>>>>> Logs are at: stat1002:/a/eventlogging/
>>>>>>
>>>>>
>>>>> gilles@stat1003:~/20150120$ cat all-events.log-20150120 | grep
>>>>> "\"clientValidated\": true" | grep -c "\"schema\":
>>>>> \"MultimediaViewerNetworkPerformance\""
>>>>> 330061
>>>>>
>>>>> And in the DB across the tables that have "active" versions, I get a
>>>>> total of 31 + 327867 = 327898 events recorded
>>>>>
>>>>> A difference of 2000ish events, might not be a real difference since
>>>>> the date cutoff probably differs between the log file and the timestamp
>>>>> contained in its entries.
>>>>>
>>>>> Now, looking at an earlier date, when things seemed to be stable in
>>>>> terms of country-based sampling:
>>>>>
>>>>> gilles@stat1003:~/20150103$ cat all-events.log-20150103  | grep
>>>>> "\"clientValidated\": true" | grep -c "\"schema\":
>>>>> \"MultimediaViewerNetworkPerformance\""
>>>>> 287712
>>>>>
>>>>> And in the DB, a total of  5 + 289461 = 289466
>>>>>
>>>>> So far it does look like everything is making it to the DB, I'll keep
>>>>> investigating tomorrow.
>>>>>
>>>>> On Wed, Jan 28, 2015 at 5:43 PM, Nuria Ruiz <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Gilles:
>>>>>>
>>>>>> This event has a "pretty" constant rate of input:
>>>>>> http://graphite.wikimedia.org/render/?width=588&height=311&_salt=1422494956.516&from=00%3A00_20141201&until=23%3A59_20150128&target=eventlogging.schema.MultimediaViewerNetworkPerformance.rate
>>>>>>
>>>>>> And as far as I can see it has not changed (greatly) before and after
>>>>>> the 7th. Does this throughput rate match what you see on the database? If
>>>>>> so, I would say db matches incoming stream. Now, if it doesn't (somehow 
>>>>>> the
>>>>>> db has less data) we might have a problem. However, dropped events when
>>>>>> inserting on the db will affect all countries equally so I doubt these 
>>>>>> will
>>>>>> be the source of the discrepancy.
>>>>>>
>>>>>> Are there a bigger number of events that might not be validating for
>>>>>> other wikis versus China?
>>>>>> Logs are at: stat1002:/a/eventlogging/archive and the all-events log
>>>>>> will have every single one of your events received. Please note that 
>>>>>> events
>>>>>> are not inserted right away when received, there is a buffer of couple
>>>>>> minutes.
>>>>>>
>>>>>> >This balance shifting over time is really problematic for tracking
>>>>>> Media Viewer client-side network performance, because Chinese >traffic
>>>>>> suddenly accounting for a bigger or smaller share of the overall recorded
>>>>>> events creates big swings in the global >averages/percentiles (since
>>>>>> network performance in China is bad).
>>>>>>
>>>>>> Wouldn't this happen if your product is "more used" in china than
>>>>>> elsewhere? If you are counting "absolute" values you will be always 
>>>>>> skewed
>>>>>> to the biggest dataset unless you are doing some kind of scaling.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Nuria
>>>>>>
>>>>>> On Wed, Jan 28, 2015 at 4:09 PM, Gilles Dubuc <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I've tracked down an unexplained EL phenomenon that surfaced in our
>>>>>>> stats as a false trend in our global stats.
>>>>>>>
>>>>>>> The data I'm looking at specifically is coming from Media Viewer's
>>>>>>> MultimediaViewerNetworkPerformance_* tables.
>>>>>>>
>>>>>>> Have a look at this graph:
>>>>>>> https://docs.google.com/spreadsheets/d/1PJsyzAyj74dctGCl4-09L7LS4AMZRh57G56Xc2qZv6w/edit?usp=sharing
>>>>>>> the big change is on Jan 7th/8th
>>>>>>>
>>>>>>> It shows how many EL events we've recorded, per client-reported
>>>>>>> country, over the last 90 days. The sampling factor we use has been
>>>>>>> constant for each wiki over that period. Thus, the distribution 
>>>>>>> shouldn't
>>>>>>> evolve drastically, aside from seasonal/local trends. Besides the 
>>>>>>> Ukraine
>>>>>>> spike on a particular date (probably related to world events), the graph
>>>>>>> before Jan 7th looks like what you would expect. Then, following the 
>>>>>>> outage
>>>>>>> that happened on Jan 7th, not only the balance is completely changed, 
>>>>>>> but
>>>>>>> it evolves over time (the US and China are keeping "higher than normal"
>>>>>>> levels, while the rest seems to slide down lower than pre-7th 
>>>>>>> quantities),
>>>>>>> showing me that something strange is happening and is probably 
>>>>>>> unresolved.
>>>>>>>
>>>>>>> This balance shifting over time is really problematic for tracking
>>>>>>> Media Viewer client-side network performance, because Chinese traffic
>>>>>>> suddenly accounting for a bigger or smaller share of the overall 
>>>>>>> recorded
>>>>>>> events creates big swings in the global averages/percentiles (since 
>>>>>>> network
>>>>>>> performance in China is bad).
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Analytics mailing list
>>>>>>> [email protected]
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to