jsfiddle confirming my hunch about CORS and the misguided conditional:
http://jsfiddle.net/2ycra7y0/4/

On Thu, Jan 29, 2015 at 10:00 AM, Gilles Dubuc <[email protected]> wrote:

> I think I've found where this is coming from. Nothing to do with EL:
>
> - When Media Viewer was launched to production, we forgot about our list
> of beta sites in the config repo and as a result, wikis that had not been
> through the beta (Chinese Wikipedia being one of them) ended up being
> sampled 100 times more than all the other production wikis for this
> particular metrics (network performance). Which made China, where network
> performance is particularly bad, overrepresented in the global figures.
> - On Jan 7th the Timing-Allow-Origin: * header started being served for
> all images. In itself it should have only meant that our performance
> logging had more complete records, since the presence of the timing data
> wasn't a prerequisite in our code to recording a hit. So in itself that
> doesn't explain the spike of sampled events for that metrics.
> - In our code there was a line specifically filtering out hits based on
> timing data, when it believed the image load to be a local cache hit. In
> think that's the main source of the issue. I suspect that when CORS
> policies aren't in place, this code (which checked that a certain timing
> value was set to 0) wrongly believed those to be local cache hits and
> didn't send them to EL when it should have. Essentially because of the CORS
> situation in production we'd been accidentally suppressing all image
> performance events for people with recent browsers, and only recording
> browsers that don't support the timing APIs. Since we haven't really used
> the detailed timing data yet, this went unnoticed.
>
> The header being served on 7th suddenly "unlocked" that timing data and
> suddenly a large amount of events that were previously stopped at the gate
> on the client side, based on a wrong condition, started flowing into EL.
> And the imbalance we had because of the sampling rate config issue made
> this very visible because China, with its very slow speed, was taking a
> much larger share of the global stats than it should have.
>
> So basically 3 separate bugs combined caused this phenomenon. I think the
> lesson learned, which we actually applied to the rest of that performance
> code, is that we're better off not suppressing the client data as long as a
> user/hit is sampled and we should rather filter the data based on the
> recorded values later in SQL.
>
> Today I'm going to double-check that when CORS policies are missing timing
> data does have a 0 value that looks like a local cache hit, if I don't post
> an update to this thread, my assumptions were correct.
>
> On Wed, Jan 28, 2015 at 8:33 PM, Gilles Dubuc <[email protected]>
> wrote:
>
>> Looking specifically at China,
>>
>> CN on Jan 3rd: 9103 in log file, 9415 in DB
>> CN on Jan 20th: 13816 in log file, 13586 in DB
>>
>> This is further confirmation that EL seems to be fine. Thanks for
>> pointing me in the direction of the raw logs, I didn't know where they
>> were. I'll post an update when we've found the source of this issue.
>>
>> On Wed, Jan 28, 2015 at 6:20 PM, Gilles Dubuc <[email protected]>
>> wrote:
>>
>>> Does this throughput rate match what you see on the database?
>>>>
>>>
>>> It's hard to tell, because I can't infer how many events per day are
>>> generated based on what you're showing me. It depends on the granularity of
>>> the curve's rendering. Even if I knew what the numbers going up to 6.2 on
>>> the left represent, it still wouldn't tell me how many per hour/per day are
>>> recorded there.
>>>
>>> Are there a bigger number of events that might not be validating for
>>>> other wikis versus China?
>>>>
>>>
>>> Are you talking about syntactically incorrect events that don't pass the
>>> schema check?
>>>
>>> Wouldn't this happen if your product is "more used" in china than
>>>> elsewhere?
>>>>
>>>
>>> Sure, but that doesn't really matter, because global trends would evolve
>>> slowly. The point of tracking the network performance statistics worldwide,
>>> despite their very uneven distribution, is to immediately spot when
>>> something very wrong happens. We also track the same stats per wiki, and
>>> per country (albeit not on a timeline at this point for the per-country
>>> ones). The bottom line is that the per-country share of recorded EL events
>>> shouldn't vary wildly like that.
>>>
>>> Logs are at: stat1002:/a/eventlogging/
>>>>
>>>
>>> gilles@stat1003:~/20150120$ cat all-events.log-20150120 | grep
>>> "\"clientValidated\": true" | grep -c "\"schema\":
>>> \"MultimediaViewerNetworkPerformance\""
>>> 330061
>>>
>>> And in the DB across the tables that have "active" versions, I get a
>>> total of 31 + 327867 = 327898 events recorded
>>>
>>> A difference of 2000ish events, might not be a real difference since the
>>> date cutoff probably differs between the log file and the timestamp
>>> contained in its entries.
>>>
>>> Now, looking at an earlier date, when things seemed to be stable in
>>> terms of country-based sampling:
>>>
>>> gilles@stat1003:~/20150103$ cat all-events.log-20150103  | grep
>>> "\"clientValidated\": true" | grep -c "\"schema\":
>>> \"MultimediaViewerNetworkPerformance\""
>>> 287712
>>>
>>> And in the DB, a total of  5 + 289461 = 289466
>>>
>>> So far it does look like everything is making it to the DB, I'll keep
>>> investigating tomorrow.
>>>
>>> On Wed, Jan 28, 2015 at 5:43 PM, Nuria Ruiz <[email protected]> wrote:
>>>
>>>> Gilles:
>>>>
>>>> This event has a "pretty" constant rate of input:
>>>> http://graphite.wikimedia.org/render/?width=588&height=311&_salt=1422494956.516&from=00%3A00_20141201&until=23%3A59_20150128&target=eventlogging.schema.MultimediaViewerNetworkPerformance.rate
>>>>
>>>> And as far as I can see it has not changed (greatly) before and after
>>>> the 7th. Does this throughput rate match what you see on the database? If
>>>> so, I would say db matches incoming stream. Now, if it doesn't (somehow the
>>>> db has less data) we might have a problem. However, dropped events when
>>>> inserting on the db will affect all countries equally so I doubt these will
>>>> be the source of the discrepancy.
>>>>
>>>> Are there a bigger number of events that might not be validating for
>>>> other wikis versus China?
>>>> Logs are at: stat1002:/a/eventlogging/archive and the all-events log
>>>> will have every single one of your events received. Please note that events
>>>> are not inserted right away when received, there is a buffer of couple
>>>> minutes.
>>>>
>>>> >This balance shifting over time is really problematic for tracking
>>>> Media Viewer client-side network performance, because Chinese >traffic
>>>> suddenly accounting for a bigger or smaller share of the overall recorded
>>>> events creates big swings in the global >averages/percentiles (since
>>>> network performance in China is bad).
>>>>
>>>> Wouldn't this happen if your product is "more used" in china than
>>>> elsewhere? If you are counting "absolute" values you will be always skewed
>>>> to the biggest dataset unless you are doing some kind of scaling.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Nuria
>>>>
>>>> On Wed, Jan 28, 2015 at 4:09 PM, Gilles Dubuc <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've tracked down an unexplained EL phenomenon that surfaced in our
>>>>> stats as a false trend in our global stats.
>>>>>
>>>>> The data I'm looking at specifically is coming from Media Viewer's
>>>>> MultimediaViewerNetworkPerformance_* tables.
>>>>>
>>>>> Have a look at this graph:
>>>>> https://docs.google.com/spreadsheets/d/1PJsyzAyj74dctGCl4-09L7LS4AMZRh57G56Xc2qZv6w/edit?usp=sharing
>>>>> the big change is on Jan 7th/8th
>>>>>
>>>>> It shows how many EL events we've recorded, per client-reported
>>>>> country, over the last 90 days. The sampling factor we use has been
>>>>> constant for each wiki over that period. Thus, the distribution shouldn't
>>>>> evolve drastically, aside from seasonal/local trends. Besides the Ukraine
>>>>> spike on a particular date (probably related to world events), the graph
>>>>> before Jan 7th looks like what you would expect. Then, following the 
>>>>> outage
>>>>> that happened on Jan 7th, not only the balance is completely changed, but
>>>>> it evolves over time (the US and China are keeping "higher than normal"
>>>>> levels, while the rest seems to slide down lower than pre-7th quantities),
>>>>> showing me that something strange is happening and is probably unresolved.
>>>>>
>>>>> This balance shifting over time is really problematic for tracking
>>>>> Media Viewer client-side network performance, because Chinese traffic
>>>>> suddenly accounting for a bigger or smaller share of the overall recorded
>>>>> events creates big swings in the global averages/percentiles (since 
>>>>> network
>>>>> performance in China is bad).
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to