jsfiddle confirming my hunch about CORS and the misguided conditional: http://jsfiddle.net/2ycra7y0/4/
On Thu, Jan 29, 2015 at 10:00 AM, Gilles Dubuc <[email protected]> wrote: > I think I've found where this is coming from. Nothing to do with EL: > > - When Media Viewer was launched to production, we forgot about our list > of beta sites in the config repo and as a result, wikis that had not been > through the beta (Chinese Wikipedia being one of them) ended up being > sampled 100 times more than all the other production wikis for this > particular metrics (network performance). Which made China, where network > performance is particularly bad, overrepresented in the global figures. > - On Jan 7th the Timing-Allow-Origin: * header started being served for > all images. In itself it should have only meant that our performance > logging had more complete records, since the presence of the timing data > wasn't a prerequisite in our code to recording a hit. So in itself that > doesn't explain the spike of sampled events for that metrics. > - In our code there was a line specifically filtering out hits based on > timing data, when it believed the image load to be a local cache hit. In > think that's the main source of the issue. I suspect that when CORS > policies aren't in place, this code (which checked that a certain timing > value was set to 0) wrongly believed those to be local cache hits and > didn't send them to EL when it should have. Essentially because of the CORS > situation in production we'd been accidentally suppressing all image > performance events for people with recent browsers, and only recording > browsers that don't support the timing APIs. Since we haven't really used > the detailed timing data yet, this went unnoticed. > > The header being served on 7th suddenly "unlocked" that timing data and > suddenly a large amount of events that were previously stopped at the gate > on the client side, based on a wrong condition, started flowing into EL. > And the imbalance we had because of the sampling rate config issue made > this very visible because China, with its very slow speed, was taking a > much larger share of the global stats than it should have. > > So basically 3 separate bugs combined caused this phenomenon. I think the > lesson learned, which we actually applied to the rest of that performance > code, is that we're better off not suppressing the client data as long as a > user/hit is sampled and we should rather filter the data based on the > recorded values later in SQL. > > Today I'm going to double-check that when CORS policies are missing timing > data does have a 0 value that looks like a local cache hit, if I don't post > an update to this thread, my assumptions were correct. > > On Wed, Jan 28, 2015 at 8:33 PM, Gilles Dubuc <[email protected]> > wrote: > >> Looking specifically at China, >> >> CN on Jan 3rd: 9103 in log file, 9415 in DB >> CN on Jan 20th: 13816 in log file, 13586 in DB >> >> This is further confirmation that EL seems to be fine. Thanks for >> pointing me in the direction of the raw logs, I didn't know where they >> were. I'll post an update when we've found the source of this issue. >> >> On Wed, Jan 28, 2015 at 6:20 PM, Gilles Dubuc <[email protected]> >> wrote: >> >>> Does this throughput rate match what you see on the database? >>>> >>> >>> It's hard to tell, because I can't infer how many events per day are >>> generated based on what you're showing me. It depends on the granularity of >>> the curve's rendering. Even if I knew what the numbers going up to 6.2 on >>> the left represent, it still wouldn't tell me how many per hour/per day are >>> recorded there. >>> >>> Are there a bigger number of events that might not be validating for >>>> other wikis versus China? >>>> >>> >>> Are you talking about syntactically incorrect events that don't pass the >>> schema check? >>> >>> Wouldn't this happen if your product is "more used" in china than >>>> elsewhere? >>>> >>> >>> Sure, but that doesn't really matter, because global trends would evolve >>> slowly. The point of tracking the network performance statistics worldwide, >>> despite their very uneven distribution, is to immediately spot when >>> something very wrong happens. We also track the same stats per wiki, and >>> per country (albeit not on a timeline at this point for the per-country >>> ones). The bottom line is that the per-country share of recorded EL events >>> shouldn't vary wildly like that. >>> >>> Logs are at: stat1002:/a/eventlogging/ >>>> >>> >>> gilles@stat1003:~/20150120$ cat all-events.log-20150120 | grep >>> "\"clientValidated\": true" | grep -c "\"schema\": >>> \"MultimediaViewerNetworkPerformance\"" >>> 330061 >>> >>> And in the DB across the tables that have "active" versions, I get a >>> total of 31 + 327867 = 327898 events recorded >>> >>> A difference of 2000ish events, might not be a real difference since the >>> date cutoff probably differs between the log file and the timestamp >>> contained in its entries. >>> >>> Now, looking at an earlier date, when things seemed to be stable in >>> terms of country-based sampling: >>> >>> gilles@stat1003:~/20150103$ cat all-events.log-20150103 | grep >>> "\"clientValidated\": true" | grep -c "\"schema\": >>> \"MultimediaViewerNetworkPerformance\"" >>> 287712 >>> >>> And in the DB, a total of 5 + 289461 = 289466 >>> >>> So far it does look like everything is making it to the DB, I'll keep >>> investigating tomorrow. >>> >>> On Wed, Jan 28, 2015 at 5:43 PM, Nuria Ruiz <[email protected]> wrote: >>> >>>> Gilles: >>>> >>>> This event has a "pretty" constant rate of input: >>>> http://graphite.wikimedia.org/render/?width=588&height=311&_salt=1422494956.516&from=00%3A00_20141201&until=23%3A59_20150128&target=eventlogging.schema.MultimediaViewerNetworkPerformance.rate >>>> >>>> And as far as I can see it has not changed (greatly) before and after >>>> the 7th. Does this throughput rate match what you see on the database? If >>>> so, I would say db matches incoming stream. Now, if it doesn't (somehow the >>>> db has less data) we might have a problem. However, dropped events when >>>> inserting on the db will affect all countries equally so I doubt these will >>>> be the source of the discrepancy. >>>> >>>> Are there a bigger number of events that might not be validating for >>>> other wikis versus China? >>>> Logs are at: stat1002:/a/eventlogging/archive and the all-events log >>>> will have every single one of your events received. Please note that events >>>> are not inserted right away when received, there is a buffer of couple >>>> minutes. >>>> >>>> >This balance shifting over time is really problematic for tracking >>>> Media Viewer client-side network performance, because Chinese >traffic >>>> suddenly accounting for a bigger or smaller share of the overall recorded >>>> events creates big swings in the global >averages/percentiles (since >>>> network performance in China is bad). >>>> >>>> Wouldn't this happen if your product is "more used" in china than >>>> elsewhere? If you are counting "absolute" values you will be always skewed >>>> to the biggest dataset unless you are doing some kind of scaling. >>>> >>>> >>>> Thanks, >>>> >>>> Nuria >>>> >>>> On Wed, Jan 28, 2015 at 4:09 PM, Gilles Dubuc <[email protected]> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I've tracked down an unexplained EL phenomenon that surfaced in our >>>>> stats as a false trend in our global stats. >>>>> >>>>> The data I'm looking at specifically is coming from Media Viewer's >>>>> MultimediaViewerNetworkPerformance_* tables. >>>>> >>>>> Have a look at this graph: >>>>> https://docs.google.com/spreadsheets/d/1PJsyzAyj74dctGCl4-09L7LS4AMZRh57G56Xc2qZv6w/edit?usp=sharing >>>>> the big change is on Jan 7th/8th >>>>> >>>>> It shows how many EL events we've recorded, per client-reported >>>>> country, over the last 90 days. The sampling factor we use has been >>>>> constant for each wiki over that period. Thus, the distribution shouldn't >>>>> evolve drastically, aside from seasonal/local trends. Besides the Ukraine >>>>> spike on a particular date (probably related to world events), the graph >>>>> before Jan 7th looks like what you would expect. Then, following the >>>>> outage >>>>> that happened on Jan 7th, not only the balance is completely changed, but >>>>> it evolves over time (the US and China are keeping "higher than normal" >>>>> levels, while the rest seems to slide down lower than pre-7th quantities), >>>>> showing me that something strange is happening and is probably unresolved. >>>>> >>>>> This balance shifting over time is really problematic for tracking >>>>> Media Viewer client-side network performance, because Chinese traffic >>>>> suddenly accounting for a bigger or smaller share of the overall recorded >>>>> events creates big swings in the global averages/percentiles (since >>>>> network >>>>> performance in China is bad). >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >> >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
