That makes sense Gilles - the "local detection" seems too sophisticated in this case, outliers and stuff like that can be figured out in SQL.
On Thu, Jan 29, 2015 at 2:17 PM, Gilles Dubuc <[email protected]> wrote: > jsfiddle confirming my hunch about CORS and the misguided conditional: > http://jsfiddle.net/2ycra7y0/4/ > > On Thu, Jan 29, 2015 at 10:00 AM, Gilles Dubuc <[email protected]> > wrote: > >> I think I've found where this is coming from. Nothing to do with EL: >> >> - When Media Viewer was launched to production, we forgot about our list >> of beta sites in the config repo and as a result, wikis that had not been >> through the beta (Chinese Wikipedia being one of them) ended up being >> sampled 100 times more than all the other production wikis for this >> particular metrics (network performance). Which made China, where network >> performance is particularly bad, overrepresented in the global figures. >> - On Jan 7th the Timing-Allow-Origin: * header started being served for >> all images. In itself it should have only meant that our performance >> logging had more complete records, since the presence of the timing data >> wasn't a prerequisite in our code to recording a hit. So in itself that >> doesn't explain the spike of sampled events for that metrics. >> - In our code there was a line specifically filtering out hits based on >> timing data, when it believed the image load to be a local cache hit. In >> think that's the main source of the issue. I suspect that when CORS >> policies aren't in place, this code (which checked that a certain timing >> value was set to 0) wrongly believed those to be local cache hits and >> didn't send them to EL when it should have. Essentially because of the CORS >> situation in production we'd been accidentally suppressing all image >> performance events for people with recent browsers, and only recording >> browsers that don't support the timing APIs. Since we haven't really used >> the detailed timing data yet, this went unnoticed. >> >> The header being served on 7th suddenly "unlocked" that timing data and >> suddenly a large amount of events that were previously stopped at the gate >> on the client side, based on a wrong condition, started flowing into EL. >> And the imbalance we had because of the sampling rate config issue made >> this very visible because China, with its very slow speed, was taking a >> much larger share of the global stats than it should have. >> >> So basically 3 separate bugs combined caused this phenomenon. I think the >> lesson learned, which we actually applied to the rest of that performance >> code, is that we're better off not suppressing the client data as long as a >> user/hit is sampled and we should rather filter the data based on the >> recorded values later in SQL. >> >> Today I'm going to double-check that when CORS policies are missing >> timing data does have a 0 value that looks like a local cache hit, if I >> don't post an update to this thread, my assumptions were correct. >> >> On Wed, Jan 28, 2015 at 8:33 PM, Gilles Dubuc <[email protected]> >> wrote: >> >>> Looking specifically at China, >>> >>> CN on Jan 3rd: 9103 in log file, 9415 in DB >>> CN on Jan 20th: 13816 in log file, 13586 in DB >>> >>> This is further confirmation that EL seems to be fine. Thanks for >>> pointing me in the direction of the raw logs, I didn't know where they >>> were. I'll post an update when we've found the source of this issue. >>> >>> On Wed, Jan 28, 2015 at 6:20 PM, Gilles Dubuc <[email protected]> >>> wrote: >>> >>>> Does this throughput rate match what you see on the database? >>>>> >>>> >>>> It's hard to tell, because I can't infer how many events per day are >>>> generated based on what you're showing me. It depends on the granularity of >>>> the curve's rendering. Even if I knew what the numbers going up to 6.2 on >>>> the left represent, it still wouldn't tell me how many per hour/per day are >>>> recorded there. >>>> >>>> Are there a bigger number of events that might not be validating for >>>>> other wikis versus China? >>>>> >>>> >>>> Are you talking about syntactically incorrect events that don't pass >>>> the schema check? >>>> >>>> Wouldn't this happen if your product is "more used" in china than >>>>> elsewhere? >>>>> >>>> >>>> Sure, but that doesn't really matter, because global trends would >>>> evolve slowly. The point of tracking the network performance statistics >>>> worldwide, despite their very uneven distribution, is to immediately spot >>>> when something very wrong happens. We also track the same stats per wiki, >>>> and per country (albeit not on a timeline at this point for the per-country >>>> ones). The bottom line is that the per-country share of recorded EL events >>>> shouldn't vary wildly like that. >>>> >>>> Logs are at: stat1002:/a/eventlogging/ >>>>> >>>> >>>> gilles@stat1003:~/20150120$ cat all-events.log-20150120 | grep >>>> "\"clientValidated\": true" | grep -c "\"schema\": >>>> \"MultimediaViewerNetworkPerformance\"" >>>> 330061 >>>> >>>> And in the DB across the tables that have "active" versions, I get a >>>> total of 31 + 327867 = 327898 events recorded >>>> >>>> A difference of 2000ish events, might not be a real difference since >>>> the date cutoff probably differs between the log file and the timestamp >>>> contained in its entries. >>>> >>>> Now, looking at an earlier date, when things seemed to be stable in >>>> terms of country-based sampling: >>>> >>>> gilles@stat1003:~/20150103$ cat all-events.log-20150103 | grep >>>> "\"clientValidated\": true" | grep -c "\"schema\": >>>> \"MultimediaViewerNetworkPerformance\"" >>>> 287712 >>>> >>>> And in the DB, a total of 5 + 289461 = 289466 >>>> >>>> So far it does look like everything is making it to the DB, I'll keep >>>> investigating tomorrow. >>>> >>>> On Wed, Jan 28, 2015 at 5:43 PM, Nuria Ruiz <[email protected]> >>>> wrote: >>>> >>>>> Gilles: >>>>> >>>>> This event has a "pretty" constant rate of input: >>>>> http://graphite.wikimedia.org/render/?width=588&height=311&_salt=1422494956.516&from=00%3A00_20141201&until=23%3A59_20150128&target=eventlogging.schema.MultimediaViewerNetworkPerformance.rate >>>>> >>>>> And as far as I can see it has not changed (greatly) before and after >>>>> the 7th. Does this throughput rate match what you see on the database? If >>>>> so, I would say db matches incoming stream. Now, if it doesn't (somehow >>>>> the >>>>> db has less data) we might have a problem. However, dropped events when >>>>> inserting on the db will affect all countries equally so I doubt these >>>>> will >>>>> be the source of the discrepancy. >>>>> >>>>> Are there a bigger number of events that might not be validating for >>>>> other wikis versus China? >>>>> Logs are at: stat1002:/a/eventlogging/archive and the all-events log >>>>> will have every single one of your events received. Please note that >>>>> events >>>>> are not inserted right away when received, there is a buffer of couple >>>>> minutes. >>>>> >>>>> >This balance shifting over time is really problematic for tracking >>>>> Media Viewer client-side network performance, because Chinese >traffic >>>>> suddenly accounting for a bigger or smaller share of the overall recorded >>>>> events creates big swings in the global >averages/percentiles (since >>>>> network performance in China is bad). >>>>> >>>>> Wouldn't this happen if your product is "more used" in china than >>>>> elsewhere? If you are counting "absolute" values you will be always skewed >>>>> to the biggest dataset unless you are doing some kind of scaling. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Nuria >>>>> >>>>> On Wed, Jan 28, 2015 at 4:09 PM, Gilles Dubuc <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I've tracked down an unexplained EL phenomenon that surfaced in our >>>>>> stats as a false trend in our global stats. >>>>>> >>>>>> The data I'm looking at specifically is coming from Media Viewer's >>>>>> MultimediaViewerNetworkPerformance_* tables. >>>>>> >>>>>> Have a look at this graph: >>>>>> https://docs.google.com/spreadsheets/d/1PJsyzAyj74dctGCl4-09L7LS4AMZRh57G56Xc2qZv6w/edit?usp=sharing >>>>>> the big change is on Jan 7th/8th >>>>>> >>>>>> It shows how many EL events we've recorded, per client-reported >>>>>> country, over the last 90 days. The sampling factor we use has been >>>>>> constant for each wiki over that period. Thus, the distribution shouldn't >>>>>> evolve drastically, aside from seasonal/local trends. Besides the Ukraine >>>>>> spike on a particular date (probably related to world events), the graph >>>>>> before Jan 7th looks like what you would expect. Then, following the >>>>>> outage >>>>>> that happened on Jan 7th, not only the balance is completely changed, but >>>>>> it evolves over time (the US and China are keeping "higher than normal" >>>>>> levels, while the rest seems to slide down lower than pre-7th >>>>>> quantities), >>>>>> showing me that something strange is happening and is probably >>>>>> unresolved. >>>>>> >>>>>> This balance shifting over time is really problematic for tracking >>>>>> Media Viewer client-side network performance, because Chinese traffic >>>>>> suddenly accounting for a bigger or smaller share of the overall recorded >>>>>> events creates big swings in the global averages/percentiles (since >>>>>> network >>>>>> performance in China is bad). >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
