I just linked your results from https://phabricator.wikimedia.org/T58575, but really think that they should be more widely known. Do you mind writing a mail to wikitech@ or engineering@ about this finding?
Gabriel On Sun, Mar 1, 2015 at 6:24 PM, Nuria Ruiz <[email protected]> wrote: > >Note that couple days worth of traffic might be more than a 1 billion > requests for javascript on bits. > Sorry, correction. Couple days worth of "javascript bits" requests comes > up to 100 million requests not a 1000 million. > > On Sun, Mar 1, 2015 at 4:35 PM, Nuria Ruiz <[email protected]> wrote: > >> Thanks Timo for taking the time to write this. >> >> >The following requests are not part of our primary javascript payload >> and should be excluded when >interpreting bits.wikimedia.org requests >> for purposes of javascript "support": >> Correct. I think I excluded all those. >> Note that I listed on methodology "bits javascript traffic" not overall >> "bits traffic" >> https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript#Metodology >> >> I will double check the startup module just to be safe. >> >> >> >> >There are also non-MediaWiki environments (ab)using bits.wikimedia.org and >> bypassing the startup module. As such these are loading javascript modules >> directly, regardless of browser. There are at least two of these that I >> know of: >> I think our raw hive data probably does not includes the traffic from >> tools or wikipedia.org (need to confirm). But even if it did, the >> traffic of tools on bits is not significant compared to the one from >> wikipedia thus does not affect the overall results as we are throwing away >> the longtail. Note that couple days worth of traffic might be more than a 1 >> billion requests for javascript on bits. >> >> >> >> >Actually, there are probably about a dozen more exceptions I can think >> of. I don't believe it is feasibly possible to filter everything out. >> Statistically I do not think you need to, given the volume of traffic in >> wikipedia versus the other sources, you just cannot report results with a >> precision of, say, 0.001%. Even very small wikis - whose traffic is >> insignificant compared to english wikipedia- are also being thrown away. >> That is to say that if in the vasque wikipedia everyone started using >> "browser X" w/o Javascript support it will not be counted as it represents >> too small of a percentage of overall traffic. Results provided are an >> agreggation over all wikipedia's bits raw javascript traffic versus >> wikipedias overall pageviews. Because we are throwing away the long tail, >> results come from the most trafficked wikis (our disparity in pageviews >> among wikis is huge). If you want to get per wiki results you need to >> analyze the data in a completely different fashion. >> >> >> >> >> >> >> >> On Sat, Feb 28, 2015 at 4:48 PM, Timo Tijhof <[email protected]> >> wrote: >> >>> Hi, >>> >>> Here's a few thoughts about what may influence the data you're gathering. >>> >>> The decision of whether a browser has sufficient support for our Grade A >>> runtime happens client-side based on a combination of feature tests and >>> (unfortunately) user-agent sniffing. >>> >>> For this reason, our bootstrap script is written using only the most >>> basic syntax and prototype methods (as any other methods would cause a >>> run-time exception). For those familiar, this is somewhat similar to PHP >>> version detection in MediaWiki. The file has to parse and run to a certain >>> point in very old environments. >>> >>> The following requests are not part of our primary javascript payload >>> and should be excluded when interpreting bits.wikimedia.org requests >>> for purposes of javascript "support": >>> >>> * stylesheets (e.g. ".css" requests as well as load.php?...&only=styles >>> requests) >>> * images (e.g. ".png", ".svg" etc. as well as load.php?...&image=.. >>> requests) >>> * favicons and apple-touch icons (e.g. bits.wikimedia.org/favicon/.., >>> bits.wikimedia.org/apple-touch/..) >>> * fonts (e.g. bits.wikimedia.org/static-../../fonts/..) >>> * events (e.g. bits.wikimedia.org/event.gif, bits.wikimedia.org/statsv) >>> * startup module (bits.wikimedia.org/../load.php?..modules=startup) >>> >>> There are also non-MediaWiki environments (ab)using bits.wikimedia.org >>> and bypassing the startup module. As such these are loading javascript >>> modules directly, regardless of browser. There are at least two of these >>> that I know of: >>> >>> 1) Tool labs tools. Developers there may use bits.wikimedia.org to >>> serve modules like jQuery UI. They may circumvent the startup module and >>> unconditionally load those (which will cause errors in older browsers, but >>> they don't care or are unaware of how this works). >>> >>> 2) Portals such as www.wikipedia.org and others. >>> >>> For the data to be as reliable as feasibly possible, one would want to >>> filter out these "forged" requests not produced by MediaWiki. The best way >>> to filter out requests that bypassed the startup module is to filter out >>> requests with no version= query parameter. As well as request with an >>> outdated version parameter (since they can copy an old url and hardcode it >>> in their app). >>> >>> Actually, there are probably about a dozen more exceptions I can think >>> of. I don't believe it is feasibly possible to filter everything out. >>> Perhaps focus your next data-gathering window on a specific payload url - >>> instead of trying to catch all javascript payloads with exclusions for >>> wrong ones. >>> >>> For example, right now in MediaWiki 1.25wmf18 the jquery/mediawiki base >>> payload has version 20150225T221331Z and is requested by the startup module >>> from url (grabbed from the Network tab in Chrome Dev Tools): >>> >>> >>> https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=jquery%2Cmediawiki&only=scripts&skin=vector&version=20150225T221331Z >>> >>> Using only a specific url like that to gather user agents that support >>> javascript will have considerably less false positives. >>> >>> If you want to incorporate multiple wikis, it'll be a little more work >>> to get all the right urls, but handpicking a dozen wikis will probably be >>> good enough. >>> >>> This also has the advantage of not being biased by devices cache size. >>> Because, unlike all other modules, the base module is not cached in the >>> LocalStorage. It will still benefit HTTP 304 caching however. It would help >>> to have your window start simultaneously with the deployment of a new wmf >>> branch to en.wikipedia.org (and other wikis you include in the >>> experiment) so there's a fresh start with caching. >>> >>> </braindump> >>> >>> — Timo >>> >>> On 18 Feb 2015, at 18:07, Nuria Ruiz <[email protected]> wrote: >>> >>> > Do you think it's worth getting the UA distribution for CSS requests >>> & correlate it with the distribution for page / JS loading? >>> Yes, we can do that. I would need to gather a new dataset for it so I've >>> made a new task for it (https://phabricator.wikimedia.org/T89847), >>> marking this one as complete: https://phabricator.wikimedia.org/T88560 >>> >>> >>> I also like to do some research regarding IE6 /IE7 as we should see >>> those (according to our code: >>> https://github.com/wikimedia/mediawiki/blob/master/resources/src/startup.js) >>> in the no JS list but we only see some UA agents there. There are >>> definitely IE6/IE7 browsers to which we are serving javascript, just have >>> to look in detail what is what we are serving there. Will report on this. >>> Looks like this startup.js file is being served to all browsers regardless, >>> so I might need to do some more fine grained queries. >>> >>> Just consider the 3% as your approximate upper bound for overall >>> traffic, big bots removed. If you just count mobile traffic, numbers in >>> percentage are, of course, a lot higher. >>> >>> Thanks, >>> >>> Nuria >>> >>> >>> >>> On 17 Feb 2015, at 03:38, Nuria Ruiz <[email protected]> wrote: >>> >>> Gabriel: >>> >>> I have run through the data and have a rough estimate of how many of our >>> pageviews are requested from browsers w/o strong javascript support. It is >>> a preliminary rough estimate but I think is pretty useful. >>> >>> TL;DR >>> According to our new pageview definition ( >>> https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of >>> pageviews come from clients w/o much javascript support. But - BIG CAVEAT- >>> this includes bots requests. If you remove the easy-too-spot-big-bots the >>> percentage is <3%. >>> >>> Details here (still some homework to do regarding IE6 and IE7) >>> https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript >>> >>> >>> Thanks, >>> >>> Nuria >>> >>> >>> >> >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
