I just linked your results from https://phabricator.wikimedia.org/T58575,
but really think that they should be more widely known. Do you mind writing
a mail to wikitech@ or engineering@ about this finding?

Gabriel

On Sun, Mar 1, 2015 at 6:24 PM, Nuria Ruiz <[email protected]> wrote:

> >Note that couple days worth of traffic might be more than a 1 billion
> requests for javascript on bits.
> Sorry, correction. Couple days worth of "javascript bits" requests comes
> up to 100 million requests not a 1000 million.
>
> On Sun, Mar 1, 2015 at 4:35 PM, Nuria Ruiz <[email protected]> wrote:
>
>> Thanks Timo for taking the time to write this.
>>
>> >The following requests are not part of our primary javascript payload
>> and should be excluded when >interpreting bits.wikimedia.org requests
>> for purposes of javascript "support":
>> Correct. I think I excluded all those.
>> Note that I listed on methodology "bits javascript traffic" not  overall
>> "bits traffic"
>> https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript#Metodology
>>
>> I will double check the startup module just to be safe.
>>
>>
>>
>> >There are also non-MediaWiki environments (ab)using bits.wikimedia.org and
>> bypassing the startup module. As such these are loading javascript modules
>> directly, regardless of browser. There are at least two of these that I
>> know of:
>> I think our raw hive data probably does not includes the traffic from
>> tools or wikipedia.org (need to confirm). But even if it did, the
>> traffic of  tools on bits is not significant compared to the one from
>> wikipedia thus does not affect the overall results as we are throwing away
>> the longtail. Note that couple days worth of traffic might be more than a 1
>> billion requests for javascript on bits.
>>
>>
>>
>> >Actually, there are probably about a dozen more exceptions I can think
>> of. I don't believe it is feasibly possible to filter everything out.
>> Statistically I do not think you need to, given the volume of traffic in
>> wikipedia versus the other sources, you just cannot report results with a
>> precision of, say, 0.001%. Even very small wikis - whose traffic is
>> insignificant compared to english wikipedia- are also being thrown away.
>> That is to say that if in the vasque wikipedia everyone started using
>> "browser X" w/o Javascript support it will not be counted as it represents
>> too small of a percentage of overall traffic. Results provided are an
>> agreggation over all wikipedia's bits raw javascript traffic versus
>> wikipedias overall pageviews. Because we are throwing away the long tail,
>>  results come from the most trafficked wikis (our disparity in pageviews
>> among wikis is huge). If you want to get per wiki results you need to
>> analyze the data in a completely different fashion.
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Feb 28, 2015 at 4:48 PM, Timo Tijhof <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> Here's a few thoughts about what may influence the data you're gathering.
>>>
>>> The decision of whether a browser has sufficient support for our Grade A
>>> runtime happens client-side based on a combination of feature tests and
>>> (unfortunately) user-agent sniffing.
>>>
>>> For this reason, our bootstrap script is written using only the most
>>> basic syntax and prototype methods (as any other methods would cause a
>>> run-time exception). For those familiar, this is somewhat similar to PHP
>>> version detection in MediaWiki. The file has to parse and run to a certain
>>> point in very old environments.
>>>
>>> The following requests are not part of our primary javascript payload
>>> and should be excluded when interpreting bits.wikimedia.org requests
>>> for purposes of javascript "support":
>>>
>>> * stylesheets (e.g. ".css" requests as well as load.php?...&only=styles
>>> requests)
>>> * images (e.g. ".png", ".svg" etc. as well as load.php?...&image=..
>>> requests)
>>> * favicons and apple-touch icons (e.g. bits.wikimedia.org/favicon/..,
>>> bits.wikimedia.org/apple-touch/..)
>>> * fonts (e.g. bits.wikimedia.org/static-../../fonts/..)
>>> * events (e.g. bits.wikimedia.org/event.gif, bits.wikimedia.org/statsv)
>>> * startup module (bits.wikimedia.org/../load.php?..modules=startup)
>>>
>>> There are also non-MediaWiki environments (ab)using bits.wikimedia.org
>>> and bypassing the startup module. As such these are loading javascript
>>> modules directly, regardless of browser. There are at least two of these
>>> that I know of:
>>>
>>> 1) Tool labs tools. Developers there may use bits.wikimedia.org to
>>> serve modules like jQuery UI. They may circumvent the startup module and
>>> unconditionally load those (which will cause errors in older browsers, but
>>> they don't care or are unaware of how this works).
>>>
>>> 2) Portals such as www.wikipedia.org and others.
>>>
>>> For the data to be as reliable as feasibly possible, one would want to
>>> filter out these "forged" requests not produced by MediaWiki. The best way
>>> to filter out requests that bypassed the startup module is to filter out
>>> requests with no version= query parameter. As well as request with an
>>> outdated version parameter (since they can copy an old url and hardcode it
>>> in their app).
>>>
>>> Actually, there are probably about a dozen more exceptions I can think
>>> of. I don't believe it is feasibly possible to filter everything out.
>>> Perhaps focus your next data-gathering window on a specific payload url -
>>> instead of trying to catch all javascript payloads with exclusions for
>>> wrong ones.
>>>
>>> For example, right now in MediaWiki 1.25wmf18 the jquery/mediawiki base
>>> payload has version 20150225T221331Z and is requested by the startup module
>>> from url (grabbed from the Network tab in Chrome Dev Tools):
>>>
>>>
>>> https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=jquery%2Cmediawiki&only=scripts&skin=vector&version=20150225T221331Z
>>>
>>> Using only a specific url like that to gather user agents that support
>>> javascript will have considerably less false positives.
>>>
>>> If you want to incorporate multiple wikis, it'll be a little more work
>>> to get all the right urls, but handpicking a dozen wikis will probably be
>>> good enough.
>>>
>>> This also has the advantage of not being biased by devices cache size.
>>> Because, unlike all other modules, the base module is not cached in the
>>> LocalStorage. It will still benefit HTTP 304 caching however. It would help
>>> to have your window start simultaneously with the deployment of a new wmf
>>> branch to en.wikipedia.org (and other wikis you include in the
>>> experiment) so there's a fresh start with caching.
>>>
>>> </braindump>
>>>
>>> — Timo
>>>
>>> On 18 Feb 2015, at 18:07, Nuria Ruiz <[email protected]> wrote:
>>>
>>> > Do you think it's worth getting the UA distribution for CSS requests
>>> & correlate it with the distribution for page / JS loading?
>>> Yes, we can do that. I would need to gather a new dataset for it so I've
>>> made a new task for it (https://phabricator.wikimedia.org/T89847),
>>> marking this one as complete: https://phabricator.wikimedia.org/T88560
>>>
>>>
>>> I also like to do some research regarding IE6 /IE7 as we should see
>>> those (according to our code:
>>> https://github.com/wikimedia/mediawiki/blob/master/resources/src/startup.js)
>>> in the no JS list but we only see some UA agents there. There are
>>> definitely IE6/IE7 browsers to which we are serving javascript, just have
>>> to look in detail what is what we are serving there. Will report on this.
>>> Looks like this startup.js file is being served to all browsers regardless,
>>> so I might need to do some more fine grained queries.
>>>
>>> Just consider the 3% as your approximate upper bound for overall
>>> traffic, big bots removed. If you just count mobile traffic, numbers in
>>> percentage are, of course, a lot higher.
>>>
>>> Thanks,
>>>
>>> Nuria
>>>
>>>
>>>
>>> On 17 Feb 2015, at 03:38, Nuria Ruiz <[email protected]> wrote:
>>>
>>> Gabriel:
>>>
>>> I have run through the data and have a rough estimate of how many of our
>>> pageviews are requested from browsers w/o strong javascript support. It is
>>> a preliminary rough estimate but I think is pretty useful.
>>>
>>> TL;DR
>>> According to our new pageview definition (
>>> https://meta.wikimedia.org/wiki/Research:Page_view) about 10% of
>>> pageviews come from clients w/o much javascript support. But - BIG CAVEAT-
>>> this includes bots requests. If you remove the easy-too-spot-big-bots the
>>> percentage is <3%.
>>>
>>> Details here (still some homework to do regarding IE6 and IE7)
>>> https://www.mediawiki.org/wiki/Analytics/Reports/ClientsWithoutJavascript
>>>
>>>
>>> Thanks,
>>>
>>> Nuria
>>>
>>>
>>>
>>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to