Dr0ptp4kt has uploaded a new change for review.

  https://gerrit.wikimedia.org/r/89120


Change subject: Adjust for newer redirect formats, getting results faster & 
easier.
......................................................................

Adjust for newer redirect formats, getting results faster & easier.

Change-Id: Ib626ce29f6338de2bfffd12e8a1c1c811b337f46
---
M maintenance/phantom/README
M maintenance/phantom/zero_automated_tests.js
2 files changed, 182 insertions(+), 142 deletions(-)


  git pull 
ssh://gerrit.wikimedia.org:29418/mediawiki/extensions/ZeroRatedMobileAccess 
refs/changes/20/89120/1

diff --git a/maintenance/phantom/README b/maintenance/phantom/README
index 20c268c..d78f644 100644
--- a/maintenance/phantom/README
+++ b/maintenance/phantom/README
@@ -9,10 +9,10 @@
 
 PATH=/bin:/usr/bin
 
-5 20 * * * /home/dr0ptp4kt/job.sh
+5 20 * * * /home/dr0ptp4kt/job.sh faster
 50 23 * * * /home/dr0ptp4kt/diskusage.sh
 
-These scripts are executable. job.sh can also be run with an optional 
parameter or two:
+These scripts are executable. As is the case on zero-bdd, job.sh can be run 
with an optional parameter or two:
 
 $ ./job.sh faster
 
@@ -26,13 +26,17 @@
 
 To run the full battery of accessed pages, run with anything other than 
"faster". For example:
 
-$ ./job normal 250-99
+$ ./job.sh normal 250-99
 
 runs the full battery for an MCC/MNC of 250-99, or
 
-$ ./job normal
+$ ./job.sh normal
 
-runs the full battery for all configured MCC/MNCs. "normal" may be omitted in 
this example, as it's the default.
+or just
+
+$ ./job.sh
+
+runs the full battery for all configured MCC/MNCs. "normal" is the default.
 
 
 Be sure to establish appropriate permissions:
@@ -185,7 +189,9 @@
 * 
http://<domain>/wiki/Google?renderZeroRatedRedirect=true&returnto=https%3A%2F%2Fwww.google.com
 
 B.
-It additionally accesses the same URLs in 3A, tagging on an extra parameter to 
the URL, "cachebuster=<random value>".
+It additionally accesses the same URLs in 3A, tagging on an extra parameter to 
the URL, "cachebuster=<random value>". It also hits the following URL only as a 
cachebusted hit:
+
+* 
http://<domain>/wiki/Special:ZeroRatedMobileAccess?from=Google&to=https%3A%2F%2Fgoogle.com
 
 C.
 It also connects to the following URLs, even if one of them is not part of a 
partner's Wikipedia Zero configuration.
@@ -203,15 +209,15 @@
 
 4. For the URLs in 3A, 3B, and 3D, in addition to screenshots of the pages, 
hyperlinks from the pages and the URLs of subresources encountered during page 
load and redirects are logged. Hyperlinks and subresources that would violate 
the standard domain name patterns communicated to partners or that would land 
the user on hard-to-track encrypted HTTPS connections are flagged as "UNSAFE" 
whereas other hyperlinks and URLs are flagged as "OKAY". Hyperlinks found on 
domains that aren't part of the carrier configuration aren't considered for 
further spidering.
 
-"OKAY" domain names for zerodot access: zero.wikipedia.org, 
*.zero.wikipedia.org, bits.wikimedia.org
+"OKAY" domain names for zerodot access: zero.wikipedia.org, 
*.zero.wikipedia.org, bits.wikimedia.org, meta.wikimedia.org, 
meta.m.wikimedia.org, commons.wikimedia.org, commons.m.wikimedia.org
 
-"OKAY" domain names for mdot access: m.wikipedia.org, *.m.wikipedia.org, 
zero.wikipedia.org, *.zero.wikipedia.org, bits.wikimedia.org, 
commons.wikimedia.org, upload.wikimedia.org
+"OKAY" domain names for mdot access: m.wikipedia.org, *.m.wikipedia.org, 
zero.wikipedia.org, *.zero.wikipedia.org, bits.wikimedia.org, 
commons.wikimedia.org, commons.m.wikimedia.org, upload.wikimedia.org, 
meta.wikimedia.org, meta.m.wikimedia.org
 
-Note that upload.wikimedia.org is not logged as "OKAY" zerodot access. Note 
that in both the zerodot and the mdot cases meta.wikimedia.org and 
meta.m.wikimedia.org are not yet considered "SAFE", either.
+Note that upload.wikimedia.org is not logged as "OKAY" zerodot access.
 
 Additionally, note that a partner's webpage hyperlink is flagged as "OKAY" if 
warnings are suppressed for the carrier with the bannerWarning configuration 
setting of "false". Further link collection for pseudorandom selection is not 
performed if the top level subdomain is not in the carrier configuration.
 
-For the URLs in 3C, screenshots are taken and hyperlinks plus subresources are 
logged, although no attempt is currently made to classify UNSAFE or OKAY URLs 
because without X-Forwarded-For spoofing, proper redirection is not guaranteed. 
A Varnish change has been submitted for review 
(https://gerrit.wikimedia.org/r/74509) to allow the Wikipedia Zero automation 
testing server, zero-test.ptmpa.wmflabs (aliased at zero-test.wmflabs.org with 
a static IP address of 208.80.153.184), to be allowed to spoof the 
X-Forwarded-For header to mimic carrier access and enable the correct redirect 
behavior; once this is supported, pseudorandom selection of spoofed IP 
addresses to simulate carriers may be worth exploration.
+For the URLs in 3C, screenshots are taken and hyperlinks plus subresources are 
logged, although no attempt is currently made to classify UNSAFE or OKAY URLs 
because without X-Forwarded-For spoofing, proper redirection is not guaranteed.
 
 For the URL in 3E, a "screenshot" is taken, although PhantomJS is currently 
unable to actually render this, so it is blank. Further processing is not done 
with the results other than logging the WML response negotiation.
 
@@ -222,13 +228,11 @@
 
 WHAT job.sh DOES:
 
-A zip file with the logging data is emailed daily to [email protected].
+A zip file with the logging data is emailed daily.
 
 A zip file showing the diffs for bad resources and links for non-cachebusted 
pages, as well as improper ResourceLoader inclusion of Wikipedia Zero CSS and 
JavaScript, as well as W3C HTML Tidy "gripes", as well as basic WML outcomes 
(achieved with basic awk, BASH, and Curl scripting) are collected and emailed.
 
-An HTMl file showing production banners for each carrier is produced and 
emailed, in addition to being copied to the webserver.
-
-As the diffs and production banners scripts prove themselves, likely in early 
August 2013, other individuals will start receiving the files for routine 
review.
+An HTML file showing production banners for each carrier is produced and 
emailed, in addition to being copied to the webserver.
 
 The full set of files is also zipped and copied to the webserver for retrieval 
from the banners webpage.
 
diff --git a/maintenance/phantom/zero_automated_tests.js 
b/maintenance/phantom/zero_automated_tests.js
index 5971508..37ddc26 100644
--- a/maintenance/phantom/zero_automated_tests.js
+++ b/maintenance/phantom/zero_automated_tests.js
@@ -8,7 +8,7 @@
 
 Here's one feature that would be really nice to have.
 
- * X-Forwarded-For. Contingent on pending wikimedia.vcl.erb change submitted.
+ * X-Forwarded-For. Note that on zero-bdd.wmflabs.org a 10dot address is 
identified by Varnish.
 
 Here are some other nice-to-haves for rainy days.
 
@@ -442,9 +442,9 @@
 
                 var speed = ZRMA.getSpeed();
 
-                var safeZerodotDomains = 
/^(zero\.wikipedia\.org|.+?\.zero\.wikipedia\.org|bits\.wikimedia\.org)$/;
+                var safeMdotDomains = 
/^(m\.wikipedia\.org|.+?\.m\.wikipedia\.org|.+?\.zero\.wikipedia\.org|zero\.wikipedia\.org|bits\.wikimedia\.org|commons\.wikimedia\.org|upload\.wikimedia\.org|meta\.wikimedia\.org|geoiplookup\.wikimedia\.org|meta\.m\.wikimedia\.org|commons\.m\.wikimedia\.org)$/;
 
-                var safeMdotDomains 
=/^(m\.wikipedia\.org|.+?\.m\.wikipedia\.org|.+?\.zero\.wikipedia\.org|zero\.wikipedia\.org|bits\.wikimedia\.org|commons\.wikimedia\.org|upload\.wikimedia\.org)$/;
+                var safeZerodotDomains = 
/^(.+?\.zero\.wikipedia\.org|zero\.wikipedia\.org|bits\.wikimedia\.org|commons\.wikimedia\.org|meta\.wikimedia\.org|geoiplookup\.wikimedia\.org|meta\.m\.wikimedia\.org|commons\.m\.wikimedia\.org)$/;
 
                 var urls = [];
 
@@ -517,6 +517,25 @@
                                     subdomains
                                 )
                             );
+
+                            urls.push(
+                                ZRMA.makeUrl(
+                                    'http',
+                                    langs[l],
+                                    subdomains[s],
+                                    {'X-CS': xcs},
+                                    'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_0 
like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 
Mobile/7A341 Safari/528.16',
+                                    
'Special:ZeroRatedMobileAccess?from=Google&to=https%3A%2F%2Fgoogle.com',
+                                    true,
+                                    safeDomains,
+                                    safePartnerUrl,
+                                    enabledForPair,
+                                    comment,
+                                    whitelistedLangs,
+                                    subdomains
+                                )
+                            );
+
 
                             // WAP!
 
@@ -789,7 +808,8 @@
                                 for (i = 0; i < list.length; i++) {
                                     var href = list[i].href;
                                     var linkClass = 
list[i].getAttribute('class');
-                                    hrefList.push( {'sourceUrl' : u['url']  
,'href': href, 'xcs' : u['headers']['X-CS'], 'enabled' : u['enabled'], 
'linkClass' : linkClass} );
+                                    var linkAllClasses = 
list[i].className.split(/\s+/);
+                                    hrefList.push( {'sourceUrl' : u['url']  
,'href': href, 'xcs' : u['headers']['X-CS'], 'enabled' : u['enabled'], 
'linkClass' : linkClass, 'linkAllClasses' : linkAllClasses} );
                                 }
                                 return hrefList;
                             }, url);
@@ -864,7 +884,8 @@
                                                                 'actualUrl' : 
page.url,
                                                                 
'destinationUrl' : elem['href'],
                                                                 
'destinationDomain' : domain[1],
-                                                                'linkClass' : 
elem['linkClass']
+                                                                'linkClass' : 
elem['linkClass'],
+                                                                
'linkAllClasses' : elem['linkAllClasses']
                                                             }
                                                         );
                                                     }
@@ -884,7 +905,7 @@
                                 fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
                                 console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
 
-                                page.clipRect = { top: 0, left: 0, width: 320, 
height: 100 };
+                                page.clipRect = { top: 0, left: 0, width: 320, 
height: 150 };
                                 page.render(ZRMA.dt + '/' + fileName + 
'-topclip.png');
                                 console.log('SAVED FILE ' + fileName + 
'-topclip.png on ' + (new Date()).toISOString());
                                 var wlLangs = url['whitelistedLangs'].length 
=== 0 ? 'all langs' : url['whitelistedLangs'].join(", ");
@@ -921,7 +942,7 @@
                                 fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
                                 console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
 
-                                page.clipRect = { top: 0, left: 0, width: 320, 
height: 100 };
+                                page.clipRect = { top: 0, left: 0, width: 320, 
height: 150 };
                                 page.render(ZRMA.dt + '/' + fileName + 
'-topclip.png');
                                 console.log('SAVED FILE ' + fileName + 
'-topclip.png on ' + (new Date()).toISOString());
 
@@ -950,112 +971,6 @@
                                 );
                             }
 
-                            patternCat = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+Cat[?]cachebuster=(\d+)-([0-9*]+)-\d+$/;
-                            mC = patternCat.exec(url['url']);
-                            if (mC !== null) {
-                                fileName = mC[3] + '-' + mC[4] + '-' + mC[2] + 
'-' + mC[1] + '-Cat-Cachebusted';
-                                page.render(ZRMA.dt + '/' + fileName + '.pdf');
-                                console.log('SAVED FILE ' + fileName + '.pdf 
on ' + (new Date()).toISOString());
-                                fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
-                                console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
-
-                            }
-
-                            patternCatNobust = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+Cat$/;
-                            mCN = patternCatNobust.exec(url['url']);
-                            if (mCN !== null) {
-                                fileName = url['headers']['X-CS'] + '-' + 
mCN[2] + '-' + mCN[1] + '-Cat-Nobust';
-                                page.render(ZRMA.dt + '/' + fileName + '.pdf');
-                                console.log('SAVED FILE ' + fileName + '.pdf 
on ' + (new Date()).toISOString());
-                                fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
-                                console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
-
-                            }
-
-                            patternSpecial = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+Special:ZeroRatedMobileAccess[?]cachebuster=(\d+)-([0-9*]+)-\d+$/;
-                            mS = patternSpecial.exec(url['url']);
-                            if (mS !== null) {
-                                fileName = mS[3] + '-' + mS[4] + '-' + mS[2] + 
'-' + mS[1] + '-SpecialZRMA-Cachebusted';
-                                page.render(ZRMA.dt + '/' + fileName + '.pdf');
-                                console.log('SAVED FILE ' + fileName + '.pdf 
on ' + (new Date()).toISOString());
-                                fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
-                                console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
-                            }
-
-                            patternSpecialNobust = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+Special:ZeroRatedMobileAccess$/;
-                            mSN = patternSpecialNobust.exec(url['url']);
-                            if (mSN !== null) {
-                                fileName = url['headers']['X-CS'] + '-' + 
mSN[2] + '-' + mSN[1] + '-SpecialZRMA-Nobust';
-                                page.render(ZRMA.dt + '/' + fileName + '.pdf');
-                                console.log('SAVED FILE ' + fileName + '.pdf 
on ' + (new Date()).toISOString());
-                                fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
-                                console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
-                            }
-
-                            patternFile = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+File:Santa_Claus\.jpg[?]cachebuster=(\d+)-([0-9*]+)-\d+$/;
-                            mF = patternFile.exec(url['url']);
-                            if (mF !== null) {
-                                fileName = mF[3] + '-' + mF[4] + '-' + mF[2] + 
'-' + mF[1] + '-File_Santa_Claus_dot_jpg-Cachebusted';
-                                page.render(ZRMA.dt + '/' + fileName + '.pdf');
-                                console.log('SAVED FILE ' + fileName + '.pdf 
on ' + (new Date()).toISOString());
-                                fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
-                                console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
-                            }
-
-                            patternFileNobust = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+File:Santa_Claus\.jpg$/;
-                            mFN = patternFileNobust.exec(url['url']);
-                            if (mFN !== null) {
-                                fileName = url['headers']['X-CS'] + '-' + 
mFN[2] + '-' + mFN[1] + '-File_Santa_Claus_dot_jpg-Nobust';
-                                page.render(ZRMA.dt + '/' + fileName + '.pdf');
-                                console.log('SAVED FILE ' + fileName + '.pdf 
on ' + (new Date()).toISOString());
-                                fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
-                                console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
-                            }
-
-                            patternExternalLink = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+www\.google\.com[&]cachebuster=(\d+)-([0-9*]+)-\d+$/;
-                            mE = patternExternalLink.exec(url['url']);
-                            if (mE !== null) {
-                                fileName = mE[3] + '-' + mE[4] + '-' + mE[2] + 
'-' + mE[1] + '-External_Link-Cachebusted';
-                                page.render(ZRMA.dt + '/' + fileName + '.pdf');
-                                console.log('SAVED FILE ' + fileName + '.pdf 
on ' + (new Date()).toISOString());
-                                fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
-                                console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
-                            }
-
-                            patternExternalLinkNobust = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+www\.google\.com$/;
-                            mEN = patternExternalLinkNobust.exec(url['url']);
-                            if (mEN !== null) {
-                                fileName = url['headers']['X-CS'] + '-' + 
mEN[2] + '-' + mEN[1] + '-External_Link-Nobust';
-                                page.render(ZRMA.dt + '/' + fileName + '.pdf');
-                                console.log('SAVED FILE ' + fileName + '.pdf 
on ' + (new Date()).toISOString());
-                                fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
-                                console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
-                            }
-
-
-                            // Special:Random isn't queued as a cachebusted 
URL, so only one file save here
-                            patternSpecialRandomNobust = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+Special:Random$/;
-                            mSRN = patternSpecialRandomNobust.exec(url['url']);
-                            if (mSRN !== null) {
-                                fileName = url['headers']['X-CS'] + '-' + 
mSRN[2] + '-' + mSRN[1] + '-SpecialRandom-NobustOnly';
-                                page.render(ZRMA.dt + '/' + fileName + '.pdf');
-                                console.log('SAVED FILE ' + fileName + '.pdf 
on ' + (new Date()).toISOString());
-                                fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
-                                console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
-                            }
-
-
-                            // WAP
-
-                            patternSanFrancisco = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+San_Francisco[?]cachebuster=(\d+)-([0-9*]+)-\d+$/;
-                            mW = patternSanFrancisco.exec(url['url']);
-                            if (mW !== null) {
-                                fileName = mW[3] + '-' + mW[4] + '-' + mW[2] + 
'-' + mW[1] + '-WAP-San_Francisco-Cachebusted';
-                                page.render(ZRMA.dt + '/' + fileName + '.pdf');
-                                console.log('SAVED FILE ' + fileName + '.pdf 
on ' + (new Date()).toISOString());
-                                fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
-                                console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
-                            }
 
                             patternSanFranciscoNobust = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+San_Francisco$/;
                             mWN = patternSanFranciscoNobust.exec(url['url']);
@@ -1067,6 +982,117 @@
                                 console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
                             }
 
+
+                            if (speed === 'normal') {
+
+                                patternCat = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+Cat[?]cachebuster=(\d+)-([0-9*]+)-\d+$/;
+                                mC = patternCat.exec(url['url']);
+                                if (mC !== null) {
+                                    fileName = mC[3] + '-' + mC[4] + '-' + 
mC[2] + '-' + mC[1] + '-Cat-Cachebusted';
+                                    page.render(ZRMA.dt + '/' + fileName + 
'.pdf');
+                                    console.log('SAVED FILE ' + fileName + 
'.pdf on ' + (new Date()).toISOString());
+                                    fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
+                                    console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
+
+                                }
+
+                                patternCatNobust = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+Cat$/;
+                                mCN = patternCatNobust.exec(url['url']);
+                                if (mCN !== null) {
+                                    fileName = url['headers']['X-CS'] + '-' + 
mCN[2] + '-' + mCN[1] + '-Cat-Nobust';
+                                    page.render(ZRMA.dt + '/' + fileName + 
'.pdf');
+                                    console.log('SAVED FILE ' + fileName + 
'.pdf on ' + (new Date()).toISOString());
+                                    fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
+                                    console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
+
+                                }
+
+                                patternSpecial = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+Special:ZeroRatedMobileAccess[?]cachebuster=(\d+)-([0-9*]+)-\d+$/;
+                                mS = patternSpecial.exec(url['url']);
+                                if (mS !== null) {
+                                    fileName = mS[3] + '-' + mS[4] + '-' + 
mS[2] + '-' + mS[1] + '-SpecialZRMA-Cachebusted';
+                                    page.render(ZRMA.dt + '/' + fileName + 
'.pdf');
+                                    console.log('SAVED FILE ' + fileName + 
'.pdf on ' + (new Date()).toISOString());
+                                    fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
+                                    console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
+                                }
+
+                                patternSpecialNobust = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+Special:ZeroRatedMobileAccess$/;
+                                mSN = patternSpecialNobust.exec(url['url']);
+                                if (mSN !== null) {
+                                    fileName = url['headers']['X-CS'] + '-' + 
mSN[2] + '-' + mSN[1] + '-SpecialZRMA-Nobust';
+                                    page.render(ZRMA.dt + '/' + fileName + 
'.pdf');
+                                    console.log('SAVED FILE ' + fileName + 
'.pdf on ' + (new Date()).toISOString());
+                                    fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
+                                    console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
+                                }
+
+                                patternFile = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+File:Santa_Claus\.jpg[?]cachebuster=(\d+)-([0-9*]+)-\d+$/;
+                                mF = patternFile.exec(url['url']);
+                                if (mF !== null) {
+                                    fileName = mF[3] + '-' + mF[4] + '-' + 
mF[2] + '-' + mF[1] + '-File_Santa_Claus_dot_jpg-Cachebusted';
+                                    page.render(ZRMA.dt + '/' + fileName + 
'.pdf');
+                                    console.log('SAVED FILE ' + fileName + 
'.pdf on ' + (new Date()).toISOString());
+                                    fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
+                                    console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
+                                }
+
+                                patternFileNobust = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+File:Santa_Claus\.jpg$/;
+                                mFN = patternFileNobust.exec(url['url']);
+                                if (mFN !== null) {
+                                    fileName = url['headers']['X-CS'] + '-' + 
mFN[2] + '-' + mFN[1] + '-File_Santa_Claus_dot_jpg-Nobust';
+                                    page.render(ZRMA.dt + '/' + fileName + 
'.pdf');
+                                    console.log('SAVED FILE ' + fileName + 
'.pdf on ' + (new Date()).toISOString());
+                                    fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
+                                    console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
+                                }
+
+                                patternExternalLink = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+www\.google\.com[&]cachebuster=(\d+)-([0-9*]+)-\d+$/;
+                                mE = patternExternalLink.exec(url['url']);
+                                if (mE !== null) {
+                                    fileName = mE[3] + '-' + mE[4] + '-' + 
mE[2] + '-' + mE[1] + '-External_Link-Cachebusted';
+                                    page.render(ZRMA.dt + '/' + fileName + 
'.pdf');
+                                    console.log('SAVED FILE ' + fileName + 
'.pdf on ' + (new Date()).toISOString());
+                                    fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
+                                    console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
+                                }
+
+                                patternExternalLinkNobust = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+www\.google\.com$/;
+                                mEN = 
patternExternalLinkNobust.exec(url['url']);
+                                if (mEN !== null) {
+                                    fileName = url['headers']['X-CS'] + '-' + 
mEN[2] + '-' + mEN[1] + '-External_Link-Nobust';
+                                    page.render(ZRMA.dt + '/' + fileName + 
'.pdf');
+                                    console.log('SAVED FILE ' + fileName + 
'.pdf on ' + (new Date()).toISOString());
+                                    fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
+                                    console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
+                                }
+
+
+                                // Special:Random isn't queued as a 
cachebusted URL, so only one file save here
+                                patternSpecialRandomNobust = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+Special:Random$/;
+                                mSRN = 
patternSpecialRandomNobust.exec(url['url']);
+                                if (mSRN !== null) {
+                                    fileName = url['headers']['X-CS'] + '-' + 
mSRN[2] + '-' + mSRN[1] + '-SpecialRandom-NobustOnly';
+                                    page.render(ZRMA.dt + '/' + fileName + 
'.pdf');
+                                    console.log('SAVED FILE ' + fileName + 
'.pdf on ' + (new Date()).toISOString());
+                                    fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
+                                    console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
+                                }
+
+
+                                // WAP
+
+                                patternSanFrancisco = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero).+San_Francisco[?]cachebuster=(\d+)-([0-9*]+)-\d+$/;
+                                mW = patternSanFrancisco.exec(url['url']);
+                                if (mW !== null) {
+                                    fileName = mW[3] + '-' + mW[4] + '-' + 
mW[2] + '-' + mW[1] + '-WAP-San_Francisco-Cachebusted';
+                                    page.render(ZRMA.dt + '/' + fileName + 
'.pdf');
+                                    console.log('SAVED FILE ' + fileName + 
'.pdf on ' + (new Date()).toISOString());
+                                    fs.write(ZRMA.dt + '/' + fileName + 
'.content.html', page.content, 'w');
+                                    console.log('SAVED FILE ' + fileName + 
'.content.html on ' + (new Date()).toISOString());
+                                }
+
+                            }
 
 
 
@@ -1111,7 +1137,7 @@
                         var randIndexes = [];
                         var randBillingIndexes = [];
 
-                        var acceptBillingRx = /acceptbilling=yes/;
+                        var acceptBillingRx = 
/Special:ZeroRatedMobileAccess[&?]from=([^&]+)&to=(.+)$/;
                         for (var u = 0; u < targetLinks.length; u++) {
 
                             /*
@@ -1121,7 +1147,7 @@
                              */
 
 
-                            if ((targetLinks[u]['linkClass'] === 'image' && 
targetLinks[u]['urlObj']['subdomain'] === 'zero') || 
acceptBillingRx.test(targetLinks[u]['destinationUrl'])) {
+                            if ((targetLinks[u]['linkClass'] === 'image' && 
targetLinks[u]['urlObj']['subdomain'] === 'zero') || 
acceptBillingRx.test(targetLinks[u]['destinationUrl']) || 
targetLinks[u]['linkAllClasses'].indexOf('external') !== -1) {
                                 randBillingIndexes.push(u);
                             }
                         }
@@ -1157,6 +1183,8 @@
                             urlToAugment['url'] = 
targetLinks[randIndexes[r]]['destinationUrl'];
                             urlToAugment['destinationDomain'] = 
targetLinks[randIndexes[r]]['destinationDomain'];
                             urlToAugment['linkClass'] = 
targetLinks[randIndexes[r]]['linkClass'];
+                            urlToAugment['linkAllClasses'] = 
targetLinks[randIndexes[r]]['linkAllClasses'];
+                            urlToAugment['source'] = 
targetLinks[randIndexes[r]]['actualUrl'];
 
                             var redirRx = 
/^http:\/\/([a-zA-Z-]+)\.(m|zero)\.wikipedia\.org\/(.*)(\?|\?.+&)acceptbilling=yes(.*)$/;
 
@@ -1168,9 +1196,11 @@
                              because the user has accepted billing, so...
                             */
 
-                            if (redirRx.test(urlToAugment['url'])) {
+
+                            if (redirRx.test(urlToAugment['url']) || 
acceptBillingRx.test(urlToAugment['source'])) {
                                 urlToAugment['goAnywhere'] = true;
                             }
+
 
                             urlsToFollow.push(urlToAugment);
                         }
@@ -1224,12 +1254,17 @@
                                                 domain = 
hostExtractor.exec(elem['href']);
 
                                                 if (url['safeDomains']) {
+                                                    worthConsideration = false;
+
                                                     if (domain === null) {
                                                         emptyOrUrlIndicator = 
(elem['href'] === undefined || elem['href'] === '') ? 'with undefined or empty 
HREF' : decodeURI(elem['href']);
                                                         console.log('UNSAFE 
URL UNABLE TO EXTRACT PROPER SCHEME AND DOMAIN WITH X-CS ' + elem['xcs'] + ' 
(enabled=' + elem['enabled'] + ') : From ' + elem['sourceUrl'] + ' actually at 
' + page.url + ' saw hyperlink ' + emptyOrUrlIndicator + ' using regex ' + 
hostExtractor.source + ' FROM SPIDERED PAGE on ' + (new Date()).toISOString())
                                                     } else {
                                                         domainMatch = 
url['safeDomains'].exec(domain[1]);
-                                                        if (domainMatch === 
null) {
+
+                                                        if 
(redirRx.test(elem['href']) || acceptBillingRx.test(urlToAugment['source'])) {
+                                                            worthConsideration 
= true;
+                                                        } else if (domainMatch 
=== null) {
                                                             
console.log('UNSAFE URL SCHEME CORRECT BUT DOMAIN NOT MATCHED WITH X-CS ' + 
elem['xcs'] + ' (enabled=' + elem['enabled'] + ') : From ' + elem['sourceUrl'] 
+ ' actually at ' + page.url + ' saw hyperlink ' + decodeURI(elem['href']) + ' 
using regex ' + url['safeDomains'].source + ' after ' + hostExtractor.source + 
' identified domain ' + domain[1] + ' FROM SPIDERED PAGE on ' + (new 
Date()).toISOString());
                                                         } else {
 
@@ -1266,17 +1301,18 @@
                                                                 
worthConsideration = true;
                                                             }
 
-                                                            if 
(worthConsideration) {
-                                                                
addToBucketFunc(
-                                                                    {
-                                                                        
'urlObj' : url,
-                                                                        
'actualUrl' : page.url,
-                                                                        
'destinationUrl' : elem['href'],
-                                                                        
'destinationDomain' : domain[1],
-                                                                        
'linkClass' : elem['linkClass']
-                                                                    }
-                                                                );
-                                                            }
+                                                        }
+
+                                                        if 
(worthConsideration) {
+                                                            addToBucketFunc(
+                                                                {
+                                                                    'urlObj' : 
url,
+                                                                    
'actualUrl' : page.url,
+                                                                    
'destinationUrl' : elem['href'],
+                                                                    
'destinationDomain' : domain[1],
+                                                                    
'linkClass' : elem['linkClass']
+                                                                }
+                                                            );
                                                         }
                                                     }
                                                 }

-- 
To view, visit https://gerrit.wikimedia.org/r/89120
To unsubscribe, visit https://gerrit.wikimedia.org/r/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib626ce29f6338de2bfffd12e8a1c1c811b337f46
Gerrit-PatchSet: 1
Gerrit-Project: mediawiki/extensions/ZeroRatedMobileAccess
Gerrit-Branch: master
Gerrit-Owner: Dr0ptp4kt <[email protected]>

_______________________________________________
MediaWiki-commits mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-commits

Reply via email to