On 15 December 2014 at 07:35, Christian Aistleitner < [email protected]> wrote: > > Hi, > > On Thu, Dec 11, 2014 at 06:27:02PM -0500, Oliver Keyes wrote: > > > On Sun, Dec 07, 2014 at 12:59:27PM +0100, Christian Aistleitner wrote: > > > > http://config-master.wikimedia.org/pybal/esams/text-https > > [...] > > I'm not sure how to interpret the pybal, > > The exemplary file linked above holds lines like > > { 'host': 'amssq36.esams.wmnet', 'weight': 1, 'enabled': True } > > Such a line means: > > The host 'amssq36.esams.wmnet' [1] is [2] an SSL terminator for text > cluster in esams [3], and has weight 1 [4]. > > > > > Essentially; we want > > to be excluding internal IP spaces, because that contains a lot of > > automatically-generated traffic (fundraising, I'm looking at you) > > Oliver, I do not like blaming games. > You blamed Fundraising before to cause lots of internal requests. > And I called you out on that before to please provide an example. > However, you failed to provide an example. And yet you call out > Fundraising again. > > Please provide an example [5] of such traffic, so we're all on the > same page. >
It's hard to pull out, but they're requests with a PhantomJS user-agent that hit a large number of places to test banners. To be clear (my initial email was not clear) this is not a serious "damn you fundraising! Damn you all to heck!" but a joking one ;p. They do fantastic work and the requests they make to test banner appearance is part of that work. FWIW, I was informed that they were doing this...by Fundraising ;p. If you'd like more confirmation than that, I can talk to them and grab a specific example. > > > > > So, we > > exclude all requests from IPs within our ranges. Except, then we also > > exclude all the SSL traffic, since that will appear to come from an > > internal IP address, from the point of view of the request logs. > > > > So, do I interpret this pybal as: if it's tagged as HTTPS, > > Since you use 'tag' in different contexts around https, let me clarify > how I read 'tag' here. I read it as “If a pybal *-https file lists a > host as enabled with positive weight in a line that is not commented > out" > > > > > it's an SSL > > terminator, [...] > > Yes. > > > > > [...] and so requests from those machines, from internal IP > > addresses, should be included? > > In the end “should be included” is something you have to decide. > > But if you see a request, whose ip column comes from a machine whose > corresponding name has been listed in a pybal *-https file while the > request was processed, it “typically” is a relayed request from the > SSL terminator. > > (Note the distinction between my “typcially is a relayed request from > the SSL terminator” and your “should be included”.) > Awesome :). We'll never get certainty - getting "most of the time" is, I think, Good Enough (tm). > > > > Or: those are the SSL machines, find out > > their IP addresses and you find out the internal IPs that represent SSLd > > requests, rather than internally-generated traffic? > > I cannot fully parse that sentence. > But it sounds a bit like SSL traffic would not be internally-generated > traffic. > From the logging perspective, SSL traffic is internally-generated traffic: > > The SSL terminator performs a separate, genuinely fresh and new > request to the caches. > > This separate, genuinely fresh and new request gets logged. And that's > the log line you're after, if you want to look at https traffic from > within Hive. > > Gotcha. So, if we wanted to exclude internally-generated traffic most of the time, without unduly punishing HTTPs traffic, we'd be looking at a heuristic that looks something like: *If the request comes from a WMF IP range; **Exclude, unless; ***The request is to a host listed as https=1 in the pyball file If I'm reading right? > > > Have fun, > Christian > > > > [1] 'host' field > > [2] 'enabled' field > > [3] see URL > > [4] 'weight' field. You probably need not care about the weight. The > weight tells you how much of the overall traffic a node gets. In the > given file, all hosts have weight 1, so they all get a similar sized > part of the overall traffic. > > [5] Either anonymized on-list, or else for example through a command > that we can run on stat1002. > > > > -- > ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ---- > Companies' registry: 360296y in Linz > Christian Aistleitner > Kefermarkterstrasze 6a/3 Email: [email protected] > 4293 Gutau, Austria Phone: +43 7946 / 20 5 81 > Fax: +43 7946 / 20 5 81 > Homepage: http://quelltextlich.at/ > --------------------------------------------------------------- > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Oliver Keyes Research Analyst Wikimedia Foundation
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
