On 15 December 2014 at 07:35, Christian Aistleitner <
[email protected]> wrote:
>
> Hi,
>
> On Thu, Dec 11, 2014 at 06:27:02PM -0500, Oliver Keyes wrote:
> > > On Sun, Dec 07, 2014 at 12:59:27PM +0100, Christian Aistleitner wrote:
> > > >   http://config-master.wikimedia.org/pybal/esams/text-https
> > [...]
> > I'm not sure how to interpret the pybal,
>
> The exemplary file linked above holds lines like
>
>   { 'host': 'amssq36.esams.wmnet', 'weight': 1, 'enabled': True }
>
> Such a line means:
>
>   The host 'amssq36.esams.wmnet' [1] is [2] an SSL terminator for text
>   cluster in esams [3], and has weight 1 [4].
>
>
>
> > Essentially; we want
> > to be excluding internal IP spaces, because that contains a lot of
> > automatically-generated traffic (fundraising, I'm looking at you)
>
> Oliver, I do not like blaming games.
> You blamed Fundraising before to cause lots of internal requests.
> And I called you out on that before to please provide an example.
> However, you failed to provide an example. And yet you call out
> Fundraising again.
>
> Please provide an example [5] of such traffic, so we're all on the
> same page.
>

It's hard to pull out, but they're requests with a PhantomJS user-agent
that hit a large number of places to test banners. To be clear (my initial
email was not clear) this is not a serious "damn you fundraising! Damn you
all to heck!" but a joking one ;p. They do fantastic work and the requests
they make to test banner appearance is part of that work. FWIW, I was
informed that they were doing this...by Fundraising ;p. If you'd like more
confirmation than that, I can talk to them and grab a specific example.


>
>
>
> > So, we
> > exclude all requests from IPs within our ranges. Except, then we also
> > exclude all the SSL traffic, since that will appear to come from an
> > internal IP address, from the point of view of the request logs.
> >
> > So, do I interpret this pybal as: if it's tagged as HTTPS,
>
> Since you use 'tag' in different contexts around https, let me clarify
> how I read 'tag' here. I read it as “If a pybal *-https file lists a
> host as enabled with positive weight in a line that is not commented
> out"
>
>
>
> > it's an SSL
> > terminator, [...]
>
> Yes.
>
>
>
> > [...] and so requests from those machines, from internal IP
> > addresses, should be included?
>
> In the end “should be included” is something you have to decide.
>
> But if you see a request, whose ip column comes from a machine whose
> corresponding name has been listed in a pybal *-https file while the
> request was processed, it “typically” is a relayed request from the
> SSL terminator.
>
> (Note the distinction between my “typcially is a relayed request from
> the SSL terminator” and your “should be included”.)
>

Awesome :). We'll never get certainty - getting "most of the time" is, I
think, Good Enough (tm).


>
>
> > Or: those are the SSL machines, find out
> > their IP addresses and you find out the internal IPs that represent SSLd
> > requests, rather than internally-generated traffic?
>
> I cannot fully parse that sentence.
> But it sounds a bit like SSL traffic would not be internally-generated
> traffic.
> From the logging perspective, SSL traffic is internally-generated traffic:
>
>   The SSL terminator performs a separate, genuinely fresh and new
>   request to the caches.
>
> This separate, genuinely fresh and new request gets logged. And that's
> the log line you're after, if you want to look at https traffic from
> within Hive.
>
>
Gotcha. So, if we wanted to exclude internally-generated traffic most of
the time, without unduly punishing HTTPs traffic, we'd be looking at a
heuristic that looks something like:

*If the request comes from a WMF IP range;
**Exclude, unless;
***The request is to a host listed as https=1 in the pyball file

If I'm reading right?


>
>
> Have fun,
> Christian
>
>
>
> [1] 'host' field
>
> [2] 'enabled' field
>
> [3] see URL
>
> [4] 'weight' field. You probably need not care about the weight. The
> weight tells you how much of the overall traffic a node gets. In the
> given file, all hosts have weight 1, so they all get a similar sized
> part of the overall traffic.
>
> [5] Either anonymized on-list, or else for example through a command
> that we can run on stat1002.
>
>
>
> --
> ---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
>                            Companies' registry: 360296y in Linz
> Christian Aistleitner
> Kefermarkterstrasze 6a/3     Email:  [email protected]
> 4293 Gutau, Austria          Phone:          +43 7946 / 20 5 81
>                              Fax:            +43 7946 / 20 5 81
>                              Homepage: http://quelltextlich.at/
> ---------------------------------------------------------------
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to