Summary of IRC Meeting in #aurora at Mon Apr 13 18:00:24 2015:

Attendees: thalin, jcohen, wfarner, kts, jaybuff, mkhutornenko, bbrazil, 
zmanji, Floomi

- Preface
- 0.8.0 progress update
- CI flakiness
- static port assignment


IRC log follows:

## Preface ##
[Mon Apr 13 18:00:58 2015] <wfarner>: welcome to the weekly community meeting!  
everyone is welcome, so if you're in this channel please feel free to 
participate!
[Mon Apr 13 18:01:08 2015] <wfarner>: let's start with roll call
[Mon Apr 13 18:01:09 2015] <wfarner>: here
[Mon Apr 13 18:01:12 2015] <Floomi>: here
[Mon Apr 13 18:01:13 2015] <kts>: here
[Mon Apr 13 18:01:15 2015] <bbrazil>: here
[Mon Apr 13 18:01:20 2015] <jcohen>: ahoy ahoy
[Mon Apr 13 18:02:07 2015] <jaybuff>: here
[Mon Apr 13 18:02:33 2015] <mkhutornenko>: here
## 0.8.0 progress update ##
[Mon Apr 13 18:04:34 2015] <wfarner>: jira is failing me at the moment, but 
IIRC we only needed some final polish around auth
[Mon Apr 13 18:04:40 2015] <wfarner>: kts: how is that looking?
[Mon Apr 13 18:04:46 2015] <thalin>: here
[Mon Apr 13 18:05:21 2015] <kts>: wfarner: code's there, just needs docs
[Mon Apr 13 18:07:16 2015] <wfarner>: kts: is that to say you're working on the 
docs?
[Mon Apr 13 18:07:37 2015] <kts>: AURORA-817
[Mon Apr 13 18:07:38 2015] <wfarner>: i.e. think we'll be able to cut an RC by 
EOW?
[Mon Apr 13 18:07:47 2015] <bbrazil>: Should we add the fix for AURORA-1268 to 
0.8.0 considering it prevents backwards compatability for some config schema 
changes?
[Mon Apr 13 18:08:03 2015] <wfarner>: bbrazil: that's a good point, i had 
forgotten about that
[Mon Apr 13 18:08:15 2015] <wfarner>: it would be nice to not further push that 
problem down the road
[Mon Apr 13 18:08:30 2015] <bbrazil>: I've 
https://github.com/wickman/pystachio/pull/15 which should fix it, but I've not 
tested it end-to-end
[Mon Apr 13 18:08:55 2015] <wfarner>: impromptu vote - add AURORA-1268 as an 
0.8.0 release blocker
[Mon Apr 13 18:09:00 2015] <wfarner>: +1
[Mon Apr 13 18:09:33 2015] <kts>: we would need to wait for a new pystachio 
release with that patch
[Mon Apr 13 18:09:40 2015] <bbrazil>: +1
[Mon Apr 13 18:09:46 2015] <wfarner>: yup, which i think we should push for
[Mon Apr 13 18:10:18 2015] <zmanji>: here
[Mon Apr 13 18:10:22 2015] <kts>: +1
[Mon Apr 13 18:10:42 2015] <zmanji>: +1
[Mon Apr 13 18:10:48 2015] <wfarner>: mkhutornenko jcohen, can you vote?
[Mon Apr 13 18:10:52 2015] <mkhutornenko>: +1
[Mon Apr 13 18:11:11 2015] <wfarner>: i can provide context if needed
[Mon Apr 13 18:11:29 2015] <jcohen>: +1
[Mon Apr 13 18:11:42 2015] <wfarner>: thanks.  i am adding this as a release 
blocker
## CI flakiness ##
[Mon Apr 13 18:14:18 2015] <wfarner>: mostly FYI - i put a small amount of 
effort over the weekend into squelching some sources of CI flakiness we've had 
over the last few months
[Mon Apr 13 18:14:24 2015] <wfarner>: one is here: 
https://reviews.apache.org/r/33103/
[Mon Apr 13 18:14:52 2015] <wfarner>: another item was removing a jenkins slave 
from our builds that was consistently producing pip install errors
[Mon Apr 13 18:15:07 2015] <wfarner>: so, hopefully we have a less noisy signal 
from CI
[Mon Apr 13 18:15:51 2015] <zmanji>: wfarner: is it possible for us to file 
tickets about these issues?
[Mon Apr 13 18:16:03 2015] <wfarner>: of course it's possible
[Mon Apr 13 18:16:21 2015] <wfarner>: the flaky tests all have tickets on our 
end
[Mon Apr 13 18:16:47 2015] <wfarner>: and we have AURORA-1238 to track the pip 
error
[Mon Apr 13 18:17:34 2015] <wfarner>: identifying reliable jenkins slaves has 
been a constant time drain, and i would love contribution there
[Mon Apr 13 18:18:32 2015] <wfarner>: that's all the topics i had in mind.  
floor is open for additional topics!
[Mon Apr 13 18:19:15 2015] <bbrazil>: AURORA-1212
## static port assignment ##
[Mon Apr 13 18:19:51 2015] <bbrazil>: I've sent mail about this on the list and 
got no responses, any opinions on whether this shouldn/shouldn't go in?
[Mon Apr 13 18:21:21 2015] <bbrazil>: I've got a setup where I currently need 
this, at it seems like something that'll have to get full scheduler support at 
some point
[Mon Apr 13 18:22:48 2015] <wfarner>: +1 from me, so long as it's not default 
behavior
[Mon Apr 13 18:23:36 2015] <wfarner>: can others chime in?  we can't proceed on 
lazy consensus
[Mon Apr 13 18:24:49 2015] <kts>: how do you feel about an abstraction where 
they're accounted, like supporting a staticports resource on the mesos slave?
[Mon Apr 13 18:26:05 2015] <jaybuff>: ACTION reads the email on static ports
[Mon Apr 13 18:26:24 2015] <jaybuff>: yeah, this sounds like you want a 
constraint of "a consistent port for all instnaces of this job"
[Mon Apr 13 18:27:21 2015] <jaybuff>: would that meet your use case?  you 
wouldn't be able to specify the port, but aurora would gaurentee that all 
instances were allocated the same port
[Mon Apr 13 18:27:31 2015] <wfarner>: seems like that would require 
coordination, right?  i.e. can't be accomplished with _only_ adding a slave 
attribute
[Mon Apr 13 18:27:40 2015] <wfarner>: the scheduler would need to own conflict 
avoidance
[Mon Apr 13 18:28:32 2015] <kts>: yes we'd need some support in the scheduler 
to know that the task needed a staticport==80 resource
[Mon Apr 13 18:29:17 2015] <wfarner>: well, i meant conflict across 
slaves...though i suppose that might not be necessary
[Mon Apr 13 18:29:29 2015] <wfarner>: (i'm mapping this specifically into 
bbrazil's use case)
[Mon Apr 13 18:29:33 2015] <bbrazil>: 
http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201504.mbox/%3CCALG-N5Pgx4d0SY56dfxTgE4wQkrT-a4eC3B9pdUMt7Y4vJUjeQ%40mail.gmail.com%3E
[Mon Apr 13 18:29:33 2015] <bbrazil>: I think that could form part of a full 
scheduler solution, it's a lot of work though to remove a client-side only 
restriction so I'd prefer to just remove the restriction now
[Mon Apr 13 18:29:33 2015] <bbrazil>: I need to specify the port, as the other 
end doesn't has service discovery
[Mon Apr 13 18:30:09 2015] <jcohen>: The problem with removing the restriction 
is that it opens a fairly big hole for people to fall into
[Mon Apr 13 18:30:26 2015] <mkhutornenko>: +1 I was typing something along 
those lines
[Mon Apr 13 18:30:50 2015] <mkhutornenko>: I am afraid relying on hooks to 
enforce the current behavior is not safe enough
[Mon Apr 13 18:31:03 2015] <bbrazil>: I can see it going further and having 
ACLs around what's allowed to get to what static ports for security
[Mon Apr 13 18:31:15 2015] <jcohen>: I think if we were going to remove the 
restriction that the scheduler would have to get involved to ensure that two 
tasks requesting the same static port did not land on the same slave
[Mon Apr 13 18:32:19 2015] <jcohen>: I can potentially imagine a solution where 
this functionality can be conditionally enabled in the client (basically 
“operators beware!”), but that’s definitely a half measure and not 
something I’d be terribly comfortable with.
[Mon Apr 13 18:32:26 2015] <bbrazil>: I'm proposing for now to remove the 
restriction that static ports only work in dedicated roles, and it's up to the 
user to configure things to avoid conflicts - as they do with dedicated roles 
today
[Mon Apr 13 18:32:58 2015] <jcohen>: right, but what happens when you have two 
users of the cluser who are not aware of each others static port requirements?
[Mon Apr 13 18:33:07 2015] <jcohen>: it seems difficult to manage in that way
[Mon Apr 13 18:33:18 2015] <bbrazil>: you'd have to have central port 
allocation (we've a wiki page)
[Mon Apr 13 18:33:19 2015] <jcohen>: (as you said, you guys require external 
coordination for this, but that’s not really scalable)
[Mon Apr 13 18:33:37 2015] <mkhutornenko>: the problem though that users may 
not be aware of underlying limitations and overuse the feature to the extent of 
killing the cluster
[Mon Apr 13 18:34:27 2015] <bbrazil>: yeah, long term using static ports for 
everything doesn't scale - but there's going to be transitions and edge cases 
where you need it
[Mon Apr 13 18:36:05 2015] <bbrazil>: as it stands someone can send an RPC to 
the scheduler and create such a job, the restriction is only in the client
[Mon Apr 13 18:37:09 2015] <wfarner>: vote: allow non-default configuration of 
the client to remove the barrier to static port assignment
[Mon Apr 13 18:37:15 2015] <wfarner>: +1
[Mon Apr 13 18:37:28 2015] <bbrazil>: +1
[Mon Apr 13 18:37:30 2015] <kts>: +0
[Mon Apr 13 18:38:47 2015] <zmanji>: +0
[Mon Apr 13 18:39:26 2015] <kts>: nothing against configuring a task to use a 
static port, just think mesos should know you're doing it, but agree that 
removing the client restriction is a reasonable short-term fix
[Mon Apr 13 18:40:05 2015] <bbrazil>: the docs should probably be expanded (I 
forget if they mention the caveats)
[Mon Apr 13 18:40:20 2015] <zmanji>: I also think the client is a good short 
term fix but really mesos should be dealing with this
[Mon Apr 13 18:40:52 2015] <bbrazil>: zmanji: agreed
[Mon Apr 13 18:41:37 2015] <mkhutornenko>: -1, I’d rather see this condition 
tightened than removed
[Mon Apr 13 18:41:39 2015] <jcohen>: I’m slightly on the negative side on 
this. It feels like opening up a pretty big hole, unless I’m misunderstanding.
[Mon Apr 13 18:42:03 2015] <jcohen>: Given that it’s possible with direct RPC 
is why I’m not fully against it
[Mon Apr 13 18:43:02 2015] <wfarner>: mkhutornenko jcohen: for the negative 
side, do you feel that bbrazil's use case is satisfied, or one that we should 
not aim to satisfy?
[Mon Apr 13 18:43:25 2015] <jcohen>: I think the use case is reasonable.
[Mon Apr 13 18:43:50 2015] <mkhutornenko>: same here I think though the use 
case is currently addressed by relying on dedicated ports
[Mon Apr 13 18:44:24 2015] <kts>: mkhutornenko: you need to dedicate the whole 
slave though
[Mon Apr 13 18:44:32 2015] <jcohen>: My understanding is that they’d prefer 
not to use dedicated resources just for this one task
[Mon Apr 13 18:44:44 2015] <kts>: bbrazil only wants to dedicate the ports, not 
all the resources of the slave
[Mon Apr 13 18:44:46 2015] <mkhutornenko>: agreed, that’s the limitation of 
the use case
[Mon Apr 13 18:45:48 2015] <mkhutornenko>: perhaps a different “static 
pool” attribute can be used cluster similar jobs but I agree, it’s not as 
flexible
[Mon Apr 13 18:46:41 2015] <kts>: I don't follow - shouldn't the other 
resources of the slave be available to any other task that wants them?
[Mon Apr 13 18:47:17 2015] <bbrazil>: yes, I'll have 3-4 roles with many tasks 
each
[Mon Apr 13 18:47:51 2015] <mkhutornenko>: that’s the part I was referring to 
as “not as flexible”. Given enough of the jobs with static port reqs they 
could be scheduled onto the same set of hosts though without limiting resource 
utilization
[Mon Apr 13 18:48:15 2015] <mkhutornenko>: that’s provided jobs require 
different subsets of static ports
[Mon Apr 13 18:48:48 2015] <bbrazil>: in my use case, every job has a unique 
assigned static port
[Mon Apr 13 18:49:12 2015] <bbrazil>: and these are separate from the port 
range mesos assigns
[Mon Apr 13 18:49:44 2015] <mkhutornenko>: is it possible to co-locate those 
jobs on the same pool of machines then?
[Mon Apr 13 18:49:55 2015] <bbrazil>: yes, this will be all of my machines
[Mon Apr 13 18:50:12 2015] <jcohen>: bbrazil: right, but we’re talking about 
adding a feature to Aurora in general, we need to abstract from your use case 
to the general use case where people might be less diligent.
[Mon Apr 13 18:50:16 2015] <mkhutornenko>: sorry, meant sub-pool defined by an 
attribute :)
[Mon Apr 13 18:51:13 2015] <bbrazil>: mkhutornenko: I think that's a separate 
feature, you can kind of do that with mesos slave atributes and aurora 
constraints currently
[Mon Apr 13 18:51:46 2015] <mkhutornenko>: that was my point exactly
[Mon Apr 13 18:52:22 2015] <mkhutornenko>: any chance you could adapt to the 
current approach?
[Mon Apr 13 18:52:55 2015] <bbrazil>: for security/isolation/quote I want each 
of the 4 teams we have to have their own role
[Mon Apr 13 18:53:05 2015] <bbrazil>: but share all the machines
[Mon Apr 13 18:53:13 2015] <bbrazil>: *quota
[Mon Apr 13 18:54:13 2015] <mkhutornenko>: right, can’t you just assign a 
dedicated attribute to ALL of your machines and require your teams use a 
dedicated constraint?
[Mon Apr 13 18:54:44 2015] <wfarner>: mkhutornenko: who is that really helping?
[Mon Apr 13 18:55:02 2015] <bbrazil>: iirc, dedicated constraints allow a 
machine to only allow one role to use them
[Mon Apr 13 18:55:28 2015] <kts>: yeah they're limited to a single role
[Mon Apr 13 18:55:39 2015] <wfarner>: no, they can have multiple
[Mon Apr 13 18:55:48 2015] <mkhutornenko>: right, i just checked that
[Mon Apr 13 18:55:56 2015] <kts>: oh interesting
[Mon Apr 13 18:56:02 2015] <wfarner>: but it's not a pretty solution - you'd 
need to change slave attributes every time a new role is added, and thus reboot 
the cluster
[Mon Apr 13 18:56:11 2015] <wfarner>: and i still claim - to what end
[Mon Apr 13 18:57:12 2015] <mkhutornenko>: well, the alternative is to 
potentially open up the cluster for a user misuse, which I am not sure is better
[Mon Apr 13 18:57:19 2015] <bbrazil>: if I can assign multiple users to a 
dedicated machine that'd work for me
[Mon Apr 13 18:57:50 2015] <kts>: bbrazil: if you go down that route you'll 
have to reboot the cluster to add a new role
[Mon Apr 13 18:59:26 2015] <bbrazil>: 
http://aurora.apache.org/documentation/latest/deploying-aurora-scheduler/#dedicated-attribute
 seems to indicate that isn't possible though - are the docs out of date?
[Mon Apr 13 19:00:08 2015] <bbrazil>: kts: I don't expect to add new roles 
often, I'm only expecting 4 as-is for 20-30 jobs
[Mon Apr 13 19:00:21 2015] <bbrazil>: not having to do that would be better of 
course
[Mon Apr 13 19:01:38 2015] <mkhutornenko>: bbrazil: mind exploring the 
dedicated constraint route first? may be easier to enhance that feature to 
address your needs instead?
[Mon Apr 13 19:01:46 2015] <bbrazil>: I note that we've been discussing this 
for 40m
[Mon Apr 13 19:02:05 2015] <mkhutornenko>: agree, let’s move to the email list
[Mon Apr 13 19:02:38 2015] <bbrazil>: If I can do dedicated=.* that'd work, not 
sure that's allowed though
[Mon Apr 13 19:02:56 2015] <jcohen>: +1 for moving to list
[Mon Apr 13 19:04:11 2015] <kts>: mkhutornenko: would you like to take the lead 
on updating the documentation regarding dedicated hosts?
[Mon Apr 13 19:05:01 2015] <mkhutornenko>: sure, I can double check if it’s 
still up-to-date
[Mon Apr 13 19:05:43 2015] <wfarner>: sounds like that wraps thing up
[Mon Apr 13 19:05:58 2015] <wfarner>: thanks for the engaging discussions, 
everyone!
[Mon Apr 13 19:06:00 2015] <wfarner>: ASFBot: meeting stop


Meeting ended at Mon Apr 13 19:06:00 2015

Reply via email to