Summary of IRC Meeting in #aurora at Mon Apr 13 18:00:24 2015: Attendees: thalin, jcohen, wfarner, kts, jaybuff, mkhutornenko, bbrazil, zmanji, Floomi
- Preface - 0.8.0 progress update - CI flakiness - static port assignment IRC log follows: ## Preface ## [Mon Apr 13 18:00:58 2015] <wfarner>: welcome to the weekly community meeting! everyone is welcome, so if you're in this channel please feel free to participate! [Mon Apr 13 18:01:08 2015] <wfarner>: let's start with roll call [Mon Apr 13 18:01:09 2015] <wfarner>: here [Mon Apr 13 18:01:12 2015] <Floomi>: here [Mon Apr 13 18:01:13 2015] <kts>: here [Mon Apr 13 18:01:15 2015] <bbrazil>: here [Mon Apr 13 18:01:20 2015] <jcohen>: ahoy ahoy [Mon Apr 13 18:02:07 2015] <jaybuff>: here [Mon Apr 13 18:02:33 2015] <mkhutornenko>: here ## 0.8.0 progress update ## [Mon Apr 13 18:04:34 2015] <wfarner>: jira is failing me at the moment, but IIRC we only needed some final polish around auth [Mon Apr 13 18:04:40 2015] <wfarner>: kts: how is that looking? [Mon Apr 13 18:04:46 2015] <thalin>: here [Mon Apr 13 18:05:21 2015] <kts>: wfarner: code's there, just needs docs [Mon Apr 13 18:07:16 2015] <wfarner>: kts: is that to say you're working on the docs? [Mon Apr 13 18:07:37 2015] <kts>: AURORA-817 [Mon Apr 13 18:07:38 2015] <wfarner>: i.e. think we'll be able to cut an RC by EOW? [Mon Apr 13 18:07:47 2015] <bbrazil>: Should we add the fix for AURORA-1268 to 0.8.0 considering it prevents backwards compatability for some config schema changes? [Mon Apr 13 18:08:03 2015] <wfarner>: bbrazil: that's a good point, i had forgotten about that [Mon Apr 13 18:08:15 2015] <wfarner>: it would be nice to not further push that problem down the road [Mon Apr 13 18:08:30 2015] <bbrazil>: I've https://github.com/wickman/pystachio/pull/15 which should fix it, but I've not tested it end-to-end [Mon Apr 13 18:08:55 2015] <wfarner>: impromptu vote - add AURORA-1268 as an 0.8.0 release blocker [Mon Apr 13 18:09:00 2015] <wfarner>: +1 [Mon Apr 13 18:09:33 2015] <kts>: we would need to wait for a new pystachio release with that patch [Mon Apr 13 18:09:40 2015] <bbrazil>: +1 [Mon Apr 13 18:09:46 2015] <wfarner>: yup, which i think we should push for [Mon Apr 13 18:10:18 2015] <zmanji>: here [Mon Apr 13 18:10:22 2015] <kts>: +1 [Mon Apr 13 18:10:42 2015] <zmanji>: +1 [Mon Apr 13 18:10:48 2015] <wfarner>: mkhutornenko jcohen, can you vote? [Mon Apr 13 18:10:52 2015] <mkhutornenko>: +1 [Mon Apr 13 18:11:11 2015] <wfarner>: i can provide context if needed [Mon Apr 13 18:11:29 2015] <jcohen>: +1 [Mon Apr 13 18:11:42 2015] <wfarner>: thanks. i am adding this as a release blocker ## CI flakiness ## [Mon Apr 13 18:14:18 2015] <wfarner>: mostly FYI - i put a small amount of effort over the weekend into squelching some sources of CI flakiness we've had over the last few months [Mon Apr 13 18:14:24 2015] <wfarner>: one is here: https://reviews.apache.org/r/33103/ [Mon Apr 13 18:14:52 2015] <wfarner>: another item was removing a jenkins slave from our builds that was consistently producing pip install errors [Mon Apr 13 18:15:07 2015] <wfarner>: so, hopefully we have a less noisy signal from CI [Mon Apr 13 18:15:51 2015] <zmanji>: wfarner: is it possible for us to file tickets about these issues? [Mon Apr 13 18:16:03 2015] <wfarner>: of course it's possible [Mon Apr 13 18:16:21 2015] <wfarner>: the flaky tests all have tickets on our end [Mon Apr 13 18:16:47 2015] <wfarner>: and we have AURORA-1238 to track the pip error [Mon Apr 13 18:17:34 2015] <wfarner>: identifying reliable jenkins slaves has been a constant time drain, and i would love contribution there [Mon Apr 13 18:18:32 2015] <wfarner>: that's all the topics i had in mind. floor is open for additional topics! [Mon Apr 13 18:19:15 2015] <bbrazil>: AURORA-1212 ## static port assignment ## [Mon Apr 13 18:19:51 2015] <bbrazil>: I've sent mail about this on the list and got no responses, any opinions on whether this shouldn/shouldn't go in? [Mon Apr 13 18:21:21 2015] <bbrazil>: I've got a setup where I currently need this, at it seems like something that'll have to get full scheduler support at some point [Mon Apr 13 18:22:48 2015] <wfarner>: +1 from me, so long as it's not default behavior [Mon Apr 13 18:23:36 2015] <wfarner>: can others chime in? we can't proceed on lazy consensus [Mon Apr 13 18:24:49 2015] <kts>: how do you feel about an abstraction where they're accounted, like supporting a staticports resource on the mesos slave? [Mon Apr 13 18:26:05 2015] <jaybuff>: ACTION reads the email on static ports [Mon Apr 13 18:26:24 2015] <jaybuff>: yeah, this sounds like you want a constraint of "a consistent port for all instnaces of this job" [Mon Apr 13 18:27:21 2015] <jaybuff>: would that meet your use case? you wouldn't be able to specify the port, but aurora would gaurentee that all instances were allocated the same port [Mon Apr 13 18:27:31 2015] <wfarner>: seems like that would require coordination, right? i.e. can't be accomplished with _only_ adding a slave attribute [Mon Apr 13 18:27:40 2015] <wfarner>: the scheduler would need to own conflict avoidance [Mon Apr 13 18:28:32 2015] <kts>: yes we'd need some support in the scheduler to know that the task needed a staticport==80 resource [Mon Apr 13 18:29:17 2015] <wfarner>: well, i meant conflict across slaves...though i suppose that might not be necessary [Mon Apr 13 18:29:29 2015] <wfarner>: (i'm mapping this specifically into bbrazil's use case) [Mon Apr 13 18:29:33 2015] <bbrazil>: http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201504.mbox/%3CCALG-N5Pgx4d0SY56dfxTgE4wQkrT-a4eC3B9pdUMt7Y4vJUjeQ%40mail.gmail.com%3E [Mon Apr 13 18:29:33 2015] <bbrazil>: I think that could form part of a full scheduler solution, it's a lot of work though to remove a client-side only restriction so I'd prefer to just remove the restriction now [Mon Apr 13 18:29:33 2015] <bbrazil>: I need to specify the port, as the other end doesn't has service discovery [Mon Apr 13 18:30:09 2015] <jcohen>: The problem with removing the restriction is that it opens a fairly big hole for people to fall into [Mon Apr 13 18:30:26 2015] <mkhutornenko>: +1 I was typing something along those lines [Mon Apr 13 18:30:50 2015] <mkhutornenko>: I am afraid relying on hooks to enforce the current behavior is not safe enough [Mon Apr 13 18:31:03 2015] <bbrazil>: I can see it going further and having ACLs around what's allowed to get to what static ports for security [Mon Apr 13 18:31:15 2015] <jcohen>: I think if we were going to remove the restriction that the scheduler would have to get involved to ensure that two tasks requesting the same static port did not land on the same slave [Mon Apr 13 18:32:19 2015] <jcohen>: I can potentially imagine a solution where this functionality can be conditionally enabled in the client (basically âoperators beware!â), but thatâs definitely a half measure and not something Iâd be terribly comfortable with. [Mon Apr 13 18:32:26 2015] <bbrazil>: I'm proposing for now to remove the restriction that static ports only work in dedicated roles, and it's up to the user to configure things to avoid conflicts - as they do with dedicated roles today [Mon Apr 13 18:32:58 2015] <jcohen>: right, but what happens when you have two users of the cluser who are not aware of each others static port requirements? [Mon Apr 13 18:33:07 2015] <jcohen>: it seems difficult to manage in that way [Mon Apr 13 18:33:18 2015] <bbrazil>: you'd have to have central port allocation (we've a wiki page) [Mon Apr 13 18:33:19 2015] <jcohen>: (as you said, you guys require external coordination for this, but thatâs not really scalable) [Mon Apr 13 18:33:37 2015] <mkhutornenko>: the problem though that users may not be aware of underlying limitations and overuse the feature to the extent of killing the cluster [Mon Apr 13 18:34:27 2015] <bbrazil>: yeah, long term using static ports for everything doesn't scale - but there's going to be transitions and edge cases where you need it [Mon Apr 13 18:36:05 2015] <bbrazil>: as it stands someone can send an RPC to the scheduler and create such a job, the restriction is only in the client [Mon Apr 13 18:37:09 2015] <wfarner>: vote: allow non-default configuration of the client to remove the barrier to static port assignment [Mon Apr 13 18:37:15 2015] <wfarner>: +1 [Mon Apr 13 18:37:28 2015] <bbrazil>: +1 [Mon Apr 13 18:37:30 2015] <kts>: +0 [Mon Apr 13 18:38:47 2015] <zmanji>: +0 [Mon Apr 13 18:39:26 2015] <kts>: nothing against configuring a task to use a static port, just think mesos should know you're doing it, but agree that removing the client restriction is a reasonable short-term fix [Mon Apr 13 18:40:05 2015] <bbrazil>: the docs should probably be expanded (I forget if they mention the caveats) [Mon Apr 13 18:40:20 2015] <zmanji>: I also think the client is a good short term fix but really mesos should be dealing with this [Mon Apr 13 18:40:52 2015] <bbrazil>: zmanji: agreed [Mon Apr 13 18:41:37 2015] <mkhutornenko>: -1, Iâd rather see this condition tightened than removed [Mon Apr 13 18:41:39 2015] <jcohen>: Iâm slightly on the negative side on this. It feels like opening up a pretty big hole, unless Iâm misunderstanding. [Mon Apr 13 18:42:03 2015] <jcohen>: Given that itâs possible with direct RPC is why Iâm not fully against it [Mon Apr 13 18:43:02 2015] <wfarner>: mkhutornenko jcohen: for the negative side, do you feel that bbrazil's use case is satisfied, or one that we should not aim to satisfy? [Mon Apr 13 18:43:25 2015] <jcohen>: I think the use case is reasonable. [Mon Apr 13 18:43:50 2015] <mkhutornenko>: same here I think though the use case is currently addressed by relying on dedicated ports [Mon Apr 13 18:44:24 2015] <kts>: mkhutornenko: you need to dedicate the whole slave though [Mon Apr 13 18:44:32 2015] <jcohen>: My understanding is that theyâd prefer not to use dedicated resources just for this one task [Mon Apr 13 18:44:44 2015] <kts>: bbrazil only wants to dedicate the ports, not all the resources of the slave [Mon Apr 13 18:44:46 2015] <mkhutornenko>: agreed, thatâs the limitation of the use case [Mon Apr 13 18:45:48 2015] <mkhutornenko>: perhaps a different âstatic poolâ attribute can be used cluster similar jobs but I agree, itâs not as flexible [Mon Apr 13 18:46:41 2015] <kts>: I don't follow - shouldn't the other resources of the slave be available to any other task that wants them? [Mon Apr 13 18:47:17 2015] <bbrazil>: yes, I'll have 3-4 roles with many tasks each [Mon Apr 13 18:47:51 2015] <mkhutornenko>: thatâs the part I was referring to as ânot as flexibleâ. Given enough of the jobs with static port reqs they could be scheduled onto the same set of hosts though without limiting resource utilization [Mon Apr 13 18:48:15 2015] <mkhutornenko>: thatâs provided jobs require different subsets of static ports [Mon Apr 13 18:48:48 2015] <bbrazil>: in my use case, every job has a unique assigned static port [Mon Apr 13 18:49:12 2015] <bbrazil>: and these are separate from the port range mesos assigns [Mon Apr 13 18:49:44 2015] <mkhutornenko>: is it possible to co-locate those jobs on the same pool of machines then? [Mon Apr 13 18:49:55 2015] <bbrazil>: yes, this will be all of my machines [Mon Apr 13 18:50:12 2015] <jcohen>: bbrazil: right, but weâre talking about adding a feature to Aurora in general, we need to abstract from your use case to the general use case where people might be less diligent. [Mon Apr 13 18:50:16 2015] <mkhutornenko>: sorry, meant sub-pool defined by an attribute :) [Mon Apr 13 18:51:13 2015] <bbrazil>: mkhutornenko: I think that's a separate feature, you can kind of do that with mesos slave atributes and aurora constraints currently [Mon Apr 13 18:51:46 2015] <mkhutornenko>: that was my point exactly [Mon Apr 13 18:52:22 2015] <mkhutornenko>: any chance you could adapt to the current approach? [Mon Apr 13 18:52:55 2015] <bbrazil>: for security/isolation/quote I want each of the 4 teams we have to have their own role [Mon Apr 13 18:53:05 2015] <bbrazil>: but share all the machines [Mon Apr 13 18:53:13 2015] <bbrazil>: *quota [Mon Apr 13 18:54:13 2015] <mkhutornenko>: right, canât you just assign a dedicated attribute to ALL of your machines and require your teams use a dedicated constraint? [Mon Apr 13 18:54:44 2015] <wfarner>: mkhutornenko: who is that really helping? [Mon Apr 13 18:55:02 2015] <bbrazil>: iirc, dedicated constraints allow a machine to only allow one role to use them [Mon Apr 13 18:55:28 2015] <kts>: yeah they're limited to a single role [Mon Apr 13 18:55:39 2015] <wfarner>: no, they can have multiple [Mon Apr 13 18:55:48 2015] <mkhutornenko>: right, i just checked that [Mon Apr 13 18:55:56 2015] <kts>: oh interesting [Mon Apr 13 18:56:02 2015] <wfarner>: but it's not a pretty solution - you'd need to change slave attributes every time a new role is added, and thus reboot the cluster [Mon Apr 13 18:56:11 2015] <wfarner>: and i still claim - to what end [Mon Apr 13 18:57:12 2015] <mkhutornenko>: well, the alternative is to potentially open up the cluster for a user misuse, which I am not sure is better [Mon Apr 13 18:57:19 2015] <bbrazil>: if I can assign multiple users to a dedicated machine that'd work for me [Mon Apr 13 18:57:50 2015] <kts>: bbrazil: if you go down that route you'll have to reboot the cluster to add a new role [Mon Apr 13 18:59:26 2015] <bbrazil>: http://aurora.apache.org/documentation/latest/deploying-aurora-scheduler/#dedicated-attribute seems to indicate that isn't possible though - are the docs out of date? [Mon Apr 13 19:00:08 2015] <bbrazil>: kts: I don't expect to add new roles often, I'm only expecting 4 as-is for 20-30 jobs [Mon Apr 13 19:00:21 2015] <bbrazil>: not having to do that would be better of course [Mon Apr 13 19:01:38 2015] <mkhutornenko>: bbrazil: mind exploring the dedicated constraint route first? may be easier to enhance that feature to address your needs instead? [Mon Apr 13 19:01:46 2015] <bbrazil>: I note that we've been discussing this for 40m [Mon Apr 13 19:02:05 2015] <mkhutornenko>: agree, letâs move to the email list [Mon Apr 13 19:02:38 2015] <bbrazil>: If I can do dedicated=.* that'd work, not sure that's allowed though [Mon Apr 13 19:02:56 2015] <jcohen>: +1 for moving to list [Mon Apr 13 19:04:11 2015] <kts>: mkhutornenko: would you like to take the lead on updating the documentation regarding dedicated hosts? [Mon Apr 13 19:05:01 2015] <mkhutornenko>: sure, I can double check if itâs still up-to-date [Mon Apr 13 19:05:43 2015] <wfarner>: sounds like that wraps thing up [Mon Apr 13 19:05:58 2015] <wfarner>: thanks for the engaging discussions, everyone! [Mon Apr 13 19:06:00 2015] <wfarner>: ASFBot: meeting stop Meeting ended at Mon Apr 13 19:06:00 2015