we’re using cgroups, if that’s what you’re asking :)
> On Mar 30, 2017, at 3:21 PM, Zameer Manji <zma...@apache.org> wrote: > > What kind of isolation features are you using? > > I would like to probe a little deeper here, because this is not an ideal > rationale for changing the placement algorithm. Ideally Mesos and Linux > provides the right isolation technology to make this a non problem. > > I understand the push for job anti-affinity (ie don't put too many kafka > workers in general on one host), but I would imagine it would be for > reliability reasons not for performance reasons. > > On Thu, Mar 30, 2017 at 12:16 PM, Rick Mangi <r...@chartbeat.com> wrote: > >> Performance and utilization mostly. The kafka consumers are CPU bound (and >> sometimes network) and the rest of our jobs are mostly memory bound. We’ve >> found that if too many consumers wind up on the same EC2 instance they >> don’t perform as well. It’s hard to prove this, but the gut feeling is >> pretty strong. >> >> >>> On Mar 30, 2017, at 2:35 PM, Zameer Manji <zma...@apache.org> wrote: >>> >>> Rick, >>> >>> Can you share why it would be nice to spread out these different jobs on >>> different hosts? Is it for reliability, performance, utilization, etc? >>> >>> On Thu, Mar 30, 2017 at 11:31 AM, Rick Mangi <r...@chartbeat.com> wrote: >>> >>>> Yeah, we have a dozen or so kafka consumer jobs running in our cluster, >>>> each having about 40 or so instances. >>>> >>>> >>>>> On Mar 30, 2017, at 2:06 PM, David McLaughlin <da...@dmclaughlin.com> >>>> wrote: >>>>> >>>>> There is absolutely a need for custom hook points in the scheduler >>>> (injecting default constraints to running tasks for example). I don't >> think >>>> users should be asked to write custom scheduling algorithms to solve the >>>> problems in this thread though. There are also huge downsides to >> exposing >>>> the internals of scheduling as a part of a plugin API. >>>>> >>>>> Out of curiosity do your Kafka consumers span multiple jobs? Otherwise >>>> host constraints solve that problem right? >>>>> >>>>>> On Mar 30, 2017, at 10:34 AM, Rick Mangi <r...@chartbeat.com> wrote: >>>>>> >>>>>> I think the complexity is a great rationale for having a pluggable >>>> scheduling layer. Aurora is very flexible and people use it in many >>>> different ways. Giving users more flexibility in how jobs are scheduled >>>> seems like it would be a good direction for the project. >>>>>> >>>>>> >>>>>>> On Mar 30, 2017, at 12:16 PM, David McLaughlin < >> dmclaugh...@apache.org> >>>> wrote: >>>>>>> >>>>>>> I think this is more complicated than multiple scheduling algorithms. >>>> The >>>>>>> problem you'll end up having if you try to solve this in the >> Scheduling >>>>>>> loop is when resources are unavailable because there are preemptible >>>> tasks >>>>>>> running in them, rather than hosts being down. Right now the fact >> that >>>> the >>>>>>> task cannot be scheduled is important because it triggers preemption >>>> and >>>>>>> will make room. An alternative algorithm that tries at all costs to >>>>>>> schedule the task in the TaskAssigner could decide to place the task >>>> in a >>>>>>> non-ideal slot and leave a preemptible task running instead. >>>>>>> >>>>>>> It's also important to think of the knock-on effects here when we >> move >>>> to >>>>>>> offer affinity (i.e. the current Dynamic Reservation proposal). If >>>> you've >>>>>>> made this non-ideal compromise to get things scheduled - that >> decision >>>> will >>>>>>> basically be permanent until the host you're on goes down. At least >>>> with >>>>>>> how things work now, with each scheduling attempt the job has a fresh >>>>>>> chance of being put in an ideal slot. >>>>>>> >>>>>>>> On Thu, Mar 30, 2017 at 8:12 AM, Rick Mangi <r...@chartbeat.com> >>>> wrote: >>>>>>>> >>>>>>>> Sorry for the late reply, but I wanted to chime in here as wanting >> to >>>> see >>>>>>>> this feature. We run a medium size cluster (around 1000 cores) in >> EC2 >>>> and I >>>>>>>> think we could get better usage of the cluster with more control >> over >>>> the >>>>>>>> distribution of job instances. For example it would be nice to limit >>>> the >>>>>>>> number of kafka consumers running on the same physical box. >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Rick >>>>>>>> >>>>>>>> >>>>>>>>> On 2017-03-06 14:44 (-0400), Mauricio Garavaglia <m...@gmail.com> >>>> wrote: >>>>>>>>> Hello!> >>>>>>>>> >>>>>>>>> I have a job that have multiple instances (>100) that'd I like to >>>> spread> >>>>>>>>> across the hosts in a cluster. Using a constraint such as >>>> "limit=host:1"> >>>>>>>>> doesn't work quite well, as I have more instances than nodes.> >>>>>>>>> >>>>>>>>> As a workaround I increased the limit value to something like> >>>>>>>>> ceil(instances/nodes). But now the problem happens if a bunch of >>>> nodes >>>>>>>> go> >>>>>>>>> down (think a whole rack dies) because the instances will not run >>>> until> >>>>>>>>> them are back, even though we may have spare capacity on the rest >> of >>>> the> >>>>>>>>> hosts that we'd like to use. In that scenario, the job availability >>>> may >>>>>>>> be> >>>>>>>>> affected because it's running with fewer instances than expected. >> On >>>> a> >>>>>>>>> smaller scale, the former approach would also apply if you want to >>>>>>>> spread> >>>>>>>>> tasks in racks or availability zones. I'd like to have one instance >>>> of a> >>>>>>>>> job per rack (failure domain) but in the case of it going down, >> the> >>>>>>>>> instance can be spawn on a different rack.> >>>>>>>>> >>>>>>>>> I thought we could have a scheduling constraint to "spread" >>>> instances> >>>>>>>>> across a particular host attribute; instead of vetoing an offer >> right >>>>>>>> away> >>>>>>>>> we check where the other instances of a task are running, looking >>>> for a> >>>>>>>>> particular attribute of the host. We try to maximize the different >>>>>>>> values> >>>>>>>>> of a particular attribute (rack, hostname, etc) on the task >>>> instances> >>>>>>>>> assignment.> >>>>>>>>> >>>>>>>>> what do you think? did something like this came up in the past? is >>>> it> >>>>>>>>> feasible?> >>>>>>>>> >>>>>>>>> >>>>>>>>> Mauricio> >>>>>>>>> >>>>>>>> >>>>>> >>>> >>>> -- >>>> Zameer Manji >>>> >> >> -- >> Zameer Manji >>
signature.asc
Description: Message signed with OpenPGP using GPGMail