Yeah, we have a dozen or so kafka consumer jobs running in our cluster, each having about 40 or so instances.
> On Mar 30, 2017, at 2:06 PM, David McLaughlin <da...@dmclaughlin.com> wrote: > > There is absolutely a need for custom hook points in the scheduler (injecting > default constraints to running tasks for example). I don't think users should > be asked to write custom scheduling algorithms to solve the problems in this > thread though. There are also huge downsides to exposing the internals of > scheduling as a part of a plugin API. > > Out of curiosity do your Kafka consumers span multiple jobs? Otherwise host > constraints solve that problem right? > >> On Mar 30, 2017, at 10:34 AM, Rick Mangi <r...@chartbeat.com> wrote: >> >> I think the complexity is a great rationale for having a pluggable >> scheduling layer. Aurora is very flexible and people use it in many >> different ways. Giving users more flexibility in how jobs are scheduled >> seems like it would be a good direction for the project. >> >> >>> On Mar 30, 2017, at 12:16 PM, David McLaughlin <dmclaugh...@apache.org> >>> wrote: >>> >>> I think this is more complicated than multiple scheduling algorithms. The >>> problem you'll end up having if you try to solve this in the Scheduling >>> loop is when resources are unavailable because there are preemptible tasks >>> running in them, rather than hosts being down. Right now the fact that the >>> task cannot be scheduled is important because it triggers preemption and >>> will make room. An alternative algorithm that tries at all costs to >>> schedule the task in the TaskAssigner could decide to place the task in a >>> non-ideal slot and leave a preemptible task running instead. >>> >>> It's also important to think of the knock-on effects here when we move to >>> offer affinity (i.e. the current Dynamic Reservation proposal). If you've >>> made this non-ideal compromise to get things scheduled - that decision will >>> basically be permanent until the host you're on goes down. At least with >>> how things work now, with each scheduling attempt the job has a fresh >>> chance of being put in an ideal slot. >>> >>>> On Thu, Mar 30, 2017 at 8:12 AM, Rick Mangi <r...@chartbeat.com> wrote: >>>> >>>> Sorry for the late reply, but I wanted to chime in here as wanting to see >>>> this feature. We run a medium size cluster (around 1000 cores) in EC2 and I >>>> think we could get better usage of the cluster with more control over the >>>> distribution of job instances. For example it would be nice to limit the >>>> number of kafka consumers running on the same physical box. >>>> >>>> Best, >>>> >>>> Rick >>>> >>>> >>>>> On 2017-03-06 14:44 (-0400), Mauricio Garavaglia <m...@gmail.com> wrote: >>>>> Hello!> >>>>> >>>>> I have a job that have multiple instances (>100) that'd I like to spread> >>>>> across the hosts in a cluster. Using a constraint such as "limit=host:1"> >>>>> doesn't work quite well, as I have more instances than nodes.> >>>>> >>>>> As a workaround I increased the limit value to something like> >>>>> ceil(instances/nodes). But now the problem happens if a bunch of nodes >>>> go> >>>>> down (think a whole rack dies) because the instances will not run until> >>>>> them are back, even though we may have spare capacity on the rest of the> >>>>> hosts that we'd like to use. In that scenario, the job availability may >>>> be> >>>>> affected because it's running with fewer instances than expected. On a> >>>>> smaller scale, the former approach would also apply if you want to >>>> spread> >>>>> tasks in racks or availability zones. I'd like to have one instance of a> >>>>> job per rack (failure domain) but in the case of it going down, the> >>>>> instance can be spawn on a different rack.> >>>>> >>>>> I thought we could have a scheduling constraint to "spread" instances> >>>>> across a particular host attribute; instead of vetoing an offer right >>>> away> >>>>> we check where the other instances of a task are running, looking for a> >>>>> particular attribute of the host. We try to maximize the different >>>> values> >>>>> of a particular attribute (rack, hostname, etc) on the task instances> >>>>> assignment.> >>>>> >>>>> what do you think? did something like this came up in the past? is it> >>>>> feasible?> >>>>> >>>>> >>>>> Mauricio> >>>>> >>>> >>
signature.asc
Description: Message signed with OpenPGP using GPGMail