Re: Drill on YARN

John Omernik Sat, 26 Mar 2016 06:49:53 -0700

Paul  -

Great write-up.

Your description of Llama and Yarn is both informative and troubling for a
potential cluster administrator. Looking at this solution, it would appear
that to use Yarn with Llama, the "citizen" in this case Drill would have to
be an extremely good citizen and honor all requests from Llama related to
deallocation and limits on resources while in reality there is no
enforcement mechanisms.  Not that I don't think Drill is a great tool
written well by great people, but I don't know if I would want to leave my
cluster SLAs up to Drill bits doing the self regulation.  Edge cases, etc
causing a Drillbit to start taking more resources would be very impactful
to a cluster, and with more and more people going to highly concurrent,
multi-tenant solutions, this becomes a HUGE challenge.

Obviously dynamic allocation, flexing up and down to use "spare" cluster
resources is very important to many cluster/architecture administrators,
but if I had to guess, SLAs/Workload guarantees would rank higher.

The Llama approach seems to be to much of a  "house of cards" to me to be
viable, and I worry that long term it may not be best for a product like
Drill. Our goal I think should be to play nice with others, if our core
philosophy in integration is playing nice with others, it will only help
adoption and people giving it a try.  So back to Drill on Yarn (natively)...

 A few questions around this.  You mention that resource allocations are
mostly a gentlemen's agreement. Can you explore that a bit more?  I do
believe there is Cgroup support in Yarn.  (I know the Myriad project is
looking to use Cgroups).  So is this gentlemen's agreement more about when
Cgroups is NOT enabled?  Thus it is only the word of the the process
running in the container in Yarn?  If this is the case, then has there been
any research on the stability of Cgroups and the implementation in Yarn?
Basically, Poll: Are you using Yarn? If so are you using Cgroups? If not,
why? If you are using them, any issues?   This may be helpful in what we
are looking to do with Drill.

"Hanifi’s work will allow us to increase or decrease the number of cores we
consume."  Do you have any JIRAs I can follow on this, I am very interested
in this. One of the benefits of CGroups in Mesos as it relates to CPU
shares is a sorta built in Dynamic allocation. And it would be interesting
to test a Yarn Cluster with Cgroups enabled (once a basic Yarn Aware Drill
bit is enabled) to see if Yarn reacts the same way.

Basically, when I run a drillbit on a node with Cgroup isolation enabled in
Marathon on Mesos, lets say I have 16 total cores on the node. For me, I
run my Mesos-Agent  with "14" available Vcores... Why? Static allocation of
2 vcores for MapR-FS.  Those 14 vcores are now available to tasks on the
agent.  When I start the drillbit, let's say I allocate 8 vcores to the
drillbit in Marathon.  Drill runs queries, and let's say the actual CPU
usage on this node is minimal at the time, Drill, because it is not
currently CPU aware, takes all the CPU it can. (it will use all 16 cores).
Query finishes it goes back to 0. But what happens if MapR is heavily using
it's 2 cores? Well , Cgroups detects contention and limits Drill because
it's only allocated 8 shares of those 14 it's aware of, this gives priority
to the MapR operations. Even more so, if there are other Mesos tasks asking
for CPU shares, Drill's CPU share is being scaled back, not by telling
Drill it can't use core, but by processing what Drill is trying to do
slower compared to the rest of the work loads.   I am know I am dumbing
this down, but that's how I understand Cgroups working. Basically, I was
very concerned when I first started doing Drill queries in Mesos, and I
posted to the Mesos list to which some people smarter than I took the time
to explain things. (Vinode, you are lurking on this list, thanks again!)

In a way, this is actually a nice side effect of Cgroup Isolation, from
Drill's perspective it gets all the CPU, and is only scaled back on
contention.  So, my long explanation here is to bring things back to the
Yarn/Cgroup/Gentlemen's agreement comment. I'd really want to understand
this. As a cluster administrator, I can guarantee a level of resources with
Mesos, Can I get that same guarantee in Yarn? Is it only with certain
settings?  I just want to be 100% clear that if we go the route, and make
Drill work on Yarn, that in our documentation/instructions we are explicit
in what we are giving the user on Yarn.  To me, a bad situation would occur
when someone thinks all will be well when they run Drill on Yarn, and
because they are not aware of their own settings (say not enabling Cgroups)
They blame Drill for breaking something.

So that falls back to Memory and scaling memory in Drill.  Memory for
obvious reason can't operate like CPU with Cgroups. You can't allocated all
memory to all the things, and then scale back if contention. So being a
complete neophyte on the inner workings of Drill memory. What options would
exist for allocation of memory.  Could we trigger events that would
allocate up and down what memory a given drillbit can use so it self
limits? It currently self limits because we can see the memory settings in
drill-env.sh.  But what about changing that at a later time?   Is it easier
to change direct memory limits rather than Heap?

Hypothesis 1:  If Direct memory isn't allocated (i.e. a drillbit is idle)
then setting what it could POTENTIALLY use if a query would come in would
be easier then actually deallocating Heap that's been used. Hypothesis 2:
Direct Memory, if it's truly deallocated when not in use is more about the
limit of what it could use not about allocated or deallocating memory.
Hence a nice step one may be to allow this to change as needed by an
Application Master in Yarn (or a Mesos Framework)

If the work to change the limit on Direct Memory usage was easier, it may
be a good first step, (assuming I am not completely wrong on memory
allocation) if we have to statically allocate Heap, and it's a big change
in code to make that dynamic, but Direct Memory is easy to change, that's a
great first feature, without boiling the ocean. Obviously lots of
assumptions here, but I am just thinking outloud.

Paul - When it comes to Application in Yarn, and the containers that the
Application Master allocates, can containers be joined?  Let's say I am an
application master, and I allocated 4 CPU Cores and 16 GB of ram to a
Drillbit. (8 for Heap and 8 for Direct) .  Then at a later time I can add
more memory to the drill bit.... If my assumptions worked on Direct memory
in Drill, could my Application master tell the drill bit, ok you can use 16
GB of direct memory now (i.e. the AM asks the RM to allocate 8 more GB of
ram on that node, the RM agrees, and allocates another container, can it
just resize, or would that not work? I guess what I am describing here is
sorta what Llama is doing... but I am actually talking about the ability to
enforce the quotas.... This may actually be a question that fits into your
discussion on resizing Yarn containers more than anything.

So i just tossed out a bunch of ideas here to keep discussion running.
Drill Devs, I would love a better understanding of the memory allocation
mechanisms within Drill. (High level, neophyte here).  I do feel as a
cluster admin, as I have said, that the Llama approach (now that I
understand it better) would worry me, especially in a multi-tenant
cluster.  And as you said Paul, it "feels" hacky.

Thanks for this discussion, it's a great opportunity for Drill adoption as
clusters go more and more multi-tenant/multi-use.

John

On Fri, Mar 25, 2016 at 5:45 PM, Paul Rogers <[email protected]> wrote:

> Hi Jacques,
>
> Llama is a very investing approach; I read their paper [1] early on; just
> went back and read it again. Basically, Llama (as best as I can tell) has a
> two-part solution.
>
> First, Impala is run off-YARN (that is, not in a YARN container). Llama
> uses “dummy” containers to inform YARN of Impala’s resource usage. They can
> grow/shrink static allocations by launching more dummy containers. Each
> dummy container does nothing other than inform off-YARN Impala of the
> container resources. Rather clever, actually, even if it “abuses the
> software” a bit.
>
> Secondly, Llama is able to dynamically grab spare YARN resources on each
> node. Specifically, Llama runs a Node Manager (NM) plugin that watches
> actual node usage. The plugin detects the free NM resources and informs
> Impala of them. Impala then consumes the resources as needed. When the NM
> allocates a new container, the plugin informs Impala which relinquishes the
> resources. All this works because YARN allocations are mostly a gentleman’s
> agreement. Again, this is pretty clever, but only one app per node can play
> this game.
>
> The Llama approach could work for Drill. The benefit is that Drill runs as
> it does today. Hanifi’s work will allow us to increase or decrease the
> number of cores we consume. The draw-back is that Drill is not yet ready to
> play the memory game: it can’t release memory back to the OS when
> requested. Plus, the approach just smells like a hack.
>
> The “pure-YARN” approach would be to let YARN start/stop the Drill-bits.
> The user can grow/shrink Drill resources by starting/stopping Drill-bits.
> (This is simple to do if one ignores data locality and starts each
> Drill-bit on a separate node. It is a bit more work if one wants to
> preserve data locality by being rack-aware, or by running multiple
> drill-bits per node.)
>
> YARN has been working on the ability to resize running containers. (See
> YARN-1197 - Support changing resources of an allocated container [2]) Once
> that is available, we can grow/shrink existing Drill-bits (assuming that
> Drill itself is enhanced as discussed above.) The promise of resizable
> containers also suggests that the “pure-YARN” approach is workable.
>
> Once resizable containers are available, one more piece is needed to let
> Drill use free resources. Some cluster-wide component must detect free
> resources and offer them to applications that want them, deciding how to
> divvy up the resources between, say, Drill and Impala. The same piece would
> revoke resources when paying YARN customers need them.
>
> Of course, if the resizable container feature come too late, or does not
> work well, we still have the option of going off-YARN using the Llama
> trick. But the Llama trick does nothing to do the cluster-wide coordination
> discussed above.
>
> So, the thought is: start simple with a “stock” YARN app. Then, we can add
> bells and whistles as we gain experience and as YARN offers more
> capabilities.
>
> The nice thing about this approach is that the same idea plays well with
> Mesos (though the implementation is different).
>
> Thanks,
>
> - Paul
>
> [1] http://cloudera.github.io/llama/ <http://cloudera.github.io/llama/>
> [2] https://issues.apache.org/jira/browse/YARN-1197 <
> https://issues.apache.org/jira/browse/YARN-1197>
>
> > On Mar 24, 2016, at 2:34 PM, Jacques Nadeau <[email protected]> wrote:
> >
> > Your proposed allocation approach makes a lot of sense. I think it will
> > solve a large number of use cases. Thanks for giving an overview of the
> > different frameworks. I wonder if they got too focused on the simple use
> > case....
> >
> > Have you looked at LLama to see whether we could extend it for our needs?
> > Its Apache licensed and probably has at least a start at a bunch of
> things
> > we're trying to do.
> >
> > https://github.com/cloudera/llama
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Mar 22, 2016 at 7:42 PM, Paul Rogers <[email protected]>
> wrote:
> >
> >> Hi Jacques,
> >>
> >> I’m thinking of “semi-static” allocation at first. Spin up a cluster of
> >> Drill-bits, after which the user can add or remove nodes while the
> cluster
> >> runs. (The add part is easy, the remove part is a bit tricky since we
> don’t
> >> yet have a way to gracefully shut down a Drill-bit.) Once we get the
> basics
> >> to work, we can incrementally try out dynamics. For example, someone
> could
> >> whip up a script to look at load and use the proposed YARN client app to
> >> adjust resources. Later, we can fold dynamic load management into the
> >> solution once we’re sure what folks want.
> >>
> >> I did look at Slider, Twill, Kitten and REEF. Kitten is too basic. I had
> >> great hope for Slider. But, it turns out that Slider and Weave have each
> >> built an elaborate framework to isolate us from YARN. The Slider
> framework
> >> (written in Python) seems harder to understand than YARN itself. At
> least,
> >> one has to be an expert in YARN to understand what all that Python code
> >> does. And, just looking at the class count in the Twill Javadoc was
> >> overwhelming. Slider and Twill have to solve the general case. If we
> build
> >> our own Java solution, we only have to solve the Drill case, which is
> >> likely much simpler.
> >>
> >> A bespoke solution would seem to offer some other advantages. It lets us
> >> do things like integrate ZK monitoring so we can learn of zombie drill
> bits
> >> (haven’t exited, but not sending heartbeat messages.) We can also gather
> >> metrics and historical data about the cluster as a whole. We can try out
> >> different cluster topologies. (Run Drill-bits on x of y nodes on a rack,
> >> say.) And, we can eventually do the dynamic load management we discussed
> >> earlier.
> >>
> >> But first, I look forward to hearing what others have tried and what
> we’ve
> >> learned about how people want to use Drill in a production YARN cluster.
> >>
> >> Thanks,
> >>
> >> - Paul
> >>
> >>
> >>> On Mar 22, 2016, at 5:45 PM, Jacques Nadeau <[email protected]>
> wrote:
> >>>
> >>> This is great news, welcome!
> >>>
> >>> What are you thinking in regards to static versus dynamic resource
> >>> allocation? We have some conversations going regarding workload
> >> management
> >>> but they are still early so it seems like starting with user-controlled
> >>> allocation makes sense initially.
> >>>
> >>> Also, have you spent much time evaluating whether one of the existing
> >> YARN
> >>> frameworks such as Slider would be useful? Does anyone on the list have
> >> any
> >>> feedback on the relative merits of these technologies?
> >>>
> >>> Again, glad to see someone picking this up.
> >>>
> >>> Jacques
> >>>
> >>>
> >>> --
> >>> Jacques Nadeau
> >>> CTO and Co-Founder, Dremio
> >>>
> >>> On Tue, Mar 22, 2016 at 4:58 PM, Paul Rogers <[email protected]>
> >> wrote:
> >>>
> >>>> Hi All,
> >>>>
> >>>> I’m a new member of the Drill Team here at MapR. We’d like to take a
> >> look
> >>>> at running Drill on YARN for production customers. JIRA suggests some
> >> early
> >>>> work may have been done (DRILL-142 <
> >>>> https://issues.apache.org/jira/browse/DRILL-142>, DRILL-1170 <
> >>>> https://issues.apache.org/jira/browse/DRILL-1170>, DRILL-3675 <
> >>>> https://issues.apache.org/jira/browse/DRILL-3675>).
> >>>>
> >>>> YARN is a complex beast and the Drill community is large and growing.
> >> So,
> >>>> a good place to start is to ask if anyone has already done work on
> >>>> integrating Drill with YARN (see DRILL-142)?  Or has thought about
> what
> >>>> might be needed?
> >>>>
> >>>> DRILL-1170 (YARN support for Drill) seems a good place to gather
> >>>> requirements, designs and so on. I’ve posted a “starter set” of
> >>>> requirements to spur discussion.
> >>>>
> >>>> Thanks,
> >>>>
> >>>> - Paul
> >>>>
> >>>>
> >>
> >>
>
>

Re: Drill on YARN

Reply via email to