Hey Paul and Jacques, Great discussion here. Paul, I believe we met a week or two ago on a call.
I have been running Drill successfully and powerfully (Multi-tenant etc) using Apache Mesos and Marathon. While I didn't write a framework for Drill in Mesos, Marathon does give some very nice capabilities in managing my Drill bits. Some of the features talked about here would actually be extremely helpful in my Mesos/Marathon work as well if they were built into Drill, however, I don't want to muddy the waters by taking away from the Yarn discussion either. I can go into details on my Mesos setup here to get an understanding of how I am approaching things, but like I said, I will do so only on request as to not clutter the conversation. The initial reaction to "ya I want that" would be a way to send a signal somehow, perhaps via rest API to a specific drill bit to enter "Drain Mode". In this case, all currently running queries/fragments continue to execute as expected, but the bit won't be the foreman for any new queries nor should it accept new fragments. Basically allowing the graceful shutdown Paul spoke of. This would be extremely helpful in shutting nodes down with a minimum of user impact. On Tue, Mar 22, 2016 at 9:42 PM, Paul Rogers <prog...@maprtech.com> wrote: > Hi Jacques, > > I’m thinking of “semi-static” allocation at first. Spin up a cluster of > Drill-bits, after which the user can add or remove nodes while the cluster > runs. (The add part is easy, the remove part is a bit tricky since we don’t > yet have a way to gracefully shut down a Drill-bit.) Once we get the basics > to work, we can incrementally try out dynamics. For example, someone could > whip up a script to look at load and use the proposed YARN client app to > adjust resources. Later, we can fold dynamic load management into the > solution once we’re sure what folks want. > > I did look at Slider, Twill, Kitten and REEF. Kitten is too basic. I had > great hope for Slider. But, it turns out that Slider and Weave have each > built an elaborate framework to isolate us from YARN. The Slider framework > (written in Python) seems harder to understand than YARN itself. At least, > one has to be an expert in YARN to understand what all that Python code > does. And, just looking at the class count in the Twill Javadoc was > overwhelming. Slider and Twill have to solve the general case. If we build > our own Java solution, we only have to solve the Drill case, which is > likely much simpler. > > A bespoke solution would seem to offer some other advantages. It lets us > do things like integrate ZK monitoring so we can learn of zombie drill bits > (haven’t exited, but not sending heartbeat messages.) We can also gather > metrics and historical data about the cluster as a whole. We can try out > different cluster topologies. (Run Drill-bits on x of y nodes on a rack, > say.) And, we can eventually do the dynamic load management we discussed > earlier. > > But first, I look forward to hearing what others have tried and what we’ve > learned about how people want to use Drill in a production YARN cluster. > > Thanks, > > - Paul > > > > On Mar 22, 2016, at 5:45 PM, Jacques Nadeau <jacq...@dremio.com> wrote: > > > > This is great news, welcome! > > > > What are you thinking in regards to static versus dynamic resource > > allocation? We have some conversations going regarding workload > management > > but they are still early so it seems like starting with user-controlled > > allocation makes sense initially. > > > > Also, have you spent much time evaluating whether one of the existing > YARN > > frameworks such as Slider would be useful? Does anyone on the list have > any > > feedback on the relative merits of these technologies? > > > > Again, glad to see someone picking this up. > > > > Jacques > > > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Tue, Mar 22, 2016 at 4:58 PM, Paul Rogers <prog...@maprtech.com> > wrote: > > > >> Hi All, > >> > >> I’m a new member of the Drill Team here at MapR. We’d like to take a > look > >> at running Drill on YARN for production customers. JIRA suggests some > early > >> work may have been done (DRILL-142 < > >> https://issues.apache.org/jira/browse/DRILL-142>, DRILL-1170 < > >> https://issues.apache.org/jira/browse/DRILL-1170>, DRILL-3675 < > >> https://issues.apache.org/jira/browse/DRILL-3675>). > >> > >> YARN is a complex beast and the Drill community is large and growing. > So, > >> a good place to start is to ask if anyone has already done work on > >> integrating Drill with YARN (see DRILL-142)? Or has thought about what > >> might be needed? > >> > >> DRILL-1170 (YARN support for Drill) seems a good place to gather > >> requirements, designs and so on. I’ve posted a “starter set” of > >> requirements to spur discussion. > >> > >> Thanks, > >> > >> - Paul > >> > >> > >