I've been under the weather the last few days and haven't made much progress. Let me see if I can get you something tomorrow.
On Mar 15, 2013, at 2:36 PM, David Alves <[email protected]> wrote: > Hi Jacques > > Is there any chance we could get a preview of this physical plan stuff and > basic plumbing for distributed execution before the weekend? maybe in a > github branch somewhere? > I mean it doesn't have to be complete or even running, I'd just like to > make some progress with other stuff and keeping it in line with whichever > plumbing you already have would be great. > > Best > David > > On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <[email protected]> wrote: > >> I'm working on some physical plan stuff as well as some basic plumbing for >> distributed execution. Its very in progress so I need to clean things up a >> bit before we could collaborate/ divide and conquer on it. Depending on >> your timing and availability, maybe I could put some of this together in >> the next couple days so that you could plug in rather than reinvent. In >> the meantime, pushing forward the builder stuff, additional test cases on >> the reference interpreter and/or thinking through the logical plan storage >> engine pushdown/rewrite could be very useful. >> >> Let me know your thoughts. >> >> thanks, >> Jacques >> >> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <[email protected]> wrote: >> >>> Hi Jacques >>> >>> I can assign issues to me now, thanks. >>> What you say wrt to the logical/physical/execution layers sounds >>> good. >>> My main concern, for the moment is to have something working as >>> fast as possible, i.e. some daemons that I'd be able to deploy to a working >>> hbase cluster and send them work to do in some form (first step would be to >>> treat is as a non distributed engine where each daemon runs an instance of >>> the prototype). >>> Here's where I'd like to go next: >>> - lay the ground work for the daemons (scripts/rpc iface/wiring >>> protocol). >>> - create an execution engine iface that allows to abstract future >>> implementations, and make it available through the rpc iface. this would >>> sit in front of the ref impl for now and would be replaced by cpp down the >>> line. >>> >>> I think we can probably concentrate on the capabilities iface a >>> bit down the line but, as a first approach, I see it simply providing a >>> simple set of ops that it is able to run internally. >>> How to abstract locality/partitioning/schema capabilities is till >>> not clear to me though, thoughts? >>> >>> David >>> >>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <[email protected]> wrote: >>> >>>> I'm working on a presentation that will better illustrate the layers. >>>> There are actually three key plans. Thinking to date has been to break >>>> the plans down into logical, physical and execution. The third hasn't >>> been >>>> expressed well here and is entirely an internal domain to the execution >>>> engine. Following some classic methods: Logical expresses what we want >>> to >>>> do, Physical expresses how we want to do it (adding points of >>>> parallelization but not specifying particular amounts of parallelization >>> or >>>> node by node assignments). The execution engine is then responsible for >>>> determining the amount of parallelization of a particular plan along with >>>> system load (likely leveraging Berkeley's Sparrow work), task priority >>> and >>>> specific data locality information, building sub-dags to be assigned to >>>> individual nodes and execute the plan. >>>> >>>> So in the higher logical and physical levels, a single Scan and >>> subsequent >>>> ScanPOP should be okay... (ScanROPs have a separate problems since they >>>> ignore the level of separation we're planning for the real execution >>> layer. >>>> This is the why the current ref impl turns a single Scan into potentially >>>> a union of ScanROPs... not elegant but logically correct.) >>>> >>>> The capabilities interface still needs to be defined for how a storage >>>> engine reveals its logical capabilities and thus consumes part of the >>> plan. >>>> >>>> J >>>> >>>> >>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <[email protected]> >>> wrote: >>>> >>>>> Hi Linsen >>>>> >>>>> Some of what you are saying like push down of ops like filter, >>>>> projection or partial aggregation below the storage engine scanner >>> level, >>>>> or sub tree execution are actively being discussed in issues DRILL-13 >>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your >>> input >>>>> in these issues is most welcome. >>>>> >>>>> HBase in particular has the notion of >>>>> enpoints/coprocessors/filters that allow pushing this down easily (this >>> is >>>>> also in line with what other parallel database over nosql >>> implementations >>>>> like tajo do). >>>>> A possible approach is to have the optimizer change the order of >>>>> the ops to place them below the storage engine scanner and let the SE >>> impl >>>>> deal with it internally. >>>>> >>>>> There are also some other pieces missing at the moment AFAIK, >>> like >>>>> a distributed metadata store, the drill daemons, wiring, etc. >>>>> >>>>> So in summary, you're absolutely right, and if you're >>> particularly >>>>> interested in the HBase SE impl (as I am, for the moment) I'd be >>> interested >>>>> in collaborating. >>>>> >>>>> Best >>>>> David >>>>> >>>>> >>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <[email protected]> wrote: >>>>> >>>>>> Hi David, >>>>>> >>>>>> Very nice to see your effort on this. >>>>>> >>>>>> Hi Jacques, >>>>>> >>>>>> we are also extending drill prototype, to see if there is any chance to >>>>>> meet our production need. However, We find that implementing a >>> performant >>>>>> HBase storage engine is a not so straight-forward work, and requires >>> some >>>>>> workaround. The problem is in Scan interface. >>>>>> >>>>>> In drill's physical plan model, ScanROP is in charge of table scan. >>>>> Storage >>>>>> engine provides output for a whole data source, a csv file for example. >>>>>> It's sufficient for input source like plain file, but for hbase, it's >>> not >>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole >>> htable >>>>>> into drill. Storage engines like HBase should have some ablility to do >>>>> part >>>>>> of the DrQL query, like Filter, if a filter can be performed by >>>>> specifying >>>>>> startRowKey and endRowKey. Storage engine like mysql could do more, >>> even >>>>>> Join. >>>>>> >>>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG >>> of >>>>>> logical plan DAG instead of a single Scan node in logical plan. If so, >>>>> more >>>>>> implementation-specific information would coupe into the plan >>>>> optimization >>>>>> & transformation phase. I guess that's the price to pay when >>> optimization >>>>>> comes, or is there other way I failed to see? >>>>>> >>>>>> Please correct me if anything is wrong. >>>>>> >>>>>> thanks, >>>>>> >>>>>> Lisen >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <[email protected]> >>>>> wrote: >>>>>> >>>>>>> Hi Jacques >>>>>>> >>>>>>> I've submitted a fist pass patch to DRILL-15. >>>>>>> I did this mostly because HBase will be my main target and >>>>> because >>>>>>> I wanted to get a feel of what would be a nice interface for DRILL-13. >>>>> Have >>>>>>> some thoughts that I will post soon. >>>>>>> btw: I still can't assign issues to myself in JIRA, did you >>>>> forget >>>>>>> to add me as a contributor? >>>>>>> >>>>>>> Best >>>>>>> David >>>>>>> >>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <[email protected]> >>> wrote: >>>>>>> >>>>>>>> Hey David, >>>>>>>> >>>>>>>> These sound good. I've add you as a contributor on jira so you can >>>>>>> assign >>>>>>>> tasks to yourself. I think 45 and 46 are good places to start. 15 >>>>>>> depends >>>>>>>> on 13 and working on the two hand in hand would probably be a good >>>>> idea. >>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you have >>>>>>> some >>>>>>>> time to focus on it. >>>>>>>> >>>>>>>> Jacques >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <[email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi All >>>>>>>>> >>>>>>>>> I have a new academic project for which I'd like to use drill >>>>>>>>> since none of the other parallel database over hadoop/nosql >>>>>>> implementations >>>>>>>>> fit just right. >>>>>>>>> To this goal I've been tinkering with the prototype trying to >>>>>>> find >>>>>>>>> where I'd be most useful. >>>>>>>>> >>>>>>>>> Here's where I'd like to start, if you agree: >>>>>>>>> - implement HBase storage engine (DRILL-15) >>>>>>>>> - start with simple scanning an push down of >>>>>>>>> selection/projection >>>>>>>>> - implement the LogicalPlanBuilder (DRILL-45) >>>>>>>>> - setup coding style in the wiki (formatting/imports etc, >>>>>>> DRILL-46) >>>>>>>>> - create builders for all logical plan elements/make logical >>>>>>> plans >>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts first). >>>>>>>>> >>>>>>>>> Please let me know your thoughts, and if you agree please >>> assign >>>>>>>>> the issues to me (it seems that I can't assign them myself). >>>>>>>>> >>>>>>>>> Best >>>>>>>>> David Alves >
