that'd be great thanks. -david
On Mar 15, 2013, at 8:51 PM, Jacques Nadeau <[email protected]> wrote: > I've been under the weather the last few days and haven't made much > progress. Let me see if I can get you something tomorrow. > > On Mar 15, 2013, at 2:36 PM, David Alves <[email protected]> wrote: > >> Hi Jacques >> >> Is there any chance we could get a preview of this physical plan stuff and >> basic plumbing for distributed execution before the weekend? maybe in a >> github branch somewhere? >> I mean it doesn't have to be complete or even running, I'd just like to >> make some progress with other stuff and keeping it in line with whichever >> plumbing you already have would be great. >> >> Best >> David >> >> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <[email protected]> wrote: >> >>> I'm working on some physical plan stuff as well as some basic plumbing for >>> distributed execution. Its very in progress so I need to clean things up a >>> bit before we could collaborate/ divide and conquer on it. Depending on >>> your timing and availability, maybe I could put some of this together in >>> the next couple days so that you could plug in rather than reinvent. In >>> the meantime, pushing forward the builder stuff, additional test cases on >>> the reference interpreter and/or thinking through the logical plan storage >>> engine pushdown/rewrite could be very useful. >>> >>> Let me know your thoughts. >>> >>> thanks, >>> Jacques >>> >>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <[email protected]> wrote: >>> >>>> Hi Jacques >>>> >>>> I can assign issues to me now, thanks. >>>> What you say wrt to the logical/physical/execution layers sounds >>>> good. >>>> My main concern, for the moment is to have something working as >>>> fast as possible, i.e. some daemons that I'd be able to deploy to a working >>>> hbase cluster and send them work to do in some form (first step would be to >>>> treat is as a non distributed engine where each daemon runs an instance of >>>> the prototype). >>>> Here's where I'd like to go next: >>>> - lay the ground work for the daemons (scripts/rpc iface/wiring >>>> protocol). >>>> - create an execution engine iface that allows to abstract future >>>> implementations, and make it available through the rpc iface. this would >>>> sit in front of the ref impl for now and would be replaced by cpp down the >>>> line. >>>> >>>> I think we can probably concentrate on the capabilities iface a >>>> bit down the line but, as a first approach, I see it simply providing a >>>> simple set of ops that it is able to run internally. >>>> How to abstract locality/partitioning/schema capabilities is till >>>> not clear to me though, thoughts? >>>> >>>> David >>>> >>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <[email protected]> wrote: >>>> >>>>> I'm working on a presentation that will better illustrate the layers. >>>>> There are actually three key plans. Thinking to date has been to break >>>>> the plans down into logical, physical and execution. The third hasn't >>>> been >>>>> expressed well here and is entirely an internal domain to the execution >>>>> engine. Following some classic methods: Logical expresses what we want >>>> to >>>>> do, Physical expresses how we want to do it (adding points of >>>>> parallelization but not specifying particular amounts of parallelization >>>> or >>>>> node by node assignments). The execution engine is then responsible for >>>>> determining the amount of parallelization of a particular plan along with >>>>> system load (likely leveraging Berkeley's Sparrow work), task priority >>>> and >>>>> specific data locality information, building sub-dags to be assigned to >>>>> individual nodes and execute the plan. >>>>> >>>>> So in the higher logical and physical levels, a single Scan and >>>> subsequent >>>>> ScanPOP should be okay... (ScanROPs have a separate problems since they >>>>> ignore the level of separation we're planning for the real execution >>>> layer. >>>>> This is the why the current ref impl turns a single Scan into potentially >>>>> a union of ScanROPs... not elegant but logically correct.) >>>>> >>>>> The capabilities interface still needs to be defined for how a storage >>>>> engine reveals its logical capabilities and thus consumes part of the >>>> plan. >>>>> >>>>> J >>>>> >>>>> >>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <[email protected]> >>>> wrote: >>>>> >>>>>> Hi Linsen >>>>>> >>>>>> Some of what you are saying like push down of ops like filter, >>>>>> projection or partial aggregation below the storage engine scanner >>>> level, >>>>>> or sub tree execution are actively being discussed in issues DRILL-13 >>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your >>>> input >>>>>> in these issues is most welcome. >>>>>> >>>>>> HBase in particular has the notion of >>>>>> enpoints/coprocessors/filters that allow pushing this down easily (this >>>> is >>>>>> also in line with what other parallel database over nosql >>>> implementations >>>>>> like tajo do). >>>>>> A possible approach is to have the optimizer change the order of >>>>>> the ops to place them below the storage engine scanner and let the SE >>>> impl >>>>>> deal with it internally. >>>>>> >>>>>> There are also some other pieces missing at the moment AFAIK, >>>> like >>>>>> a distributed metadata store, the drill daemons, wiring, etc. >>>>>> >>>>>> So in summary, you're absolutely right, and if you're >>>> particularly >>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be >>>> interested >>>>>> in collaborating. >>>>>> >>>>>> Best >>>>>> David >>>>>> >>>>>> >>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <[email protected]> wrote: >>>>>> >>>>>>> Hi David, >>>>>>> >>>>>>> Very nice to see your effort on this. >>>>>>> >>>>>>> Hi Jacques, >>>>>>> >>>>>>> we are also extending drill prototype, to see if there is any chance to >>>>>>> meet our production need. However, We find that implementing a >>>> performant >>>>>>> HBase storage engine is a not so straight-forward work, and requires >>>> some >>>>>>> workaround. The problem is in Scan interface. >>>>>>> >>>>>>> In drill's physical plan model, ScanROP is in charge of table scan. >>>>>> Storage >>>>>>> engine provides output for a whole data source, a csv file for example. >>>>>>> It's sufficient for input source like plain file, but for hbase, it's >>>> not >>>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole >>>> htable >>>>>>> into drill. Storage engines like HBase should have some ablility to do >>>>>> part >>>>>>> of the DrQL query, like Filter, if a filter can be performed by >>>>>> specifying >>>>>>> startRowKey and endRowKey. Storage engine like mysql could do more, >>>> even >>>>>>> Join. >>>>>>> >>>>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG >>>> of >>>>>>> logical plan DAG instead of a single Scan node in logical plan. If so, >>>>>> more >>>>>>> implementation-specific information would coupe into the plan >>>>>> optimization >>>>>>> & transformation phase. I guess that's the price to pay when >>>> optimization >>>>>>> comes, or is there other way I failed to see? >>>>>>> >>>>>>> Please correct me if anything is wrong. >>>>>>> >>>>>>> thanks, >>>>>>> >>>>>>> Lisen >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <[email protected]> >>>>>> wrote: >>>>>>> >>>>>>>> Hi Jacques >>>>>>>> >>>>>>>> I've submitted a fist pass patch to DRILL-15. >>>>>>>> I did this mostly because HBase will be my main target and >>>>>> because >>>>>>>> I wanted to get a feel of what would be a nice interface for DRILL-13. >>>>>> Have >>>>>>>> some thoughts that I will post soon. >>>>>>>> btw: I still can't assign issues to myself in JIRA, did you >>>>>> forget >>>>>>>> to add me as a contributor? >>>>>>>> >>>>>>>> Best >>>>>>>> David >>>>>>>> >>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <[email protected]> >>>> wrote: >>>>>>>> >>>>>>>>> Hey David, >>>>>>>>> >>>>>>>>> These sound good. I've add you as a contributor on jira so you can >>>>>>>> assign >>>>>>>>> tasks to yourself. I think 45 and 46 are good places to start. 15 >>>>>>>> depends >>>>>>>>> on 13 and working on the two hand in hand would probably be a good >>>>>> idea. >>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you have >>>>>>>> some >>>>>>>>> time to focus on it. >>>>>>>>> >>>>>>>>> Jacques >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <[email protected]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi All >>>>>>>>>> >>>>>>>>>> I have a new academic project for which I'd like to use drill >>>>>>>>>> since none of the other parallel database over hadoop/nosql >>>>>>>> implementations >>>>>>>>>> fit just right. >>>>>>>>>> To this goal I've been tinkering with the prototype trying to >>>>>>>> find >>>>>>>>>> where I'd be most useful. >>>>>>>>>> >>>>>>>>>> Here's where I'd like to start, if you agree: >>>>>>>>>> - implement HBase storage engine (DRILL-15) >>>>>>>>>> - start with simple scanning an push down of >>>>>>>>>> selection/projection >>>>>>>>>> - implement the LogicalPlanBuilder (DRILL-45) >>>>>>>>>> - setup coding style in the wiki (formatting/imports etc, >>>>>>>> DRILL-46) >>>>>>>>>> - create builders for all logical plan elements/make logical >>>>>>>> plans >>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts first). >>>>>>>>>> >>>>>>>>>> Please let me know your thoughts, and if you agree please >>>> assign >>>>>>>>>> the issues to me (it seems that I can't assign them myself). >>>>>>>>>> >>>>>>>>>> Best >>>>>>>>>> David Alves >>
