Re: contribution

Jacques Nadeau Fri, 15 Mar 2013 18:51:56 -0700

I've been under the weather the last few days and haven't made much
progress. Let me see if I can get you something tomorrow.


On Mar 15, 2013, at 2:36 PM, David Alves <[email protected]> wrote:

> Hi Jacques
>
>    Is there any chance we could get a preview of this physical plan stuff and 
> basic plumbing for distributed execution before the weekend? maybe in a 
> github branch somewhere?
>    I mean it doesn't have to be complete or even running, I'd just like to 
> make some progress with other stuff and keeping it in line with whichever 
> plumbing you already have would be great.
>
> Best
> David
>
> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <[email protected]> wrote:
>
>> I'm working on some physical plan stuff as well as some basic plumbing for
>> distributed execution.  Its very in progress so I need to clean things up a
>> bit before we could collaborate/ divide and conquer on it.  Depending on
>> your timing and availability, maybe I could put some of this together in
>> the next couple days so that you could plug in rather than reinvent.  In
>> the meantime, pushing forward the builder stuff, additional test cases on
>> the reference interpreter and/or thinking through the logical plan storage
>> engine pushdown/rewrite could be very useful.
>>
>> Let me know your thoughts.
>>
>> thanks,
>> Jacques
>>
>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <[email protected]> wrote:
>>
>>> Hi Jacques
>>>
>>>       I can assign issues to me now, thanks.
>>>       What you say wrt to the logical/physical/execution layers sounds
>>> good.
>>>       My main concern, for the moment is to have something working as
>>> fast as possible, i.e. some daemons that I'd be able to deploy to a working
>>> hbase cluster and send them work to do in some form (first step would be to
>>> treat is as a non distributed engine where each daemon runs an instance of
>>> the prototype).
>>>       Here's where I'd like to go next:
>>>       - lay the ground work for the daemons (scripts/rpc iface/wiring
>>> protocol).
>>>       - create an execution engine iface that allows to abstract future
>>> implementations, and make it available through the rpc iface. this would
>>> sit in front of the ref impl for now and would be replaced by cpp down the
>>> line.
>>>
>>>       I think we can probably concentrate on the capabilities iface a
>>> bit down the line but, as a first approach, I see it simply providing a
>>> simple set of ops that it is able to run internally.
>>>       How to abstract locality/partitioning/schema capabilities is till
>>> not clear to me though, thoughts?
>>>
>>> David
>>>
>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <[email protected]> wrote:
>>>
>>>> I'm working on a presentation that will better illustrate the layers.
>>>> There are actually three key plans.  Thinking to date has been to break
>>>> the plans down into logical, physical and execution.  The third hasn't
>>> been
>>>> expressed well here and is entirely an internal domain to the execution
>>>> engine.  Following some classic methods: Logical expresses what we want
>>> to
>>>> do, Physical expresses how we want to do it (adding points of
>>>> parallelization but not specifying particular amounts of parallelization
>>> or
>>>> node by node assignments).  The execution engine is then responsible for
>>>> determining the amount of parallelization of a particular plan along with
>>>> system load (likely leveraging Berkeley's Sparrow work), task priority
>>> and
>>>> specific data locality information, building sub-dags to be assigned to
>>>> individual nodes and execute the plan.
>>>>
>>>> So in the higher logical and physical levels, a single Scan and
>>> subsequent
>>>> ScanPOP should be okay...  (ScanROPs have a separate problems since they
>>>> ignore the level of separation we're planning for the real execution
>>> layer.
>>>> This is the why the current ref impl turns a single Scan into potentially
>>>> a union of ScanROPs... not elegant but logically correct.)
>>>>
>>>> The capabilities interface still needs to be defined for how a storage
>>>> engine reveals its logical capabilities and thus consumes part of the
>>> plan.
>>>>
>>>> J
>>>>
>>>>
>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <[email protected]>
>>> wrote:
>>>>
>>>>> Hi Linsen
>>>>>
>>>>>      Some of what you are saying like push down of ops like filter,
>>>>> projection or partial aggregation below the storage engine scanner
>>> level,
>>>>> or sub tree execution are actively being discussed in issues DRILL-13
>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your
>>> input
>>>>> in these issues is most welcome.
>>>>>
>>>>>      HBase in particular has the notion of
>>>>> enpoints/coprocessors/filters that allow pushing this down easily (this
>>> is
>>>>> also in line with what other parallel database over nosql
>>> implementations
>>>>> like tajo do).
>>>>>      A possible approach is to have the optimizer change the order of
>>>>> the ops to place them below the storage engine scanner and let the SE
>>> impl
>>>>> deal with it internally.
>>>>>
>>>>>      There are also some other pieces missing at the moment AFAIK,
>>> like
>>>>> a distributed metadata store, the drill daemons, wiring, etc.
>>>>>
>>>>>      So in summary, you're absolutely right, and if you're
>>> particularly
>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
>>> interested
>>>>> in collaborating.
>>>>>
>>>>> Best
>>>>> David
>>>>>
>>>>>
>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <[email protected]> wrote:
>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> Very nice to see your effort on this.
>>>>>>
>>>>>> Hi Jacques,
>>>>>>
>>>>>> we are also extending drill prototype, to see if there is any chance to
>>>>>> meet our production need. However, We find that implementing a
>>> performant
>>>>>> HBase storage engine is a not so straight-forward work, and requires
>>> some
>>>>>> workaround. The problem is in Scan interface.
>>>>>>
>>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
>>>>> Storage
>>>>>> engine provides output for a whole data source, a csv file for example.
>>>>>> It's sufficient for input source like plain file, but for hbase, it's
>>> not
>>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
>>> htable
>>>>>> into drill. Storage engines like HBase should have some ablility to do
>>>>> part
>>>>>> of the DrQL query, like Filter, if a filter can be performed by
>>>>> specifying
>>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
>>> even
>>>>>> Join.
>>>>>>
>>>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG
>>> of
>>>>>> logical plan DAG instead of a single Scan node in logical plan. If so,
>>>>> more
>>>>>> implementation-specific information would coupe into the plan
>>>>> optimization
>>>>>> & transformation phase. I guess that's the price to pay when
>>> optimization
>>>>>> comes, or is there other way I failed to see?
>>>>>>
>>>>>> Please correct me if anything is wrong.
>>>>>>
>>>>>> thanks,
>>>>>>
>>>>>> Lisen
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>>> Hi Jacques
>>>>>>>
>>>>>>>     I've submitted a fist pass patch to DRILL-15.
>>>>>>>     I did this mostly because HBase will be my main target and
>>>>> because
>>>>>>> I wanted to get a feel of what would be a nice interface for DRILL-13.
>>>>> Have
>>>>>>> some thoughts that I will post soon.
>>>>>>>     btw: I still can't assign issues to myself in JIRA, did you
>>>>> forget
>>>>>>> to add me as a contributor?
>>>>>>>
>>>>>>> Best
>>>>>>> David
>>>>>>>
>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <[email protected]>
>>> wrote:
>>>>>>>
>>>>>>>> Hey David,
>>>>>>>>
>>>>>>>> These sound good.  I've add you as a contributor on jira so you can
>>>>>>> assign
>>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.  15
>>>>>>> depends
>>>>>>>> on 13 and working on the two hand in hand would probably be a good
>>>>> idea.
>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you have
>>>>>>> some
>>>>>>>> time to focus on it.
>>>>>>>>
>>>>>>>> Jacques
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <[email protected]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi All
>>>>>>>>>
>>>>>>>>>    I have a new academic project for which I'd like to use drill
>>>>>>>>> since none of the other parallel database over hadoop/nosql
>>>>>>> implementations
>>>>>>>>> fit just right.
>>>>>>>>>    To this goal I've been tinkering with the prototype trying to
>>>>>>> find
>>>>>>>>> where I'd be most useful.
>>>>>>>>>
>>>>>>>>>    Here's where I'd like to start, if you agree:
>>>>>>>>>    - implement HBase storage engine (DRILL-15)
>>>>>>>>>            - start with simple scanning an push down of
>>>>>>>>> selection/projection
>>>>>>>>>    - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>>>>    - setup coding style in the wiki (formatting/imports etc,
>>>>>>> DRILL-46)
>>>>>>>>>    - create builders for all logical plan elements/make logical
>>>>>>> plans
>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts first).
>>>>>>>>>
>>>>>>>>>    Please let me know your thoughts, and if you agree please
>>> assign
>>>>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>>>>>
>>>>>>>>> Best
>>>>>>>>> David Alves
>

Re: contribution

Reply via email to