Re: contribution

David Alves Fri, 15 Mar 2013 18:59:56 -0700

that'd be great thanks.

-david


On Mar 15, 2013, at 8:51 PM, Jacques Nadeau <[email protected]> wrote:

> I've been under the weather the last few days and haven't made much
> progress. Let me see if I can get you something tomorrow.
> 
> On Mar 15, 2013, at 2:36 PM, David Alves <[email protected]> wrote:
> 
>> Hi Jacques
>> 
>>   Is there any chance we could get a preview of this physical plan stuff and 
>> basic plumbing for distributed execution before the weekend? maybe in a 
>> github branch somewhere?
>>   I mean it doesn't have to be complete or even running, I'd just like to 
>> make some progress with other stuff and keeping it in line with whichever 
>> plumbing you already have would be great.
>> 
>> Best
>> David
>> 
>> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <[email protected]> wrote:
>> 
>>> I'm working on some physical plan stuff as well as some basic plumbing for
>>> distributed execution.  Its very in progress so I need to clean things up a
>>> bit before we could collaborate/ divide and conquer on it.  Depending on
>>> your timing and availability, maybe I could put some of this together in
>>> the next couple days so that you could plug in rather than reinvent.  In
>>> the meantime, pushing forward the builder stuff, additional test cases on
>>> the reference interpreter and/or thinking through the logical plan storage
>>> engine pushdown/rewrite could be very useful.
>>> 
>>> Let me know your thoughts.
>>> 
>>> thanks,
>>> Jacques
>>> 
>>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <[email protected]> wrote:
>>> 
>>>> Hi Jacques
>>>> 
>>>>      I can assign issues to me now, thanks.
>>>>      What you say wrt to the logical/physical/execution layers sounds
>>>> good.
>>>>      My main concern, for the moment is to have something working as
>>>> fast as possible, i.e. some daemons that I'd be able to deploy to a working
>>>> hbase cluster and send them work to do in some form (first step would be to
>>>> treat is as a non distributed engine where each daemon runs an instance of
>>>> the prototype).
>>>>      Here's where I'd like to go next:
>>>>      - lay the ground work for the daemons (scripts/rpc iface/wiring
>>>> protocol).
>>>>      - create an execution engine iface that allows to abstract future
>>>> implementations, and make it available through the rpc iface. this would
>>>> sit in front of the ref impl for now and would be replaced by cpp down the
>>>> line.
>>>> 
>>>>      I think we can probably concentrate on the capabilities iface a
>>>> bit down the line but, as a first approach, I see it simply providing a
>>>> simple set of ops that it is able to run internally.
>>>>      How to abstract locality/partitioning/schema capabilities is till
>>>> not clear to me though, thoughts?
>>>> 
>>>> David
>>>> 
>>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <[email protected]> wrote:
>>>> 
>>>>> I'm working on a presentation that will better illustrate the layers.
>>>>> There are actually three key plans.  Thinking to date has been to break
>>>>> the plans down into logical, physical and execution.  The third hasn't
>>>> been
>>>>> expressed well here and is entirely an internal domain to the execution
>>>>> engine.  Following some classic methods: Logical expresses what we want
>>>> to
>>>>> do, Physical expresses how we want to do it (adding points of
>>>>> parallelization but not specifying particular amounts of parallelization
>>>> or
>>>>> node by node assignments).  The execution engine is then responsible for
>>>>> determining the amount of parallelization of a particular plan along with
>>>>> system load (likely leveraging Berkeley's Sparrow work), task priority
>>>> and
>>>>> specific data locality information, building sub-dags to be assigned to
>>>>> individual nodes and execute the plan.
>>>>> 
>>>>> So in the higher logical and physical levels, a single Scan and
>>>> subsequent
>>>>> ScanPOP should be okay...  (ScanROPs have a separate problems since they
>>>>> ignore the level of separation we're planning for the real execution
>>>> layer.
>>>>> This is the why the current ref impl turns a single Scan into potentially
>>>>> a union of ScanROPs... not elegant but logically correct.)
>>>>> 
>>>>> The capabilities interface still needs to be defined for how a storage
>>>>> engine reveals its logical capabilities and thus consumes part of the
>>>> plan.
>>>>> 
>>>>> J
>>>>> 
>>>>> 
>>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <[email protected]>
>>>> wrote:
>>>>> 
>>>>>> Hi Linsen
>>>>>> 
>>>>>>     Some of what you are saying like push down of ops like filter,
>>>>>> projection or partial aggregation below the storage engine scanner
>>>> level,
>>>>>> or sub tree execution are actively being discussed in issues DRILL-13
>>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your
>>>> input
>>>>>> in these issues is most welcome.
>>>>>> 
>>>>>>     HBase in particular has the notion of
>>>>>> enpoints/coprocessors/filters that allow pushing this down easily (this
>>>> is
>>>>>> also in line with what other parallel database over nosql
>>>> implementations
>>>>>> like tajo do).
>>>>>>     A possible approach is to have the optimizer change the order of
>>>>>> the ops to place them below the storage engine scanner and let the SE
>>>> impl
>>>>>> deal with it internally.
>>>>>> 
>>>>>>     There are also some other pieces missing at the moment AFAIK,
>>>> like
>>>>>> a distributed metadata store, the drill daemons, wiring, etc.
>>>>>> 
>>>>>>     So in summary, you're absolutely right, and if you're
>>>> particularly
>>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
>>>> interested
>>>>>> in collaborating.
>>>>>> 
>>>>>> Best
>>>>>> David
>>>>>> 
>>>>>> 
>>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <[email protected]> wrote:
>>>>>> 
>>>>>>> Hi David,
>>>>>>> 
>>>>>>> Very nice to see your effort on this.
>>>>>>> 
>>>>>>> Hi Jacques,
>>>>>>> 
>>>>>>> we are also extending drill prototype, to see if there is any chance to
>>>>>>> meet our production need. However, We find that implementing a
>>>> performant
>>>>>>> HBase storage engine is a not so straight-forward work, and requires
>>>> some
>>>>>>> workaround. The problem is in Scan interface.
>>>>>>> 
>>>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
>>>>>> Storage
>>>>>>> engine provides output for a whole data source, a csv file for example.
>>>>>>> It's sufficient for input source like plain file, but for hbase, it's
>>>> not
>>>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
>>>> htable
>>>>>>> into drill. Storage engines like HBase should have some ablility to do
>>>>>> part
>>>>>>> of the DrQL query, like Filter, if a filter can be performed by
>>>>>> specifying
>>>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
>>>> even
>>>>>>> Join.
>>>>>>> 
>>>>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG
>>>> of
>>>>>>> logical plan DAG instead of a single Scan node in logical plan. If so,
>>>>>> more
>>>>>>> implementation-specific information would coupe into the plan
>>>>>> optimization
>>>>>>> & transformation phase. I guess that's the price to pay when
>>>> optimization
>>>>>>> comes, or is there other way I failed to see?
>>>>>>> 
>>>>>>> Please correct me if anything is wrong.
>>>>>>> 
>>>>>>> thanks,
>>>>>>> 
>>>>>>> Lisen
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <[email protected]>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Jacques
>>>>>>>> 
>>>>>>>>    I've submitted a fist pass patch to DRILL-15.
>>>>>>>>    I did this mostly because HBase will be my main target and
>>>>>> because
>>>>>>>> I wanted to get a feel of what would be a nice interface for DRILL-13.
>>>>>> Have
>>>>>>>> some thoughts that I will post soon.
>>>>>>>>    btw: I still can't assign issues to myself in JIRA, did you
>>>>>> forget
>>>>>>>> to add me as a contributor?
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> David
>>>>>>>> 
>>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <[email protected]>
>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hey David,
>>>>>>>>> 
>>>>>>>>> These sound good.  I've add you as a contributor on jira so you can
>>>>>>>> assign
>>>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.  15
>>>>>>>> depends
>>>>>>>>> on 13 and working on the two hand in hand would probably be a good
>>>>>> idea.
>>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you have
>>>>>>>> some
>>>>>>>>> time to focus on it.
>>>>>>>>> 
>>>>>>>>> Jacques
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <[email protected]>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi All
>>>>>>>>>> 
>>>>>>>>>>   I have a new academic project for which I'd like to use drill
>>>>>>>>>> since none of the other parallel database over hadoop/nosql
>>>>>>>> implementations
>>>>>>>>>> fit just right.
>>>>>>>>>>   To this goal I've been tinkering with the prototype trying to
>>>>>>>> find
>>>>>>>>>> where I'd be most useful.
>>>>>>>>>> 
>>>>>>>>>>   Here's where I'd like to start, if you agree:
>>>>>>>>>>   - implement HBase storage engine (DRILL-15)
>>>>>>>>>>           - start with simple scanning an push down of
>>>>>>>>>> selection/projection
>>>>>>>>>>   - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>>>>>   - setup coding style in the wiki (formatting/imports etc,
>>>>>>>> DRILL-46)
>>>>>>>>>>   - create builders for all logical plan elements/make logical
>>>>>>>> plans
>>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts first).
>>>>>>>>>> 
>>>>>>>>>>   Please let me know your thoughts, and if you agree please
>>>> assign
>>>>>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>>>>>> 
>>>>>>>>>> Best
>>>>>>>>>> David Alves
>>

Re: contribution

Reply via email to