Re: contribution

David Alves Fri, 22 Mar 2013 12:07:03 -0700

Hey Jacques

        Sorry to be a nag, but is there any change to take a sneak peak at the 
protobuf rpc stuff?
        I'd really like hack something together wrt to the daemon this weekend.
        Also, wrt to configuration management (zk/helix) maybe you could post 
the iface so that it'd be possible to hack something static (i.e. non-ft, 
properties file based) just to make dist execution work.


Thanks
David

On Mar 16, 2013, at 8:34 PM, Jacques Nadeau <[email protected]> wrote:

> Hey David,
> 
> The java-exec framework is not far enough along that it makes sense for me
> to push it externally yet.  However, I did push my initial wip physical
> plan approach.  You can find it here:
> https://github.com/jacques-n/incubator-drill/tree/physical_plan_updates
> 
> Hopefully, I will get further along on the java-exec stuff soon.
> 
> I'd suggest that you focus your energy on the StorageEngine API and HBase
> implementation.  If you're up for it, let's do a quick skype chat to sync
> up.  Let me know your availability over the next few days.
> 
> Thanks,
> Jacques
> 
> 
> 
> On Fri, Mar 15, 2013 at 6:59 PM, David Alves <[email protected]> wrote:
> 
>> that'd be great thanks.
>> 
>> -david
>> 
>> On Mar 15, 2013, at 8:51 PM, Jacques Nadeau <[email protected]>
>> wrote:
>> 
>>> I've been under the weather the last few days and haven't made much
>>> progress. Let me see if I can get you something tomorrow.
>>> 
>>> On Mar 15, 2013, at 2:36 PM, David Alves <[email protected]> wrote:
>>> 
>>>> Hi Jacques
>>>> 
>>>>  Is there any chance we could get a preview of this physical plan
>> stuff and basic plumbing for distributed execution before the weekend?
>> maybe in a github branch somewhere?
>>>>  I mean it doesn't have to be complete or even running, I'd just like
>> to make some progress with other stuff and keeping it in line with
>> whichever plumbing you already have would be great.
>>>> 
>>>> Best
>>>> David
>>>> 
>>>> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <[email protected]> wrote:
>>>> 
>>>>> I'm working on some physical plan stuff as well as some basic plumbing
>> for
>>>>> distributed execution.  Its very in progress so I need to clean things
>> up a
>>>>> bit before we could collaborate/ divide and conquer on it.  Depending
>> on
>>>>> your timing and availability, maybe I could put some of this together
>> in
>>>>> the next couple days so that you could plug in rather than reinvent.
>> In
>>>>> the meantime, pushing forward the builder stuff, additional test cases
>> on
>>>>> the reference interpreter and/or thinking through the logical plan
>> storage
>>>>> engine pushdown/rewrite could be very useful.
>>>>> 
>>>>> Let me know your thoughts.
>>>>> 
>>>>> thanks,
>>>>> Jacques
>>>>> 
>>>>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <[email protected]>
>> wrote:
>>>>> 
>>>>>> Hi Jacques
>>>>>> 
>>>>>>     I can assign issues to me now, thanks.
>>>>>>     What you say wrt to the logical/physical/execution layers sounds
>>>>>> good.
>>>>>>     My main concern, for the moment is to have something working as
>>>>>> fast as possible, i.e. some daemons that I'd be able to deploy to a
>> working
>>>>>> hbase cluster and send them work to do in some form (first step would
>> be to
>>>>>> treat is as a non distributed engine where each daemon runs an
>> instance of
>>>>>> the prototype).
>>>>>>     Here's where I'd like to go next:
>>>>>>     - lay the ground work for the daemons (scripts/rpc iface/wiring
>>>>>> protocol).
>>>>>>     - create an execution engine iface that allows to abstract future
>>>>>> implementations, and make it available through the rpc iface. this
>> would
>>>>>> sit in front of the ref impl for now and would be replaced by cpp
>> down the
>>>>>> line.
>>>>>> 
>>>>>>     I think we can probably concentrate on the capabilities iface a
>>>>>> bit down the line but, as a first approach, I see it simply providing
>> a
>>>>>> simple set of ops that it is able to run internally.
>>>>>>     How to abstract locality/partitioning/schema capabilities is till
>>>>>> not clear to me though, thoughts?
>>>>>> 
>>>>>> David
>>>>>> 
>>>>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <[email protected]>
>> wrote:
>>>>>> 
>>>>>>> I'm working on a presentation that will better illustrate the layers.
>>>>>>> There are actually three key plans.  Thinking to date has been to
>> break
>>>>>>> the plans down into logical, physical and execution.  The third
>> hasn't
>>>>>> been
>>>>>>> expressed well here and is entirely an internal domain to the
>> execution
>>>>>>> engine.  Following some classic methods: Logical expresses what we
>> want
>>>>>> to
>>>>>>> do, Physical expresses how we want to do it (adding points of
>>>>>>> parallelization but not specifying particular amounts of
>> parallelization
>>>>>> or
>>>>>>> node by node assignments).  The execution engine is then responsible
>> for
>>>>>>> determining the amount of parallelization of a particular plan along
>> with
>>>>>>> system load (likely leveraging Berkeley's Sparrow work), task
>> priority
>>>>>> and
>>>>>>> specific data locality information, building sub-dags to be assigned
>> to
>>>>>>> individual nodes and execute the plan.
>>>>>>> 
>>>>>>> So in the higher logical and physical levels, a single Scan and
>>>>>> subsequent
>>>>>>> ScanPOP should be okay...  (ScanROPs have a separate problems since
>> they
>>>>>>> ignore the level of separation we're planning for the real execution
>>>>>> layer.
>>>>>>> This is the why the current ref impl turns a single Scan into
>> potentially
>>>>>>> a union of ScanROPs... not elegant but logically correct.)
>>>>>>> 
>>>>>>> The capabilities interface still needs to be defined for how a
>> storage
>>>>>>> engine reveals its logical capabilities and thus consumes part of the
>>>>>> plan.
>>>>>>> 
>>>>>>> J
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <[email protected]
>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi Linsen
>>>>>>>> 
>>>>>>>>    Some of what you are saying like push down of ops like filter,
>>>>>>>> projection or partial aggregation below the storage engine scanner
>>>>>> level,
>>>>>>>> or sub tree execution are actively being discussed in issues
>> DRILL-13
>>>>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine),
>> your
>>>>>> input
>>>>>>>> in these issues is most welcome.
>>>>>>>> 
>>>>>>>>    HBase in particular has the notion of
>>>>>>>> enpoints/coprocessors/filters that allow pushing this down easily
>> (this
>>>>>> is
>>>>>>>> also in line with what other parallel database over nosql
>>>>>> implementations
>>>>>>>> like tajo do).
>>>>>>>>    A possible approach is to have the optimizer change the order of
>>>>>>>> the ops to place them below the storage engine scanner and let the
>> SE
>>>>>> impl
>>>>>>>> deal with it internally.
>>>>>>>> 
>>>>>>>>    There are also some other pieces missing at the moment AFAIK,
>>>>>> like
>>>>>>>> a distributed metadata store, the drill daemons, wiring, etc.
>>>>>>>> 
>>>>>>>>    So in summary, you're absolutely right, and if you're
>>>>>> particularly
>>>>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
>>>>>> interested
>>>>>>>> in collaborating.
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> David
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> Hi David,
>>>>>>>>> 
>>>>>>>>> Very nice to see your effort on this.
>>>>>>>>> 
>>>>>>>>> Hi Jacques,
>>>>>>>>> 
>>>>>>>>> we are also extending drill prototype, to see if there is any
>> chance to
>>>>>>>>> meet our production need. However, We find that implementing a
>>>>>> performant
>>>>>>>>> HBase storage engine is a not so straight-forward work, and
>> requires
>>>>>> some
>>>>>>>>> workaround. The problem is in Scan interface.
>>>>>>>>> 
>>>>>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
>>>>>>>> Storage
>>>>>>>>> engine provides output for a whole data source, a csv file for
>> example.
>>>>>>>>> It's sufficient for input source like plain file, but for hbase,
>> it's
>>>>>> not
>>>>>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
>>>>>> htable
>>>>>>>>> into drill. Storage engines like HBase should have some ablility
>> to do
>>>>>>>> part
>>>>>>>>> of the DrQL query, like Filter, if a filter can be performed by
>>>>>>>> specifying
>>>>>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
>>>>>> even
>>>>>>>>> Join.
>>>>>>>>> 
>>>>>>>>> Generally, it would be more clear if a ScanROP is mapped to a
>> sub-DAG
>>>>>> of
>>>>>>>>> logical plan DAG instead of a single Scan node in logical plan. If
>> so,
>>>>>>>> more
>>>>>>>>> implementation-specific information would coupe into the plan
>>>>>>>> optimization
>>>>>>>>> & transformation phase. I guess that's the price to pay when
>>>>>> optimization
>>>>>>>>> comes, or is there other way I failed to see?
>>>>>>>>> 
>>>>>>>>> Please correct me if anything is wrong.
>>>>>>>>> 
>>>>>>>>> thanks,
>>>>>>>>> 
>>>>>>>>> Lisen
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <
>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Jacques
>>>>>>>>>> 
>>>>>>>>>>   I've submitted a fist pass patch to DRILL-15.
>>>>>>>>>>   I did this mostly because HBase will be my main target and
>>>>>>>> because
>>>>>>>>>> I wanted to get a feel of what would be a nice interface for
>> DRILL-13.
>>>>>>>> Have
>>>>>>>>>> some thoughts that I will post soon.
>>>>>>>>>>   btw: I still can't assign issues to myself in JIRA, did you
>>>>>>>> forget
>>>>>>>>>> to add me as a contributor?
>>>>>>>>>> 
>>>>>>>>>> Best
>>>>>>>>>> David
>>>>>>>>>> 
>>>>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <[email protected]>
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hey David,
>>>>>>>>>>> 
>>>>>>>>>>> These sound good.  I've add you as a contributor on jira so you
>> can
>>>>>>>>>> assign
>>>>>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.
>> 15
>>>>>>>>>> depends
>>>>>>>>>>> on 13 and working on the two hand in hand would probably be a
>> good
>>>>>>>> idea.
>>>>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you
>> have
>>>>>>>>>> some
>>>>>>>>>>> time to focus on it.
>>>>>>>>>>> 
>>>>>>>>>>> Jacques
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <
>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi All
>>>>>>>>>>>> 
>>>>>>>>>>>>  I have a new academic project for which I'd like to use drill
>>>>>>>>>>>> since none of the other parallel database over hadoop/nosql
>>>>>>>>>> implementations
>>>>>>>>>>>> fit just right.
>>>>>>>>>>>>  To this goal I've been tinkering with the prototype trying to
>>>>>>>>>> find
>>>>>>>>>>>> where I'd be most useful.
>>>>>>>>>>>> 
>>>>>>>>>>>>  Here's where I'd like to start, if you agree:
>>>>>>>>>>>>  - implement HBase storage engine (DRILL-15)
>>>>>>>>>>>>          - start with simple scanning an push down of
>>>>>>>>>>>> selection/projection
>>>>>>>>>>>>  - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>>>>>>>  - setup coding style in the wiki (formatting/imports etc,
>>>>>>>>>> DRILL-46)
>>>>>>>>>>>>  - create builders for all logical plan elements/make logical
>>>>>>>>>> plans
>>>>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts
>> first).
>>>>>>>>>>>> 
>>>>>>>>>>>>  Please let me know your thoughts, and if you agree please
>>>>>> assign
>>>>>>>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>>>>>>>> 
>>>>>>>>>>>> Best
>>>>>>>>>>>> David Alves
>>>> 
>> 
>>

Re: contribution

Reply via email to