Hey Jacques
Sorry to be a nag, but is there any change to take a sneak peak at the
protobuf rpc stuff?
I'd really like hack something together wrt to the daemon this weekend.
Also, wrt to configuration management (zk/helix) maybe you could post
the iface so that it'd be possible to hack something static (i.e. non-ft,
properties file based) just to make dist execution work.
Thanks
David
On Mar 16, 2013, at 8:34 PM, Jacques Nadeau <[email protected]> wrote:
> Hey David,
>
> The java-exec framework is not far enough along that it makes sense for me
> to push it externally yet. However, I did push my initial wip physical
> plan approach. You can find it here:
> https://github.com/jacques-n/incubator-drill/tree/physical_plan_updates
>
> Hopefully, I will get further along on the java-exec stuff soon.
>
> I'd suggest that you focus your energy on the StorageEngine API and HBase
> implementation. If you're up for it, let's do a quick skype chat to sync
> up. Let me know your availability over the next few days.
>
> Thanks,
> Jacques
>
>
>
> On Fri, Mar 15, 2013 at 6:59 PM, David Alves <[email protected]> wrote:
>
>> that'd be great thanks.
>>
>> -david
>>
>> On Mar 15, 2013, at 8:51 PM, Jacques Nadeau <[email protected]>
>> wrote:
>>
>>> I've been under the weather the last few days and haven't made much
>>> progress. Let me see if I can get you something tomorrow.
>>>
>>> On Mar 15, 2013, at 2:36 PM, David Alves <[email protected]> wrote:
>>>
>>>> Hi Jacques
>>>>
>>>> Is there any chance we could get a preview of this physical plan
>> stuff and basic plumbing for distributed execution before the weekend?
>> maybe in a github branch somewhere?
>>>> I mean it doesn't have to be complete or even running, I'd just like
>> to make some progress with other stuff and keeping it in line with
>> whichever plumbing you already have would be great.
>>>>
>>>> Best
>>>> David
>>>>
>>>> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <[email protected]> wrote:
>>>>
>>>>> I'm working on some physical plan stuff as well as some basic plumbing
>> for
>>>>> distributed execution. Its very in progress so I need to clean things
>> up a
>>>>> bit before we could collaborate/ divide and conquer on it. Depending
>> on
>>>>> your timing and availability, maybe I could put some of this together
>> in
>>>>> the next couple days so that you could plug in rather than reinvent.
>> In
>>>>> the meantime, pushing forward the builder stuff, additional test cases
>> on
>>>>> the reference interpreter and/or thinking through the logical plan
>> storage
>>>>> engine pushdown/rewrite could be very useful.
>>>>>
>>>>> Let me know your thoughts.
>>>>>
>>>>> thanks,
>>>>> Jacques
>>>>>
>>>>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <[email protected]>
>> wrote:
>>>>>
>>>>>> Hi Jacques
>>>>>>
>>>>>> I can assign issues to me now, thanks.
>>>>>> What you say wrt to the logical/physical/execution layers sounds
>>>>>> good.
>>>>>> My main concern, for the moment is to have something working as
>>>>>> fast as possible, i.e. some daemons that I'd be able to deploy to a
>> working
>>>>>> hbase cluster and send them work to do in some form (first step would
>> be to
>>>>>> treat is as a non distributed engine where each daemon runs an
>> instance of
>>>>>> the prototype).
>>>>>> Here's where I'd like to go next:
>>>>>> - lay the ground work for the daemons (scripts/rpc iface/wiring
>>>>>> protocol).
>>>>>> - create an execution engine iface that allows to abstract future
>>>>>> implementations, and make it available through the rpc iface. this
>> would
>>>>>> sit in front of the ref impl for now and would be replaced by cpp
>> down the
>>>>>> line.
>>>>>>
>>>>>> I think we can probably concentrate on the capabilities iface a
>>>>>> bit down the line but, as a first approach, I see it simply providing
>> a
>>>>>> simple set of ops that it is able to run internally.
>>>>>> How to abstract locality/partitioning/schema capabilities is till
>>>>>> not clear to me though, thoughts?
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <[email protected]>
>> wrote:
>>>>>>
>>>>>>> I'm working on a presentation that will better illustrate the layers.
>>>>>>> There are actually three key plans. Thinking to date has been to
>> break
>>>>>>> the plans down into logical, physical and execution. The third
>> hasn't
>>>>>> been
>>>>>>> expressed well here and is entirely an internal domain to the
>> execution
>>>>>>> engine. Following some classic methods: Logical expresses what we
>> want
>>>>>> to
>>>>>>> do, Physical expresses how we want to do it (adding points of
>>>>>>> parallelization but not specifying particular amounts of
>> parallelization
>>>>>> or
>>>>>>> node by node assignments). The execution engine is then responsible
>> for
>>>>>>> determining the amount of parallelization of a particular plan along
>> with
>>>>>>> system load (likely leveraging Berkeley's Sparrow work), task
>> priority
>>>>>> and
>>>>>>> specific data locality information, building sub-dags to be assigned
>> to
>>>>>>> individual nodes and execute the plan.
>>>>>>>
>>>>>>> So in the higher logical and physical levels, a single Scan and
>>>>>> subsequent
>>>>>>> ScanPOP should be okay... (ScanROPs have a separate problems since
>> they
>>>>>>> ignore the level of separation we're planning for the real execution
>>>>>> layer.
>>>>>>> This is the why the current ref impl turns a single Scan into
>> potentially
>>>>>>> a union of ScanROPs... not elegant but logically correct.)
>>>>>>>
>>>>>>> The capabilities interface still needs to be defined for how a
>> storage
>>>>>>> engine reveals its logical capabilities and thus consumes part of the
>>>>>> plan.
>>>>>>>
>>>>>>> J
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <[email protected]
>>>
>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Linsen
>>>>>>>>
>>>>>>>> Some of what you are saying like push down of ops like filter,
>>>>>>>> projection or partial aggregation below the storage engine scanner
>>>>>> level,
>>>>>>>> or sub tree execution are actively being discussed in issues
>> DRILL-13
>>>>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine),
>> your
>>>>>> input
>>>>>>>> in these issues is most welcome.
>>>>>>>>
>>>>>>>> HBase in particular has the notion of
>>>>>>>> enpoints/coprocessors/filters that allow pushing this down easily
>> (this
>>>>>> is
>>>>>>>> also in line with what other parallel database over nosql
>>>>>> implementations
>>>>>>>> like tajo do).
>>>>>>>> A possible approach is to have the optimizer change the order of
>>>>>>>> the ops to place them below the storage engine scanner and let the
>> SE
>>>>>> impl
>>>>>>>> deal with it internally.
>>>>>>>>
>>>>>>>> There are also some other pieces missing at the moment AFAIK,
>>>>>> like
>>>>>>>> a distributed metadata store, the drill daemons, wiring, etc.
>>>>>>>>
>>>>>>>> So in summary, you're absolutely right, and if you're
>>>>>> particularly
>>>>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
>>>>>> interested
>>>>>>>> in collaborating.
>>>>>>>>
>>>>>>>> Best
>>>>>>>> David
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi David,
>>>>>>>>>
>>>>>>>>> Very nice to see your effort on this.
>>>>>>>>>
>>>>>>>>> Hi Jacques,
>>>>>>>>>
>>>>>>>>> we are also extending drill prototype, to see if there is any
>> chance to
>>>>>>>>> meet our production need. However, We find that implementing a
>>>>>> performant
>>>>>>>>> HBase storage engine is a not so straight-forward work, and
>> requires
>>>>>> some
>>>>>>>>> workaround. The problem is in Scan interface.
>>>>>>>>>
>>>>>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
>>>>>>>> Storage
>>>>>>>>> engine provides output for a whole data source, a csv file for
>> example.
>>>>>>>>> It's sufficient for input source like plain file, but for hbase,
>> it's
>>>>>> not
>>>>>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
>>>>>> htable
>>>>>>>>> into drill. Storage engines like HBase should have some ablility
>> to do
>>>>>>>> part
>>>>>>>>> of the DrQL query, like Filter, if a filter can be performed by
>>>>>>>> specifying
>>>>>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
>>>>>> even
>>>>>>>>> Join.
>>>>>>>>>
>>>>>>>>> Generally, it would be more clear if a ScanROP is mapped to a
>> sub-DAG
>>>>>> of
>>>>>>>>> logical plan DAG instead of a single Scan node in logical plan. If
>> so,
>>>>>>>> more
>>>>>>>>> implementation-specific information would coupe into the plan
>>>>>>>> optimization
>>>>>>>>> & transformation phase. I guess that's the price to pay when
>>>>>> optimization
>>>>>>>>> comes, or is there other way I failed to see?
>>>>>>>>>
>>>>>>>>> Please correct me if anything is wrong.
>>>>>>>>>
>>>>>>>>> thanks,
>>>>>>>>>
>>>>>>>>> Lisen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <
>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Jacques
>>>>>>>>>>
>>>>>>>>>> I've submitted a fist pass patch to DRILL-15.
>>>>>>>>>> I did this mostly because HBase will be my main target and
>>>>>>>> because
>>>>>>>>>> I wanted to get a feel of what would be a nice interface for
>> DRILL-13.
>>>>>>>> Have
>>>>>>>>>> some thoughts that I will post soon.
>>>>>>>>>> btw: I still can't assign issues to myself in JIRA, did you
>>>>>>>> forget
>>>>>>>>>> to add me as a contributor?
>>>>>>>>>>
>>>>>>>>>> Best
>>>>>>>>>> David
>>>>>>>>>>
>>>>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <[email protected]>
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey David,
>>>>>>>>>>>
>>>>>>>>>>> These sound good. I've add you as a contributor on jira so you
>> can
>>>>>>>>>> assign
>>>>>>>>>>> tasks to yourself. I think 45 and 46 are good places to start.
>> 15
>>>>>>>>>> depends
>>>>>>>>>>> on 13 and working on the two hand in hand would probably be a
>> good
>>>>>>>> idea.
>>>>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you
>> have
>>>>>>>>>> some
>>>>>>>>>>> time to focus on it.
>>>>>>>>>>>
>>>>>>>>>>> Jacques
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <
>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All
>>>>>>>>>>>>
>>>>>>>>>>>> I have a new academic project for which I'd like to use drill
>>>>>>>>>>>> since none of the other parallel database over hadoop/nosql
>>>>>>>>>> implementations
>>>>>>>>>>>> fit just right.
>>>>>>>>>>>> To this goal I've been tinkering with the prototype trying to
>>>>>>>>>> find
>>>>>>>>>>>> where I'd be most useful.
>>>>>>>>>>>>
>>>>>>>>>>>> Here's where I'd like to start, if you agree:
>>>>>>>>>>>> - implement HBase storage engine (DRILL-15)
>>>>>>>>>>>> - start with simple scanning an push down of
>>>>>>>>>>>> selection/projection
>>>>>>>>>>>> - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>>>>>>> - setup coding style in the wiki (formatting/imports etc,
>>>>>>>>>> DRILL-46)
>>>>>>>>>>>> - create builders for all logical plan elements/make logical
>>>>>>>>>> plans
>>>>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts
>> first).
>>>>>>>>>>>>
>>>>>>>>>>>> Please let me know your thoughts, and if you agree please
>>>>>> assign
>>>>>>>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>>>>>>>>
>>>>>>>>>>>> Best
>>>>>>>>>>>> David Alves
>>>>
>>
>>