Re: question about schema

Lisen Mu Mon, 22 Apr 2013 01:52:04 -0700

David,

Thanks, I'm willing to help.


Sorry I missed the conclusion in jira. Thanks for the explanation, I guess
further push from you and Jacques would make things clearer.





On Mon, Apr 22, 2013 at 12:47 PM, David Alves <[email protected]> wrote:

> Lisen
>
>         Ah, got what you mean by encoding mutliple fields into rowkey.
>         Well that makes projection trickier, but still definitely possible
> to do with Filters.
>         As soon as I get something reasonable working I'll push it and I
> welcome your help in dealing with that particular situation and any others
> you can come up with.
>
>         With regard to pushdown after a bit of the discussion in the SE
> jira (I forget the number) the consensus seems to be that the SE advertises
> opaque OptimizerRules that the optimizer runs.
>         These can for instance, push the project in Jacques example inside
> the scan, or change the order of ops.
>         In general I can see the case where a typical RDBMS would publish
> multiple rules (for agg, proj, select, even join) which, when run by the
> optimizer would go through the ops directly above the scan and keep pushing
> most inside the scan until there is either nothing left but the sink and
> the scan (and not even the sink if it goes into the same data source) or
> there's a multi-branch multi-data source op such as union or join.
>         All of there are inside the Scan physical op (and are SE agnostic
> up to this point).
>         So the physical plan portion to be executed by the SE is actually
> inside the scan op.
>         At least this is how I'm thinking about it right now…
>
> Best
> David
>
>
> On Apr 21, 2013, at 11:29 PM, Lisen Mu <[email protected]> wrote:
>
> > David,
> >
> > Suppose we have planned to use domainId+uid+timestamp as my HTable
> rowkey.
> >
> > I wish to retrieve uid portion from my rowkey, like:
> >
> >  SELECT distinct(uid) from `my_table` where xxx
> >
> > Or, I wish I can do:
> >
> >  a) SELECT xxx from `my_table` where domainId='a'
> >  b) SELECT xxx from `my_table` where uid='[email protected]'
> >
> > And HBase SE would determine the best startKey and endKey according to
> > rowkey definition info, so a) and b) would get different performance.
> >
> >> about selection/Filter & aggregation:
> >
> > I have too many questions that I feel it be better to wait your HBase SE
> > first... However:
> >
> > How to push down aggregation and selection into scan pop?
> >
> > @Jacques, It seems to me that your idea is to use a scan pop node to
> > describe what SE would do in a query, right?
> >
> > Would scan pop become a little too complicated if scan pop stay SE
> > independent? Since mysql & mongo need more for scan pop.
> >
> > Previously I thought you would provide something like
> >
> >  RecordReader getReader(PhysicalPlan subPlan)
> >
> > SE advertises ability back to drill, drill push part of physical plan to
> SE
> > and let SE figure out how to deal with the subdag as long as SE can
> provide
> > correct RecordBatch.
> >
> >
> >
> >
> >
> > On Mon, Apr 22, 2013 at 12:06 PM, David Alves <[email protected]>
> wrote:
> >
> >> Hi Lisen
> >>
> >>        Phoenix has been a good source of inspiration.
> >>        Had it not been for license issues (non-standard license) and the
> >> fact it is designed to run locally I would have used it directly
> instead of
> >> coding my own.
> >>        Not completely sure what you mean wrt to "map fields in the query
> >> into portion of rowkey in HBase" but here's what I'm doing with regard
> to
> >> the operations that are pushed to HBase:
> >>
> >>        Projection comes from setting the interesting CF's and CQ's in
> the
> >> Scan prior to starting it (where those come from in drill was the reason
> >> for my previous email).
> >>        Selection comes from setting Filters that are created directly
> >> form expresssions in drlll and are submitted with the scan.
> >>        Partial Aggregation (which I'm not doing right now but will do
> >> soon ) will come from co-processors.
> >>        Joins: I'm investigating a couple on pushing some of the work to
> >> hbase.
> >>
> >>        All the remaining operations will happen within drill itself.
> >>
> >> Best
> >> David
> >>
> >> On Apr 21, 2013, at 10:45 PM, Lisen Mu <[email protected]> wrote:
> >>
> >>> David,
> >>>
> >>> Another case about schema: how to map fields in the query into portion
> of
> >>> rowkey in HBase? Like phoenix does.
> >>> http://files.meetup.com/1350427/IntelPhoenixHBaseMeetup.ppt
> >>>
> >>> I think it might be common in HBase schema design that several logical
> >>> parts form rowkey in a particular order for the most frequent access
> >>> pattern.
> >>>
> >>>
> >>>
> >>>
> >>> On Sun, Apr 21, 2013 at 1:45 PM, David Alves <[email protected]>
> >> wrote:
> >>>
> >>>> had a "duh" moment, realizing that, of course, I don't need a
> >>>> ProjectFilter as I can set the relevant cq's and cf's on HBase's Scan.
> >>>> the question or how to get the names of the columns the query is
> asking
> >>>> for or even "*" if that is the case, still stands though…
> >>>>
> >>>> -david
> >>>>
> >>>> On Apr 20, 2013, at 10:39 PM, David Alves <[email protected]>
> >> wrote:
> >>>>
> >>>>> Hi Jacques
> >>>>>
> >>>>>     I'm implementing a ProjectFilter for HBase and I got to the point
> >>>> where I need to pass to HBase the fields that are required (even if
> it's
> >>>> simply "all" as in *).
> >>>>>     How to know which fields to scan in the SE and their expected
> >> type?
> >>>>>     There's a bunch of schema stuff in the
> >>>> org/apache/drill/exec/schema but I can't figure how SE uses that.
> >>>>>     Will this info come inside the scan logical op in
> >>>> getReadEntries(Scan scan) (in the arbitrary "selection" section)?
> >>>>>     Is this method still going to receive a logical Scan op or is
> this
> >>>> just a legacy stuff that you didn't have the chance to get to yet?
> >>>>>     BatchSchema seems to only refer to field ids…
> >>>>>
> >>>>>     I'm thinking this is most likely because the work is still very
> >>>> much in progress but as I browse the code I can see you have put a lot
> >> of
> >>>> thought into almost everything even when it's not being used right now
> >> and
> >>>> I don't want to make any stupid assumption.
> >>>>>     I can definitely make that info get to the SE iface myself just
> >>>> wondering how do you envision it should get there…
> >>>>>
> >>>>> Best
> >>>>> David
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: question about schema

Reply via email to