Re: question about schema

David Alves Sun, 21 Apr 2013 21:48:16 -0700

Lisen

        Ah, got what you mean by encoding mutliple fields into rowkey.
        Well that makes projection trickier, but still definitely possible to 
do with Filters.
        As soon as I get something reasonable working I'll push it and I 
welcome your help in dealing with that particular situation and any others you 
can come up with.
        
        With regard to pushdown after a bit of the discussion in the SE jira (I 
forget the number) the consensus seems to be that the SE advertises opaque 
OptimizerRules that the optimizer runs.
        These can for instance, push the project in Jacques example inside the 
scan, or change the order of ops.
        In general I can see the case where a typical RDBMS would publish 
multiple rules (for agg, proj, select, even join) which, when run by the 
optimizer would go through the ops directly above the scan and keep pushing 
most inside the scan until there is either nothing left but the sink and the 
scan (and not even the sink if it goes into the same data source) or there's a 
multi-branch multi-data source op such as union or join.
        All of there are inside the Scan physical op (and are SE agnostic up to 
this point).
        So the physical plan portion to be executed by the SE is actually 
inside the scan op.
        At least this is how I'm thinking about it right now…


Best
David
        

On Apr 21, 2013, at 11:29 PM, Lisen Mu <[email protected]> wrote:

> David,
> 
> Suppose we have planned to use domainId+uid+timestamp as my HTable rowkey.
> 
> I wish to retrieve uid portion from my rowkey, like:
> 
>  SELECT distinct(uid) from `my_table` where xxx
> 
> Or, I wish I can do:
> 
>  a) SELECT xxx from `my_table` where domainId='a'
>  b) SELECT xxx from `my_table` where uid='[email protected]'
> 
> And HBase SE would determine the best startKey and endKey according to
> rowkey definition info, so a) and b) would get different performance.
> 
>> about selection/Filter & aggregation:
> 
> I have too many questions that I feel it be better to wait your HBase SE
> first... However:
> 
> How to push down aggregation and selection into scan pop?
> 
> @Jacques, It seems to me that your idea is to use a scan pop node to
> describe what SE would do in a query, right?
> 
> Would scan pop become a little too complicated if scan pop stay SE
> independent? Since mysql & mongo need more for scan pop.
> 
> Previously I thought you would provide something like
> 
>  RecordReader getReader(PhysicalPlan subPlan)
> 
> SE advertises ability back to drill, drill push part of physical plan to SE
> and let SE figure out how to deal with the subdag as long as SE can provide
> correct RecordBatch.
> 
> 
> 
> 
> 
> On Mon, Apr 22, 2013 at 12:06 PM, David Alves <[email protected]> wrote:
> 
>> Hi Lisen
>> 
>>        Phoenix has been a good source of inspiration.
>>        Had it not been for license issues (non-standard license) and the
>> fact it is designed to run locally I would have used it directly instead of
>> coding my own.
>>        Not completely sure what you mean wrt to "map fields in the query
>> into portion of rowkey in HBase" but here's what I'm doing with regard to
>> the operations that are pushed to HBase:
>> 
>>        Projection comes from setting the interesting CF's and CQ's in the
>> Scan prior to starting it (where those come from in drill was the reason
>> for my previous email).
>>        Selection comes from setting Filters that are created directly
>> form expresssions in drlll and are submitted with the scan.
>>        Partial Aggregation (which I'm not doing right now but will do
>> soon ) will come from co-processors.
>>        Joins: I'm investigating a couple on pushing some of the work to
>> hbase.
>> 
>>        All the remaining operations will happen within drill itself.
>> 
>> Best
>> David
>> 
>> On Apr 21, 2013, at 10:45 PM, Lisen Mu <[email protected]> wrote:
>> 
>>> David,
>>> 
>>> Another case about schema: how to map fields in the query into portion of
>>> rowkey in HBase? Like phoenix does.
>>> http://files.meetup.com/1350427/IntelPhoenixHBaseMeetup.ppt
>>> 
>>> I think it might be common in HBase schema design that several logical
>>> parts form rowkey in a particular order for the most frequent access
>>> pattern.
>>> 
>>> 
>>> 
>>> 
>>> On Sun, Apr 21, 2013 at 1:45 PM, David Alves <[email protected]>
>> wrote:
>>> 
>>>> had a "duh" moment, realizing that, of course, I don't need a
>>>> ProjectFilter as I can set the relevant cq's and cf's on HBase's Scan.
>>>> the question or how to get the names of the columns the query is asking
>>>> for or even "*" if that is the case, still stands though…
>>>> 
>>>> -david
>>>> 
>>>> On Apr 20, 2013, at 10:39 PM, David Alves <[email protected]>
>> wrote:
>>>> 
>>>>> Hi Jacques
>>>>> 
>>>>>     I'm implementing a ProjectFilter for HBase and I got to the point
>>>> where I need to pass to HBase the fields that are required (even if it's
>>>> simply "all" as in *).
>>>>>     How to know which fields to scan in the SE and their expected
>> type?
>>>>>     There's a bunch of schema stuff in the
>>>> org/apache/drill/exec/schema but I can't figure how SE uses that.
>>>>>     Will this info come inside the scan logical op in
>>>> getReadEntries(Scan scan) (in the arbitrary "selection" section)?
>>>>>     Is this method still going to receive a logical Scan op or is this
>>>> just a legacy stuff that you didn't have the chance to get to yet?
>>>>>     BatchSchema seems to only refer to field ids…
>>>>> 
>>>>>     I'm thinking this is most likely because the work is still very
>>>> much in progress but as I browse the code I can see you have put a lot
>> of
>>>> thought into almost everything even when it's not being used right now
>> and
>>>> I don't want to make any stupid assumption.
>>>>>     I can definitely make that info get to the SE iface myself just
>>>> wondering how do you envision it should get there…
>>>>> 
>>>>> Best
>>>>> David
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: question about schema

Reply via email to