[ 
https://issues.apache.org/jira/browse/PHOENIX-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682960#comment-16682960
 ] 

James Taylor commented on PHOENIX-4757:
---------------------------------------

For this query today, Phoenix will cause only regions prefixed with 2 or 3 to 
be scanned:
{code:java}
Select id_2, sum(val)
>From relationship
Where id_1 in (2,3)
Group by id_2{code}
If the table is salted, then for each salt bucket, the above is done. For 
example, with SALT_BUCKET=16, there will be 16 parallel scans all looking for 
data with the second byte being 2 or 3. This sets an upper bound on the size of 
the merge sort.

How would this be handled with this proposal? Would every region need to be 
scanned and merge sorted? In general, how would this impact scans for id_1? 
What's the difference between your proposal and what Phoenix would do if the PK 
was (id_2, id_1)?

Would another, simpler alternative given a PK of (id_2, id_1) and a filter of 
id_1 be to improve the filtering by using a skip scan? Given the above query, 
Phoenix would aggregate in-place since you're grouping by the leading PK 
column. What it's not doing automatically is skipping to id_1 of 2 or 3 during 
the aggregation. It might just work if you add a /*+ SKIP_SCAN */ hint to the 
above query. If the cardinality of id_2 is low then you can get better 
performance by using this hint for queries of specific id_1 values.

 

> composite key salt_buckets
> --------------------------
>
>                 Key: PHOENIX-4757
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4757
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.11.0
>            Reporter: cmd
>            Priority: Major
>             Fix For: 4.11.0
>
>
> CREATE TABLE IF NOT EXISTS user_events (
>  user_id VARCHAR NOT NULL,
>  event_type VARCHAR NOT NULL,
>  event_time VARCHAR NOT NULL
>  event_msg VARCHAR NOT NULL
>  event_status VARCHAR NOT NULL
>  event_opt VARCHAR NOT NULL
>  CONSTRAINT my_pk PRIMARY KEY (user_id,event_type,event_time)) 
> SALT_BUCKETS=128;
> and my query is:
>  1.select event_type,count(0) from us_population where user_id='xxxx' group 
> by event_type
>  2.select count(0) from us_population where user_id='xxxx' and 
> event_type='0101'
>  3.select * from us_population where user_id='xxxx' and event_type='0101' and 
> event_time>'20180101' and event_time<'20180201' order by event_time limit 
> 50,100
> Concurrency query ratio:
>  1:80%
>  2:10%
>  3:10% 
>  user_events data :50billion
>  It can be a field/some fileds of the primary key salted by hash
>  grammar with "SALT_BUCKETS(user_id)=4" or 
> "SALT_BUCKETS(user_id,event_type)=4"
> ref:
>  
> [https://www.safaribooksonline.com/library/view/greenplum-architecture/9781940540337/xhtml/chapter03.xhtml]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to