[
https://issues.apache.org/jira/browse/PHOENIX-4757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16682960#comment-16682960
]
James Taylor commented on PHOENIX-4757:
---------------------------------------
For this query today, Phoenix will cause only regions prefixed with 2 or 3 to
be scanned:
{code:java}
Select id_2, sum(val)
>From relationship
Where id_1 in (2,3)
Group by id_2{code}
If the table is salted, then for each salt bucket, the above is done. For
example, with SALT_BUCKET=16, there will be 16 parallel scans all looking for
data with the second byte being 2 or 3. This sets an upper bound on the size of
the merge sort.
How would this be handled with this proposal? Would every region need to be
scanned and merge sorted? In general, how would this impact scans for id_1?
What's the difference between your proposal and what Phoenix would do if the PK
was (id_2, id_1)?
Would another, simpler alternative given a PK of (id_2, id_1) and a filter of
id_1 be to improve the filtering by using a skip scan? Given the above query,
Phoenix would aggregate in-place since you're grouping by the leading PK
column. What it's not doing automatically is skipping to id_1 of 2 or 3 during
the aggregation. It might just work if you add a /*+ SKIP_SCAN */ hint to the
above query. If the cardinality of id_2 is low then you can get better
performance by using this hint for queries of specific id_1 values.
> composite key salt_buckets
> --------------------------
>
> Key: PHOENIX-4757
> URL: https://issues.apache.org/jira/browse/PHOENIX-4757
> Project: Phoenix
> Issue Type: Improvement
> Affects Versions: 4.11.0
> Reporter: cmd
> Priority: Major
> Fix For: 4.11.0
>
>
> CREATE TABLE IF NOT EXISTS user_events (
> user_id VARCHAR NOT NULL,
> event_type VARCHAR NOT NULL,
> event_time VARCHAR NOT NULL
> event_msg VARCHAR NOT NULL
> event_status VARCHAR NOT NULL
> event_opt VARCHAR NOT NULL
> CONSTRAINT my_pk PRIMARY KEY (user_id,event_type,event_time))
> SALT_BUCKETS=128;
> and my query is:
> 1.select event_type,count(0) from us_population where user_id='xxxx' group
> by event_type
> 2.select count(0) from us_population where user_id='xxxx' and
> event_type='0101'
> 3.select * from us_population where user_id='xxxx' and event_type='0101' and
> event_time>'20180101' and event_time<'20180201' order by event_time limit
> 50,100
> Concurrency query ratio:
> 1:80%
> 2:10%
> 3:10%
> user_events data :50billion
> It can be a field/some fileds of the primary key salted by hash
> grammar with "SALT_BUCKETS(user_id)=4" or
> "SALT_BUCKETS(user_id,event_type)=4"
> ref:
>
> [https://www.safaribooksonline.com/library/view/greenplum-architecture/9781940540337/xhtml/chapter03.xhtml]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)