In case the text formatting is lost below, I also added it as a comment in the JIRA ticket:
https://issues.apache.org/jira/browse/PHOENIX-4757 On Thu, Sep 13, 2018 at 3:24 PM, Gerald Sangudi <gsang...@23andme.com> wrote: > Sorry I missed Josh's reply; I've subscribed to the dev list now. > > Below is a copy-and-paste from our internal document. Thanks in advance > for your review and additional feedback on this. > > Gerald > > > > > > > > > > > > > > > > > > > > *BackgroundWe make extensive use of multi-column rowkeys and salting > <https://phoenix.apache.org/salted.html> in our different apache phoenix > deployments. We frequently perform group-by aggregations on these data > along a specific dimension that would benefit from predictably partitioning > the data along that dimension. Proposal:We propose to add table metadata to > allow schema designers to constrain salting to a subset of the rowkey, > rather than the full rowkey as it is today. This will introduce a mechanism > to partition data on a per-table basis along a single dimension without > application changes or much change to the phoenix runtime logic. We expect > this will result in substantially faster group-by’s along the salted > dimension and negligible penalties elsewhere. This feature has also been > proposed in PHOENIX-4757 > <https://issues.apache.org/jira/browse/PHOENIX-4757> where it was pointed > out that partitioning and sorting data along different dimensions is a > common pattern in other datastores as well.Theoretically, it could cause > hotspotting when querying along the salted dimension without the leading > rowkey - that would be an anti-pattern.Usage ExampleCurrent:Schema:CREATE > TABLE relationship (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key > BIGINT NOT NULL,val SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2, > other_key))SALT_BUCKETS=60;Query:Select id_2, sum(val)From > relationshipWhere id_1 in (2,3)Group by id_2Explain:0: jdbc:phoenix:> > EXPLAIN Select id_2, sum(val) From relationship Where id_1 in (2,3) Group > by id_2 > ;+-----------------------------------------------------------------------------------------+--------+| > PLAN | EST_BY > |+-----------------------------------------------------------------------------------------+--------+| > CLIENT 60-CHUNK PARALLEL 60-WAY SKIP SCAN ON 120 KEYS OVER RELATIONSHIP > [0,2] - [59,3] | null || SERVER AGGREGATE INTO DISTINCT ROWS BY [ID_2] > | null || CLIENT MERGE SORT > | > null > |+-----------------------------------------------------------------------------------------+--------+3 > rows selected (0.048 seconds)In this case, although the group by is > performed on both the client and regionserver, almost all of the actual > grouping happens on the client because the id_2’s are randomly distributed > across the regionservers. As a result, a lot of unnecessary data is > serialized to the client and grouped serially there. This can become quite > material with large resultsets.Proposed:Schema:CREATE TABLE relationship > (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key BIGINT NOT NULL,val > SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2, > other_key),SALT_BUCKETS=60,SALT_COLUMN = id_2);Query (unchanged):Select > id_2, sum(val)From relationshipWhere id_1 in (2,3)Group by id_2Explain > (unchanged)Under the proposal, the data are merely partitioned so that all > rows containing the same id_2 are on the same regionserver, the above query > will perform almost all of the grouping in parallel on the regionservers. > No special hint or changes to the query plan would be required to benefit. > Tables would need to be re-salted to take advantage of the new > functionality.Technical changes proposed to phoenix: - Create a new piece > of table-level metadata: SALT_COLUMN. SALT_COLUMN will instruct the salting > logic to generate a salt-byte based only on the specified column. If > unspecified, it will behave as it does today and default to salting the > entire rowkey. This metadata may be specified only when the table is > created and may not be modified. The specified column must be part of the > rowkey. - Modify all callers of getSaltingByte > <https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/schema/SaltingUtil.java#L77>(byte[] > value, int offset, int length, int bucketNum) to consistently leverage the > new metadata.- Tests- DocsDesign points:One salt column vs multiple salt > columns: Based on the existing signature for getSaltingByte, it seems > simpler to only support a single SALT_COLUMN rather than multiple arbitrary > SALT_COLUMNS. Known use-cases are completely supported by a single > column.Syntax: PHOENIX-4757 > <https://issues.apache.org/jira/browse/PHOENIX-4757> suggests an alternate, > less verbose syntax for defining the salt bucket. The SALT_COLUMN syntax is > suggested for clarity and consistency with other Phoenix table > options.Future Enhancements (not in scope)Different aspects of the query > execution runtime could take advantage of new metadata and implied > knowledge that the data are partitioned in a predictable manner. For > example: - It could be that client side grouping is completely unnecessary > in cases where the SALT_COLUMN is part of the group-by expression.- A query > that contains a literal equality predicate for the SALT_COLUMN can be > isolated to a single regionserver, rather than broadcast to all > regionservers.- A client-side merge-sort-join based on the SALT_COLUMN > could optimize organization of merges. - Similarly, a server-side hash join > could distribute only ‘necessary’ portions of the hash table to each > regionserver.If additional advantages of these types come for free, then > that’s great but can be follow on enhancements from the initial commit.* > > On Thu, Sep 13, 2018 at 9:33 AM, Thomas D'Silva <tdsi...@salesforce.com> > wrote: > >> Gerald, >> >> I think you missed Josh's reply here : https://lists.apache.org/thr >> ead.html/c5145461805429622a410c23c1199d578e146a5c94511b2d583 >> 3438b@%3Cdev.phoenix.apache.org%3E >> >> Could you explain how using a subset of the pk columns to generate the >> salt byte helps with partitioning, aggregations etc? >> >> Thanks, >> Thomas >> >> On Thu, Sep 13, 2018 at 8:32 AM, Gerald Sangudi <gsang...@23andme.com> >> wrote: >> >>> Hi folks, >>> >>> Any thoughts or feedback on this? >>> >>> Thanks, >>> Gerald >>> >>> On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi <gsang...@23andme.com> >>> wrote: >>> >>>> Hello folks, >>>> >>>> We have a requirement for salting based on partial, rather than full, >>>> rowkeys. My colleague Mike Polcari has identified the requirement and >>>> proposed an approach. >>>> >>>> I found an already-open JIRA ticket for the same issue: >>>> https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more >>>> details from the proposal. >>>> >>>> The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike >>>> proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... . >>>> >>>> The benefit at issue is that users gain more control over partitioning, >>>> and this can be used to push some additional aggregations and hash joins >>>> down to region servers. >>>> >>>> I would appreciate any go-ahead / thoughts / guidance / objections / >>>> feedback. I'd like to be sure that the concept at least is not >>>> objectionable. We would like to work on this and submit a patch down the >>>> road. I'll also add a note to the JIRA ticket. >>>> >>>> Thanks, >>>> Gerald >>>> >>>> >>> >> >