Re: Salting based on partial rowkeys

Josh Elser Thu, 13 Sep 2018 18:42:21 -0700

Ahh, I get you now.

For a composite primary key made up of columns 1 through N, you wantsimilar controls to compute the value of the salt based on a sequence ofthe columns 1 through M where M <= N (instead of always on all columns).

For large numbers of salt buckets and a scan over a facet, you pruneyour search space considerably. Makes sense to me!


On 9/13/18 6:37 PM, Gerald Sangudi wrote:

In case the text formatting is lost below, I also added it as a comment in
the JIRA ticket:

https://issues.apache.org/jira/browse/PHOENIX-4757


On Thu, Sep 13, 2018 at 3:24 PM, Gerald Sangudi <gsang...@23andme.com>
wrote:

Sorry I missed Josh's reply; I've subscribed to the dev list now.

Below is a copy-and-paste from our internal document. Thanks in advance
for your review and additional feedback on this.

Gerald



















*BackgroundWe make extensive use of multi-column rowkeys and salting
<https://phoenix.apache.org/salted.html> in our different apache phoenix
deployments. We frequently perform group-by aggregations on these data
along a specific dimension that would benefit from predictably partitioning
the data along that dimension. Proposal:We propose to add table metadata to
allow schema designers to constrain salting to a subset of the rowkey,
rather than the full rowkey as it is today. This will introduce a mechanism
to partition data on a per-table basis along a single dimension without
application changes or much change to the phoenix runtime logic. We expect
this will result in substantially faster group-by’s along the salted
dimension and negligible penalties elsewhere. This feature has also been
proposed in PHOENIX-4757
<https://issues.apache.org/jira/browse/PHOENIX-4757> where it was pointed
out that partitioning and sorting data along different dimensions is a
common pattern in other datastores as well.Theoretically, it could cause
hotspotting when querying along the salted dimension without the leading
rowkey - that would be an anti-pattern.Usage ExampleCurrent:Schema:CREATE
TABLE relationship (id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key
BIGINT NOT NULL,val SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2,
other_key))SALT_BUCKETS=60;Query:Select id_2, sum(val)From
relationshipWhere id_1 in (2,3)Group by id_2Explain:0: jdbc:phoenix:>
EXPLAIN Select id_2, sum(val) From relationship Where id_1 in (2,3) Group
by id_2
;+-----------------------------------------------------------------------------------------+--------+|
                                          PLAN    | EST_BY
|+-----------------------------------------------------------------------------------------+--------+|
CLIENT 60-CHUNK PARALLEL 60-WAY SKIP SCAN ON 120 KEYS OVER RELATIONSHIP
[0,2] - [59,3]  | null ||     SERVER AGGREGATE INTO DISTINCT ROWS BY [ID_2]
                                       | null || CLIENT MERGE SORT
                                                                       |
null
|+-----------------------------------------------------------------------------------------+--------+3
rows selected (0.048 seconds)In this case, although the group by is
performed on both the client and regionserver, almost all of the actual
grouping happens on the client because the id_2’s are randomly distributed
across the regionservers. As a result, a lot of unnecessary data is
serialized to the client and grouped serially there. This can become quite
material with large resultsets.Proposed:Schema:CREATE TABLE relationship
(id_1 BIGINT NOT NULL,id_2 BIGINT NOT NULL,other_key BIGINT NOT NULL,val
SMALLINT,CONSTRAINT pk PRIMARY KEY (id_1, id_2,
other_key),SALT_BUCKETS=60,SALT_COLUMN = id_2);Query (unchanged):Select
id_2, sum(val)From relationshipWhere id_1 in (2,3)Group by id_2Explain
(unchanged)Under the proposal, the data are merely partitioned so that all
rows containing the same id_2 are on the same regionserver, the above query
will perform almost all of the grouping in parallel on the regionservers.
No special hint or changes to the query plan would be required to benefit.
Tables would need to be re-salted to take advantage of the new
functionality.Technical changes proposed to phoenix: - Create a new piece
of table-level metadata: SALT_COLUMN. SALT_COLUMN will instruct the salting
logic to generate a salt-byte based only on the specified column. If
unspecified, it will behave as it does today and default to salting the
entire rowkey. This metadata may be specified only when the table is
created and may not be modified. The specified column must be part of the
rowkey.  - Modify all callers of getSaltingByte
<https://github.com/apache/phoenix/blob/master/phoenix-core/src/main/java/org/apache/phoenix/schema/SaltingUtil.java#L77>(byte[]
value, int offset, int length, int bucketNum) to consistently leverage the
new metadata.- Tests- DocsDesign points:One salt column vs multiple salt
columns: Based on the existing signature for getSaltingByte, it seems
simpler to only support a single SALT_COLUMN rather than multiple arbitrary
SALT_COLUMNS. Known use-cases are completely supported by a single
column.Syntax:  PHOENIX-4757
<https://issues.apache.org/jira/browse/PHOENIX-4757> suggests an alternate,
less verbose syntax for defining the salt bucket. The SALT_COLUMN syntax is
suggested for clarity and consistency with other Phoenix table
options.Future Enhancements (not in scope)Different aspects of the query
execution runtime could take advantage of new metadata and implied
knowledge that the data are partitioned in a predictable manner. For
example: - It could be that client side grouping is completely unnecessary
in cases where the SALT_COLUMN is part of the group-by expression.- A query
that contains a literal equality predicate for the SALT_COLUMN can be
isolated to a single regionserver, rather than broadcast to all
regionservers.- A client-side merge-sort-join based on the SALT_COLUMN
could optimize organization of merges. - Similarly, a server-side hash join
could distribute only ‘necessary’ portions of the hash table to each
regionserver.If additional advantages of these types come for free, then
that’s great but can be follow on enhancements from the initial commit.*

On Thu, Sep 13, 2018 at 9:33 AM, Thomas D'Silva <tdsi...@salesforce.com>
wrote:

Gerald,

I think you missed Josh's reply here : https://lists.apache.org/thr
ead.html/c5145461805429622a410c23c1199d578e146a5c94511b2d583
3438b@%3Cdev.phoenix.apache.org%3E

Could you explain how using a subset of the pk columns to generate the
salt byte helps with partitioning, aggregations etc?

Thanks,
Thomas

On Thu, Sep 13, 2018 at 8:32 AM, Gerald Sangudi <gsang...@23andme.com>
wrote:

Hi folks,

Any thoughts or feedback on this?

Thanks,
Gerald

On Mon, Sep 10, 2018 at 1:56 PM, Gerald Sangudi <gsang...@23andme.com>
wrote:

Hello folks,

We have a requirement for salting based on partial, rather than full,
rowkeys. My colleague Mike Polcari has identified the requirement and
proposed an approach.

I found an already-open JIRA ticket for the same issue:
https://issues.apache.org/jira/browse/PHOENIX-4757. I can provide more
details from the proposal.

The JIRA proposes a syntax of SALT_BUCKETS(col, ...) = N, whereas Mike
proposes SALT_COLUMN=col or SALT_COLUMNS=col, ... .

The benefit at issue is that users gain more control over partitioning,
and this can be used to push some additional aggregations and hash joins
down to region servers.

I would appreciate any go-ahead / thoughts / guidance / objections /
feedback. I'd like to be sure that the concept at least is not
objectionable. We would like to work on this and submit a patch down the
road. I'll also add a note to the JIRA ticket.

Thanks,
Gerald

Re: Salting based on partial rowkeys

Reply via email to