GitHub user mairbek opened a pull request:

    https://github.com/apache/beam/pull/3729

    [BEAM-1542] Added a preprocessing step to the Cloud Spanner sink.

    The general intuition we follow here: if mutations are presorted by the 
primary key before batching, it is more likely that mutations in the batch will 
end up in the same partition. It minimizes the number of participants in the 
distributed transaction on the Cloud Spanner side and leads to a better 
throughput.
    
    Mutations are encoded before running other steps to avoid paying the 
serialization price. Primary keys are encoded using OrderedCode library, and 
ApproximateQuantiles transform is used to sample keys.
    
    Once primary keys are sampled, for each mutation we assign the index of the 
closest primary key as a key and group by that key. Range deletes are submitted 
separately.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mairbek/beam prepro-pr

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/beam/pull/3729.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3729
    
----
commit 7aeb0b0d02c308c690fb598b69a7aec649e4bb89
Author: Mairbek Khadikov <[email protected]>
Date:   2017-07-20T23:22:04Z

    Added a preprocessing step to the Cloud Spanner sink.
    
    The general intuition we follow here: if mutations are presorted by the 
primary key before batching, it is more likely that mutations in the batch will 
end up in the same partition. It minimizes the number of participants in the 
distributed transaction on the Cloud Spanner side and leads to a better 
throughput.
    
    Mutations are encoded before running other steps to avoid paying the 
serialization price. Primary keys are encoded using OrderedCode library, and 
ApproximateQuantiles transform is used to sample keys.
    
    Once primary keys are sampled, for each mutation we assign the index of the 
closest primary key as a key and group by that key. Range deletes are submitted 
separately.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to