Hey.

My organization is working on implementing a generic sharding mechanism
where teams can configure mapping of keys to a set of shards, and shards to
a set of containers, for a given service. This is analogous to Facebook's
shard manager [1]. For what the routing layer is concerned, the idea is
that:

- a control plane component provides information for selecting a shard
based on information contained in the incoming request (e.g. a list of keys
to a shard identifier, or list of ranges of keys to a shard identifier).
Assignments for keys to shards do not have to change very frequently (e.g.
monthly), but this explicit mapping can be quite large (e.g. millions of
entries).
- a control plane component provides information on how to map shards to
running containers (network endpoints).
- the data plane (gRPC or Envoy) uses that information to send requests to
the appropriate endpoint.

I'm looking for guidance on how to implement the routing layer in gRPC,
potentially using xDS. We also use Envoy for the same purpose, so ideally,
the solution is similar for both. In our initial design for Envoy, we use:

1. A custom filter to map requests to shards. In gRPC, this could be done
via an interceptor that injects the shard identifier in the request context.
2. Load Balancer Subsets [2] to route requests to individual endpoints
within a CDS cluster. This assumes that all shards are part of the same CDS
cluster, and identifiers are communicated from the control plane to the
clients via EDS config.endpoint.v3.LbEndpoint metadata field [3]. The
problem here is that as far as I can see, Load Balancer Subsets are not
implemented in gRPC.

For 2., an alternative design would use a different CDS cluster for each
shard. This has some advantages, such as more configurable options and
detailed metrics for each shard, but is more costly, both in terms of Envoy
memory footprint, metrics cost (Envoy generates a lot of metrics for each
cluster -- but we could work around that one) and number of CDS and EDS
subscriptions. So first, I'd like to know whether support for Load Balancer
Subsets is something the gRPC team has considered and rejected, or whether
it has not been implemented because it is low priority (I do not see any
mention in all xDS related gRFCs).

If the team does not plan on supporting load balancing subsets, would you
have recommendations on an alternative way to implement sharding within or
on top of gRPC? There are a couple of ways I can think of:

1. implement sharding entirely on top of the stack, using multiple gRPC
channels. Exposing an interface that looks like a single gRPC client is
some amount of work on the data plane/library side but possible.
2. put each shard in a different CDS cluster rather than using load
balancer subsets, simply using mechanisms described in A28 - gRPC xDS
traffic splitting and routing [4] to route to the correct endpoint. This
would increase the amount of xDS subscriptions by a factor 1000 for sharded
services, which may become a problem. I think this would also increase the
number of connections to each backend, since each container typically hosts
multiple shards.
3. implement a custom load balancing policy that does the mapping for gRPC
based on load balancer subsets [5]. I haven't dug too much into what that
would look like in practice.

If that matters, we plan to use all Go and Java (we also use Python but
typically use an egress Envoy proxy in that case). Has anyone
successfully done any of those? (some teams here have done 1 to some
extent, but we are aiming to make sharding more transparent and less work
for users).

Thanks,
Antoine.

[1]
https://research.facebook.com/publications/shard-manager-a-generic-shard-management-framework-for-geo-distributed-applications/
[2]
https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/load_balancing/subsets
[3]
https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/endpoint/v3/endpoint_components.proto#config-endpoint-v3-lbendpoint
[4]
https://github.com/grpc/proposal/blob/master/A28-xds-traffic-splitting-and-routing.md
[5]
https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/cluster.proto#config-cluster-v3-loadbalancingpolicy

-- 
You received this message because you are subscribed to the Google Groups 
"grpc.io" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/grpc-io/CADrS3jJ-iaUHH9S%2BPr8VWKy9F0sSUQLvaTN3r%3Dz%3DwL-cOXBJAw%40mail.gmail.com.

Reply via email to