[
https://issues.apache.org/jira/browse/KUDU-3135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Henke reassigned KUDU-3135:
---------------------------------
Assignee: Grant Henke
> Add Client Metadata Tokens
> --------------------------
>
> Key: KUDU-3135
> URL: https://issues.apache.org/jira/browse/KUDU-3135
> Project: Kudu
> Issue Type: Improvement
> Components: client
> Affects Versions: 1.12.0
> Reporter: Grant Henke
> Assignee: Grant Henke
> Priority: Major
> Labels: roadmap-candidate, scalability
>
> Currently when a distributed task is done using the Kudu client, the
> driver/coordinator client needs to open the table to request its current
> metadata and locations. Then it can distribute the work to tasks/executors on
> remote nodes. In the case of reading data, often ScanTokens are used to
> distribute the work, and in the case of writing data perhaps just the table
> name is required.
> The problem is that each parallel task then also needs to open the table to
> request the metadata for the table. Using Spark as an example, this happens
> when deserializing the scan tokens in KuduRDD
> ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduRDD.scala#L107-L108])
> or when writing rows using the KuduContext
> ([here|https://github.com/apache/kudu/blob/master/java/kudu-spark/src/main/scala/org/apache/kudu/spark/kudu/KuduContext.scala#L466]).
> This results in a large burst of metadata requests to the leader Kudu master
> all at once. Given the Kudu master is only a single server and requests can't
> be served from the follower masters, this effectively limits the amount of
> parallel tasks that can run in a large Kudu deployment. Even if the follower
> masters could service the requests, that still limits scalability in very
> large clusters given most deployments would only have 3-5 masters.
> Adding a metadata token, similar to a scan token, would be a useful way to
> allow the single driver to fetch all the metadata required for the parallel
> tasks. The tokens can be serialized and then passed to each task in a similar
> fashion to scan tokens.
> Of course in a pessimistic case, something may change between generation of
> the token and the start of the task. In that case a request would need to be
> sent to get the updated metadata. However, that scenario should be rare and
> likely would not result in all of the requests happening at the same time.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)