rschwagercharter opened a new pull request, #44954:
URL: https://github.com/apache/spark/pull/44954

   ### What changes were proposed in this pull request?
   Add support for custom partition location assignment for Kafka sources in 
Structured Streaming.
   
   
   ### Why are the changes needed?
   
   [Please see the design doc for greater detail and further 
discussion](https://docs.google.com/document/d/1RoEk_mt8AUh9sTQZ1NfzIuuYKf1zx6BP1K3IlJ2b8iM/edit#heading=h.pbt6pdb2jt5c)
   
   [SPARK-15406](https://issues.apache.org/jira/browse/SPARK-15406) Added Kafka 
consumer support to Spark Structured Streaming, but it did not add custom 
partition location assignment as a feature. The Structured Streaming Kafka 
consumer as it exists today evenly allocates Kafka topic partitions to 
executors without regard to Kafka broker rack information or executor location. 
This behavior can drive large cross-AZ networking costs in large deployments.
   
   [The design doc for 
SPARK-15406](https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit#heading=h.k36c6oyz89xw)
 described the ability to assign Kafka partitions to particular executors (a 
feature which would enable rack awareness), but it seems that feature was never 
implemented.
   
   For DStreams users, there does seem to be a way to assign Kafka partitions 
to Spark executors in a custom fashion with 
[LocationStrategies.PreferFixed](https://github.com/apache/spark/blob/master/connector/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/LocationStrategy.scala#L69),
 so this sort of functionality has a precedent.
   
   
   ### Does this PR introduce _any_ user-facing change?
   An additional parameter will be accepted on the Kafka source provider. This 
parameter is provisionally named `partitionlocationassigner`. The parameter 
takes a class name, which when instantiated gives the Kafka source with 
user-provided Kafka partition location suggestions. The class should implement 
a new trait defined in this PR and described in the [design 
document](https://docs.google.com/document/d/1RoEk_mt8AUh9sTQZ1NfzIuuYKf1zx6BP1K3IlJ2b8iM/edit#heading=h.pbt6pdb2jt5c).
   
   ### How was this patch tested?
   Unit tests are forthcoming. 
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to