[ 
https://issues.apache.org/jira/browse/SPARK-46798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Schwager updated SPARK-46798:
-------------------------------------
    Description: 
I'd like to propose, and implement if approved, support for custom partition 
location assignment. [Please find the design doc for SPARK-46798 describing the 
change 
here.|https://docs.google.com/document/d/1RoEk_mt8AUh9sTQZ1NfzIuuYKf1zx6BP1K3IlJ2b8iM/edit#heading=h.pbt6pdb2jt5c]

SPARK-15406 Added Kafka consumer support to Spark Structured Streaming, but it 
did not add custom partition location assignment as a feature. The Structured 
Streaming Kafka consumer as it exists today evenly allocates Kafka topic 
partitions to executors without regard to Kafka broker rack information or 
executor location. This behavior can drive large cross-AZ networking costs in 
large deployments.

[The design doc for 
SPARK-15406|https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit#heading=h.k36c6oyz89xw]
 described the ability to assign Kafka partitions to particular executors (a 
feature which would enable rack awareness), but it seems that feature was never 
implemented.

For DStreams users, there does seem to be a way to assign Kafka partitions to 
Spark executors in a custom fashion with 
[LocationStrategies.PreferFixed|https://github.com/apache/spark/blob/master/connector/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/LocationStrategy.scala#L69],
 so this sort of functionality has a precedent.

 

  was:
SPARK-15406 Added Kafka consumer support to Spark Structured Streaming, but it 
did not add custom partition location assignment as a feature. The Structured 
Streaming Kafka consumer as it exists today evenly allocates Kafka topic 
partitions to executors without regard to Kafka broker rack information or 
executor location. This behavior can drive large cross-AZ networking costs in 
large deployments.

In the [Design 
Doc|https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit#heading=h.k36c6oyz89xw]
 for SPARK-15406, the ability to assign Kafka partitions to particular 
executors (a feature which would enable rack awareness) was discussed, but 
never implemented.

For DStreams users, there does seem to be a way to assign Kafka partitions to 
Spark executors in a custom fashion: 
[LocationStrategies.PreferFixed|https://github.com/apache/spark/blob/master/connector/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/LocationStrategy.scala#L69].

I'd like to propose, and implement if approved, support for custom partition 
location assignment. Please find the design doc describing the change 
[here|https://docs.google.com/document/d/1RoEk_mt8AUh9sTQZ1NfzIuuYKf1zx6BP1K3IlJ2b8iM/edit#heading=h.pbt6pdb2jt5c]
 





> Kafka custom partition location assignment in Spark Structured Streaming 
> (rack awareness)
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-46798
>                 URL: https://issues.apache.org/jira/browse/SPARK-46798
>             Project: Spark
>          Issue Type: New Feature
>          Components: Structured Streaming
>    Affects Versions: 3.1.0, 3.2.0, 3.3.0, 3.4.0, 3.5.0
>            Reporter: Randall Schwager
>            Priority: Major
>
> I'd like to propose, and implement if approved, support for custom partition 
> location assignment. [Please find the design doc for SPARK-46798 describing 
> the change 
> here.|https://docs.google.com/document/d/1RoEk_mt8AUh9sTQZ1NfzIuuYKf1zx6BP1K3IlJ2b8iM/edit#heading=h.pbt6pdb2jt5c]
> SPARK-15406 Added Kafka consumer support to Spark Structured Streaming, but 
> it did not add custom partition location assignment as a feature. The 
> Structured Streaming Kafka consumer as it exists today evenly allocates Kafka 
> topic partitions to executors without regard to Kafka broker rack information 
> or executor location. This behavior can drive large cross-AZ networking costs 
> in large deployments.
> [The design doc for 
> SPARK-15406|https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit#heading=h.k36c6oyz89xw]
>  described the ability to assign Kafka partitions to particular executors (a 
> feature which would enable rack awareness), but it seems that feature was 
> never implemented.
> For DStreams users, there does seem to be a way to assign Kafka partitions to 
> Spark executors in a custom fashion with 
> [LocationStrategies.PreferFixed|https://github.com/apache/spark/blob/master/connector/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/LocationStrategy.scala#L69],
>  so this sort of functionality has a precedent.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to