Rodrigo Boavida created SPARK-37536:
---------------------------------------
Summary: Allow for API user to disable Shuffle Operations while
running locally
Key: SPARK-37536
URL: https://issues.apache.org/jira/browse/SPARK-37536
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 3.3.0
Environment: Spark running in local mode
Reporter: Rodrigo Boavida
We have been using Spark on local mode, as a small embedded, in-memory SQL DB
for our microservice.
Spark's powerful SQL features, and flexibility enables developers to build
efficient data querying solutions. Due to the nature of our solution dealing
with small datasets, which required to be queried through SQL, on very low
latencies, we found the embedded approach a very good model.
We found through experimentation, that Spark on local mode, would gain
significant performance improvements (on average between 20-30%) by disabling
the shuffling on aggregation operations. This is done by expanding the query
execution plan with ShuffleExchangeExec or the BroadcastExchangeExec.
I will be raising a PR, to propose introducing a new configuration variable
*spark.shuffle.local.enabled*
This variable will default to true, and will be checked on the QueryExecution
EnsureRequirements creation time, in conjunction with checking if Spark is
running on local mode, will keep the execution plan unchanged if the value is
false.
Looking forward any comments and feedback.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]