mosche commented on code in PR #22157:
URL: https://github.com/apache/beam/pull/22157#discussion_r923226348
##########
runners/spark/src/test/java/org/apache/beam/runners/spark/structuredstreaming/SparkSessionRule.java:
##########
@@ -37,23 +39,43 @@ public SparkSessionRule(String sparkMaster, Map<String,
String> sparkConfig) {
builder = SparkSession.builder();
sparkConfig.forEach(builder::config);
builder.master(sparkMaster);
+ builder.config("spark.sql.shuffle.partitions",
numDriverCores(sparkMaster));
}
public SparkSessionRule(KV<String, String>... sparkConfig) {
- this("local", sparkConfig);
+ this("local[2]", sparkConfig);
Review Comment:
If you prefer using `local[4]` I'm fine changing. That's just a value taken
from past experience having worked lots on improving performance of Spark
tests.
For tests it's generally a misconception that more cores (e.g. all available
cores) mean more performance. Tests are typically running in parallel in
multiple forks. More cores also means more congestion in that case. Also, using
more partitions means more overhead on the processing side (always assuming
tests operate on rather tiny datasets). That's why it's important to reduce the
number of shuffle partitions for tests before Spark 3.2. (looks like later
Spark versions are smarter about this).
On the other side, using just 1 core / 1 partition can be dangerous as it
might obfuscate problems (e.g. invalid assumptions about ordering) in the code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]