mosche commented on code in PR #22157:
URL: https://github.com/apache/beam/pull/22157#discussion_r923226348


##########
runners/spark/src/test/java/org/apache/beam/runners/spark/structuredstreaming/SparkSessionRule.java:
##########
@@ -37,23 +39,43 @@ public SparkSessionRule(String sparkMaster, Map<String, 
String> sparkConfig) {
     builder = SparkSession.builder();
     sparkConfig.forEach(builder::config);
     builder.master(sparkMaster);
+    builder.config("spark.sql.shuffle.partitions", 
numDriverCores(sparkMaster));
   }
 
   public SparkSessionRule(KV<String, String>... sparkConfig) {
-    this("local", sparkConfig);
+    this("local[2]", sparkConfig);

Review Comment:
   If you prefer using `local[4]` I'm fine changing. That's just a value taken 
from past experience having worked lots on improving performance of Spark 
tests. 
   For tests it's generally a misconception that more cores (e.g. all available 
cores) mean more performance. Tests are typically running in parallel in 
multiple forks. More cores also means more congestion in that case. Also, using 
more partitions means more overhead on the processing side (always assuming 
tests operate on rather tiny datasets). That's why it's important to reduce the 
number of shuffle partitions for tests before Spark 3.2. (looks like later 
Spark versions are smarter about this).
   On the other side, using just 1 core / 1 partition can be dangerous as it 
might obfuscate problems (e.g. invalid assumptions about ordering) in the code. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to