[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #8621: Spark 3.5: Use fanout writers for unsorted tables by default

via GitHub Tue, 26 Sep 2023 11:06:11 -0700


aokolnychyi commented on code in PR #8621:
URL: https://github.com/apache/iceberg/pull/8621#discussion_r1337601468



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkWriteConf.java:
##########
@@ -174,12 +174,17 @@ public long targetDataFileSize() {
         .parse();
   }
 
-  public boolean fanoutWriterEnabled() {
+  public boolean useFanoutWriter(SparkWriteRequirements writeRequirements) {
+    boolean defaultValue = !writeRequirements.hasOrdering();

Review Comment:
   Actually, we may want to keep it this way.
   
   Another use case that may benefit from the current approach is SPJ. There, 
users set the distribution mode to none. They would get a super expensive local 
sort and spill without explicitly enabling fanout writers. I think that's a 
more realistic use case than setting to `none` and generating tons of files per 
task. If the user sets it to `none` explicitly, they are probably OK with the 
number of produced files, which hints it is not a crazy number. So why do a 
local sort for them?
   
   We generally preferred safe options vs more performant and it meant more 
configs required to improve the performance. I'd say let's turn that around.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #8621: Spark 3.5: Use fanout writers for unsorted tables by default

Reply via email to