[GitHub] [iceberg] sunchao commented on pull request #2276: Core: Add option to combine tasks by partition

GitBox Tue, 18 Oct 2022 14:03:12 -0700


sunchao commented on PR #2276:
URL: https://github.com/apache/iceberg/pull/2276#issuecomment-1283000536


   Thanks @aokolnychyi ! let me fix the API compatibility check too.
   
   > I think it is reasonable to not combine files across partitions for 
partitioned tables by default in Spark, hoping we can benefit from 
storage-partitioned joins. However, I worry the new behavior may cause 
performance regressions in some cases as we will generate more splits (even 
though we may not benefit from any join optimizations). Do we want to expose a 
way to force combining files across partitions (i.e. old behavior)? There are 
two ways to support that: either add a read option in Iceberg or try checking 
if storage-partitioned joins are enabled in Spark SQL (if not, we can safely 
combine). Since Spark will pass join attributes in the future, adding a read 
option does not seem preferable. Any thoughts?
   
   As discussed offline, this adds a Spark SQL conf: 
`spark.sql.iceberg.splits-by-partition`, to dictate whether we should combine 
splits across partition boundaries in Iceberg. There's work planned on Spark 
side to add APIs and pass the info to Iceberg, which is a better solution and 
will eventually supersede this approach.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] sunchao commented on pull request #2276: Core: Add option to combine tasks by partition

Reply via email to