aokolnychyi commented on PR #2276:
URL: https://github.com/apache/iceberg/pull/2276#issuecomment-1273967116

   I think this is an essential PR to support storage-partitioned joins in 
Spark 3.3. It would be great to rebase it.
   
   Here is what I noted based on our discussion earlier:
   - We want an ability to avoid combining tasks across particular partition 
columns in our `Scan` API.
   - We do NOT want a table property as the config does not really belong there 
and it would affect other engines.
   - We want Spark to propagate join attributes in the future. Until then, we 
should have a reasonable default behavior.
   
   I think it is reasonable to not combine files across partitions for 
partitioned tables by default in Spark, hoping we can benefit from 
storage-partitioned joins. However, I worry the new behavior may cause 
performance regressions in some cases as we will generate more splits (even 
though we may not benefit from any join optimizations). Do we want to expose a 
way to force combining files across partitions (i.e. old behavior)? There are 
two ways to support that: either add a read option in Iceberg or try checking 
if storage-partitioned joins are enabled in Spark SQL (if not, we can safely 
combine). Since Spark will pass join attributes in the future, adding a read 
option does not seem to make a lot of sense. Any thoughts?
   
   cc @sunchao @rdblue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to