Himadri Pal created SPARK-47094:
-----------------------------------

             Summary: SPJ : Dynamically rebalance number of buckets when they 
are not equal
                 Key: SPARK-47094
                 URL: https://issues.apache.org/jira/browse/SPARK-47094
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
    Affects Versions: 3.4.0, 3.3.0
            Reporter: Himadri Pal


SPJ: Storage Partition Join works with Iceberg tables when both the tables have 
same number of buckets. As part of this feature request, we would like spark to 
gather the number of buckets information from both the tables and dynamically 
rebalance the number of buckets by coalesce or repartition so that SPJ will 
work fine. In this case, we would still have to shuffle but would be better 
than no SPJ.

Use Case : 

Many times we do not have control of the input tables, hence it's not possible 
to change partitioning scheme on those tables. As a consumer, we would still 
like them to be used with SPJ when used with other tables and output tables 
which has different number of buckets.

In these scenario, we would need to read those tables rewrite them with 
matching number of buckets for the SPJ to work, this extra step could outweigh 
the benefits of less shuffle via SPJ. Also when there are multiple different 
tables being joined, each tables need to be rewritten with matching number of 
buckets. 

If this feature is implemented, SPJ functionality will be more powerful.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to