Himadri Pal created SPARK-47094:
-----------------------------------
Summary: SPJ : Dynamically rebalance number of buckets when they
are not equal
Key: SPARK-47094
URL: https://issues.apache.org/jira/browse/SPARK-47094
Project: Spark
Issue Type: New Feature
Components: Spark Core
Affects Versions: 3.4.0, 3.3.0
Reporter: Himadri Pal
SPJ: Storage Partition Join works with Iceberg tables when both the tables have
same number of buckets. As part of this feature request, we would like spark to
gather the number of buckets information from both the tables and dynamically
rebalance the number of buckets by coalesce or repartition so that SPJ will
work fine. In this case, we would still have to shuffle but would be better
than no SPJ.
Use Case :
Many times we do not have control of the input tables, hence it's not possible
to change partitioning scheme on those tables. As a consumer, we would still
like them to be used with SPJ when used with other tables and output tables
which has different number of buckets.
In these scenario, we would need to read those tables rewrite them with
matching number of buckets for the SPJ to work, this extra step could outweigh
the benefits of less shuffle via SPJ. Also when there are multiple different
tables being joined, each tables need to be rewritten with matching number of
buckets.
If this feature is implemented, SPJ functionality will be more powerful.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]