[
https://issues.apache.org/jira/browse/SPARK-26305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-26305:
----------------------------------
Affects Version/s: (was: 3.0.0)
3.1.0
> Breakthrough the memory limitation of broadcast join
> ----------------------------------------------------
>
> Key: SPARK-26305
> URL: https://issues.apache.org/jira/browse/SPARK-26305
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.1.0
> Reporter: Lantao Jin
> Priority: Major
>
> If the join between a big table and a small one faces data skewing issue, we
> usually use a broadcast hint in SQL to resolve it. However, current broadcast
> join has many limitations. The primary restriction is memory. The small table
> which is broadcasted must be fulfilled to memory in driver/executors side.
> Although it will spill to disk when the memory is insufficient, it still
> causes OOM if the small table actually is not absolutely small, it's
> relatively small. In our company, we have many real big data SQL analysis
> jobs which handle dozens of hundreds terabytes join and shuffle. For example,
> the size of large table is 100TB, and the small one is 10000 times less,
> still 10GB. In this case, broadcast join couldn't be finished since the small
> one is still larger than expected. If the join is data skewing, the sortmerge
> join always failed.
> Hive has a skew join hint which could trigger two-stage task to handle the
> skew key and normal key separately. I guess Databricks Runtime has the
> similar implementation. However, the skew join hint needs SQL users know the
> data in table like their children. They must know which key is skewing in a
> join. It's very hard to know since the data is changing day by day and the
> join key isn't fixed in different queries. The users have to set a huge
> partition number to try their luck.
> So, do we have a simple, rude and efficient way to resolve it? Back to the
> limitation, if the broadcasted table no needs to fill to memory, in other
> words, driver/executor stores the broadcasted table to disk only. The problem
> mentioned above could be resolved.
> A new hint like BROADCAST_DISK or an additional parameter in original
> BROADCAST hint will be introduced to cover this case. The original broadcast
> behavior won’t be changed.
> I will offer a design doc if you have same feeling about it.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]