[
https://issues.apache.org/jira/browse/SPARK-30399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-30399.
----------------------------------
Resolution: Invalid
[~shay_elbaz], please ask questions into mailing lists (see
https://spark.apache.org/community.html). I think you could have a better
answer.
> Bucketing does not compatible with partitioning in practice
> -----------------------------------------------------------
>
> Key: SPARK-30399
> URL: https://issues.apache.org/jira/browse/SPARK-30399
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.3.0
> Environment: HDP 2.7
> Reporter: Shay Elbaz
> Priority: Minor
>
> When using Spark Bucketed table, Spark would use as many partitions as the
> number of buckets for the map-side join
> (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static"
> tables, but quite disastrous for _time-partitioned_ tables. In our use case,
> a daily partitioned key-value table is added 100GB of data every day. So in
> 100 days there are 10TB of data we want to join with. Aiming to this
> scenario, we need thousands of buckets if we want every task to successfully
> *read and sort* all of it's data in a map-side join. But in such case, every
> daily increment would emit thousands of small files, leading to other big
> issues.
> In practice, and with a hope for some hidden optimization, we set the number
> of buckets to 1000 and backfilled such a table with 10TB. When trying to join
> with the smallest input, every executor was killed by Yarn due to over
> allocating memory in the sorting phase. Even without such failures, it would
> take every executor unreasonably amount of time to locally sort all its data.
> A question on SO remained unanswered for a while, so I thought asking here -
> is it by design that buckets cannot be used in time-partitioned table, or am
> I doing something wrong?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]