[jira] [Resolved] (SPARK-30399) Bucketing does not compatible with partitioning in practice

Hyukjin Kwon (Jira) Sun, 05 Jan 2020 22:07:10 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-30399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-30399.
----------------------------------
    Resolution: Invalid

[~shay_elbaz], please ask questions into mailing lists (see 
https://spark.apache.org/community.html). I think you could have a better 
answer.

> Bucketing does not compatible with partitioning in practice
> -----------------------------------------------------------
>
>                 Key: SPARK-30399
>                 URL: https://issues.apache.org/jira/browse/SPARK-30399
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>         Environment: HDP 2.7
>            Reporter: Shay Elbaz
>            Priority: Minor
>
> When using Spark Bucketed table, Spark would use as many partitions as the 
> number of buckets for the map-side join 
> (_FileSourceScanExec.createBucketedReadRDD_). This works great for "static" 
> tables, but quite disastrous for _time-partitioned_ tables. In our use case, 
> a daily partitioned key-value table is added 100GB of data every day. So in 
> 100 days there are 10TB of data we want to join with. Aiming to this 
> scenario, we need thousands of buckets if we want every task to successfully 
> *read and sort* all of it's data in a map-side join. But in such case, every 
> daily increment would emit thousands of small files, leading to other big 
> issues.
> In practice, and with a hope for some hidden optimization, we set the number 
> of buckets to 1000 and backfilled such a table with 10TB. When trying to join 
> with the smallest input, every executor was killed by Yarn due to over 
> allocating memory in the sorting phase. Even without such failures, it would 
> take every executor unreasonably amount of time to locally sort all its data.
> A question on SO remained unanswered for a while, so I thought asking here - 
> is it by design that buckets cannot be used in time-partitioned table, or am 
> I doing something wrong?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-30399) Bucketing does not compatible with partitioning in practice

Reply via email to