[
https://issues.apache.org/jira/browse/SPARK-25411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wang, Gang updated SPARK-25411:
-------------------------------
Attachment: (was: range partition design doc.pdf)
> Implement range partition in Spark
> ----------------------------------
>
> Key: SPARK-25411
> URL: https://issues.apache.org/jira/browse/SPARK-25411
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Affects Versions: 2.3.0
> Reporter: Wang, Gang
> Priority: Major
> Attachments: range partition design doc.pdf
>
>
> In our product environment, there are some partitioned fact tables, which are
> all quite huge. To accelerate join execution, we need make them also
> bucketed. Than comes the problem, if the bucket number is large enough, there
> may be too many files(files count = bucket number * partition count), which
> may bring pressure to the HDFS. And if the bucket number is small, Spark will
> launch equal number of tasks to read/write it.
>
> So, can we implement a new partition support range values, just like range
> partition in Oracle/MySQL
> ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]).
> Say, we can partition by a date column, and make every two months as a
> partition, or partitioned by a integer column, make interval of 10000 as a
> partition.
>
> Ideally, feature like range partition should be implemented in Hive. While,
> it's been always hard to update Hive version in a prod environment, and much
> lightweight and flexible if we implement it in Spark.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]