[
https://issues.apache.org/jira/browse/SPARK-54838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dorjee Tsering updated SPARK-54838:
-----------------------------------
Description:
I am proposing to add a new function in Dataset class to fix small file
problem. We have noticed that if the source data our spark job reads have many
small files (KB size), it creates lot of partitions. This PR adds a new
function named optimizePartition which when used creates partitions of size
128MB if no desired partition's size passed. You can pass your own desired
partition size.
I published
[this|https://medium.com/@dotsering/stop-drowning-in-small-files-how-i-built-a-smart-partition-optimizer-for-pyspark-8b064c742667]
article on Medium where I detailed the test I did and result I achieved.
was:
I am proposing to add a new function in Dataset class to fix small file
problem. We have noticed that if the source data our spark job reads have many
small files (KB size), it creates lot of partitions. This PR adds a new
function named optimizePartition which when used creates partitions of size
128MB if no desired partition's size passed. You can pass your own desired
partition size.
I published
[this|http://example.com]https://medium.com/@dotsering/stop-drowning-in-small-files-how-i-built-a-smart-partition-optimizer-for-pyspark-8b064c742667
article on Medium where I detailed the test I did and result I achieved.
> Optimize spark partition size
> -----------------------------
>
> Key: SPARK-54838
> URL: https://issues.apache.org/jira/browse/SPARK-54838
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Affects Versions: 4.2.0
> Reporter: Dorjee Tsering
> Priority: Minor
>
> I am proposing to add a new function in Dataset class to fix small file
> problem. We have noticed that if the source data our spark job reads have
> many small files (KB size), it creates lot of partitions. This PR adds a new
> function named optimizePartition which when used creates partitions of size
> 128MB if no desired partition's size passed. You can pass your own desired
> partition size.
>
> I published
> [this|https://medium.com/@dotsering/stop-drowning-in-small-files-how-i-built-a-smart-partition-optimizer-for-pyspark-8b064c742667]
> article on Medium where I detailed the test I did and result I achieved.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]