[ 
https://issues.apache.org/jira/browse/SPARK-54838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dorjee Tsering updated SPARK-54838:
-----------------------------------
    Description: 
I am proposing to add a new function in Dataset class to fix small file 
problem. We have noticed that if the source data our spark job reads have many 
small files (KB size), it creates lot of partitions. This PR adds a new 
function named optimizePartition which when used creates partitions of size 
128MB if no desired partition's size passed. You can pass your own desired 
partition size.

 

I published 
[this|http://example.com]https://medium.com/@dotsering/stop-drowning-in-small-files-how-i-built-a-smart-partition-optimizer-for-pyspark-8b064c742667
 article on Medium where I detailed the test I did and result I achieved.

  was:
I am proposing to add a new function in Dataset class to fix small file 
problem. We have noticed that if the source data our spark job reads have many 
small files (KB size), it creates lot of partitions. This PR adds a new 
function named optimizePartition which when used creates partitions of size 
128MB if no desired partition's size passed. You can pass your own desired 
partition size.

 

I published this article on Medium where I detailed the test I did and result I 
achieved.
https://medium.com/p/8b064c742667/edit


> Optimize spark partition size
> -----------------------------
>
>                 Key: SPARK-54838
>                 URL: https://issues.apache.org/jira/browse/SPARK-54838
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 4.2.0
>            Reporter: Dorjee Tsering
>            Priority: Minor
>
> I am proposing to add a new function in Dataset class to fix small file 
> problem. We have noticed that if the source data our spark job reads have 
> many small files (KB size), it creates lot of partitions. This PR adds a new 
> function named optimizePartition which when used creates partitions of size 
> 128MB if no desired partition's size passed. You can pass your own desired 
> partition size.
>  
> I published 
> [this|http://example.com]https://medium.com/@dotsering/stop-drowning-in-small-files-how-i-built-a-smart-partition-optimizer-for-pyspark-8b064c742667
>  article on Medium where I detailed the test I did and result I achieved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to