[jira] [Resolved] (SPARK-18556) Suboptimal number of tasks when writing partitioned data with desired number of files per directory

Hyukjin Kwon (JIRA) Mon, 20 May 2019 21:53:00 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-18556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-18556.
----------------------------------
    Resolution: Incomplete

> Suboptimal number of tasks when writing partitioned data with desired number 
> of files per directory
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18556
>                 URL: https://issues.apache.org/jira/browse/SPARK-18556
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 2.0.0, 2.0.1, 2.0.2
>            Reporter: Damian Momot
>            Priority: Major
>              Labels: bulk-closed
>
> It's unable to have optimal number of write tasks when optimal number of 
> files per directory is known, example:
> When saving data to hdfs:
> 1. Data which is supposed to be partitioned by column (for example date) - it 
> contains for example 90 different dates
> 2. Upfront knowledge that each date should be written into X files (for 
> example 4, because of recommended hdfs/parquet block size etc.)
> 3. During processing, dataset was partitioned into 200 partitions (for 
> example because of some grouping operations)
> currently we can do
> {code}
> val data: Dataset[Row] = ???
> data
>   .write
>   .partitionBy("date")
>   .parquet("/xyz")
> {code}
> This will properly write data into 90 date directories (see point '1') but 
> each directory will contain 200 files (see point '3')
> We can force number of files by using repartition/coalesce:
> {code}
> val data: Dataset[Row] = ???
> data
>   .repartition(4)
>   .write
>   .partitionBy("date")
>   .parquet("xyz")
> {code}
> This will properly save 90 directories, 4 files each... but it will be done 
> using only 4 tasks which is way too slow - 360 files could be written in 
> parallel using 360 tasks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-18556) Suboptimal number of tasks when writing partitioned data with desired number of files per directory

Reply via email to