[
https://issues.apache.org/jira/browse/SPARK-18556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-18556.
----------------------------------
Resolution: Incomplete
> Suboptimal number of tasks when writing partitioned data with desired number
> of files per directory
> ---------------------------------------------------------------------------------------------------
>
> Key: SPARK-18556
> URL: https://issues.apache.org/jira/browse/SPARK-18556
> Project: Spark
> Issue Type: Improvement
> Affects Versions: 2.0.0, 2.0.1, 2.0.2
> Reporter: Damian Momot
> Priority: Major
> Labels: bulk-closed
>
> It's unable to have optimal number of write tasks when optimal number of
> files per directory is known, example:
> When saving data to hdfs:
> 1. Data which is supposed to be partitioned by column (for example date) - it
> contains for example 90 different dates
> 2. Upfront knowledge that each date should be written into X files (for
> example 4, because of recommended hdfs/parquet block size etc.)
> 3. During processing, dataset was partitioned into 200 partitions (for
> example because of some grouping operations)
> currently we can do
> {code}
> val data: Dataset[Row] = ???
> data
> .write
> .partitionBy("date")
> .parquet("/xyz")
> {code}
> This will properly write data into 90 date directories (see point '1') but
> each directory will contain 200 files (see point '3')
> We can force number of files by using repartition/coalesce:
> {code}
> val data: Dataset[Row] = ???
> data
> .repartition(4)
> .write
> .partitionBy("date")
> .parquet("xyz")
> {code}
> This will properly save 90 directories, 4 files each... but it will be done
> using only 4 tasks which is way too slow - 360 files could be written in
> parallel using 360 tasks
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]