[
https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776677#comment-16776677
]
Yuming Wang commented on SPARK-20049:
-------------------------------------
Could you try to setÂ
{{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}}?
> Writing data to Parquet with partitions takes very long after the job finishes
> ------------------------------------------------------------------------------
>
> Key: SPARK-20049
> URL: https://issues.apache.org/jira/browse/SPARK-20049
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output, PySpark, SQL
> Affects Versions: 2.1.0
> Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian
> GNU/Linux 8.7 (jessie)
> Reporter: Jakub Nowacki
> Priority: Minor
>
> I was testing writing DataFrame to partitioned Parquet files.The command is
> quite straight forward and the data set is really a sample from larger data
> set in Parquet; the job is done in PySpark on YARN and written to HDFS:
> {code}
> # there is column 'date' in df
> df.write.partitionBy("date").parquet("dest_dir")
> {code}
> The reading part took as long as usual, but after the job has been marked in
> PySpark and UI as finished, the Python interpreter still was showing it as
> busy. Indeed, when I checked the HDFS folder I noticed that the files are
> still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}}
> folders.
> First of all it takes much longer than saving the same set without
> partitioning. Second, it is done in the background, without visible progress
> of any kind.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]