[jira] [Commented] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

Yuming Wang (JIRA) Mon, 25 Feb 2019 01:34:01 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776677#comment-16776677
 ]


Yuming Wang commented on SPARK-20049:
-------------------------------------

Could you try to set 
{{spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2}}?

> Writing data to Parquet with partitions takes very long after the job finishes
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-20049
>                 URL: https://issues.apache.org/jira/browse/SPARK-20049
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output, PySpark, SQL
>    Affects Versions: 2.1.0
>         Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian 
> GNU/Linux 8.7 (jessie)
>            Reporter: Jakub Nowacki
>            Priority: Minor
>
> I was testing writing DataFrame to partitioned Parquet files.The command is 
> quite straight forward and the data set is really a sample from larger data 
> set in Parquet; the job is done in PySpark on YARN and written to HDFS:
> {code}
> # there is column 'date' in df
> df.write.partitionBy("date").parquet("dest_dir")
> {code}
> The reading part took as long as usual, but after the job has been marked in 
> PySpark and UI as finished, the Python interpreter still was showing it as 
> busy. Indeed, when I checked the HDFS folder I noticed that the files are 
> still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} 
> folders. 
> First of all it takes much longer than saving the same set without 
> partitioning. Second, it is done in the background, without visible progress 
> of any kind. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

Reply via email to