Jakub Nowacki created SPARK-20049:
-------------------------------------
Summary: Writing data to Parquet with partitions takes very long
after the job finishes
Key: SPARK-20049
URL: https://issues.apache.org/jira/browse/SPARK-20049
Project: Spark
Issue Type: Bug
Components: Input/Output, PySpark, SQL
Affects Versions: 2.1.0
Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian
GNU/Linux 8.7 (jessie)
Reporter: Jakub Nowacki
I was testing writing DataFrame to partitioned Parquet files.The command is
quite straight forward and the data set is really a sample from larger data set
in Parquet; the job is done in PySpark on YARN and written to HDFS:
{code}
# there is column 'date' in df
df.write.partitionBy("date").parquet("dest_dir")
{code}
The reading part took as long as usual, but after the job has been marked in
PySpark and UI as finished, the Python interpreter still was showing it as
busy. Indeed, when I checked the HDFS folder I noticed that the files are still
transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}}
folders.
First of all it takes much longer than saving the same set without
partitioning. Second, it is done in the background, without visible progress of
any kind.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]