Apologies if this appears a second time! 

I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
the size of file this is way over-provisioned (I've tried it with fewer
partitions and fewer nodes, no obvious effect). I was expecting the dump to
disk to be very fast -- the DataFrame is cached in memory and contains just
14 columns (13 are floats and one is a string). When I write it out in json
format, this is indeed reasonably fast (though it still takes a few minutes,
which is longer than I would expect). 

However, when I try to write a parquet file it takes way longer -- the first
set of tasks finishes in a few minutes, but the subsequent tasks take more
than twice as long or longer. In the end it takes over half an hour to write
the file. I've looked at the disk I/O and cpu usage on the compute nodes and
it looks like the processors are fully loaded while the disk I/O is
essentially zero for long periods of time. I don't see any obvious garbage
collection issues and there are no problems with memory. 

Any ideas on how to debug/fix this? 

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to