Re: very slow parquet file write

Cheng Lian Fri, 06 Nov 2015 07:42:18 -0800


On 11/6/15 10:53 PM, Rok Roskar wrote:

yes I was expecting that too because of all the metadata generationand compression. But I have not seen performance this bad for otherparquet files I’ve written and was wondering if there could besomething obvious (and wrong) to do with how I’ve specified the schemaetc. It’s a very simple schema consisting of a StructType with a fewStructField floats and a string. I’m using all the spark defaults forio compression.
I'll see what I can do about running a profiler -- can you point me toa resource/example?

This link is probably helpful:https://cwiki.apache.org/confluence/display/SPARK/Profiling+Spark+Applications+Using+YourKit

Thanks,

Rok
ps: my post on the mailing list is still listed as not accepted by themailing list:http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-td25295.html-- none of your responses are there either. I am definitely subscribedto the list though (I get daily digests). Any clue how to fix it?

Sorry, no idea :-/

On Nov 6, 2015, at 9:26 AM, Cheng Lian <lian.cs....@gmail.com<mailto:lian.cs....@gmail.com>> wrote:
I'd expect writing Parquet files slower than writing JSON files sinceParquet involves more complicated encoders, but maybe not that slow.Would you mind to try to profile one Spark executor using tools likeYJP to see what's the hotspot?
Cheng

On 11/6/15 7:34 AM, rok wrote:
Apologies if this appears a second time!
I'm writing a ~100 Gb pyspark DataFrame with a few hundredpartitions into aparquet file on HDFS. I've got a few hundred nodes in the cluster,so for
the size of file this is way over-provisioned (I've tried it with fewer
partitions and fewer nodes, no obvious effect). I was expecting thedump todisk to be very fast -- the DataFrame is cached in memory andcontains just14 columns (13 are floats and one is a string). When I write it outin jsonformat, this is indeed reasonably fast (though it still takes a fewminutes,
which is longer than I would expect).
However, when I try to write a parquet file it takes way longer --the firstset of tasks finishes in a few minutes, but the subsequent taskstake morethan twice as long or longer. In the end it takes over half an hourto writethe file. I've looked at the disk I/O and cpu usage on the computenodes and
it looks like the processors are fully loaded while the disk I/O is
essentially zero for long periods of time. I don't see any obviousgarbage
collection issues and there are no problems with memory.

Any ideas on how to debug/fix this?

Thanks!



--
View this message in context:http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>For additional commands, e-mail: user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

Re: very slow parquet file write

Reply via email to