Have you use any partitioned columns when write as json or parquet? On Fri, Nov 6, 2015 at 6:53 AM, Rok Roskar <rokros...@gmail.com> wrote: > yes I was expecting that too because of all the metadata generation and > compression. But I have not seen performance this bad for other parquet > files I’ve written and was wondering if there could be something obvious > (and wrong) to do with how I’ve specified the schema etc. It’s a very simple > schema consisting of a StructType with a few StructField floats and a > string. I’m using all the spark defaults for io compression. > > I'll see what I can do about running a profiler -- can you point me to a > resource/example? > > Thanks, > > Rok > > ps: my post on the mailing list is still listed as not accepted by the > mailing list: > http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-td25295.html > -- none of your responses are there either. I am definitely subscribed to > the list though (I get daily digests). Any clue how to fix it? > > > > > On Nov 6, 2015, at 9:26 AM, Cheng Lian <lian.cs....@gmail.com> wrote: > > I'd expect writing Parquet files slower than writing JSON files since > Parquet involves more complicated encoders, but maybe not that slow. Would > you mind to try to profile one Spark executor using tools like YJP to see > what's the hotspot? > > Cheng > > On 11/6/15 7:34 AM, rok wrote: > > Apologies if this appears a second time! > > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a > parquet file on HDFS. I've got a few hundred nodes in the cluster, so for > the size of file this is way over-provisioned (I've tried it with fewer > partitions and fewer nodes, no obvious effect). I was expecting the dump to > disk to be very fast -- the DataFrame is cached in memory and contains just > 14 columns (13 are floats and one is a string). When I write it out in json > format, this is indeed reasonably fast (though it still takes a few minutes, > which is longer than I would expect). > > However, when I try to write a parquet file it takes way longer -- the first > set of tasks finishes in a few minutes, but the subsequent tasks take more > than twice as long or longer. In the end it takes over half an hour to write > the file. I've looked at the disk I/O and cpu usage on the compute nodes and > it looks like the processors are fully loaded while the disk I/O is > essentially zero for long periods of time. I don't see any obvious garbage > collection issues and there are no problems with memory. > > Any ideas on how to debug/fix this? > > Thanks! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > >
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org