Hello, try to use parquet format with compression ( like snappy or lz4 ) so the produced files will be smaller and it will generate less i/o. Moreover normally parquet is more faster than csv format in reading for further operations . Another possible format is ORC file.
Kind Regards Matteo 2018-03-09 11:23 GMT+01:00 Md. Rezaul Karim <rezaul.ka...@insight-centre.org >: > Dear All, > > I have a tiny CSV file, which is around 250MB. There are only 30 columns > in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an > another CSV file on disk for later usage. > > However, I'm getting pissed off as writing the resultant DataFrame is > taking too long, which is about 4 to 5 hours. Nevertheless, the size of the > file written on the disk is about 58GB! > > Here's the sample code that I tried: > > # Using repartition() > myDF.repartition(1).write.format("com.databricks.spark. > csv").save("data/file.csv") > > # Using coalesce() > myDF. coalesce(1).write.format("com.databricks.spark.csv").save(" > data/file.csv") > > > Any better suggestion? > > > > ---- > Md. Rezaul Karim, BSc, MSc > Research Scientist, Fraunhofer FIT, Germany > > Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany > > eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de> > Tel: +49 241 80-21527 <+49%20241%208021527> >