Hello, try to use parquet format with compression ( like snappy or lz4 ) so
the produced files will be smaller and it will generate less i/o. Moreover
normally parquet is more faster than csv format in reading for further
operations .
Another possible format is ORC file.

Kind Regards

Matteo


2018-03-09 11:23 GMT+01:00 Md. Rezaul Karim <rezaul.ka...@insight-centre.org
>:

> Dear All,
>
> I have a tiny CSV file, which is around 250MB. There are only 30 columns
> in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an
> another CSV file on disk for later usage.
>
> However, I'm getting pissed off as writing the resultant DataFrame is
> taking too long, which is about 4 to 5 hours. Nevertheless, the size of the
> file written on the disk is about 58GB!
>
> Here's the sample code that I tried:
>
> # Using repartition()
> myDF.repartition(1).write.format("com.databricks.spark.
> csv").save("data/file.csv")
>
> # Using coalesce()
> myDF. coalesce(1).write.format("com.databricks.spark.csv").save("
> data/file.csv")
>
>
> Any better suggestion?
>
>
>
> ----
> Md. Rezaul Karim, BSc, MSc
> Research Scientist, Fraunhofer FIT, Germany
>
> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany
>
> eMail: rezaul.ka...@fit.fraunhofer.de <andrea.berna...@fit.fraunhofer.de>
> Tel: +49 241 80-21527 <+49%20241%208021527>
>

Reply via email to