Is there a way to control how large the part- files are for a parquet
dataset? I'm currently using e.g.

results.toDF.coalesce(60).write.mode("append").parquet(outputdir)

to manually reduce the number of parts, but this doesn't map linearly to
fewer parts: I noticed that coalescing to 30 actually gives smaller parts.
I'd like to be able to specify the size of the parts- directly rather than
guess and check what coalesce value to use.

Why I care: my data is ~3Tb in Parquet form, with about 16 thousand files of
around 200MB each. Transferring this from HDFS on EC2 to S3 based on the
transfer rate I calculated from the yarn webui's progress indicator will
take more than 4 hours. By way of comparison, when I transferred 3.8 Tb of
data out from S3 to HDFS on EC2, that only took about 1.5 hours; there the
files were 1.7 Gb each. 

Minimizing the transfer time is important because I'll be taking the dataset
out of S3 many times.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/controlling-parquet-file-sizes-for-faster-transfer-to-S3-from-HDFS-tp25490.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to