RE: CSV to parquet preserving partitioning

neil90 Wed, 16 Nov 2016 09:42:29 -0800

All you need to do is load all the files into one dataframe at once. Then
save the dataframe using partitionBy -


df.write.format("parquet").partitionBy("directoryCol").save("hdfs://path")

Then if you look at the new folder it should look like how you want it I.E -
hdfs://path/dir=dir1/part-r-xxx.gz.parquet 
hdfs://path/dir=dir2/part-r-yyy.gz.parquet 
hdfs://path/dir=dir3/part-r-zzz.gz.parquet 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078p28087.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RE: CSV to parquet preserving partitioning

Reply via email to