RE: CSV to parquet preserving partitioning

Drooghaag, Benoit (Nokia - BE) Wed, 16 Nov 2016 05:54:10 -0800

Good point, thanks !

That does the job from the moment the datasets corresponding to each input 
directory contain a single partition.

Question now is how to achieve this without shuffling the data ?
I’m using the databricks csv reader on spark 1.6 and I don’t think there is a 
way to control the partitioning.
As I can see, it creates one partition per csv file, so the data from one input 
directory can be puzzled accross the nodes ...

From: Daniel Siegmann [mailto:dsiegm...@securityscorecard.io]
Sent: mardi 15 novembre 2016 18:57
To: Drooghaag, Benoit (Nokia - BE) <benoit.droogh...@nokia.com>
Cc: user <user@spark.apache.org>
Subject: Re: CSV to parquet preserving partitioning

Did you try unioning the datasets for each CSV into a single dataset? You may 
need to put the directory name into a column so you can partition by it.
On Tue, Nov 15, 2016 at 8:44 AM, benoitdr 
<benoit.droogh...@nokia.com<mailto:benoit.droogh...@nokia.com>> wrote:
Hello,

I'm trying to convert a bunch of csv files to parquet, with the interesting
case that the input csv files are already "partitioned" by directory.
All the input files have the same set of columns.
The input files structure looks like :

/path/dir1/file1.csv
/path/dir1/file2.csv
/path/dir2/file3.csv
/path/dir3/file4.csv
/path/dir3/file5.csv
/path/dir3/file6.csv

I'd like to read those files and write their data to a parquet table in
hdfs, preserving the partitioning (partitioned by input directory), and such
as there is a single output file per partition.
The output files strucutre should look like :

hdfs://path/dir=dir1/part-r-xxx.gz.parquet
hdfs://path/dir=dir2/part-r-yyy.gz.parquet
hdfs://path/dir=dir3/part-r-zzz.gz.parquet

The best solution I have found so far is to loop among the input
directories, loading the csv files in a dataframe and to write the dataframe
in the target partition.
But this not efficient since I want a single output file per partition, the
writing to hdfs is a single tasks that blocks the loop.
I wonder how to achieve this with a maximum of parallelism (and without
shuffling the data in the cluster).

Thanks !

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/CSV-to-parquet-preserving-partitioning-tp28078.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

RE: CSV to parquet preserving partitioning

Reply via email to