subject:"Loading CSV to DataFrame and saving it into Parquet for speedup"

Re: Loading CSV to DataFrame and saving it into Parquet for speedup

2015-06-05 Thread Hossein

Why not letting SparkSQL deal with parallelism? When using SparkSQL data sources you can control parallelism by specifying mapred.min.split.size and mapred.max.split.size in your Hadoop configuration. You can then repartition your data as you wish and save it as Parquet. --Hossein On Thu, May

Loading CSV to DataFrame and saving it into Parquet for speedup

2015-05-28 Thread M Rez

I am using Spark-CSV to load a 50GB of around 10,000 CSV files into couple of unified DataFrames. Since this process is slow I have wrote this snippet: targetList.foreach { target = // this is using sqlContext.load by getting list of files then loading them according to schema files