Why not letting SparkSQL deal with parallelism? When using SparkSQL data
sources you can control parallelism by specifying mapred.min.split.size
and mapred.max.split.size in your Hadoop configuration. You can then
repartition your data as you wish and save it as Parquet.
--Hossein
On Thu, May
I am using Spark-CSV to load a 50GB of around 10,000 CSV files into couple
of unified DataFrames. Since this process is slow I have wrote this snippet:
targetList.foreach { target =
// this is using sqlContext.load by getting list of files then
loading them according to schema files