Re: Loading CSV to DataFrame and saving it into Parquet for speedup

Hossein Fri, 05 Jun 2015 15:36:32 -0700

Why not letting SparkSQL deal with parallelism? When using SparkSQL data
sources you can control parallelism by specifying mapred.min.split.size
and mapred.max.split.size in your Hadoop configuration. You can then
repartition your data as you wish and save it as Parquet.


--Hossein

On Thu, May 28, 2015 at 8:32 AM, M Rez <mmrez...@gmail.com> wrote:

> I am using Spark-CSV  to load a 50GB of around 10,000 CSV files into couple
> of unified DataFrames. Since this process is slow I have wrote this
> snippet:
>
>     targetList.foreach { target =>
>         // this is using sqlContext.load by getting list of files then
> loading them according to schema files that
>         // read before and built their StructType
>         getTrace(target, sqlContext)
>           .reduce(_ unionAll _)
>           .registerTempTable(target.toUpperCase())
>         sqlContext.sql("SELECT * FROM " + target.toUpperCase())
>           .saveAsParquetFile(processedTraces + target)
>
> to load the csv files and then union all the cvs files with the same schema
> and write them into a single parquet file with their parts. The problems is
> my cpu (not all cpus are being busy) and disk (ssd, with 1MB/s at most) are
> barely utilized. I wonder what am I doing wrong?!
>
> snippet for getTrace:
>
> def getTrace(target: String, sqlContext: SQLContext): Seq[DataFrame] = {
>     logFiles(mainLogFolder + target).map {
>       file =>
>         sqlContext.load(
>           driver,
>           // schemaSelect builds the StructType once
>           schemaSelect(schemaFile, target, sqlContext),
>           Map("path" -> file, "header" -> "false", "delimiter" -> ","))
>     }
>   }
>
> thanks for any help
>
>
>
>
> -----
> regards,
> mohamad
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Loading-CSV-to-DataFrame-and-saving-it-into-Parquet-for-speedup-tp23071.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Loading CSV to DataFrame and saving it into Parquet for speedup

Reply via email to