Can you confirm that file1df("COLUMN2") and file2df("COLUMN10") appeared in the output of joineddf.collect.foreach(println) ?
Thanks On Sun, Dec 27, 2015 at 6:32 PM, Divya Gehlot <divya.htco...@gmail.com> wrote: > Hi, > I am trying to join two dataframes and able to display the results in the > console ater join. I am saving that data and and saving in the joined data > in CSV format using spark-csv api . Its just saving the column names not > data at all. > > Below is the sample code for the reference: > > spark-shell --packages com.databricks:spark-csv_2.10:1.1.0 --master >> yarn-client --driver-memory 512m --executor-memory 512m >> >> import org.apache.spark.sql.hive.HiveContext >> import org.apache.spark.sql.hive.orc._ >> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) >> import org.apache.spark.sql.types.{StructType, StructField, StringType, >> IntegerType,FloatType ,LongType ,TimestampType }; >> >> val firstSchema = StructType(Seq(StructField("COLUMN1", StringType, >> true),StructField("COLUMN2", StringType, true),StructField("COLUMN2", >> StringType, true),StructField("COLUMN3", StringType, true) >> StructField("COLUMN4", StringType, true),StructField("COLUMN5", >> StringType, true))) >> val file1df = >> hiveContext.read.format("com.databricks.spark.csv").option("header", >> "true").schema(firstSchema).load("/tmp/File1.csv") >> >> >> val secondSchema = StructType(Seq( >> StructField("COLUMN1", StringType, true), >> StructField("COLUMN2", NullType , true), >> StructField("COLUMN3", TimestampType , true), >> StructField("COLUMN4", TimestampType , true), >> StructField("COLUMN5", NullType , true), >> StructField("COLUMN6", StringType, true), >> StructField("COLUMN7", IntegerType, true), >> StructField("COLUMN8", IntegerType, true), >> StructField("COLUMN9", StringType, true), >> StructField("COLUMN10", IntegerType, true), >> StructField("COLUMN11", IntegerType, true), >> StructField("COLUMN12", IntegerType, true))) >> >> >> val file2df = >> hiveContext.read.format("com.databricks.spark.csv").option("header", >> "false").schema(secondSchema).load("/tmp/file2.csv") >> val joineddf = file1df.join(file2df, file1df("COLUMN1") === >> file2df("COLUMN6")) >> val selecteddata = joineddf.select(file1df("COLUMN2"),file2df("COLUMN10")) >> > //the below statement is printing the joined data > >> joineddf.collect.foreach(println) >> > > >> //this statement saves the CSVfile but only columns names mentioned above >> on the select are being saved >> selecteddata.write.format("com.databricks.spark.csv").option("header", >> "true").save("/tmp/JoinedData.csv") >> > > > Would really appreciate the pointers /help. > > Thanks, > Divya > > > > >