Can you confirm that file1df("COLUMN2") and file2df("COLUMN10") appeared in
the output of joineddf.collect.foreach(println)
 ?

Thanks

On Sun, Dec 27, 2015 at 6:32 PM, Divya Gehlot <divya.htco...@gmail.com>
wrote:

> Hi,
> I am trying to join two dataframes and able to display the results in the
> console ater join. I am saving that data and and saving in the joined data
> in CSV format using spark-csv api . Its just saving the column names not
> data at all.
>
> Below is the sample code for the reference:
>
> spark-shell   --packages com.databricks:spark-csv_2.10:1.1.0  --master
>> yarn-client --driver-memory 512m --executor-memory 512m
>>
>> import org.apache.spark.sql.hive.HiveContext
>> import org.apache.spark.sql.hive.orc._
>> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> import org.apache.spark.sql.types.{StructType, StructField, StringType,
>> IntegerType,FloatType ,LongType ,TimestampType };
>>
>> val firstSchema = StructType(Seq(StructField("COLUMN1", StringType,
>> true),StructField("COLUMN2", StringType, true),StructField("COLUMN2",
>> StringType, true),StructField("COLUMN3", StringType, true)
>> StructField("COLUMN4", StringType, true),StructField("COLUMN5",
>> StringType, true)))
>> val file1df =
>> hiveContext.read.format("com.databricks.spark.csv").option("header",
>> "true").schema(firstSchema).load("/tmp/File1.csv")
>>
>>
>> val secondSchema = StructType(Seq(
>> StructField("COLUMN1", StringType, true),
>> StructField("COLUMN2", NullType  , true),
>> StructField("COLUMN3", TimestampType , true),
>> StructField("COLUMN4", TimestampType , true),
>> StructField("COLUMN5", NullType , true),
>> StructField("COLUMN6", StringType, true),
>> StructField("COLUMN7", IntegerType, true),
>> StructField("COLUMN8", IntegerType, true),
>> StructField("COLUMN9", StringType, true),
>> StructField("COLUMN10", IntegerType, true),
>> StructField("COLUMN11", IntegerType, true),
>> StructField("COLUMN12", IntegerType, true)))
>>
>>
>> val file2df =
>> hiveContext.read.format("com.databricks.spark.csv").option("header",
>> "false").schema(secondSchema).load("/tmp/file2.csv")
>> val joineddf = file1df.join(file2df, file1df("COLUMN1") ===
>> file2df("COLUMN6"))
>> val selecteddata = joineddf.select(file1df("COLUMN2"),file2df("COLUMN10"))
>>
> //the below statement is printing the joined data
>
>> joineddf.collect.foreach(println)
>>
>
>
>> //this statement saves the CSVfile but only columns names mentioned above
>> on the select are being saved
>> selecteddata.write.format("com.databricks.spark.csv").option("header",
>> "true").save("/tmp/JoinedData.csv")
>>
>
>
> Would really appreciate the pointers /help.
>
> Thanks,
> Divya
>
>
>
>
>

Reply via email to