asheeshgarg commented on issue #621: Broadcast Join Failure
URL: 
https://github.com/apache/incubator-iceberg/issues/621#issuecomment-552485021
 
 
   def writeData(dataFrame: DataFrame, path: String, format: String, mode: 
SaveMode): Unit = {
     val tables = new HadoopTables(spark.sparkContext.hadoopConfiguration)
     val schema = SparkSchemaUtil.convert(dataFrame.schema)
     val partitionSpec = 
PartitionSpec.builderFor(SparkSchemaUtil.convert(dataFrame.schema)).build()
     tables.create(schema, partitionSpec, path)
     dataFrame
       .write
       .format(format)
       .mode(mode)
       .partitionBy("DATE")
       .save(path)
   }
   val icebergTableLoc = s"${storeLocation}/iceberg/eqty/reference"
   writeData(refDf, deltaLakeTableLoc, "iceberg", SaveMode.Append)
   
   val icebergTableLoc = s"${storeLocation}/iceberg/eqty/pricing"
   writeData(pricingDf, icebergTableLoc, "iceberg", SaveMode.Append)
   
   Above method is used to generate the data which 30 days of the reference and 
pricing data.
   Data loaded is pariting by date so I see roughly equally sized data in the 
reference and pricing store in the parquet file created in iceberg.
   
   Read operation is performed using 
     
spark.read.format("iceberg").load("iceberg/eqty/reference").join(spark.read.format("iceberg").load("iceberg/eqty/pricing"),Seq("ID_BB_GLOBAL","DATE")).count()
   
   As soon as you do this you see the error I mentioned in the original request.
   If you repartition data to more partition it worked.
   
   As mentioned it worked directly with raw Parquet and I also tried the 
similar join using Apache delta it worked. So size of the data is really fine. 
   As we don't want to arbitrarily reparation the data.
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to