asheeshgarg commented on issue #621: Broadcast Join Failure URL: https://github.com/apache/incubator-iceberg/issues/621#issuecomment-552485021 def writeData(dataFrame: DataFrame, path: String, format: String, mode: SaveMode): Unit = { val tables = new HadoopTables(spark.sparkContext.hadoopConfiguration) val schema = SparkSchemaUtil.convert(dataFrame.schema) val partitionSpec = PartitionSpec.builderFor(SparkSchemaUtil.convert(dataFrame.schema)).build() tables.create(schema, partitionSpec, path) dataFrame .write .format(format) .mode(mode) .partitionBy("DATE") .save(path) } val icebergTableLoc = s"${storeLocation}/iceberg/eqty/reference" writeData(refDf, deltaLakeTableLoc, "iceberg", SaveMode.Append) val icebergTableLoc = s"${storeLocation}/iceberg/eqty/pricing" writeData(pricingDf, icebergTableLoc, "iceberg", SaveMode.Append) Above method is used to generate the data which 30 days of the reference and pricing data. Data loaded is pariting by date so I see roughly equally sized data in the reference and pricing store in the parquet file created in iceberg. Read operation is performed using spark.read.format("iceberg").load("iceberg/eqty/reference").join(spark.read.format("iceberg").load("iceberg/eqty/pricing"),Seq("ID_BB_GLOBAL","DATE")).count() As soon as you do this you see the error I mentioned in the original request. If you repartition data to more partition it worked. As mentioned it worked directly with raw Parquet and I also tried the similar join using Apache delta it worked. So size of the data is really fine. As we don't want to arbitrarily reparation the data.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
