wypoon commented on issue #4217:
URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1058730372


   I have done some experiments along this line too. The way I created the 
Iceberg tables from the TPD-DS Hive tables is using the snapshot procedure to 
create them in a different database:
   ```
   spark.sql("create database tpcds_iceberg")
   spark.sql("use tpcds")
   val tables = spark.sql("show tables")
   tables.collect().map(r => r(1).toString).foreach(t =>
     spark.sql(s"call spark_catalog.system.snapshot('tpcds.$t', 
'tpcds_iceberg.$t')")
   )
   ```
   This creates Iceberg tables backed by the same underlying data. As we're not 
writing to the tables, this does not create any problems.
   For the same value of `spark.sql.autoBroadcastJoinThreshold` Spark will use 
a broadcast join in some of the TPC-DS queries on Hive tables but use a 
SortMergeJoin on the Iceberg tables. This is because the way Spark estimates 
the size of the relation for the native table case is different from the way 
Iceberg estimates the size. For the native table case, Spark uses file size in 
its estimation. File size can be a significant underestimate as in your case, 
where you are using Snappy-compressed Parquet files, as most folks don't even 
set `spark.sql.sources.fileCompressionFactor` (which defaults to 1.0). Iceberg 
estimates the size of the relation by multiplying the estimated width of the 
requested columns by the number of rows. In my original commit for 
https://github.com/apache/iceberg/pull/3038, I used the same approach to 
estimating the size of the relation that Spark uses for `FileScan`s, but 
@rdblue suggested to use the approach actually adopted.
   To get Spark to use broadcast joins for the Iceberg tables, you can set a 
higher value for `spark.sql.autoBroadcastJoinThreshold`. The problem is that it 
is hard to compare apples with apples in this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to