wypoon commented on issue #4217:
URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1058730372
I have done some experiments along this line too. The way I created the
Iceberg tables from the TPD-DS Hive tables is using the snapshot procedure to
create them in a different database:
```
spark.sql("create database tpcds_iceberg")
spark.sql("use tpcds")
val tables = spark.sql("show tables")
tables.collect().map(r => r(1).toString).foreach(t =>
spark.sql(s"call spark_catalog.system.snapshot('tpcds.$t',
'tpcds_iceberg.$t')")
)
```
This creates Iceberg tables backed by the same underlying data. As we're not
writing to the tables, this does not create any problems.
For the same value of `spark.sql.autoBroadcastJoinThreshold` Spark will use
a broadcast join in some of the TPC-DS queries on Hive tables but use a
SortMergeJoin on the Iceberg tables. This is because the way Spark estimates
the size of the relation for the native table case is different from the way
Iceberg estimates the size. For the native table case, Spark uses file size in
its estimation. File size can be a significant underestimate as in your case,
where you are using Snappy-compressed Parquet files, as most folks don't even
set `spark.sql.sources.fileCompressionFactor` (which defaults to 1.0). Iceberg
estimates the size of the relation by multiplying the estimated width of the
requested columns by the number of rows. In my original commit for
https://github.com/apache/iceberg/pull/3038, I used the same approach to
estimating the size of the relation that Spark uses for `FileScan`s, but
@rdblue suggested to use the approach actually adopted.
To get Spark to use broadcast joins for the Iceberg tables, you can set a
higher value for `spark.sql.autoBroadcastJoinThreshold`. The problem is that it
is hard to compare apples with apples in this case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]