ConeyLiu commented on issue #4217: URL: https://github.com/apache/iceberg/issues/4217#issuecomment-1061735573
We also have benchmarked iceberg with TPCDS. And got the following finds: 1. As @wypoon said, spark reading parquet and reading iceberg using different relation size estimation, which leads to the different table plan. Such as BroadcastJoin to SortMergeJoin. 2. Spark reading parquet with vectorized data reading by default. However, we have just enabled it by default in iceberg recently. You could enable it by yourself. This could improve data reading a lot. And I have a pr (https://github.com/apache/iceberg/pull/3249) for optimizing iceberg parquet decimal vectorized reading. 3. Spark supports DDP for build-in datasource while it supports iceberg-like datasource since spark 3.2. This could influence the performance a lot from TPCDS side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
