smallx opened a new issue #2562:
URL: https://github.com/apache/iceberg/issues/2562
Through test and comparison, we find that spark on iceberg table is slower
than spark on hive parquet table, and its performance is reduced by about half.
After optimizing some parameters, iceberg's performance is improved, but it is
still slower than simple parquet. The optimized parameters are as follows.
```sql
ALTER TABLE iceberg_table SET TBLPROPERTIES
('write.parquet.compression-codec'='snappy');
ALTER TABLE iceberg_table SET TBLPROPERTIES
('read.parquet.vectorization.enabled'='true');
ALTER TABLE iceberg_table SET TBLPROPERTIES
('read.parquet.vectorization.batch-size'='100000');
ALTER TABLE iceberg_table SET TBLPROPERTIES
('commit.manifest.min-count-to-merge'='2');
```
It seems that iceberg's vectorization reading performance is not as good as
spark's. There are too many for loops and function calls in vectorization
reading code. See the following code.
https://github.com/apache/iceberg/blob/a2103b7131bb039c531a2c9f70c7c8b9fe9715ac/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java#L42-L70
Perhaps my guess is wrong, please help to analyze the reasons. Thanks very
much.
Version info:
- spark 3.0.2
- iceberg 0.11.1
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]