[GitHub] [iceberg] smallx opened a new issue #2562: Spark on iceberg table is slower than spark on hive parquet table

GitBox Fri, 07 May 2021 01:50:33 -0700


smallx opened a new issue #2562:
URL: https://github.com/apache/iceberg/issues/2562

Through test and comparison, we find that spark on iceberg table is slower
than spark on hive parquet table, and its performance is reduced by about half.
After optimizing some parameters, iceberg's performance is improved, but it is
still slower than simple parquet. The optimized parameters are as follows.

```sql
ALTER TABLE iceberg_table SET TBLPROPERTIES
('write.parquet.compression-codec'='snappy');
ALTER TABLE iceberg_table SET TBLPROPERTIES
('read.parquet.vectorization.enabled'='true');
ALTER TABLE iceberg_table SET TBLPROPERTIES
('read.parquet.vectorization.batch-size'='100000');
ALTER TABLE iceberg_table SET TBLPROPERTIES
('commit.manifest.min-count-to-merge'='2');
```

It seems that iceberg's vectorization reading performance is not as good as
spark's. There are too many for loops and function calls in vectorization
reading code. See the following code.

https://github.com/apache/iceberg/blob/a2103b7131bb039c531a2c9f70c7c8b9fe9715ac/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java#L42-L70

Perhaps my guess is wrong, please help to analyze the reasons. Thanks very
much.

Version info:
- spark 3.0.2
- iceberg 0.11.1

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] smallx opened a new issue #2562: Spark on iceberg table is slower than spark on hive parquet table

Reply via email to