smallx opened a new issue #2562:
URL: https://github.com/apache/iceberg/issues/2562


   Through test and comparison, we find that spark on iceberg table is slower 
than spark on hive parquet table, and its performance is reduced by about half. 
After optimizing some parameters, iceberg's performance is improved, but it is 
still slower than simple parquet. The optimized parameters are as follows.
   
   ```sql
   ALTER TABLE iceberg_table SET TBLPROPERTIES 
('write.parquet.compression-codec'='snappy');
   ALTER TABLE iceberg_table SET TBLPROPERTIES 
('read.parquet.vectorization.enabled'='true');
   ALTER TABLE iceberg_table SET TBLPROPERTIES 
('read.parquet.vectorization.batch-size'='100000');
   ALTER TABLE iceberg_table SET TBLPROPERTIES 
('commit.manifest.min-count-to-merge'='2');
   ```
   
   It seems that iceberg's vectorization reading performance is not as good as 
spark's. There are too many for loops and function calls in vectorization 
reading code. See the following code.
   
   
https://github.com/apache/iceberg/blob/a2103b7131bb039c531a2c9f70c7c8b9fe9715ac/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedDictionaryEncodedParquetValuesReader.java#L42-L70
   
   Perhaps my guess is wrong, please help to analyze the reasons. Thanks very 
much.
   
   Version info:
   - spark 3.0.2
   - iceberg 0.11.1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to