[GitHub] [iceberg] lirui-apache commented on pull request #4596: Use bounded queue to avoid consuming too much memory

GitBox Wed, 25 May 2022 20:04:14 -0700


lirui-apache commented on PR #4596:
URL: https://github.com/apache/iceberg/pull/4596#issuecomment-1138096307


   Let me clarify our use case. We have an iceberg table partitioned by date, 
and we run an ETL job every day to sync data from an upstream hive table into 
this iceberg table. The ETL job basically just runs an `INSERT INTO` with 
SparkSQL which adds a new partition to the iceberg table. So we end up having a 
manifest for each partition, and each partition has lots of data files, i.e. 
ranging roughly from 50k to 130k.
   Then we have a trino cluster where users submit ad-hoc queries. This cluster 
is multi-tenant and not just meant for the iceberg table mentioned above.
   The problem we faced was that querying this huge iceberg table can easily 
make the trino coordinator unstable or even crash with OOM.
   
   I'll check whether `commit.manifest.target-size-byte` can mitigate our case. 
But IIUC, the iceberg worker pool is static and shared among all jobs. So we 
probably won't want to change the pool size.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] lirui-apache commented on pull request #4596: Use bounded queue to avoid consuming too much memory

Reply via email to