Optimizations to GlueCatalog

Chetas Joshi Wed, 06 Mar 2024 11:52:36 -0800

Hi Community,

I am working on loading iceberg data from S3 using Flink. I am using
GlueCatalog for storing the iceberg table metadata. I found that the
GlueCatalog’s loadTable call (implemented
<https://github.com/apache/iceberg/blob/apache-iceberg-1.4.0/core/src/main/java/org/apache/iceberg/BaseMetastoreCatalog.java#L46>
in the abstract class BaseMetastoreCatalog) creates a new instance of
GlueTableOperations every time for a Glue table identifier. This instance
is initialized with shouldRefresh = true and hence it refreshes the
tableMetadata for a given table identifier every time the loadTable is
called for that tableIdentifier even though it was called in the recent
past. I am wondering why these tableOperation instances are not cached in
the catalog. I suggest the following changes in the newTableOps method
<https://github.com/apache/iceberg/blob/apache-iceberg-1.4.0/aws/src/main/java/org/apache/iceberg/aws/glue/GlueCatalog.java#L205>
in the GlueCatalog (and other catalog impls) and would really appreciate
the community's feedback on this.


protected TableOperations newTableOps(TableIdentifier tableIdentifier) {

    // tableCache is a Cache with key=tableIdentifier and
value=GlueTableOperations object

    if (tableCache.containsKey(tableIdentifier)) {

       return tableCache.get(tableIdentifier)

    } else {

       return new GlueTableOperations(....)

    }
}

If you like the approach, I am happy to contribute to open source. Let me
know.

Thank you
Chetas

Optimizations to GlueCatalog

Reply via email to