edgarRd opened a new issue #2319: URL: https://github.com/apache/iceberg/issues/2319
I've been experiencing a few issues with refreshing table metadata in Iceberg. I think caching in Iceberg is a bit flawed in the sense that if we use spark3 via `SparkCatalog` with `cache-enable=true` leading to wrap the Iceberg Catalog with `CachingCatalog` - which is the default - those tables will pretty much stay stale until: 1. They are evicted, or 2. There's a commit, which will trigger a refresh. With this, it's a bit dangerous to have long lived `TableOperations` objects, e.g. multiple long lived Spark sessions reading the same table that gets modified. I don't think the `TableOperations` are cache friendly unless we expect to have stale data results in different sessions, e.g. with `cache-enable=true` we have the following behavior: 1. First session reads `table1`: 2. Second session reads `table1`. <-- up to this point, both sessions see the same data 3. First session commits to `table1` 4. Second session reads `table1` <-- this read is stale, due to caching - changes in 3) are not reflected In order for this flow to work, as in Hive tables, and represent the up-to-date data in both sessions, we can't use caching right now. While not checking for the up-to-date metadata location saves client calls, I think we should do checks in TableOperations to refresh the metadata when metadata location changes, with this we could cache the objects and have correctness on data freshness. Caching is enabled by default in `SparkCatalog` (Spark 3) - For now, I think the default should be `false`, especially since currently it could lead to data inconsistency. What do you think @rdblue @aokolnychyi ? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
