edgarRd opened a new issue #2319:
URL: https://github.com/apache/iceberg/issues/2319


   I've been experiencing a few issues with refreshing table metadata in 
Iceberg. I think caching in Iceberg is a bit flawed in the sense that if we use 
spark3 via `SparkCatalog` with `cache-enable=true` leading to wrap the Iceberg 
Catalog with `CachingCatalog` - which is the default - those tables will pretty 
much stay stale until:
   1. They are evicted, or
   2. There's a commit, which will trigger a refresh.
   
   With this, it's a bit dangerous to have long lived `TableOperations` 
objects, e.g. multiple long lived Spark sessions reading the same table that 
gets modified. 
   
   I don't think the `TableOperations` are cache friendly unless we expect to 
have stale data results in different sessions, e.g. with `cache-enable=true` we 
have the following behavior:
   1. First session reads `table1`:
   2. Second session reads `table1`. <-- up to this point, both sessions see 
the same data
   3. First session commits to `table1`
   4. Second session reads `table1` <-- this read is stale, due to caching - 
changes in 3) are not reflected
   
   In order for this flow to work, as in Hive tables, and represent the 
up-to-date data in both sessions, we can't use caching right now.
   While not checking for the up-to-date metadata location saves client calls, 
I think we should do checks in TableOperations to refresh the metadata when 
metadata location changes, with this we could cache the objects and have 
correctness on data freshness.
   
   Caching is enabled by default in `SparkCatalog` (Spark 3) - For now, I think 
the default should be `false`, especially since currently it could lead to data 
inconsistency.
   
   What do you think @rdblue @aokolnychyi ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to