[GitHub] [iceberg] pvary commented on issue #2319: Caching Tables in SparkCatalog via CachingCatalog by default leads to stale data

GitBox Thu, 11 Mar 2021 00:58:39 -0800


pvary commented on issue #2319:
URL: https://github.com/apache/iceberg/issues/2319#issuecomment-796578997



   I found the same issue and tried to start a discussion on the dev list 
[discussion](https://mail-archives.apache.org/mod_mbox/iceberg-dev/202102.mbox/%[email protected]%3e)
 about it.
   The main points are:
   - Stale data
   - Table object is not thread safe
   
   Also chatted a little bit about it with @rdblue, and he mentioned that in 
Spark the CachingCatalog is also used for making sure that the same version of 
the table is retrieved every time during the same session. So getting back 
stale data is a feature, not a bug.
   
   Based on this discussion my feeling is that the best solution would be to 
create a metadata cache around `TableMetadataParser.read(FileIO io, InputFile 
file)` where the cache key is the `file.location()`.
   
   The snapshots are immutable and I guess (no hard numbers on it yet) that the 
most resource intensive part of the table creation is metadata fetching from S3 
and file parsing, so this would help us more and allows us to have a least 
complicated solution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] pvary commented on issue #2319: Caching Tables in SparkCatalog via CachingCatalog by default leads to stale data

Reply via email to