edgarRd commented on issue #2319:
URL: https://github.com/apache/iceberg/issues/2319#issuecomment-797103513


   I think caching tables indefinitely in sessions (until garbage collect or a 
change happens) returning stale results is unintuitive (non-obvious) and 
inconsistent to be set enabled by default. Seems like multiple people have 
brought up this behavior in the past -  at least @aokolnychyi and @pvary; 
signaling how unintuitive returning stale results by default is. Also, it's 
inconsistent with how Hive and Presto handle Iceberg tables; but also how Spark 
handles queries to non-Iceberg tables.
   
   I'm not arguing for removing caching altogether - as @pvary mentioned, it 
can be a feature in some cases - but instead that the default should be the 
most intuitive and consistent behavior. If there are use cases that need to fix 
the state of the table at some point, then cache can be used explicitly rather 
than using it implicitly by default. In fact, Spark provides such constructs 
with 
https://spark.apache.org/docs/3.0.1/sql-ref-syntax-aux-cache-cache-table.html - 
shouldn't that be more inline with Spark's expected handling of cached tables?
   
   > My only worry about setting it as false is breaking self joins
   
   If caching by default is needed in a full environment and it depends on 
having it enabled, this can be set in `spark-defaults.conf` and set the 
`cache-enable=true` regardless of the default value - it's anyways good 
practice to avoid depending on defaults. I think this is better rather than the 
other way around, expecting users to know they have to refresh their table to 
see if the table had changes. If the change of the default were to happen, we'd 
definitely need to include in docs/release notes.
   
   > Based on this discussion my feeling is that the best solution would be to 
create a metadata cache around TableMetadataParser.read(FileIO io, InputFile 
file) where the cache key is the file.location().
   
   I agree, this would solve caching for saving resources. However, this does 
not address the self-join concerns mentioned before, since they rely on looking 
at the same snapshot.
   
   I think @Parth-Brahmbhatt mentioned that there's a `refresh` store 
procedure, however I think that this goes in the wrong direction to support 
caching by default, i.e. users would need to know that tables are cached by 
default, which is problematic if the behavior is inconsistent to other compute 
engines or table formats. Instead, I think it's preferable to cache explicitly 
(either with 
https://spark.apache.org/docs/3.0.1/sql-ref-syntax-aux-cache-cache-table.html 
or a stored procedure); this makes the default behavior intuitive and 
consistent with other compute engines and other non-Iceberg tables in Spark.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to