wypoon opened a new issue, #7474:
URL: https://github.com/apache/iceberg/issues/7474

   ### Feature Request / Improvement
   
   It is known that use of the CachingCatalog can lead to stale data being read 
from an Iceberg table in one Spark session when the table is updated in another 
Spark session: https://github.com/apache/iceberg/issues/2319, 
https://github.com/apache/iceberg/issues/3357. Within the same Spark session, a 
commit causes the metadata of a cached table to be refreshed, so normally 
writes should be seen right away by subsequent reads. However, there is a 
problem even within the same Spark session.
   In a customer SQL workload, we discovered that queries used inconsistent 
case for database and table names. A table is read using an upper case name and 
is updated using a lower case name. This is not incorrect as SQL is case 
insensitive for database, table and column names. This is in the same Spark 
session. Normally the new snapshot should be read immediately after the write, 
but it is not, due to a different table being loaded from the cache (two 
different entries for the table are in the cache, under different keys). As a 
result, stale data is read until the cache expiration occurs. (Due to repeated 
reads, the cache keeps getting renewed, exacerbating the problem.)
   I opened https://github.com/apache/iceberg/pull/7469 to address this problem 
by providing a conf to control the case sensitivity of the CachingCatalog.
   
   ### Query engine
   
   Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to