Dear Apache Iceberg Community, 
I hope this message finds you well. I’m writing to discuss a proposed 
improvement to caching strategies in the Iceberg-Spark integration, as outlined 
in Issue #14417
Background and Problem:

The current caching behaviour in Iceberg’s Spark integration uses 
expireAfterAccess semantics. This can prevent periodic refreshes of Iceberg 
metadata or table data in long-running structured streaming jobs. This poses 
challenges for workloads that require updated reference data, such as 
stream-to-static joins.

In such cases, frequently accessed data remains in cache indefinitely, 
reflecting stale snapshots. Disabling caching entirely is the only workaround, 
which leads to significant overhead due to frequent metadata reloads during 
micro-batches.

For example, in a Spark Structured Streaming job with continuous Kafka input 
and joins against slowly evolving reference data, updates to Iceberg tables are 
not reflected unless caching is disabled. While this ensures data freshness, it 
introduces performance bottlenecks due to repeated table reloads.

Proposed Solution:

To address this issue, the proposal suggests making the cache expiration 
strategy configurable. This could include:

1. Allowing users to choose between expireAfterAccess and expireAfterWrite for 
both catalogue and executor caches.
2. Implementing a smarter refresh mechanism that detects changes in table 
metadata or snapshots and refreshes the cache.

This flexibility would enable users to balance performance and data freshness, 
aligning Iceberg’s caching capabilities with other data lake formats like Delta 
Lake.

Benefits:

- Improved support for long-running structured streaming jobs that rely on 
up-to-date reference data.
- Reduced overhead from unnecessary metadata fetches.
- Greater flexibility to meet diverse caching requirements.
- Enhanced user experience by addressing caching limitations in the current 
implementation. I’ve also submitted a corresponding pull request (#14440) to 
implement the cachePolicy feature, which introduces support for 
`EXPIRE_AFTER_WRITE` and `EXPIRE_AFTER_ACCESS` strategies. I kindly invite the 
community to review and provide feedback on both the issue and the pull 
request. Your insights and guidance would be invaluable in refining and 
advancing this feature. Feel free to join the discussion on the GitHub issue or 
reply to this email. I look forward to collaborating with you!


Kind regards,
Hossein

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to