Thanks for the thoughtful discussion. Given that no consistency issue and a rare memory leak, I think we could remove #761 as a 1.0 blocker. We can continue to work on #1339, which is a good improvement, as well as providing better structure and clear message that EntityCache and Resolver should be used together for correctness.
Yufei On Fri, May 2, 2025 at 1:25 AM Pierre Laporte <pie...@pingtimeout.fr> wrote: > On Fri, May 2, 2025 at 7:01 AM Dennis Huo <huoi...@gmail.com> wrote: > > > Hello all, > > > > Following up from today's community sync where we were discussing 1.0 > > blockers, I just wanted to finalize discussion on > > https://github.com/apache/polaris/issues/761 to check if everyone's > > aligned > > on the current state of things. > > > > My understanding is that the originally reported issues regarding > > concurrency problems for the EntityCache as a standalone class and its > > internal lack of enforcement of cohesion between the "byId" and "byName" > > lookups is not actually a concurrency issue for the overall system > because > > the design of EntityCache today is simply tightly coupled to > Resolver.java. > > > Hi Dennis > > Indeed, most of the discussion focused on the correctness of the > EntityCache under concurrency. In my experience working on the EntityCache > code, it is because that's where the reported issues happen. All those > issues arise from the lack of cohesion between the internal by id and by > name internal caches: > > - Returning a different version of the entity depending on how it is > accessed > - Returning a different entity using the by-name lookup, in the event of > renames > - Incorrectly removing a different entity by name in the event of > evictions/deletions and renames > - Leaking memory in the by-name cache in the event of > evictions/deletions and renames > > The core of the problem comes from the combination of using two distinct > primary keys to access an entity (id and name) and the second key being > mutable (the entity can be renamed). > > > > In particular, the main way to reason about where the Resolver gets its > own > > internal guarantees about monotonicity of increasing versions it sees is > in > > the "bulkValidate" method ( > > > > > https://github.com/apache/polaris/blob/3603fa317ad8313f178c09137965bc0e5c7f256d/polaris-core/src/main/java/org/apache/polaris/core/persistence/resolver/Resolver.java#L545 > > ) > > where the "loadEntitiesChangeTracking" comes directly from the underlying > > persistence store to supply the "minVersions" subsequently requested from > > the EntityCache. > > > > One maybe naive question though: the EntityCache cannot be trusted as a > regular Cache implementation, and all valid cache lookups must be preceded > by a database query. Doesn't it mean that the EntityCache brings none of > the benefits one could reasonably expect from a cache but all of its > downside (eviction concerns, staleness, memory usage)? > > I think we also agreed that using an IndexedCache to provide cohesion > > between byId and byName would be nice-to-have but in this situation not a > > strict requirement for correctness, and if we do encounter any challenges > > modifying the EntityCache to use an IndexedCache, we should probably take > > our time on it instead of rushing it. > > > > nit: Isn't it rather that correctness is not needed by the Resolver code? > > Because in my opinion, reworking the EntityCache is the only way to get > correctness. Either using an IndexedCache or a Cache+HashBiMap, as > currently is in https://github.com/apache/polaris/pull/1339. > > > > We discussed that a followup might be to change make EntityCache > > package-private in a separate package with Resolver to better make it > clear > > that the EntityCache isn't intended to be used independently. > > > > For now though, are there any remaining concerns about removing the > > "1.0-blocker" label from https://github.com/apache/polaris/issues/761 ? > > > > If the Resolver is able to deal with the correctness issues, then the only > concern that could possibly remain is the memory leak. If should be rare > though, so would it make this qualify just as a major issue and not a > blocker for 1.0? > > If the "blocker" qualifier remains, note that the proposed fix in > https://github.com/apache/polaris/pull/1339 is complete and passes all the > tests. I just need to rebase `main` because if does not yet include > https://github.com/apache/polaris/pull/1193/files. > > Cheers > -- > > Pierre >