> One maybe naive question though: the EntityCache cannot be trusted as a
> regular Cache implementation, and all valid cache lookups must be preceded
> by a database query.  Doesn't it mean that the EntityCache brings none of
> the benefits one could reasonably expect from a cache but all of its
> downside (eviction concerns, staleness, memory usage)?


Right, perhaps the class itself should be renamed, as it doesn't exactly
function as "pure" cache in the traditional sense where one might either
need staleness and cross-server decoherence tolerance, or else distributed
active-invalidation on write-through.

Instead, the behavior is more similar to something like "Freshness-aware
table loading" in Iceberg https://github.com/apache/iceberg/pull/11946

The main gains come from:

1. Lookup of entityVersions/grantRecordsVersions is more efficient than
fetching entire entities and their grants
    -Version lookups are uniformly sized records, so planning batches of
these is easy, compared to entity contents where the sizes are
user-dependent
    -Especially grant lookup will involve a range scan unless someone
implements grants as modifying a single monolithic "list" record
2. Lookup of entityVersions/grantRecordsVersions has batch APIs in
MetaStoreManager (we could also have bulk lookups of ResolvedPolarisEntity,
but this again relates to request-sizing issues of small uniform-sized
records vs large heterogeneous ones)

nit: Isn't it rather that correctness is not needed by the Resolver code?
>
> Because in my opinion, reworking the EntityCache is the only way to get
> correctness. Either using an IndexedCache or a Cache+HashBiMap, as
> currently is in https://github.com/apache/polaris/pull/1339.


I think the main distinction is between what behaviors are explicitly
documented in the interface vs which are implicit assumptions that may or
may not be true.

For example, the original EntityCache doesn't make any explicit claims
about monotonicity of versions from multiple calls of the same "get"
methods, or consistency beween byId/byName lookups. But perhaps the method
names inadvertently imply such things. And of course, it's not obvious to
the reader of the code *why* the weaker guarantees are used along with
other mechanisms (e.g. loadEntitiesChangeTracking) to achieve system-level
"correctness".

If we do tighten up the semantics of the cache though it's worth also
assessing the intended use cases that plan to depend on those semantics to
see if there are broader problems. In particular, the
loadEntitiesChangeTracking flow in the Resolver may become important for
getting a "snapshot read" behavior once we tighten up consistency of
hierarchical policies and grants that are inherited through the parent
namespace hierarchy.

If there are objections to [1339], I believe we can still clarify the
> situation by refactoring the cache code to avoid any doubt that it is used
> only within the Resolver calls paths. Perhaps we could rename it to
> ResolverCache and make it a single-purpose class without a public
> interface... These are just some ideas, I have not given this option a lot
> of thought yet as I personally prefer fixing the cache class [1339].


Yeah, it seems a worthy goal to have tighter guarantees in the cache. I
added some comments to https://github.com/apache/polaris/pull/1339 around
some behavioral changes we should talk about:

1. Getting rid of the "refreshResolvedEntity" partial-loading optimization
does have a vulnerability to a plausible real-world use case (Iceberg
tables with lots of grants on them that rarely change, but where the tables
themselves are frequently updated) - explained in
https://github.com/apache/polaris/pull/1339#discussion_r2072293174
2. Realm-wide lock contention (per-server) - normally I wouldn't worry
about even a big shared lock if it's just for doing some in-memory data
structure operations, but holding the lock across the MetaStoreManager call
has some pathological cases that impact certain distributed architectures
badly: https://github.com/apache/polaris/pull/1339#discussion_r2072295482

Those can probably just have some configurable options and can be fixed in
code directly.

One part that I'm a little less clear about is how we intend to preserve
cohesion between `byName` and `byId` as long as the Cache's native
"time-based" removal is capable of triggering a removal without cooperative
with the new `writeLock` introduced.

It seems to me we'd actually need a way to force the Caffeine Cache to
"cooperate" with our outer-layer locking for *anything* it does. Like if
the Cache defined something like "addAtomicRemovalTask" that cooperates
with a chosen lock to perform both the cache removal and the "extra task"
as a single unit.

I guess the latter is effectively just going back to trying to use
IndexedCache to force the separate lookup to be an inherent part of the
Cache interface.

On Fri, May 2, 2025 at 4:52 PM Dmitri Bourlatchkov <di...@apache.org> wrote:

> Thanks for the detailed analysis, Dennis!
>
> I tend to agree that issue [761] is not a blocker for 1.0.
>
> However, I'd prefer not to close [761] because, while it may not pose an
> immediate problem in runtime behaviour, it did cause quite a bit of
> misunderstanding and confusion for a few developers working with polaris
> code. So I believe we need to address it.
>
> I'd encourage reviews of Pierre's proposed fix [1339]. If it is acceptable
> to the community, I think it would be a good resolution for issue [761] as
> it makes the EntityCache class consistent in itself, without having to
> consider how it is used by the Resolver.
>
> If there are objections to [1339], I believe we can still clarify the
> situation by refactoring the cache code to avoid any doubt that it is used
> only within the Resolver calls paths. Perhaps we could rename it to
> ResolverCache and make it a single-purpose class without a public
> interface... These are just some ideas, I have not given this option a lot
> of thought yet as I personally prefer fixing the cache class [1339].
>
> Cheers,
> Dmitri.
>
> [761] https://github.com/apache/polaris/issues/761
> [1339] https://github.com/apache/polaris/pull/1339
>
> On Fri, May 2, 2025 at 1:02 AM Dennis Huo <huoi...@gmail.com> wrote:
>
> > Hello all,
> >
> > Following up from today's community sync where we were discussing 1.0
> > blockers, I just wanted to finalize discussion on
> > https://github.com/apache/polaris/issues/761 to check if everyone's
> > aligned
> > on the current state of things.
> >
> > My understanding is that the originally reported issues regarding
> > concurrency problems for the EntityCache as a standalone class and its
> > internal lack of enforcement of cohesion between the "byId" and "byName"
> > lookups is not actually a concurrency issue for the overall system
> because
> > the design of EntityCache today is simply tightly coupled to
> Resolver.java.
> >
> > In particular, the main way to reason about where the Resolver gets its
> own
> > internal guarantees about monotonicity of increasing versions it sees is
> in
> > the "bulkValidate" method (
> >
> >
> https://github.com/apache/polaris/blob/3603fa317ad8313f178c09137965bc0e5c7f256d/polaris-core/src/main/java/org/apache/polaris/core/persistence/resolver/Resolver.java#L545
> > )
> > where the "loadEntitiesChangeTracking" comes directly from the underlying
> > persistence store to supply the "minVersions" subsequently requested from
> > the EntityCache.
> >
> > I think we also agreed that using an IndexedCache to provide cohesion
> > between byId and byName would be nice-to-have but in this situation not a
> > strict requirement for correctness, and if we do encounter any challenges
> > modifying the EntityCache to use an IndexedCache, we should probably take
> > our time on it instead of rushing it.
> >
> > We discussed that a followup might be to change make EntityCache
> > package-private in a separate package with Resolver to better make it
> clear
> > that the EntityCache isn't intended to be used independently.
> >
> > For now though, are there any remaining concerns about removing the
> > "1.0-blocker" label from https://github.com/apache/polaris/issues/761 ?
> >
> > Cheers,
> > Dennis
> >
>

Reply via email to