[DocumentNodeStore] introduce new cache for non existence of properties in previous documents

Stefan Egli Thu, 24 Oct 2024 04:04:03 -0700

Hi,

I'm looking for opinions on OAK-11184 [0] and in particular the currentdraft PR#1779 [1] and its suggestion to introduce a new, tiny cache.

What happens in OAK-11184 is when reading a node to which an all-newproperty (revision) was just added, and that new property revision iseither (a) not yet committed or (b) not yet visible (for a peer clusterinstance), it causes a complete scan of previous documents. Reason forthis is once it found a property key it eagerly looks for the newestvisible revision. And if that's not in the main document, it needs toscan all previous documents. That's how it works. For a new propertyhowever, no previous document will have any revision for it yet - hencethat scan ultimately results in finding nothing (or if the propertyexisted in earlier days, it will have been removed, otherwise the maindocument would have two revision snow and we'd not be in this situation).

The case "a) not yet committed" is easy to avoid: It can actually easilydetect such a situation (by checking commit value). And then simply notscan previous documents, as the only possible case of a not yetcommitted revision as the only revision on the man document is exactlyfor a new property (hence previous document scan will result in nothing,hence that's not needed). So a) is easy.

The difficult case is b) - as b) can also happen if reading with acheckpoint. So in this case it can't statically know that it is a newproperty case. So there is no easy way out of this.


So without static detection, the currently known solutions are:

1) introduce a (tiny) cache for a scan of previous documents for aparticular property not finding any revisions whatsoever (not eveninvisible, uncommitted ones). That means it will still have to be doneonce, unfortunately, and that might still take a certain time - butafter that, any subsequent similar lookup will be avoidable, as thecache can then be used. (the cache key and values are actually tiny: itcan use just the previous document id and the property key, that's it)

2) another approach is to always add two revisions when adding a newproperty : the first revision would be a newly introduced one : analways-visible revision mapping to null. The second revision would bethe same as was done previously: the current revision mapping to theactual new value. When a cluster node thus reads that second revision(which it does first, as it orders in reverse) and notices that it's notyet visible, it would fall back to the first revision that one wouldalways be visible and map to null (kind of like "_deleted:false"). Thusit can avoid previous document scans. This solution seems faster - butit is a change in the stored data, uses a bit more space whichultimately then also requires garbage collection. But it is somewhatintriguing.


For now in my PR [1] I went for approach 1)

But knowing the delicacy of introducing a new cache I wanted to hear anddiscuss other opinions of this group.


Thanks,
Cheers,
Stefan
--
[0] https://issues.apache.org/jira/browse/OAK-11184
[1] https://github.com/apache/jackrabbit-oak/pull/1779

[DocumentNodeStore] introduce new cache for non existence of properties in previous documents

Reply via email to