Stefan Egli commented on OAK-4581:

I'd like to move this ticket forward and believe we need a few decisions on the 

h4. I - Who to persist for
There are different possibilities as to where the persisted queue should sit:
h5. A - BackgroundObserver
In this class the BackgroundObserver's queue is persisted - and that can 
logically only be based on {{NodeState}}. This will thus support any type of 
downstream Observer including NodeObserver etc. Being based on Observer it 
requires GC-prevention.
Here's a list of concrete subvariants:
h6. 1 - store serialized root state
This seemlessly serializes and stores {{NodeState}} objects. Later on they are 
read and used for diffing. Which means the data must still be available to do 
the actual diff. This can be achieved by increasing the GC retention period one 
way or another. What's also important here is that the caches aren't poluted 
with these late diffs - ie they should probably not be stored in the cache in 
this late-delivery case.
h6. 2 - store serialized diff (and root state)
Besides serializing the {{NodeState}} this variant (also) stores the diff. This 
speeds up later diffing as the diff is then already there (it probably must be 
stored in 'a cache' temporarily, but only temporarily as it will only be used 
for one event, likely). This variant is still dependent on preventing GC 
though, as we're still on the Observer level, which works on {{NodeState}}.
h6. 3 - base it on the journal
Alternatively the journal is equipped with more diff-like information (perhaps 
with the full, but perhaps only partially), also see OAK-4586. Otherwise this 
has same characteristics as I-A-1 and I-A-2: GC must still be prevented, we're 
still on the Observation/NodeState level. It will be implementation dependent, 
as Segment doesn't have the same type of journal as Document has.
h5. B - ChangeProcessor
In this class the queue is handled on the ChangeProcessor level (not in 
BackgroundObserver), thus no longer based on NodeState, but now independent, 
the format just must be suitable for calculation and later delivery via 
onEvent. Being independent of NodeState allows to become independent of GC and 
cache-hotness issues. However, it's important to note that this class of 
solutions targets concrete EventListeners, not Observers in general!
h6. 1 - store serialized events
The ChangeProcessor calculates events as if for delivery to onEvent, but just 
persists the events as is. This will bloat the amount of data stored and 
increase I/O. However, later delivery is trivial as all the events are already 
there, they just have to be read and onEvent called.
h6. 2 - store serialized diff
The ChangeProcessor stores the serialized diff in a form that it can later be 
processed by the EventFilter and result in events for delivery to 
EventListener.onEvent. (This would then be independent from the NodeState)
h6. 3 - base it on the journal
If the journal contains the complete diff such that ChangeProcessor can 
evaluate the filters and deliver, then the journal could be enough (however 
that might be tricky to achieve). Also, this will be implementation dependent, 
as Segment doesn't have the same type of journal as Document has.
In any case, additionally the CommitInfo must be stored somewhere, either also 
in the Journal or per ChangeProcessor.
h4. II Serializing CommitInfo
Not sure if we have many options here, I think it's just something we have to 
do. And if Oak code prevents serialization, then we can fix it. If it's 
upper/application-layer code that causes problems, we can't do much other than 
issue a warn.
h4. III - Storage Layer
This depends a bit on the actual solution chosen. If we base it eg on journal, 
then a lot comes from there already. If we persist flat events, then surely an 
extra storage is needed.
h5. A - Use a SegmentNodeStore
* would be straight forward but has issues as mentioned by Michael.

h5. B - Use internas of SegmentNodeStore, eg SegmentWriter
* might be much more optimal, but adds dependencies on internas of tarmk.

h5. C - store as JSON in a flat file

[~mduerig], [~chetanm], [~catholicon], [~tmueller], which variant should we 

> Persistent local journal for more reliable event generation
> -----------------------------------------------------------
>                 Key: OAK-4581
>                 URL: https://issues.apache.org/jira/browse/OAK-4581
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: core
>            Reporter: Chetan Mehrotra
>            Assignee: Stefan Egli
>              Labels: observation
>             Fix For: 1.6
>         Attachments: OAK-4581.v0.patch
> As discussed in OAK-2683 "hitting the observation queue limit" has multiple 
> drawbacks. Quite a bit of work is done to make diff generation faster. 
> However there are still chances of event queue getting filled up. 
> This issue is meant to implement a persistent event journal. Idea here being
> # NodeStore would push the diff into a persistent store via a synchronous 
> observer
> # Observors which are meant to handle such events in async way (by virtue of 
> being wrapped in BackgroundObserver) would instead pull the events from this 
> persisted journal
> h3. A - What is persisted
> h4. 1 - Serialized Root States and CommitInfo
> In this approach we just persist the root states in serialized form. 
> * DocumentNodeStore - This means storing the root revision vector
> * SegmentNodeStore - {color:red}Q1 - What does serialized form of 
> SegmentNodeStore root state looks like{color} - Possible the RecordId of 
> "root" state
> Note that with OAK-4528 DocumentNodeStore can rely on persisted remote 
> journal to determine the affected paths. Which reduces the need for 
> persisting complete diff locally.
> Event generation logic would then "deserialize" the persisted root states and 
> then generate the diff as currently done via NodeState comparison
> h4. 2 - Serialized commit diff and CommitInfo
> In this approach we can save the diff in JSOP form. The diff only contains 
> information about affected path. Similar to what is current being stored in 
> DocumentNodeStore journal
> h4. CommitInfo
> The commit info would also need to be serialized. So it needs to be ensure 
> whatever is stored there can be serialized or re calculated
> h3. B - How it is persisted
> h4. 1 - Use a secondary segment NodeStore
> OAK-4180 makes use of SegmentNodeStore as a secondary store for caching. 
> [~mreutegg] suggested that for persisted local journal we can also utilize a 
> SegmentNodeStore instance. Care needs to be taken for compaction. Either via 
> generation approach or relying on online compaction
> h4. 2- Make use of write ahead log implementations
> [~ianeboston] suggested that we can make use of some write ahead log 
> implementation like [1], [2] or [3]
> h3. C - How changes get pulled
> Some points to consider for event generation logic
> # Would need a way to keep pointers to journal entry on per listener basis. 
> This would allow each Listener to "pull" content changes and generate diff as 
> per its speed and keeping in memory overhead low
> # The journal should survive restarts
> [1] http://www.mapdb.org/javadoc/latest/mapdb/org/mapdb/WriteAheadLog.html
> [2] 
> https://github.com/apache/activemq/tree/master/activemq-kahadb-store/src/main/java/org/apache/activemq/store/kahadb/disk/journal
> [3] 
> https://github.com/elastic/elasticsearch/tree/master/core/src/main/java/org/elasticsearch/index/translog

This message was sent by Atlassian JIRA

Reply via email to