[ https://issues.apache.org/jira/browse/OAK-4581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15506514#comment-15506514 ]
Stefan Egli commented on OAK-4581: ---------------------------------- I'd like to move this ticket forward and believe we need a few decisions on the approach: h4. I - Who to persist for There are different possibilities as to where the persisted queue should sit: h5. A - BackgroundObserver In this class the BackgroundObserver's queue is persisted - and that can logically only be based on {{NodeState}}. This will thus support any type of downstream Observer including NodeObserver etc. Being based on Observer it requires GC-prevention. Here's a list of concrete subvariants: h6. 1 - store serialized root state This seemlessly serializes and stores {{NodeState}} objects. Later on they are read and used for diffing. Which means the data must still be available to do the actual diff. This can be achieved by increasing the GC retention period one way or another. What's also important here is that the caches aren't poluted with these late diffs - ie they should probably not be stored in the cache in this late-delivery case. h6. 2 - store serialized diff (and root state) Besides serializing the {{NodeState}} this variant (also) stores the diff. This speeds up later diffing as the diff is then already there (it probably must be stored in 'a cache' temporarily, but only temporarily as it will only be used for one event, likely). This variant is still dependent on preventing GC though, as we're still on the Observer level, which works on {{NodeState}}. h6. 3 - base it on the journal Alternatively the journal is equipped with more diff-like information (perhaps with the full, but perhaps only partially), also see OAK-4586. Otherwise this has same characteristics as I-A-1 and I-A-2: GC must still be prevented, we're still on the Observation/NodeState level. It will be implementation dependent, as Segment doesn't have the same type of journal as Document has. h5. B - ChangeProcessor In this class the queue is handled on the ChangeProcessor level (not in BackgroundObserver), thus no longer based on NodeState, but now independent, the format just must be suitable for calculation and later delivery via onEvent. Being independent of NodeState allows to become independent of GC and cache-hotness issues. However, it's important to note that this class of solutions targets concrete EventListeners, not Observers in general! h6. 1 - store serialized events The ChangeProcessor calculates events as if for delivery to onEvent, but just persists the events as is. This will bloat the amount of data stored and increase I/O. However, later delivery is trivial as all the events are already there, they just have to be read and onEvent called. h6. 2 - store serialized diff The ChangeProcessor stores the serialized diff in a form that it can later be processed by the EventFilter and result in events for delivery to EventListener.onEvent. (This would then be independent from the NodeState) h6. 3 - base it on the journal If the journal contains the complete diff such that ChangeProcessor can evaluate the filters and deliver, then the journal could be enough (however that might be tricky to achieve). Also, this will be implementation dependent, as Segment doesn't have the same type of journal as Document has. In any case, additionally the CommitInfo must be stored somewhere, either also in the Journal or per ChangeProcessor. h4. II Serializing CommitInfo Not sure if we have many options here, I think it's just something we have to do. And if Oak code prevents serialization, then we can fix it. If it's upper/application-layer code that causes problems, we can't do much other than issue a warn. h4. III - Storage Layer This depends a bit on the actual solution chosen. If we base it eg on journal, then a lot comes from there already. If we persist flat events, then surely an extra storage is needed. h5. A - Use a SegmentNodeStore * would be straight forward but has issues as mentioned by Michael. h5. B - Use internas of SegmentNodeStore, eg SegmentWriter * might be much more optimal, but adds dependencies on internas of tarmk. h5. C - store as JSON in a flat file [~mduerig], [~chetanm], [~catholicon], [~tmueller], which variant should we implement? > Persistent local journal for more reliable event generation > ----------------------------------------------------------- > > Key: OAK-4581 > URL: https://issues.apache.org/jira/browse/OAK-4581 > Project: Jackrabbit Oak > Issue Type: New Feature > Components: core > Reporter: Chetan Mehrotra > Assignee: Stefan Egli > Labels: observation > Fix For: 1.6 > > Attachments: OAK-4581.v0.patch > > > As discussed in OAK-2683 "hitting the observation queue limit" has multiple > drawbacks. Quite a bit of work is done to make diff generation faster. > However there are still chances of event queue getting filled up. > This issue is meant to implement a persistent event journal. Idea here being > # NodeStore would push the diff into a persistent store via a synchronous > observer > # Observors which are meant to handle such events in async way (by virtue of > being wrapped in BackgroundObserver) would instead pull the events from this > persisted journal > h3. A - What is persisted > h4. 1 - Serialized Root States and CommitInfo > In this approach we just persist the root states in serialized form. > * DocumentNodeStore - This means storing the root revision vector > * SegmentNodeStore - {color:red}Q1 - What does serialized form of > SegmentNodeStore root state looks like{color} - Possible the RecordId of > "root" state > Note that with OAK-4528 DocumentNodeStore can rely on persisted remote > journal to determine the affected paths. Which reduces the need for > persisting complete diff locally. > Event generation logic would then "deserialize" the persisted root states and > then generate the diff as currently done via NodeState comparison > h4. 2 - Serialized commit diff and CommitInfo > In this approach we can save the diff in JSOP form. The diff only contains > information about affected path. Similar to what is current being stored in > DocumentNodeStore journal > h4. CommitInfo > The commit info would also need to be serialized. So it needs to be ensure > whatever is stored there can be serialized or re calculated > h3. B - How it is persisted > h4. 1 - Use a secondary segment NodeStore > OAK-4180 makes use of SegmentNodeStore as a secondary store for caching. > [~mreutegg] suggested that for persisted local journal we can also utilize a > SegmentNodeStore instance. Care needs to be taken for compaction. Either via > generation approach or relying on online compaction > h4. 2- Make use of write ahead log implementations > [~ianeboston] suggested that we can make use of some write ahead log > implementation like [1], [2] or [3] > h3. C - How changes get pulled > Some points to consider for event generation logic > # Would need a way to keep pointers to journal entry on per listener basis. > This would allow each Listener to "pull" content changes and generate diff as > per its speed and keeping in memory overhead low > # The journal should survive restarts > [1] http://www.mapdb.org/javadoc/latest/mapdb/org/mapdb/WriteAheadLog.html > [2] > https://github.com/apache/activemq/tree/master/activemq-kahadb-store/src/main/java/org/apache/activemq/store/kahadb/disk/journal > [3] > https://github.com/elastic/elasticsearch/tree/master/core/src/main/java/org/elasticsearch/index/translog -- This message was sent by Atlassian JIRA (v6.3.4#6332)