On 18.6.12 23:16, Jukka Zitting wrote:
- ChangeSet is just a container carrying the trees as they where after and
before the change. So this is very close to the diffing approach you
describe only a bit more explicit. Also ChangeSet is the place where
additional information like change set meta data could live. I'm close to
certain that we will need something along these lines (i.e. userData,
timestamps, user who initiated that change, session id of the originating
session).
The reason why I worry about the ChangeSet concept is that it implies
that each commit() produces a separate ChangeSet that then gets
delivered to each observation listener for processing. This is
troublesome for two key reasons:
1) Performance: Consider a large cluster that supports lots of
concurrent writes hitting all cluster nodes. We should be able to
support at least hundreds or thousands of commits per second on such
systems, and ideally the only limit here would be the amount of
available hardware. With the ChangeSet concept each of those commits
would result in a separate waitForChanges() return value, which would
cause event queues to start growing indefinitely if any one of the
listeners can't keep up with the stream of incoming changes. The
poll+diff approach avoids that problem since a listener only sees the
combined set of changes across the polling interval.
There is nothing which implies "creating" a ChangeSet instance for each
commit. The change sets are just implicitly there and can be retrieved
with instances of ChangeSet created on the fly by calling
waitForChanges(). So no queues. Consumers which use the blocking feature
would just process a backlog which they'd consume at their own pace. The
backlog is *not* represented by a queue but by a position (the previous
parameter). Just like polling only that the call would block if there is
no next change set yet.
2) Linearity: Our overall design explicitly allows concurrent commits
that are only later merged together. This makes the concept of a
"previous" or "following" ChangeSet somewhat troublesome. You could
avoid that trouble by interpreting all concurrent commits from another
cluster node as a singe merge ChangeSet, but then you already lose
per-commit metadata. Again the poll+diff approach avoids this problem
since it doesn't care how and from where changes entered the latest
visible state of the tree.
I see. In your scenario you would return all changes (i.e. the diff of
the trees) between the last poll and this poll. In my scenario polling
would just follow the entries in the Microkernel journal and return the
changes (again as diff of the trees) of the revisions therein.
My reasoning for this is - as I said earlier - that it allows us to
implement JCR journals (which has the concept of change sets through the
persist event) and also allows us to thread through userData and related
information.
In the clustered case I'd handle changed from a cluster sync like
changes from any other session. So to the end user changes occurring due
to cluster synchronisation do not look any different than changes made
by another session on the same instance.
Different cluster nodes would see a different linear order of events and
even different events. The end result however would be the same for all
of them.
Michael
- The approach aligns neatly with the JCR features: implement observation
using blocking calls and implement journalling by using non blocking calls.
There's no concept of blocking calls for observation in JCR.
BR,
Jukka Zitting