On Sun, Feb 5, 2012 at 6:25 PM, Michael Dürig <mdue...@apache.org> wrote: > > > On 4.2.12 20:07, Stefan Guggisberg wrote: >> >> On Fri, Feb 3, 2012 at 7:44 PM, Michael Dürig<mdue...@apache.org> wrote: >>> >>> >>> Hi, >>> >>> While working on a new transient space implementation [1] I discovered >>> that >>> recreating a list of changes from transient modifications to the >>> hierarchy >>> is like opening a can of worms. Rethinking things it turned out, that it >>> is >>> much simpler to directly consolidate the change log. This approach is >>> much >>> more flexible and versatile since knowledge of the persisted hierarchy is >>> not necessary at all. Thus this approach should work equally well with >>> change logs for jr3, the Microkernel, Jackrabbit core, JCR and SPI. >>> >>> The algorithm is capable of creating minimal change logs from any valid >>> change log (that is from any change log that can be applied to some >>> hierarchy). >>> >>> I extensively tested the algorithm on randomly created change logs [3]. >>> This >>> resulted in some interesting numbers: consolidation results in an >>> decrease >>> of about 45% of these randomly generated change logs. Furthermore the >>> resulting size of the H2 database files *decreased up to 200%*! >>> >>> The implementation [2] is currently part of the jackrabbit-microkernel >>> module in the sandbox. However it does not have any crucial dependencies >>> so >>> it should be easy to adapt for other usages. >>> >>> The gory details of the algorithm are based on some algebraic properties >>> of >>> move operations on paths as detailed in the following paragraphs. >>> >>> The change log consolidation algorithm implemented in the reduce method >>> [2] >>> is based on algebraic properties of move operations on paths. The other >>> operations (add node, remove node and set property) are generalized to >>> move >>> operations. Consolidation relies on reduction and commutation rules of >>> move >>> operations. >>> >>> A move operation resembles a map on a hierarchy (of nodes and >>> properties). A >>> change log consisting of k move operations m_1 to m_k is thus the >>> composition of the individual moves: m_1 *, ..., * m_k (using * for >>> function >>> composition: f(g(x)) = (f * g)(x)). >>> >>> Some definitions, notation and propositions: >>> >>> * Let NIL denote a path which never occurs in any hierarchy. >>> >>> * Order on paths: let p, q be paths. >>> - p< q iff p != NIL and q != NIL and p is an ancestor of q. >>> - p<= q iff p< q or p == q != NIL >>> >>> * Conflict of paths: let p, q be paths. >>> - p ~ q (p conflicts with q) iff either p<= q or q<= p >>> >>> * Substitution in paths: let p, q, r be paths. >>> - [p -> q]r = r if p is not an ancestor of r and >>> - [p -> q]r = s where s is the path resulting from replacing >>> the ancestor p in r with q otherwise. >>> >>> * Let p, q be paths. Then p:q denotes a move operation where the >>> node at p is moved to a new node at q. >>> >>> * Valid moves: leq p, q be paths. >>> - p:q is valid iff p !~ q or p = q >>> - if p:q is not valid, it is invalid >>> >>> Invalid moves are exactly those which would result in a node being moved >>> to >>> an ancestor/descendant of itself. >>> >>> * Identity on moves: let p, q be paths. >>> - p:q = ID iff p = q. >>> >>> * Conflict on moves: let p, q, r, s be paths. >>> - p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s >>> or q ~ r or q ~ s. >>> >>> * Strict commutativity of moves: let m, n be moves. >>> - m * n = n * m iff m !~ n >>> >>> * Substitutions in moves: let p, q, r, s be paths. >>> - [p -> q]r:s = [p -> q]r:[p -> q]s >>> >>> * Let p be a path and let +p denote an add node operation and let -p >>> denote a remove node operation for a node at path p. >>> - +p = NIL:p That is, adding a node is represented by a move from a >>> unknown source. >>> - p = p:NIL. That is, removing a node is represented by a move to an >>> unknown sink. >>> >>> >>> Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID >>> and >>> n != ID. Then the following reduction and commutation rules apply: >>> >>> 1. p!~ r: m * n = n * m >>> 2. p< r: illegal (since this implies q<= r which implies p ~ q and >>> thus m invalid) >>> 3. p = r: illegal (since this implies q<= r which implies p ~ q and >>> this m invalid) >>> 4. p> r: does not commute if q< s. Otherwise m * n = n * [r -> s]m >>> 5. p!~ s: m * n = n * m >>> 6. p< s: illegal (since this implies p ~ q and thus m invalid) >>> 7. p = s: does not commute >>> 8. p> s: illegal (since p> s implies there is an s already which >>> will conflict with r:s) >>> 9. q!~ r: m * n = n * m >>> 10. q< r: m * n = [q -> p]n * m >>> 11. q = r: m * n = p:s (transitivity of moves) >>> 12. q> r: m * n = n * [r -> s]m >>> 13. q!~ s: m * n = n * m >>> 14. q< s: does not commute if p> r. Otherwise m * n = [q -> p]n * m >>> 15. q = s: illegal (since s conflicts with r:s) >>> 16. q> s: illegal (since s conflicts with r:s) >>> >>> Allowing add node and remove node operations the following additional >>> conditions apply: >>> >>> Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the >>> reduction and commutations rules 1. to 16. apply with extra conditions on >>> 4., 10., 12. and 14.: >>> >>> 4'. if s = NIL and q = NIL then m * n = -r. Otherwise if s = NIL then >>> m, n do not commute. >>> 10'. illegal if p = NIL >>> 12'. if s = NIL then m * n = -r * -p >>> 14'. illegal if p = NIL >> >> >> sounds very complicated ;) i admit that i don't >> fully understand it (ok, i didn't try very hard >> and gave up somewhere in the middle). > > > It is actually much simpler than it looks. Have a look at the code and the > inline comments for additional hints and my explanations below. > > >> could you please provide a simple example? > > > What I basically did is identify all cases of how moves, adds and removes > interact. This turned out to be much simpler when conceptually handling adds > as moves from 'nowhere' and deletes as moves to 'nowhere' (aka Trash) and > then specialising interaction rules for adds and deletes from there. > > One kind of interaction is reductions like > >>/a:/b, >/b:/c = >/a:/c > or > -/a, -/a/b = -/a > ... > > The other kind of interaction is commutation like > >>/a:/b, >/c:/d = >/c:/d, >/a:/b > or >> /a:/x, +/x/y:{} = +/a/y:{}, >/a:/x > ... > > With a complete list of such interactions (see 1. - 16. in my initial post) > it is possible to reduce a change log to its minimal equivalent by > subsequent application of reductions and commutations: Given a reduced > change list and a new operation inserted at index k, try to reduce it with > the operation at k - 1. If this fails commute it with the operation at k - 1 > and repeat at k - 1 until either a reduction has been found or commutation > fails. In the latter case do the same going forward (to k + 1). If still no > reduction could be found, we are done and the change list is minimized. > Otherwise, recursively apply the reduction algorithm treating the result of > the reduction as the new operation inserted into the rest of the change > list. > This is exactly what is implemented in reduce(List<Operation>, int index). > See > http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup > > Example: >>/a:/x +/x/y:{} >/x:/b = (commutation of 1 and 2, rule 14.) > +/a/y:{} >/a:/x >/x:/b = (reduction of 2 and 3, rule 11.) > +/a/y:{} >/a:/b > > Some key observations which make reduce() work: > - The interaction rules do not depend on the persisted hierarchy: If a > change log can be applied to a hierarchy, its minimized equivalent can be > applied to the same hierarchy. More over the observable effect on that > hierarchy will be the same. > > - The reduce algorithm is minimizing since whether a reduction can occur or > not does not depend on 'where it occurs'. Above example could as well reduce > as follows: > >>/a:/x +/x/y:{} >/x:/b = (commutation of 2 and 3, rule 12.) >>/a:/x >/x:/b +/b/y = (reduction of 1 and 2, rule 11.) >>/a:/b +/b/y:{} > > >> what is the benefit of this approach compared >> to the imlementation in jackrabbit-core >> (which does consolidate transient changes)? > > > - Versatility: it is not bound to any implementation, nor does it need the > actual hierarchy (i.e. it is stateless). It is basically a realisation of a > map from change logs to change logs.
ok, agreed. > > - Self contained: everything in one single quite small class. No > dependencies, no clutter. > > - Much simpler. erm, i beg to differ ... ;) > > - Provably sound and complete. > > - Provably minimizing on the change logs. > > - Portable since it doesn't depend on a specific underlying implementation. > > - Reduction of up to 45% of the change log communicated from the upper > layers to the Microkernel. great! however, to be fair, you're comparing 2 different change log-based transient space implementations in the sandbox. transient space implementations don't necessarily need to be change log-based (see e.g. jackrabbit-core). i assume you would see little (if not no) reduction in the number of persisted changes per save operation when comparing to the jackrabbit-core transient space implementation. OTOH, i am not saying that the jackrabbit-core approach is suitable when using a microkernel-based backend and i am not questioning the usefulness of consolidated change logs in general. > > - In the case of H2, reduction of up to 200% of database size. see my previous comment > > Apart from the benefits, it is a plain necessity for transient space > implementations on top of the Microkernel: without proper consolidation > users could end up not being able to save their (valid) transient > modifications. Consider a first user doing > >>/a:/t/x +/t/x/y:{} >/t/x:/b > > if in the meanwhile another user removes x the save operation for the first > user would fail. how could another user remove /t/x? it only exists in the first user's transient space. cheers stefan > > After consolidation, above change log would look like > > +/a/y:{} >/a:/b > > and can be saved. > > Michael > > >> >> cheers >> stefan >> >>> >>> Michael >>> >>> >>> [1] http://markmail.org/message/qkkcvtmtapas2cx4 >>> [2] >>> >>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup >>> [3] >>> >>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup