On 6.5.14 2:59 , Davide Giannella wrote:
On 02/05/2014 10:38, Michael Dürig wrote:
Stay with the linked list / skip list approach but defer inserting new
keys to a background operation. Instead new keys are put into a
"pending-non-conflicting-id" container until picked up by a background
thread and merged into the linked list. Conflicts are simply handled
by a first committer wins strategy in the same way we handle the
entire index update currently.

When such an index is used, the "pending-non-conflicting-id" container
would have to be considered and its content would have to be merged
into the linked list on the fly. This should be rather easy by first
sorting these extra keys and then on the fly merging the linked list
and the sorted extra keys.
Interesting idea. We could have the index with a structure like

:index : {
     _pending : {
         keyA : {...},
         keyB : {...},
         ...
         keyN : {...}
     },
     :start : {...},
     ...
     keyN : {...}
}

where under `_pending` it's kept, as you said, a list of keys+documents
but unlinked. This is maintained with a simple NodeBuilder.child(). The
Async agent will be the only one taking care of doing
NodeBuilder.remove() once merged.

Will it be an unmanageable conflict if we have the agent on ClusterA
removing the key "foobar" and in the same cycle ClusterB add a new
document under "foobar"?

Still not sure how I could manage deletions and edits (documents that
changed key) in this scenario.

I think we can. At least if we can rely on the conflict resolution being implemented as I described earlier [1]. You'd need to double check this for the DocumentNS though. @Marcel, could you shed some light on this?

The keyXXX nodes from above would need to contain properties with unique names that in turn point to the respective nodes for that key. When such a node is removed the property would get removed, but not the parent node, event if it happens to be empty. This would be left to the async maintenance task.

For example say we initially have

keyA : { p5254=/foo/bar }

now someone adds /bar/baz and someone else adds /quz/qoz and someone else again removes /foo/bar for the same keyA. This would result in:

keyA : { p6475=/bar/baz, p7489=/quz/qoz }

This would even work without conflict if two or more cluster nodes would at the same time remove /foo/bar.

As special case is when the same node for keyA, say /x/y, is added concurrently by multiple cluster nodes. This would now result in

keyA : { p6475=/bar/baz, p7489=/quz/qoz, p9873=/x/y, p1255=/x/y }

I.e. /x/y get's duplicated, which needs to be properly handled by the async maintenance task and also when using the index before it has been maintained.


[1] https://issues.apache.org/jira/browse/OAK-1553?focusedCommentId=13940731&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13940731


I think such an approach is traded off very well: the index is always
consistent, however depending on the number of keys in the
"pending-non-conflicting-id" container performance might degrade a bit
until the background process has caught up again. Additionally this
approach is only slightly more complicated than what we already have.
Well we could limit the sorting on the _pending area only to a
configurable limit to avoid OOM. Let's say 1000 keys. It could be a
decent compromise.

I think it is a decent compromise and we should not worry about it too much at this point.


In the meanwhile I was investigating the B-Tree (BT). It would work on
paper of course and it has the same O() of the SkipList (SL). As
advantage it could serve ascending and descending order off the same
index. On the other hand it sounds as it will be less efficient in
serving range queries. In the case of the SL all we have to do is
seeking the correct start and then just "browse through", while in case
of the BT to navigate it we'll have to continuously going up and down
the tree, in-memory sorting of the current bucket and seeking where to
go. Plus in the case of `lastModified` dates where we always have to add
a greater element, it will require the update of all the buckets from
top to leaf.

What I'm not sure around the BT is the ability to manage conflict. Let's
say we have

A Z
|  \
.. MQZ
     / | \
       OQ XYZ

if the same instant ClusterA deletes O and Q the results will be

A Z
|  \
.. MZ
     /  \
        XYZ

and ClusterB adds P with the results will be

A Z
|  \
.. MQZ
     / | \
      OPQ XYZ

how do we expect mongo to manage this situation during the cluster sync?
Each letter is a node and the line points to subnodes.

My initial idea was that the btree maintenance would only be done by the async task while added keys would be kept in a "pending" bucket... at which point I realised that we could very well do the same thing with the current linked/skip list approach, which probably performs better, is easier to implement and closer to what we already have.

Michael


Any thoughts on this?

D.

Reply via email to