Hi Adam, You are right on those 3 bullet points, that is what ebtree is/does.
You are not right on your second paragraph, at least as it pertains to ebtree itself. ebtree stores the inserted keys, values, links between btree nodes and intermediate reduction values in the fdb value (the fdb key is always constructed as {IndexPrefix, NodeId} where NodeId is currently an auto-incrementing integer). Some of those items are unused (stored as a single-byte NIL) if there no reduce_fun is specified. In garren's work (https://github.com/apache/couchdb/pull/3018) that integrates ebtree into the couch_views application you are right, in that ebtree is only used for the reduce storage. I've commented on this topic that PR as I think that's a missed opportunity. Through the ebtree development I had considered whether the tree should be stored separately from the user-defined (and therefore potentially large and non-uniform in size) values or not. As you've noted the "order" in ebtree is implemented correctly (on cardinality not size) and this is key to predictable performance (as is the rebalancing when deleting, which couch_btree has never attempted). Without the user-defined values, we could choose a high value for order (say, 100 or 200, perhaps even higher) and have confidence that they will fit within fdb value size limits. I deliberately went simple here and deferred the debate till I had something worth discussing. And so I would love to continue that discussion here; this thread seems the appropriate place. My thoughts so far are; 1) we could introduce new limits on what we allow users to reduce on, with a view to capture 95%+ of the intended use case. e.g, restricting the emitted values to scalars, which we know will reduce without growing excessively (and, yes, I would include lists or objects of scalars). The only thing I think we'd want to preclude is what reduce_limit also, and naively, tried to prevent; an ever-growing reduction value. 2) As you've suggested independently, the emitted key/values, and the intermediate reductions, are stored in their own k-v entries, leaving the k-v's of the btree structure itself within predictable size bounds. 3) Splitting an encoded node over multiple, adjacent k-v's exactly like we do with a large document body today. 4) A hybrid approach of the current ebtree code and 2 above, externalising key/value/reductions if warranted on a per node basis. Well-behaved reduce functions over appropriate emitted data would not need to do so, but we'd not be surprised in production deployments if large values or large reductions happened on occasion. I do think it's a good time to define what an appropriate use of reduce in CouchDB is, whatever the mechanism for calculating and storing it. I don't think we should support egregious cases like "function(ks, vs, rr) { return vs}", for example. Finally, I note that I have a local branch called "immutable" which changes node ids whenever the "members" attribute of an ebtree node changes. It changes node ids to uuids and adds a cache callback. The intention is to eliminate the vast majority of inner node lookups _outside_ of any fdb transaction (it is always necessary to read the root node and we can't cache leaf nodes as they are linked for efficient forward and reverse ordered traversal). This code works as expected but I have been unable to prove its benefit as ebtree performs very well under the test scenarios I've been able to bring to bear so far. I will post that work as a branch to couchdb later today. B. > On 23 Jul 2020, at 01:54, Adam Kocoloski <kocol...@apache.org> wrote: > > This is pretty interesting. I find myself doing a mental double-take when I > realize that it is > > - storing an ordered set of KVs, > - in an ordered KV storage engine, > - without relying on the ordering of the storage engine! > > If I’m understanding things correctly (and that’s a big if), adding a reduce > function to a view completely changes how the “map” portion of that view is > stored. A view with no reduce function stores the rows of the view as > individual KVs in FDB, but the same map output when paired with a reduce > function would be chunked up and stored in the leaf nodes of this b+tree. Has > anyone compared the performance of those two different data models, e.g. by > querying a map-only view and an mr-view with reduce=false? > > If it turns out that there’s a significant difference, is it worth > considering a model where the leaf nodes of the b+tree just point to KV > ranges in FDB, rather than holding the actual user-emitted KV data > themselves? Then the inner nodes of the b+tree would just provide the data > structure to incrementally maintain the aggregations. Inserts to that range > of KVs during indexing would still go through the b+tree code path, but > queries for the map portion of the view could go directly to the relevant > range of KVs in FDB, skipping the traversal through the inner nodes and > limiting the data transfer only to the rows requested by the client. > > Conversely, if the b+tree approach is actually better even without a > user-supplied reduce function, shouldn’t we use it for all views? > > As an aside, I’m very glad to see a departure from couch_btree’s approach of > dynamically modifying the “order” of the “b+tree” based on the size of the > reduction. Such an ugly edge case there ... > > Cheers, Adam > >> On Jul 21, 2020, at 8:48 AM, Robert Newson <rnew...@apache.org> wrote: >> >> Thank you for those kind words. >> >> -- >> Robert Samuel Newson >> rnew...@apache.org >> >> On Tue, 21 Jul 2020, at 13:45, Jan Lehnardt wrote: >>> Heya Garren an Bob, >>> >>> this looks really nice. I remember when this was a twinkle in our >>> planning eyes. Seeing the full thing realised is very very cool. >>> >>> I’m additionally impressed by the rather pretty and clean code. >>> This doesn’t have to to be hard :) >>> >>> Looking forward to see this in action. >>> >>> Best >>> Jan >>> — >>> >>>> On 21. Jul 2020, at 14:01, Garren Smith <gar...@apache.org> wrote: >>>> >>>> Hi All >>>> >>>> We have a new reduce design for FoundationDB and we think this one will >>>> work. >>>> Recently I proposed a simpler reduce design [1] and at the same time, Bob >>>> (rnewson) looked at implementing a B+tree [2], called ebtree, on top of >>>> FoundationDB. The b+tree implementation has turned out really nicely, the >>>> code is quite readable and works really well. I would like to propose that >>>> instead of using the simpler reduce design I mentioned in the previous >>>> email, we rather go with a reduce implementation on top of ebtree. The big >>>> advantage of ebtree is that it allows us to keep the behaviour of CouchDB >>>> 3.x. >>>> >>>> We have run some basic performance tests on the Cloudant performance >>>> clusters and so far the performance is looking quite good and performs very >>>> similar to my simpler reduce work. >>>> >>>> There is an unknown around the ebtree Order value. The Order is the number >>>> of key/values stored for a node. We need to determine the optimal order >>>> value for ebtree so that it doesn't exceed FoundationDB's key/value limits >>>> and still performs well. This is something we will be looking at as we >>>> finish up the reduce work. The work in progress for the reduce PR is >>>> https://github.com/apache/couchdb/pull/3018. >>>> >>>> A great thanks to Bob for implementing the B+tree. I would love to hear >>>> your thoughts or questions around this? >>>> >>>> Cheers >>>> Garren >>>> >>>> [1] >>>> https://lists.apache.org/thread.html/r1d77cf9bb9c86eddec57ca6ea2aad90f396ee5f0dfe43450f730b1cf%40%3Cdev.couchdb.apache.org%3E >>>> >>>> [2] https://github.com/apache/couchdb/pull/3017 >>>> [3] https://github.com/apache/couchdb/pull/3018 >>> >>> >