Re: [DISCUSS] Reduce on FDB take 3

Garren Smith Thu, 23 Jul 2020 03:16:35 -0700

This is a little tricky now since we having the similar conversations on
the PR and here. But to summarise my response from the PR
https://github.com/apache/couchdb/pull/3018#issuecomment-662925449
I'm only storing the reduce values in ebtree. We haven't done any
performance testing to see how ebtree would compare to the current map
index so we don't know if it would be usable for that. I would like to keep
ebtree to use just for the reduce index.


Cheers
Garren

On Thu, Jul 23, 2020 at 11:17 AM Robert Samuel Newson <[email protected]>
wrote:

> Hi Adam,
>
> You are right on those 3 bullet points, that is what ebtree is/does.
>
> You are not right on your second paragraph, at least as it pertains to
> ebtree itself. ebtree stores the inserted keys, values, links between btree
> nodes and intermediate reduction values in the fdb value (the fdb key is
> always constructed as {IndexPrefix, NodeId} where NodeId is currently an
> auto-incrementing integer). Some of those items are unused (stored as a
> single-byte NIL) if there no reduce_fun is specified. In garren's work (
> https://github.com/apache/couchdb/pull/3018) that integrates ebtree into
> the couch_views application you are right, in that ebtree is only used for
> the reduce storage. I've commented on this topic that PR as I think that's
> a missed opportunity.
>
> Through the ebtree development I had considered whether the tree should be
> stored separately from the user-defined (and therefore potentially large
> and non-uniform in size) values or not. As you've noted the "order" in
> ebtree is implemented correctly (on cardinality not size) and this is key
> to predictable performance (as is the rebalancing when deleting, which
> couch_btree has never attempted).
>
> Without the user-defined values, we could choose a high value for order
> (say, 100 or 200, perhaps even higher) and have confidence that they will
> fit within fdb value size limits. I deliberately went simple here and
> deferred the debate till I had something worth discussing. And so I would
> love to continue that discussion here; this thread seems the appropriate
> place.
>
> My thoughts so far are;
>
> 1) we could introduce new limits on what we allow users to reduce on, with
> a view to capture 95%+ of the intended use case. e.g, restricting the
> emitted values to scalars, which we know will reduce without growing
> excessively (and, yes, I would include lists or objects of scalars). The
> only thing I think we'd want to preclude is what reduce_limit also, and
> naively, tried to prevent; an ever-growing reduction value.
>
> 2) As you've suggested independently, the emitted key/values, and the
> intermediate reductions, are stored in their own k-v entries, leaving the
> k-v's of the btree structure itself within predictable size bounds.
>
> 3) Splitting an encoded node over multiple, adjacent k-v's exactly like we
> do with a large document body today.
>
> 4) A hybrid approach of the current ebtree code and 2 above, externalising
> key/value/reductions if warranted on a per node basis. Well-behaved reduce
> functions over appropriate emitted data would not need to do so, but we'd
> not be surprised in production deployments if large values or large
> reductions happened on occasion.
>
> I do think it's a good time to define what an appropriate use of reduce in
> CouchDB is, whatever the mechanism for calculating and storing it. I don't
> think we should support egregious cases like "function(ks, vs, rr) { return
> vs}", for example.
>
> Finally, I note that I have a local branch called "immutable" which
> changes node ids whenever the "members" attribute of an ebtree node
> changes. It changes node ids to uuids and adds a cache callback. The
> intention is to eliminate the vast majority of inner node lookups _outside_
> of any fdb transaction (it is always necessary to read the root node and we
> can't cache leaf nodes as they are linked for efficient forward and reverse
> ordered traversal). This code works as expected but I have been unable to
> prove its benefit as ebtree performs very well under the test scenarios
> I've been able to bring to bear so far. I will post that work as a branch
> to couchdb later today.
>
> B.
>
>
> > On 23 Jul 2020, at 01:54, Adam Kocoloski <[email protected]> wrote:
> >
> > This is pretty interesting. I find myself doing a mental double-take
> when I realize that it is
> >
> > - storing an ordered set of KVs,
> > - in an ordered KV storage engine,
> > - without relying on the ordering of the storage engine!
> >
> > If I’m understanding things correctly (and that’s a big if), adding a
> reduce function to a view completely changes how the “map” portion of that
> view is stored. A view with no reduce function stores the rows of the view
> as individual KVs in FDB, but the same map output when paired with a reduce
> function would be chunked up and stored in the leaf nodes of this b+tree.
> Has anyone compared the performance of those two different data models,
> e.g. by querying a map-only view and an mr-view with reduce=false?
> >
> > If it turns out that there’s a significant difference, is it worth
> considering a model where the leaf nodes of the b+tree just point to KV
> ranges in FDB, rather than holding the actual user-emitted KV data
> themselves? Then the inner nodes of the b+tree would just provide the data
> structure to incrementally maintain the aggregations. Inserts to that range
> of KVs during indexing would still go through the b+tree code path, but
> queries for the map portion of the view could go directly to the relevant
> range of KVs in FDB, skipping the traversal through the inner nodes and
> limiting the data transfer only to the rows requested by the client.
> >
> > Conversely, if the b+tree approach is actually better even without a
> user-supplied reduce function, shouldn’t we use it for all views?
> >
> > As an aside, I’m very glad to see a departure from couch_btree’s
> approach of dynamically modifying the “order” of the “b+tree” based on the
> size of the reduction. Such an ugly edge case there ...
> >
> > Cheers, Adam
> >
> >> On Jul 21, 2020, at 8:48 AM, Robert Newson <[email protected]> wrote:
> >>
> >> Thank you for those kind words.
> >>
> >> --
> >> Robert Samuel Newson
> >> [email protected]
> >>
> >> On Tue, 21 Jul 2020, at 13:45, Jan Lehnardt wrote:
> >>> Heya Garren an Bob,
> >>>
> >>> this looks really nice. I remember when this was a twinkle in our
> >>> planning eyes. Seeing the full thing realised is very very cool.
> >>>
> >>> I’m additionally impressed by the rather pretty and clean code.
> >>> This doesn’t have to to be hard :)
> >>>
> >>> Looking forward to see this in action.
> >>>
> >>> Best
> >>> Jan
> >>> —
> >>>
> >>>> On 21. Jul 2020, at 14:01, Garren Smith <[email protected]> wrote:
> >>>>
> >>>> Hi All
> >>>>
> >>>> We have a new reduce design for FoundationDB and we think this one
> will
> >>>> work.
> >>>> Recently I proposed a simpler reduce design [1] and at the same time,
> Bob
> >>>> (rnewson) looked at implementing a B+tree [2], called ebtree, on top
> of
> >>>> FoundationDB. The b+tree implementation has turned out really nicely,
> the
> >>>> code is quite readable and works really well. I would like to propose
> that
> >>>> instead of using the simpler reduce design I mentioned in the previous
> >>>> email, we rather go with a reduce implementation on top of ebtree.
> The big
> >>>> advantage of ebtree is that it allows us to keep the behaviour of
> CouchDB
> >>>> 3.x.
> >>>>
> >>>> We have run some basic performance tests on the Cloudant performance
> >>>> clusters and so far the performance is looking quite good and
> performs very
> >>>> similar to my simpler reduce work.
> >>>>
> >>>> There is an unknown around the ebtree Order value. The Order is the
> number
> >>>> of key/values stored for a node. We need to determine the optimal
> order
> >>>> value for ebtree so that it doesn't exceed FoundationDB's key/value
> limits
> >>>> and still performs well. This is something we will be looking at as we
> >>>> finish up the reduce work. The work in progress for the reduce PR is
> >>>> https://github.com/apache/couchdb/pull/3018.
> >>>>
> >>>> A great thanks to Bob for implementing the B+tree. I would love to
> hear
> >>>> your thoughts or questions around this?
> >>>>
> >>>> Cheers
> >>>> Garren
> >>>>
> >>>> [1]
> >>>>
> https://lists.apache.org/thread.html/r1d77cf9bb9c86eddec57ca6ea2aad90f396ee5f0dfe43450f730b1cf%40%3Cdev.couchdb.apache.org%3E
> >>>>
> >>>> [2] https://github.com/apache/couchdb/pull/3017
> >>>> [3] https://github.com/apache/couchdb/pull/3018
> >>>
> >>>
> >
>
>

Re: [DISCUSS] Reduce on FDB take 3

Reply via email to