Re: [DISCUSS] Reduce on FDB take 3

Adam Kocoloski Mon, 27 Jul 2020 19:36:09 -0700

Sounds good to me.

Adam


On Sun, Jul 26, 2020, at 8:13 AM, Robert Samuel Newson wrote:
> I'd like to find consensus on a path forward and, with that in mind, I 
> have a suggestion.
> 
> Before that, I think it's useful to lay out the points of contention, 
> which I've ranked in I hope an uncontroversial order;
> 
> The principal concern with using ebtree for map as well as reduce is 
> that it precludes the use of foundationdb's ordering, on which much 
> performance and scalability likely hinges. I agree with this on 
> principle (that is, its demonstrably true) though none of us yet know 
> the extent to which it matters in practice, or whether any gap could be 
> closed with optimization.
> 
> The secondary concern is the proposed use of ebtree for reduce-only 
> where it receives only the reduce values from a document. I understand 
> garren's stated desire to avoid duplication with the map side but I 
> don't think it works out. Users of the reduce function are using it to 
> get the reductions _across_ documents. It is rare (that is, I don't 
> recall seeing an example by a cloudant customer) to emit multiple rows 
> in a map function for the same key. In the cases where it doesn't, the 
> value inserted into ebtree is the same as the map value anyway. Where 
> it does happen, we're still duplicating the key and _a_ document level 
> value. It seems a very modest saving of space for a more complicated 
> design (two separate levels of reduce calculated in two separate 
> modules).
> 
> The tertiary concern is that the map-only indexes place user data 
> outside of aegis (native database encryption) as, necessarily, the 
> order of keys is important and a straightforward encryption there would 
> not preserve it.
> 
> Looking at all the options available I don't think there is a single 
> path that addresses everything. There are also significant unknowns 
> around practical performance for all of this.
> 
> So, I suggest that the map-only code (specifically, the code we will 
> use for an index definition that does not specify a reduce function) 
> uses the existing couch_views approach, where the fdb keys are suffixed 
> with the emitted key from the map function, and the fdb values are the 
> emitted values. This vertical index should support the fastest ?key= 
> and startkey/endkey= queries that fdb is capable of. There might be 
> simplifications we can make to this code if we accept that it only 
> needs to handle map-only indexes.
> 
> For index definitions that _do_ specify a reduce function, I suggest 
> this is done entirely with ebtree.We would call ebtree:insert(Db, Tree, 
> Key, Value) where Key and Value are the emitted key and value from the 
> map function. ebtree:open is handed a reduce_fun which handles the 
> reduce and rereduce for the users reduction and, like couchdb 1/2/3, we 
> also mix in a system reduction that calculates the view size (paul's 
> diff above appears to do just this portion and looks broadly right).
> 
> This avoids the duplication that garren's idea of storing the reduces 
> in ebtree was trying to avoid, but does so in all cases. This approach 
> allows us to compare the performance of map queries against map-only 
> and map-reduce indexes, it allows users to opt in or out of encrypting 
> their emitted view keys, and it doesn't prevent anyone from choosing to 
> do both (You can define a view with map function A and another view 
> with map function A and reduce function B, the former will be the 
> couch_views vertical index, the latter an ebtree "horizontal" index).
> 
> As we learn more over time we could enhance one type of index or the 
> other and reach a point where we eliminate one, or the two could 
> coexist indefinitely.
> 
> > On 24 Jul 2020, at 20:00, Paul Davis <paul.joseph.da...@gmail.com> wrote:
> > 
> > FWIW, a first pass at views entirely on ebtree turned out to be fairly
> > straightforward. Almost surprisingly simple in some cases.
> > 
> > https://github.com/apache/couchdb/compare/prototype/fdb-layer...prototype/fdb-layer-ebtree-views
> > 
> > Its currently passing all tests in `couch_views_map_test.erl` but is
> > failing on other suites. I only did a quick skim on the failures but
> > they all look superficial around some APIs I changed.
> > 
> > I haven't added the APIs to query reduce functions via HTTP but the
> > reduce functions are being executed to calculate row counts and KV
> > sizes. Adding the builtin reduce functions and extending those to user
> > defined reduce functions should be straightforward.
> > 
> > On Fri, Jul 24, 2020 at 9:39 AM Robert Newson <rnew...@apache.org> wrote:
> >> 
> >> 
> >> I’m happy to restrict my PR comments to the actual diff, yes. So I’m not 
> >> +1 yet.
> >> 
> >> I fixed the spurious conflicts at 
> >> https://github.com/apache/couchdb/pull/3033.
> >> 
> >> --
> >>  Robert Samuel Newson
> >>  rnew...@apache.org
> >> 
> >> On Fri, 24 Jul 2020, at 14:59, Garren Smith wrote:
> >>> Ok so just to confirm, we keep my PR as-is with ebtree only for reduce. We
> >>> can get that ready to merge into fdb master. We can then use that to 
> >>> battle
> >>> test ebtree and then look at using it for the map side as well. At that
> >>> point we would combine the reduce and map index into a ebtree index. Are
> >>> you happy with that?
> >>> 
> >>> Cheers
> >>> Garren
> >>> 
> >>> On Fri, Jul 24, 2020 at 3:48 PM Robert Newson <rnew...@apache.org> wrote:
> >>> 
> >>>> Hi,
> >>>> 
> >>>> It’s not as unknown as you think but certainly we need empirical data to
> >>>> guide us on the reduce side. I’m also fine with continuing with the
> >>>> map-only code as it stands today until such time as we demonstrate ebtree
> >>>> meets or exceeds our needs (and I freely accept the possibility that it
> >>>> might not).
> >>>> 
> >>>> I think the principal enhancement to ebtree that would address most
> >>>> concerns is if it could store the leaf entries vertically (as they are
> >>>> currently). I have some thoughts which I’ll try to realise as working 
> >>>> code.
> >>>> 
> >>>> I’ve confirmed that I do create spurious conflicts and will have a PR up
> >>>> today to fix that.
> >>>> 
> >>>> --
> >>>>  Robert Samuel Newson
> >>>>  rnew...@apache.org
> >>>> 
> >>>> On Fri, 24 Jul 2020, at 12:43, Garren Smith wrote:
> >>>>> Hi Bob,
> >>>>> 
> >>>>> Thanks for that explanation, that is really helpful and it is good we
> >>>> have
> >>>>> some options but it also does highlight a lot of unknowns. Whereas our
> >>>>> current map indexes are really simple and we know its behaviour. There
> >>>> are
> >>>>> no real unknowns. Adding ebtree here could make map indexes a fair bit
> >>>> more
> >>>>> complicated since we don't know the effect of managing different node
> >>>>> sizes, concurrent doc updates, and querying performance.
> >>>>> 
> >>>>> Could we do a compromise here, could we look at using ebtree only for
> >>>>> reduces now? That means that we can have reduce indexes working quite
> >>>> soon
> >>>>> on CouchDB on FDB. At the same time, we can work on ebtree and run
> >>>>> performance tests on ebtree for map indexes. Some interesting tests we
> >>>> can
> >>>>> do is see if a user emits KV's near the limits (8KB for keys and 50KB 
> >>>>> for
> >>>>> values) how does ebtree handle that? How it handles being updated in the
> >>>>> doc update transaction. And general query performance. What do you 
> >>>>> think?
> >>>>> 
> >>>>> Cheers
> >>>>> Garren
> >>>>> 
> >>>>> 
> >>>>> On Fri, Jul 24, 2020 at 10:06 AM Robert Samuel Newson <
> >>>> rnew...@apache.org>
> >>>>> wrote:
> >>>>> 
> >>>>>> Hi,
> >>>>>> 
> >>>>>> A short preface at this stage is needed I think:
> >>>>>> 
> >>>>>> My goal with ebtree was to implement a complete and correct b+tree that
> >>>>>> also calculated and maintained inner reductions. I consciously chose
> >>>> not to
> >>>>>> go further than that before presenting it to the group for wider debate
> >>>>>> (indeed, the very debate we're having in this thread).
> >>>>>> 
> >>>>>> --
> >>>>>> 
> >>>>>> Ebtree is not single writer, at least not inherently. Two updates to
> >>>> the
> >>>>>> same ebtree should both succeed as long as they don't modify the same
> >>>> nodes.
> >>>>>> 
> >>>>>> Modifying the same node is likely where there's a reduce function,
> >>>> though
> >>>>>> only if the reduction value actually changes, as that percolates up the
> >>>>>> tree. Where the reduce value does not change and when neither
> >>>> transaction
> >>>>>> causes a split, rebalance, or merge that affects the other, they should
> >>>>>> both commit without conflict. There is a rich history of optimizations
> >>>> in
> >>>>>> this space specifically around btrees. ebtree might cause spurious
> >>>>>> conflicts today, I will investigate and propose fixes if so (e.g, I
> >>>> think I
> >>>>>> probably call erlfdb:set on nodes that have not changed).
> >>>>>> 
> >>>>>> I certainly envisaged updating ebtree within the same txn as a doc
> >>>> update,
> >>>>>> which is why the first argument to all the public functions can be
> >>>> either
> >>>>>> an erlfdb Db or open Tx.
> >>>>>> 
> >>>>>> Parallelising the initial build is more difficult with ebtree than
> >>>>>> parallelising the existing couch_views map-only code, though it's worth
> >>>>>> noting that ebtree:insert benefits a great deal from batching (multiple
> >>>>>> calls to :insert in the same transaction). For an offline build (i.e,
> >>>> an
> >>>>>> index that the client cannot see until the entire build is complete)
> >>>> the
> >>>>>> batch size can be maximised. That is still a serial process in that
> >>>> there
> >>>>>> is only one transaction at a time updating the ebtree. I can't say
> >>>> offhand
> >>>>>> how fast that is in practice, but it is clearly less powerful than a
> >>>> fully
> >>>>>> parallelisable approach could be.
> >>>>>> 
> >>>>>> Any parallel build would require a way to divide the database into
> >>>>>> non-overlapping subsets of emitted keys. This is easy and natural if
> >>>> the
> >>>>>> fdb key is the emitted key, which is the case for the couch_views
> >>>> map-only
> >>>>>> code. For ebtree it might be enough to simply grab a large chunk of
> >>>>>> documents, perform the map transform, and then issues multiple
> >>>> transactions
> >>>>>> on subsets of those.
> >>>>>> 
> >>>>>> Another common technique for btrees is bulk loading (more or less
> >>>>>> literally constructing the btree nodes directly from the source, as
> >>>> long as
> >>>>>> you can sort it), which might be an option as well.
> >>>>>> 
> >>>>>> Parallelising a build _with_ a reduce function seems hard however we do
> >>>>>> it. The non-ebtree approach is parallelisable by virtue of paring down
> >>>> the
> >>>>>> reduce functionality itself (only whole key groups, only those
> >>>> functions
> >>>>>> that fdb has atomic operations for).
> >>>>>> 
> >>>>>> I will first of all verify the multi-writer nature of ebtree as it
> >>>> stands
> >>>>>> today and make a PR which fixes any spurious conflicts, and then ponder
> >>>>>> further  how true parallel builds might be possible.
> >>>>>> 
> >>>>>> 
> >>>>>>> On 24 Jul 2020, at 07:30, Garren Smith <gar...@apache.org> wrote:
> >>>>>>> 
> >>>>>>> We haven't spoken much about updates with ebtree.
> >>>>>>> From my understanding of ebtree it can only do single writer, is that
> >>>>>>> correct?. If that is true it means we would not be able to
> >>>> parallelize
> >>>>>> the
> >>>>>>> background building of views to speed up view builds.
> >>>>>>> The other thing that would mean is that we cannot use it for mango
> >>>> where
> >>>>>> we
> >>>>>>> update the view in the doc transaction.
> >>>>>>> 
> >>>>>>> Cheers
> >>>>>>> Garren
> >>>>>>> 
> >>>>>>> On Fri, Jul 24, 2020 at 2:35 AM Kyle Snavely <kjsnav...@gmail.com>
> >>>>>> wrote:
> >>>>>>> 
> >>>>>>>> When in doubt, throw us a build at Cloudant with ebtree maps and
> >>>> we'll
> >>>>>> see
> >>>>>>>> if it comes close to the crazy fast KV map queries.
> >>>>>>>> 
> >>>>>>>> Kyle
> >>>>>>>> 
> >>>>>>>> On Thu, Jul 23, 2020, 2:17 PM Robert Samuel Newson <
> >>>> rnew...@apache.org>
> >>>>>>>> wrote:
> >>>>>>>> 
> >>>>>>>>> I (perhaps obviously) don't agree that I'm tying myself to old
> >>>> CouchDB
> >>>>>> or
> >>>>>>>>> failing to embrace FDB. FDB is a toolkit and does not, to my mind,
> >>>>>> force
> >>>>>>>> us
> >>>>>>>>> down any particular path.
> >>>>>>>>> 
> >>>>>>>>> I haven't sat down to modify couch_views in the manner I've
> >>>> suggested
> >>>>>>>>> (where ebtree is used as designed; being fed the output of the
> >>>> emit()
> >>>>>>>> calls
> >>>>>>>>> and calculating reductions as it does so) but I think it's a
> >>>> worthwhile
> >>>>>>>>> exercise. I'd be surprised if performance of map-only traversals
> >>>> would
> >>>>>> be
> >>>>>>>>> disappointing but who knows? I also expect it would allow for
> >>>>>> significant
> >>>>>>>>> simplification of the code, which is one of the higher virtues.
> >>>>>>>>> 
> >>>>>>>>> Adam, can you describe in a little more detail how you picture
> >>>> "b+tree
> >>>>>> is
> >>>>>>>>> only used for incremental aggregations,"? It's implied in your
> >>>> reply
> >>>>>> that
> >>>>>>>>> it would preserve the "interesting property" of keeping user data
> >>>> out
> >>>>>> of
> >>>>>>>>> FDB Keys (for casual readers: the new native database encryption,
> >>>>>> called
> >>>>>>>>> "aegis", only encrypts the FDB value. It can't encrypt the key as
> >>>> this
> >>>>>>>>> would change the order of keys, which the current code depends on).
> >>>>>> Did I
> >>>>>>>>> misread you?
> >>>>>>>>> 
> >>>>>>>>> B.
> >>>>>>>>> 
> >>>>>>>>>> On 23 Jul 2020, at 20:11, Adam Kocoloski <kocol...@apache.org>
> >>>> wrote:
> >>>>>>>>>> 
> >>>>>>>>>> OK thanks for the clarification. As I said I wasn’t all that
> >>>> confident
> >>>>>>>> I
> >>>>>>>>> understood the design :)
> >>>>>>>>>> 
> >>>>>>>>>> I like the idea that the b+tree is only used for incremental
> >>>>>>>>> aggregations, rather than storing the entire materialized view,
> >>>> for the
> >>>>>>>>> same reasons that Garren stated.
> >>>>>>>>>> 
> >>>>>>>>>> An index maintained entirely in ebtree has the interesting
> >>>> property
> >>>>>>>> that
> >>>>>>>>> it does not leak any user data into FDB Keys, which could be
> >>>> attractive
> >>>>>>>> for
> >>>>>>>>> security reasons.
> >>>>>>>>>> 
> >>>>>>>>>> Adam
> >>>>>>>>>> 
> >>>>>>>>>>> On Jul 23, 2020, at 1:54 PM, Garren Smith <gar...@apache.org>
> >>>> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>> On Thu, Jul 23, 2020 at 6:55 PM Paul Davis <
> >>>>>>>> paul.joseph.da...@gmail.com
> >>>>>>>>>> 
> >>>>>>>>>>> wrote:
> >>>>>>>>>>> 
> >>>>>>>>>>>>> I would like to keep ebtree to use just for the reduce index.
> >>>>>>>>>>>> 
> >>>>>>>>>>>> Could you expand on your reasoning here, Garren? I haven't done
> >>>> any
> >>>>>>>>>>>> experiments on my own to understand if I'm missing something
> >>>>>>>>>>>> important. My initial reaction is to not diverge too far from
> >>>> the
> >>>>>>>>>>>> previous shape of the implementation since we have a decent
> >>>> idea of
> >>>>>>>>>>>> how that behaves already but perhaps you've seen or measured
> >>>>>>>> something
> >>>>>>>>>>>> I'm not thinking of?
> >>>>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>> I think this must have been a misunderstanding on my part. I
> >>>> always
> >>>>>>>>> thought
> >>>>>>>>>>> of using ebtree to solve reduce and wasn't planning to use it
> >>>> for the
> >>>>>>>>> map
> >>>>>>>>>>> index.
> >>>>>>>>>>> I don't like the idea that we have ordered distributed key/value
> >>>>>> store
> >>>>>>>>> and
> >>>>>>>>>>> then implementing a b-tree on top of that for indexing.
> >>>> Especially
> >>>>>>>>> since we
> >>>>>>>>>>> know that the current map index is fast,
> >>>>>>>>>>> easy to use, and follows recommended practices in the FDB
> >>>> community
> >>>>>> on
> >>>>>>>>> how
> >>>>>>>>>>> to do indexing. ebtree makes sense for reduce and is a means to
> >>>> an
> >>>>>> end
> >>>>>>>>> to
> >>>>>>>>>>> give us CouchDB's reduce api, which is heavily reliant on a
> >>>> b-tree,
> >>>>>>>> with
> >>>>>>>>>>> CouchDB on FDB. This feels like a step backward and I worry we
> >>>> are
> >>>>>>>> tying
> >>>>>>>>>>> ourselves heavily to old CouchDB instead of using the fact that
> >>>> we
> >>>>>>>>> moving
> >>>>>>>>>>> to FDB which then allows us to design new api's and
> >>>> functionality.
> >>>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>>> 
> >>>>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>> 
> >>>> 
> >>> 
> 
>

Re: [DISCUSS] Reduce on FDB take 3

Reply via email to