Sounds good to me. Adam
On Sun, Jul 26, 2020, at 8:13 AM, Robert Samuel Newson wrote: > I'd like to find consensus on a path forward and, with that in mind, I > have a suggestion. > > Before that, I think it's useful to lay out the points of contention, > which I've ranked in I hope an uncontroversial order; > > The principal concern with using ebtree for map as well as reduce is > that it precludes the use of foundationdb's ordering, on which much > performance and scalability likely hinges. I agree with this on > principle (that is, its demonstrably true) though none of us yet know > the extent to which it matters in practice, or whether any gap could be > closed with optimization. > > The secondary concern is the proposed use of ebtree for reduce-only > where it receives only the reduce values from a document. I understand > garren's stated desire to avoid duplication with the map side but I > don't think it works out. Users of the reduce function are using it to > get the reductions _across_ documents. It is rare (that is, I don't > recall seeing an example by a cloudant customer) to emit multiple rows > in a map function for the same key. In the cases where it doesn't, the > value inserted into ebtree is the same as the map value anyway. Where > it does happen, we're still duplicating the key and _a_ document level > value. It seems a very modest saving of space for a more complicated > design (two separate levels of reduce calculated in two separate > modules). > > The tertiary concern is that the map-only indexes place user data > outside of aegis (native database encryption) as, necessarily, the > order of keys is important and a straightforward encryption there would > not preserve it. > > Looking at all the options available I don't think there is a single > path that addresses everything. There are also significant unknowns > around practical performance for all of this. > > So, I suggest that the map-only code (specifically, the code we will > use for an index definition that does not specify a reduce function) > uses the existing couch_views approach, where the fdb keys are suffixed > with the emitted key from the map function, and the fdb values are the > emitted values. This vertical index should support the fastest ?key= > and startkey/endkey= queries that fdb is capable of. There might be > simplifications we can make to this code if we accept that it only > needs to handle map-only indexes. > > For index definitions that _do_ specify a reduce function, I suggest > this is done entirely with ebtree.We would call ebtree:insert(Db, Tree, > Key, Value) where Key and Value are the emitted key and value from the > map function. ebtree:open is handed a reduce_fun which handles the > reduce and rereduce for the users reduction and, like couchdb 1/2/3, we > also mix in a system reduction that calculates the view size (paul's > diff above appears to do just this portion and looks broadly right). > > This avoids the duplication that garren's idea of storing the reduces > in ebtree was trying to avoid, but does so in all cases. This approach > allows us to compare the performance of map queries against map-only > and map-reduce indexes, it allows users to opt in or out of encrypting > their emitted view keys, and it doesn't prevent anyone from choosing to > do both (You can define a view with map function A and another view > with map function A and reduce function B, the former will be the > couch_views vertical index, the latter an ebtree "horizontal" index). > > As we learn more over time we could enhance one type of index or the > other and reach a point where we eliminate one, or the two could > coexist indefinitely. > > > On 24 Jul 2020, at 20:00, Paul Davis <paul.joseph.da...@gmail.com> wrote: > > > > FWIW, a first pass at views entirely on ebtree turned out to be fairly > > straightforward. Almost surprisingly simple in some cases. > > > > https://github.com/apache/couchdb/compare/prototype/fdb-layer...prototype/fdb-layer-ebtree-views > > > > Its currently passing all tests in `couch_views_map_test.erl` but is > > failing on other suites. I only did a quick skim on the failures but > > they all look superficial around some APIs I changed. > > > > I haven't added the APIs to query reduce functions via HTTP but the > > reduce functions are being executed to calculate row counts and KV > > sizes. Adding the builtin reduce functions and extending those to user > > defined reduce functions should be straightforward. > > > > On Fri, Jul 24, 2020 at 9:39 AM Robert Newson <rnew...@apache.org> wrote: > >> > >> > >> I’m happy to restrict my PR comments to the actual diff, yes. So I’m not > >> +1 yet. > >> > >> I fixed the spurious conflicts at > >> https://github.com/apache/couchdb/pull/3033. > >> > >> -- > >> Robert Samuel Newson > >> rnew...@apache.org > >> > >> On Fri, 24 Jul 2020, at 14:59, Garren Smith wrote: > >>> Ok so just to confirm, we keep my PR as-is with ebtree only for reduce. We > >>> can get that ready to merge into fdb master. We can then use that to > >>> battle > >>> test ebtree and then look at using it for the map side as well. At that > >>> point we would combine the reduce and map index into a ebtree index. Are > >>> you happy with that? > >>> > >>> Cheers > >>> Garren > >>> > >>> On Fri, Jul 24, 2020 at 3:48 PM Robert Newson <rnew...@apache.org> wrote: > >>> > >>>> Hi, > >>>> > >>>> It’s not as unknown as you think but certainly we need empirical data to > >>>> guide us on the reduce side. I’m also fine with continuing with the > >>>> map-only code as it stands today until such time as we demonstrate ebtree > >>>> meets or exceeds our needs (and I freely accept the possibility that it > >>>> might not). > >>>> > >>>> I think the principal enhancement to ebtree that would address most > >>>> concerns is if it could store the leaf entries vertically (as they are > >>>> currently). I have some thoughts which I’ll try to realise as working > >>>> code. > >>>> > >>>> I’ve confirmed that I do create spurious conflicts and will have a PR up > >>>> today to fix that. > >>>> > >>>> -- > >>>> Robert Samuel Newson > >>>> rnew...@apache.org > >>>> > >>>> On Fri, 24 Jul 2020, at 12:43, Garren Smith wrote: > >>>>> Hi Bob, > >>>>> > >>>>> Thanks for that explanation, that is really helpful and it is good we > >>>> have > >>>>> some options but it also does highlight a lot of unknowns. Whereas our > >>>>> current map indexes are really simple and we know its behaviour. There > >>>> are > >>>>> no real unknowns. Adding ebtree here could make map indexes a fair bit > >>>> more > >>>>> complicated since we don't know the effect of managing different node > >>>>> sizes, concurrent doc updates, and querying performance. > >>>>> > >>>>> Could we do a compromise here, could we look at using ebtree only for > >>>>> reduces now? That means that we can have reduce indexes working quite > >>>> soon > >>>>> on CouchDB on FDB. At the same time, we can work on ebtree and run > >>>>> performance tests on ebtree for map indexes. Some interesting tests we > >>>> can > >>>>> do is see if a user emits KV's near the limits (8KB for keys and 50KB > >>>>> for > >>>>> values) how does ebtree handle that? How it handles being updated in the > >>>>> doc update transaction. And general query performance. What do you > >>>>> think? > >>>>> > >>>>> Cheers > >>>>> Garren > >>>>> > >>>>> > >>>>> On Fri, Jul 24, 2020 at 10:06 AM Robert Samuel Newson < > >>>> rnew...@apache.org> > >>>>> wrote: > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> A short preface at this stage is needed I think: > >>>>>> > >>>>>> My goal with ebtree was to implement a complete and correct b+tree that > >>>>>> also calculated and maintained inner reductions. I consciously chose > >>>> not to > >>>>>> go further than that before presenting it to the group for wider debate > >>>>>> (indeed, the very debate we're having in this thread). > >>>>>> > >>>>>> -- > >>>>>> > >>>>>> Ebtree is not single writer, at least not inherently. Two updates to > >>>> the > >>>>>> same ebtree should both succeed as long as they don't modify the same > >>>> nodes. > >>>>>> > >>>>>> Modifying the same node is likely where there's a reduce function, > >>>> though > >>>>>> only if the reduction value actually changes, as that percolates up the > >>>>>> tree. Where the reduce value does not change and when neither > >>>> transaction > >>>>>> causes a split, rebalance, or merge that affects the other, they should > >>>>>> both commit without conflict. There is a rich history of optimizations > >>>> in > >>>>>> this space specifically around btrees. ebtree might cause spurious > >>>>>> conflicts today, I will investigate and propose fixes if so (e.g, I > >>>> think I > >>>>>> probably call erlfdb:set on nodes that have not changed). > >>>>>> > >>>>>> I certainly envisaged updating ebtree within the same txn as a doc > >>>> update, > >>>>>> which is why the first argument to all the public functions can be > >>>> either > >>>>>> an erlfdb Db or open Tx. > >>>>>> > >>>>>> Parallelising the initial build is more difficult with ebtree than > >>>>>> parallelising the existing couch_views map-only code, though it's worth > >>>>>> noting that ebtree:insert benefits a great deal from batching (multiple > >>>>>> calls to :insert in the same transaction). For an offline build (i.e, > >>>> an > >>>>>> index that the client cannot see until the entire build is complete) > >>>> the > >>>>>> batch size can be maximised. That is still a serial process in that > >>>> there > >>>>>> is only one transaction at a time updating the ebtree. I can't say > >>>> offhand > >>>>>> how fast that is in practice, but it is clearly less powerful than a > >>>> fully > >>>>>> parallelisable approach could be. > >>>>>> > >>>>>> Any parallel build would require a way to divide the database into > >>>>>> non-overlapping subsets of emitted keys. This is easy and natural if > >>>> the > >>>>>> fdb key is the emitted key, which is the case for the couch_views > >>>> map-only > >>>>>> code. For ebtree it might be enough to simply grab a large chunk of > >>>>>> documents, perform the map transform, and then issues multiple > >>>> transactions > >>>>>> on subsets of those. > >>>>>> > >>>>>> Another common technique for btrees is bulk loading (more or less > >>>>>> literally constructing the btree nodes directly from the source, as > >>>> long as > >>>>>> you can sort it), which might be an option as well. > >>>>>> > >>>>>> Parallelising a build _with_ a reduce function seems hard however we do > >>>>>> it. The non-ebtree approach is parallelisable by virtue of paring down > >>>> the > >>>>>> reduce functionality itself (only whole key groups, only those > >>>> functions > >>>>>> that fdb has atomic operations for). > >>>>>> > >>>>>> I will first of all verify the multi-writer nature of ebtree as it > >>>> stands > >>>>>> today and make a PR which fixes any spurious conflicts, and then ponder > >>>>>> further how true parallel builds might be possible. > >>>>>> > >>>>>> > >>>>>>> On 24 Jul 2020, at 07:30, Garren Smith <gar...@apache.org> wrote: > >>>>>>> > >>>>>>> We haven't spoken much about updates with ebtree. > >>>>>>> From my understanding of ebtree it can only do single writer, is that > >>>>>>> correct?. If that is true it means we would not be able to > >>>> parallelize > >>>>>> the > >>>>>>> background building of views to speed up view builds. > >>>>>>> The other thing that would mean is that we cannot use it for mango > >>>> where > >>>>>> we > >>>>>>> update the view in the doc transaction. > >>>>>>> > >>>>>>> Cheers > >>>>>>> Garren > >>>>>>> > >>>>>>> On Fri, Jul 24, 2020 at 2:35 AM Kyle Snavely <kjsnav...@gmail.com> > >>>>>> wrote: > >>>>>>> > >>>>>>>> When in doubt, throw us a build at Cloudant with ebtree maps and > >>>> we'll > >>>>>> see > >>>>>>>> if it comes close to the crazy fast KV map queries. > >>>>>>>> > >>>>>>>> Kyle > >>>>>>>> > >>>>>>>> On Thu, Jul 23, 2020, 2:17 PM Robert Samuel Newson < > >>>> rnew...@apache.org> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> I (perhaps obviously) don't agree that I'm tying myself to old > >>>> CouchDB > >>>>>> or > >>>>>>>>> failing to embrace FDB. FDB is a toolkit and does not, to my mind, > >>>>>> force > >>>>>>>> us > >>>>>>>>> down any particular path. > >>>>>>>>> > >>>>>>>>> I haven't sat down to modify couch_views in the manner I've > >>>> suggested > >>>>>>>>> (where ebtree is used as designed; being fed the output of the > >>>> emit() > >>>>>>>> calls > >>>>>>>>> and calculating reductions as it does so) but I think it's a > >>>> worthwhile > >>>>>>>>> exercise. I'd be surprised if performance of map-only traversals > >>>> would > >>>>>> be > >>>>>>>>> disappointing but who knows? I also expect it would allow for > >>>>>> significant > >>>>>>>>> simplification of the code, which is one of the higher virtues. > >>>>>>>>> > >>>>>>>>> Adam, can you describe in a little more detail how you picture > >>>> "b+tree > >>>>>> is > >>>>>>>>> only used for incremental aggregations,"? It's implied in your > >>>> reply > >>>>>> that > >>>>>>>>> it would preserve the "interesting property" of keeping user data > >>>> out > >>>>>> of > >>>>>>>>> FDB Keys (for casual readers: the new native database encryption, > >>>>>> called > >>>>>>>>> "aegis", only encrypts the FDB value. It can't encrypt the key as > >>>> this > >>>>>>>>> would change the order of keys, which the current code depends on). > >>>>>> Did I > >>>>>>>>> misread you? > >>>>>>>>> > >>>>>>>>> B. > >>>>>>>>> > >>>>>>>>>> On 23 Jul 2020, at 20:11, Adam Kocoloski <kocol...@apache.org> > >>>> wrote: > >>>>>>>>>> > >>>>>>>>>> OK thanks for the clarification. As I said I wasn’t all that > >>>> confident > >>>>>>>> I > >>>>>>>>> understood the design :) > >>>>>>>>>> > >>>>>>>>>> I like the idea that the b+tree is only used for incremental > >>>>>>>>> aggregations, rather than storing the entire materialized view, > >>>> for the > >>>>>>>>> same reasons that Garren stated. > >>>>>>>>>> > >>>>>>>>>> An index maintained entirely in ebtree has the interesting > >>>> property > >>>>>>>> that > >>>>>>>>> it does not leak any user data into FDB Keys, which could be > >>>> attractive > >>>>>>>> for > >>>>>>>>> security reasons. > >>>>>>>>>> > >>>>>>>>>> Adam > >>>>>>>>>> > >>>>>>>>>>> On Jul 23, 2020, at 1:54 PM, Garren Smith <gar...@apache.org> > >>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Jul 23, 2020 at 6:55 PM Paul Davis < > >>>>>>>> paul.joseph.da...@gmail.com > >>>>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>>> I would like to keep ebtree to use just for the reduce index. > >>>>>>>>>>>> > >>>>>>>>>>>> Could you expand on your reasoning here, Garren? I haven't done > >>>> any > >>>>>>>>>>>> experiments on my own to understand if I'm missing something > >>>>>>>>>>>> important. My initial reaction is to not diverge too far from > >>>> the > >>>>>>>>>>>> previous shape of the implementation since we have a decent > >>>> idea of > >>>>>>>>>>>> how that behaves already but perhaps you've seen or measured > >>>>>>>> something > >>>>>>>>>>>> I'm not thinking of? > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I think this must have been a misunderstanding on my part. I > >>>> always > >>>>>>>>> thought > >>>>>>>>>>> of using ebtree to solve reduce and wasn't planning to use it > >>>> for the > >>>>>>>>> map > >>>>>>>>>>> index. > >>>>>>>>>>> I don't like the idea that we have ordered distributed key/value > >>>>>> store > >>>>>>>>> and > >>>>>>>>>>> then implementing a b-tree on top of that for indexing. > >>>> Especially > >>>>>>>>> since we > >>>>>>>>>>> know that the current map index is fast, > >>>>>>>>>>> easy to use, and follows recommended practices in the FDB > >>>> community > >>>>>> on > >>>>>>>>> how > >>>>>>>>>>> to do indexing. ebtree makes sense for reduce and is a means to > >>>> an > >>>>>> end > >>>>>>>>> to > >>>>>>>>>>> give us CouchDB's reduce api, which is heavily reliant on a > >>>> b-tree, > >>>>>>>> with > >>>>>>>>>>> CouchDB on FDB. This feels like a step backward and I worry we > >>>> are > >>>>>>>> tying > >>>>>>>>>>> ourselves heavily to old CouchDB instead of using the fact that > >>>> we > >>>>>>>>> moving > >>>>>>>>>>> to FDB which then allows us to design new api's and > >>>> functionality. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >