Hi Bob, That makes sense.
Cheers Garren On Sun, Jul 26, 2020 at 2:57 PM Robert Newson <rnew...@apache.org> wrote: > > Both approaches will collate keys the same way, it’s a showstopping bug if > they don’t match existing couch collation order. > > I’d probably not even document that right now. We can document what aegis > does and doesn’t cover when we make the first 4.0 RC. > -- > Robert Samuel Newson > rnew...@apache.org > > On Sun, 26 Jul 2020, at 13:27, Jan Lehnardt wrote: > > Heya Bob, > > > > this sounds like a good path forward that allows us to validate the > > remaining pieces of this. > > > > Just two side notes: > > > > - do we expect that the map-keys-in-fdb approach would sort differently > > from map-keys-in-ebtree? (I assume no) > > - I further assume that we’ll find a nicer UX for “make sure my views > > are encrypted” than “make sure your view has a reduce fun”, no details > > or drafts necessary at this point. > > > > Best > > Jan > > — > > > > > On 26. Jul 2020, at 14:13, Robert Samuel Newson <rnew...@apache.org> > wrote: > > > > > > I'd like to find consensus on a path forward and, with that in mind, I > have a suggestion. > > > > > > Before that, I think it's useful to lay out the points of contention, > which I've ranked in I hope an uncontroversial order; > > > > > > The principal concern with using ebtree for map as well as reduce is > that it precludes the use of foundationdb's ordering, on which much > performance and scalability likely hinges. I agree with this on principle > (that is, its demonstrably true) though none of us yet know the extent to > which it matters in practice, or whether any gap could be closed with > optimization. > > > > > > The secondary concern is the proposed use of ebtree for reduce-only > where it receives only the reduce values from a document. I understand > garren's stated desire to avoid duplication with the map side but I don't > think it works out. Users of the reduce function are using it to get the > reductions _across_ documents. It is rare (that is, I don't recall seeing > an example by a cloudant customer) to emit multiple rows in a map function > for the same key. In the cases where it doesn't, the value inserted into > ebtree is the same as the map value anyway. Where it does happen, we're > still duplicating the key and _a_ document level value. It seems a very > modest saving of space for a more complicated design (two separate levels > of reduce calculated in two separate modules). > > > > > > The tertiary concern is that the map-only indexes place user data > outside of aegis (native database encryption) as, necessarily, the order of > keys is important and a straightforward encryption there would not preserve > it. > > > > > > Looking at all the options available I don't think there is a single > path that addresses everything. There are also significant unknowns around > practical performance for all of this. > > > > > > So, I suggest that the map-only code (specifically, the code we will > use for an index definition that does not specify a reduce function) uses > the existing couch_views approach, where the fdb keys are suffixed with the > emitted key from the map function, and the fdb values are the emitted > values. This vertical index should support the fastest ?key= and > startkey/endkey= queries that fdb is capable of. There might be > simplifications we can make to this code if we accept that it only needs to > handle map-only indexes. > > > > > > For index definitions that _do_ specify a reduce function, I suggest > this is done entirely with ebtree.We would call ebtree:insert(Db, Tree, > Key, Value) where Key and Value are the emitted key and value from the map > function. ebtree:open is handed a reduce_fun which handles the reduce and > rereduce for the users reduction and, like couchdb 1/2/3, we also mix in a > system reduction that calculates the view size (paul's diff above appears > to do just this portion and looks broadly right). > > > > > > This avoids the duplication that garren's idea of storing the reduces > in ebtree was trying to avoid, but does so in all cases. This approach > allows us to compare the performance of map queries against map-only and > map-reduce indexes, it allows users to opt in or out of encrypting their > emitted view keys, and it doesn't prevent anyone from choosing to do both > (You can define a view with map function A and another view with map > function A and reduce function B, the former will be the couch_views > vertical index, the latter an ebtree "horizontal" index). > > > > > > As we learn more over time we could enhance one type of index or the > other and reach a point where we eliminate one, or the two could coexist > indefinitely. > > > > > >> On 24 Jul 2020, at 20:00, Paul Davis <paul.joseph.da...@gmail.com> > wrote: > > >> > > >> FWIW, a first pass at views entirely on ebtree turned out to be fairly > > >> straightforward. Almost surprisingly simple in some cases. > > >> > > >> > https://github.com/apache/couchdb/compare/prototype/fdb-layer...prototype/fdb-layer-ebtree-views > > >> > > >> Its currently passing all tests in `couch_views_map_test.erl` but is > > >> failing on other suites. I only did a quick skim on the failures but > > >> they all look superficial around some APIs I changed. > > >> > > >> I haven't added the APIs to query reduce functions via HTTP but the > > >> reduce functions are being executed to calculate row counts and KV > > >> sizes. Adding the builtin reduce functions and extending those to user > > >> defined reduce functions should be straightforward. > > >> > > >> On Fri, Jul 24, 2020 at 9:39 AM Robert Newson <rnew...@apache.org> > wrote: > > >>> > > >>> > > >>> I’m happy to restrict my PR comments to the actual diff, yes. So I’m > not +1 yet. > > >>> > > >>> I fixed the spurious conflicts at > https://github.com/apache/couchdb/pull/3033. > > >>> > > >>> -- > > >>> Robert Samuel Newson > > >>> rnew...@apache.org > > >>> > > >>> On Fri, 24 Jul 2020, at 14:59, Garren Smith wrote: > > >>>> Ok so just to confirm, we keep my PR as-is with ebtree only for > reduce. We > > >>>> can get that ready to merge into fdb master. We can then use that > to battle > > >>>> test ebtree and then look at using it for the map side as well. At > that > > >>>> point we would combine the reduce and map index into a ebtree > index. Are > > >>>> you happy with that? > > >>>> > > >>>> Cheers > > >>>> Garren > > >>>> > > >>>> On Fri, Jul 24, 2020 at 3:48 PM Robert Newson <rnew...@apache.org> > wrote: > > >>>> > > >>>>> Hi, > > >>>>> > > >>>>> It’s not as unknown as you think but certainly we need empirical > data to > > >>>>> guide us on the reduce side. I’m also fine with continuing with the > > >>>>> map-only code as it stands today until such time as we demonstrate > ebtree > > >>>>> meets or exceeds our needs (and I freely accept the possibility > that it > > >>>>> might not). > > >>>>> > > >>>>> I think the principal enhancement to ebtree that would address most > > >>>>> concerns is if it could store the leaf entries vertically (as they > are > > >>>>> currently). I have some thoughts which I’ll try to realise as > working code. > > >>>>> > > >>>>> I’ve confirmed that I do create spurious conflicts and will have a > PR up > > >>>>> today to fix that. > > >>>>> > > >>>>> -- > > >>>>> Robert Samuel Newson > > >>>>> rnew...@apache.org > > >>>>> > > >>>>> On Fri, 24 Jul 2020, at 12:43, Garren Smith wrote: > > >>>>>> Hi Bob, > > >>>>>> > > >>>>>> Thanks for that explanation, that is really helpful and it is > good we > > >>>>> have > > >>>>>> some options but it also does highlight a lot of unknowns. > Whereas our > > >>>>>> current map indexes are really simple and we know its behaviour. > There > > >>>>> are > > >>>>>> no real unknowns. Adding ebtree here could make map indexes a > fair bit > > >>>>> more > > >>>>>> complicated since we don't know the effect of managing different > node > > >>>>>> sizes, concurrent doc updates, and querying performance. > > >>>>>> > > >>>>>> Could we do a compromise here, could we look at using ebtree only > for > > >>>>>> reduces now? That means that we can have reduce indexes working > quite > > >>>>> soon > > >>>>>> on CouchDB on FDB. At the same time, we can work on ebtree and run > > >>>>>> performance tests on ebtree for map indexes. Some interesting > tests we > > >>>>> can > > >>>>>> do is see if a user emits KV's near the limits (8KB for keys and > 50KB for > > >>>>>> values) how does ebtree handle that? How it handles being updated > in the > > >>>>>> doc update transaction. And general query performance. What do > you think? > > >>>>>> > > >>>>>> Cheers > > >>>>>> Garren > > >>>>>> > > >>>>>> > > >>>>>> On Fri, Jul 24, 2020 at 10:06 AM Robert Samuel Newson < > > >>>>> rnew...@apache.org> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Hi, > > >>>>>>> > > >>>>>>> A short preface at this stage is needed I think: > > >>>>>>> > > >>>>>>> My goal with ebtree was to implement a complete and correct > b+tree that > > >>>>>>> also calculated and maintained inner reductions. I consciously > chose > > >>>>> not to > > >>>>>>> go further than that before presenting it to the group for wider > debate > > >>>>>>> (indeed, the very debate we're having in this thread). > > >>>>>>> > > >>>>>>> -- > > >>>>>>> > > >>>>>>> Ebtree is not single writer, at least not inherently. Two > updates to > > >>>>> the > > >>>>>>> same ebtree should both succeed as long as they don't modify the > same > > >>>>> nodes. > > >>>>>>> > > >>>>>>> Modifying the same node is likely where there's a reduce > function, > > >>>>> though > > >>>>>>> only if the reduction value actually changes, as that percolates > up the > > >>>>>>> tree. Where the reduce value does not change and when neither > > >>>>> transaction > > >>>>>>> causes a split, rebalance, or merge that affects the other, they > should > > >>>>>>> both commit without conflict. There is a rich history of > optimizations > > >>>>> in > > >>>>>>> this space specifically around btrees. ebtree might cause > spurious > > >>>>>>> conflicts today, I will investigate and propose fixes if so > (e.g, I > > >>>>> think I > > >>>>>>> probably call erlfdb:set on nodes that have not changed). > > >>>>>>> > > >>>>>>> I certainly envisaged updating ebtree within the same txn as a > doc > > >>>>> update, > > >>>>>>> which is why the first argument to all the public functions can > be > > >>>>> either > > >>>>>>> an erlfdb Db or open Tx. > > >>>>>>> > > >>>>>>> Parallelising the initial build is more difficult with ebtree > than > > >>>>>>> parallelising the existing couch_views map-only code, though > it's worth > > >>>>>>> noting that ebtree:insert benefits a great deal from batching > (multiple > > >>>>>>> calls to :insert in the same transaction). For an offline build > (i.e, > > >>>>> an > > >>>>>>> index that the client cannot see until the entire build is > complete) > > >>>>> the > > >>>>>>> batch size can be maximised. That is still a serial process in > that > > >>>>> there > > >>>>>>> is only one transaction at a time updating the ebtree. I can't > say > > >>>>> offhand > > >>>>>>> how fast that is in practice, but it is clearly less powerful > than a > > >>>>> fully > > >>>>>>> parallelisable approach could be. > > >>>>>>> > > >>>>>>> Any parallel build would require a way to divide the database > into > > >>>>>>> non-overlapping subsets of emitted keys. This is easy and > natural if > > >>>>> the > > >>>>>>> fdb key is the emitted key, which is the case for the couch_views > > >>>>> map-only > > >>>>>>> code. For ebtree it might be enough to simply grab a large chunk > of > > >>>>>>> documents, perform the map transform, and then issues multiple > > >>>>> transactions > > >>>>>>> on subsets of those. > > >>>>>>> > > >>>>>>> Another common technique for btrees is bulk loading (more or less > > >>>>>>> literally constructing the btree nodes directly from the source, > as > > >>>>> long as > > >>>>>>> you can sort it), which might be an option as well. > > >>>>>>> > > >>>>>>> Parallelising a build _with_ a reduce function seems hard > however we do > > >>>>>>> it. The non-ebtree approach is parallelisable by virtue of > paring down > > >>>>> the > > >>>>>>> reduce functionality itself (only whole key groups, only those > > >>>>> functions > > >>>>>>> that fdb has atomic operations for). > > >>>>>>> > > >>>>>>> I will first of all verify the multi-writer nature of ebtree as > it > > >>>>> stands > > >>>>>>> today and make a PR which fixes any spurious conflicts, and then > ponder > > >>>>>>> further how true parallel builds might be possible. > > >>>>>>> > > >>>>>>> > > >>>>>>>> On 24 Jul 2020, at 07:30, Garren Smith <gar...@apache.org> > wrote: > > >>>>>>>> > > >>>>>>>> We haven't spoken much about updates with ebtree. > > >>>>>>>> From my understanding of ebtree it can only do single writer, > is that > > >>>>>>>> correct?. If that is true it means we would not be able to > > >>>>> parallelize > > >>>>>>> the > > >>>>>>>> background building of views to speed up view builds. > > >>>>>>>> The other thing that would mean is that we cannot use it for > mango > > >>>>> where > > >>>>>>> we > > >>>>>>>> update the view in the doc transaction. > > >>>>>>>> > > >>>>>>>> Cheers > > >>>>>>>> Garren > > >>>>>>>> > > >>>>>>>> On Fri, Jul 24, 2020 at 2:35 AM Kyle Snavely < > kjsnav...@gmail.com> > > >>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> When in doubt, throw us a build at Cloudant with ebtree maps > and > > >>>>> we'll > > >>>>>>> see > > >>>>>>>>> if it comes close to the crazy fast KV map queries. > > >>>>>>>>> > > >>>>>>>>> Kyle > > >>>>>>>>> > > >>>>>>>>> On Thu, Jul 23, 2020, 2:17 PM Robert Samuel Newson < > > >>>>> rnew...@apache.org> > > >>>>>>>>> wrote: > > >>>>>>>>> > > >>>>>>>>>> I (perhaps obviously) don't agree that I'm tying myself to old > > >>>>> CouchDB > > >>>>>>> or > > >>>>>>>>>> failing to embrace FDB. FDB is a toolkit and does not, to my > mind, > > >>>>>>> force > > >>>>>>>>> us > > >>>>>>>>>> down any particular path. > > >>>>>>>>>> > > >>>>>>>>>> I haven't sat down to modify couch_views in the manner I've > > >>>>> suggested > > >>>>>>>>>> (where ebtree is used as designed; being fed the output of the > > >>>>> emit() > > >>>>>>>>> calls > > >>>>>>>>>> and calculating reductions as it does so) but I think it's a > > >>>>> worthwhile > > >>>>>>>>>> exercise. I'd be surprised if performance of map-only > traversals > > >>>>> would > > >>>>>>> be > > >>>>>>>>>> disappointing but who knows? I also expect it would allow for > > >>>>>>> significant > > >>>>>>>>>> simplification of the code, which is one of the higher > virtues. > > >>>>>>>>>> > > >>>>>>>>>> Adam, can you describe in a little more detail how you picture > > >>>>> "b+tree > > >>>>>>> is > > >>>>>>>>>> only used for incremental aggregations,"? It's implied in your > > >>>>> reply > > >>>>>>> that > > >>>>>>>>>> it would preserve the "interesting property" of keeping user > data > > >>>>> out > > >>>>>>> of > > >>>>>>>>>> FDB Keys (for casual readers: the new native database > encryption, > > >>>>>>> called > > >>>>>>>>>> "aegis", only encrypts the FDB value. It can't encrypt the > key as > > >>>>> this > > >>>>>>>>>> would change the order of keys, which the current code > depends on). > > >>>>>>> Did I > > >>>>>>>>>> misread you? > > >>>>>>>>>> > > >>>>>>>>>> B. > > >>>>>>>>>> > > >>>>>>>>>>> On 23 Jul 2020, at 20:11, Adam Kocoloski < > kocol...@apache.org> > > >>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> OK thanks for the clarification. As I said I wasn’t all that > > >>>>> confident > > >>>>>>>>> I > > >>>>>>>>>> understood the design :) > > >>>>>>>>>>> > > >>>>>>>>>>> I like the idea that the b+tree is only used for incremental > > >>>>>>>>>> aggregations, rather than storing the entire materialized > view, > > >>>>> for the > > >>>>>>>>>> same reasons that Garren stated. > > >>>>>>>>>>> > > >>>>>>>>>>> An index maintained entirely in ebtree has the interesting > > >>>>> property > > >>>>>>>>> that > > >>>>>>>>>> it does not leak any user data into FDB Keys, which could be > > >>>>> attractive > > >>>>>>>>> for > > >>>>>>>>>> security reasons. > > >>>>>>>>>>> > > >>>>>>>>>>> Adam > > >>>>>>>>>>> > > >>>>>>>>>>>> On Jul 23, 2020, at 1:54 PM, Garren Smith < > gar...@apache.org> > > >>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Thu, Jul 23, 2020 at 6:55 PM Paul Davis < > > >>>>>>>>> paul.joseph.da...@gmail.com > > >>>>>>>>>>> > > >>>>>>>>>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>>>> I would like to keep ebtree to use just for the reduce > index. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Could you expand on your reasoning here, Garren? I haven't > done > > >>>>> any > > >>>>>>>>>>>>> experiments on my own to understand if I'm missing > something > > >>>>>>>>>>>>> important. My initial reaction is to not diverge too far > from > > >>>>> the > > >>>>>>>>>>>>> previous shape of the implementation since we have a decent > > >>>>> idea of > > >>>>>>>>>>>>> how that behaves already but perhaps you've seen or > measured > > >>>>>>>>> something > > >>>>>>>>>>>>> I'm not thinking of? > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> I think this must have been a misunderstanding on my part. I > > >>>>> always > > >>>>>>>>>> thought > > >>>>>>>>>>>> of using ebtree to solve reduce and wasn't planning to use > it > > >>>>> for the > > >>>>>>>>>> map > > >>>>>>>>>>>> index. > > >>>>>>>>>>>> I don't like the idea that we have ordered distributed > key/value > > >>>>>>> store > > >>>>>>>>>> and > > >>>>>>>>>>>> then implementing a b-tree on top of that for indexing. > > >>>>> Especially > > >>>>>>>>>> since we > > >>>>>>>>>>>> know that the current map index is fast, > > >>>>>>>>>>>> easy to use, and follows recommended practices in the FDB > > >>>>> community > > >>>>>>> on > > >>>>>>>>>> how > > >>>>>>>>>>>> to do indexing. ebtree makes sense for reduce and is a > means to > > >>>>> an > > >>>>>>> end > > >>>>>>>>>> to > > >>>>>>>>>>>> give us CouchDB's reduce api, which is heavily reliant on a > > >>>>> b-tree, > > >>>>>>>>> with > > >>>>>>>>>>>> CouchDB on FDB. This feels like a step backward and I worry > we > > >>>>> are > > >>>>>>>>> tying > > >>>>>>>>>>>> ourselves heavily to old CouchDB instead of using the fact > that > > >>>>> we > > >>>>>>>>>> moving > > >>>>>>>>>>>> to FDB which then allows us to design new api's and > > >>>>> functionality. > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>> > > >>>>> > > >>>> > > > > > > > >