Re: [DISCUSS] Reduce on FDB take 3

Jan Lehnardt Mon, 27 Jul 2020 07:36:59 -0700

Thanks for the additional context. Given that our JS-keys are already
when it comes to floats/precision vs. Erlang numbers, I’m not concerned :D


> On 27. Jul 2020, at 16:35, Paul Davis <paul.joseph.da...@gmail.com> wrote:
> 
> In terms of collation there will be differences in some edge cases.
> Numbers in particular are all converted to floats which means that
> extreme values that don't fit into a float (though Erlang floats are
> doubles) will have a truncation that reduces resolution. Other than
> that I'm pretty sure it'd be the same. There might be some issues
> around the ICU library used for sort keys although both
> implementations would suffer from issues there, but I think it'd be
> slightly different issues.
> 
> Paul
> 
> On Mon, Jul 27, 2020 at 2:28 AM Garren Smith <gar...@apache.org> wrote:
>> 
>> Hi Bob,
>> 
>> That makes sense.
>> 
>> Cheers
>> Garren
>> 
>> On Sun, Jul 26, 2020 at 2:57 PM Robert Newson <rnew...@apache.org> wrote:
>> 
>>> 
>>> Both approaches will collate keys the same way, it’s a showstopping bug if
>>> they don’t match existing couch collation order.
>>> 
>>> I’d probably not even document that right now. We can document what aegis
>>> does and doesn’t cover when we make the first 4.0 RC.
>>> --
>>>  Robert Samuel Newson
>>>  rnew...@apache.org
>>> 
>>> On Sun, 26 Jul 2020, at 13:27, Jan Lehnardt wrote:
>>>> Heya Bob,
>>>> 
>>>> this sounds like a good path forward that allows us to validate the
>>>> remaining pieces of this.
>>>> 
>>>> Just two side notes:
>>>> 
>>>> - do we expect that the map-keys-in-fdb approach would sort differently
>>>> from map-keys-in-ebtree? (I assume no)
>>>> - I further assume that we’ll find a nicer UX for “make sure my views
>>>> are encrypted” than “make sure your view has a reduce fun”, no details
>>>> or drafts necessary at this point.
>>>> 
>>>> Best
>>>> Jan
>>>> —
>>>> 
>>>>> On 26. Jul 2020, at 14:13, Robert Samuel Newson <rnew...@apache.org>
>>> wrote:
>>>>> 
>>>>> I'd like to find consensus on a path forward and, with that in mind, I
>>> have a suggestion.
>>>>> 
>>>>> Before that, I think it's useful to lay out the points of contention,
>>> which I've ranked in I hope an uncontroversial order;
>>>>> 
>>>>> The principal concern with using ebtree for map as well as reduce is
>>> that it precludes the use of foundationdb's ordering, on which much
>>> performance and scalability likely hinges. I agree with this on principle
>>> (that is, its demonstrably true) though none of us yet know the extent to
>>> which it matters in practice, or whether any gap could be closed with
>>> optimization.
>>>>> 
>>>>> The secondary concern is the proposed use of ebtree for reduce-only
>>> where it receives only the reduce values from a document. I understand
>>> garren's stated desire to avoid duplication with the map side but I don't
>>> think it works out. Users of the reduce function are using it to get the
>>> reductions _across_ documents. It is rare (that is, I don't recall seeing
>>> an example by a cloudant customer) to emit multiple rows in a map function
>>> for the same key. In the cases where it doesn't, the value inserted into
>>> ebtree is the same as the map value anyway. Where it does happen, we're
>>> still duplicating the key and _a_ document level value. It seems a very
>>> modest saving of space for a more complicated design (two separate levels
>>> of reduce calculated in two separate modules).
>>>>> 
>>>>> The tertiary concern is that the map-only indexes place user data
>>> outside of aegis (native database encryption) as, necessarily, the order of
>>> keys is important and a straightforward encryption there would not preserve
>>> it.
>>>>> 
>>>>> Looking at all the options available I don't think there is a single
>>> path that addresses everything. There are also significant unknowns around
>>> practical performance for all of this.
>>>>> 
>>>>> So, I suggest that the map-only code (specifically, the code we will
>>> use for an index definition that does not specify a reduce function) uses
>>> the existing couch_views approach, where the fdb keys are suffixed with the
>>> emitted key from the map function, and the fdb values are the emitted
>>> values. This vertical index should support the fastest ?key= and
>>> startkey/endkey= queries that fdb is capable of. There might be
>>> simplifications we can make to this code if we accept that it only needs to
>>> handle map-only indexes.
>>>>> 
>>>>> For index definitions that _do_ specify a reduce function, I suggest
>>> this is done entirely with ebtree.We would call ebtree:insert(Db, Tree,
>>> Key, Value) where Key and Value are the emitted key and value from the map
>>> function. ebtree:open is handed a reduce_fun which handles the reduce and
>>> rereduce for the users reduction and, like couchdb 1/2/3, we also mix in a
>>> system reduction that calculates the view size (paul's diff above appears
>>> to do just this portion and looks broadly right).
>>>>> 
>>>>> This avoids the duplication that garren's idea of storing the reduces
>>> in ebtree was trying to avoid, but does so in all cases. This approach
>>> allows us to compare the performance of map queries against map-only and
>>> map-reduce indexes, it allows users to opt in or out of encrypting their
>>> emitted view keys, and it doesn't prevent anyone from choosing to do both
>>> (You can define a view with map function A and another view with map
>>> function A and reduce function B, the former will be the couch_views
>>> vertical index, the latter an ebtree "horizontal" index).
>>>>> 
>>>>> As we learn more over time we could enhance one type of index or the
>>> other and reach a point where we eliminate one, or the two could coexist
>>> indefinitely.
>>>>> 
>>>>>> On 24 Jul 2020, at 20:00, Paul Davis <paul.joseph.da...@gmail.com>
>>> wrote:
>>>>>> 
>>>>>> FWIW, a first pass at views entirely on ebtree turned out to be fairly
>>>>>> straightforward. Almost surprisingly simple in some cases.
>>>>>> 
>>>>>> 
>>> https://github.com/apache/couchdb/compare/prototype/fdb-layer...prototype/fdb-layer-ebtree-views
>>>>>> 
>>>>>> Its currently passing all tests in `couch_views_map_test.erl` but is
>>>>>> failing on other suites. I only did a quick skim on the failures but
>>>>>> they all look superficial around some APIs I changed.
>>>>>> 
>>>>>> I haven't added the APIs to query reduce functions via HTTP but the
>>>>>> reduce functions are being executed to calculate row counts and KV
>>>>>> sizes. Adding the builtin reduce functions and extending those to user
>>>>>> defined reduce functions should be straightforward.
>>>>>> 
>>>>>> On Fri, Jul 24, 2020 at 9:39 AM Robert Newson <rnew...@apache.org>
>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> I’m happy to restrict my PR comments to the actual diff, yes. So I’m
>>> not +1 yet.
>>>>>>> 
>>>>>>> I fixed the spurious conflicts at
>>> https://github.com/apache/couchdb/pull/3033.
>>>>>>> 
>>>>>>> --
>>>>>>> Robert Samuel Newson
>>>>>>> rnew...@apache.org
>>>>>>> 
>>>>>>> On Fri, 24 Jul 2020, at 14:59, Garren Smith wrote:
>>>>>>>> Ok so just to confirm, we keep my PR as-is with ebtree only for
>>> reduce. We
>>>>>>>> can get that ready to merge into fdb master. We can then use that
>>> to battle
>>>>>>>> test ebtree and then look at using it for the map side as well. At
>>> that
>>>>>>>> point we would combine the reduce and map index into a ebtree
>>> index. Are
>>>>>>>> you happy with that?
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> Garren
>>>>>>>> 
>>>>>>>> On Fri, Jul 24, 2020 at 3:48 PM Robert Newson <rnew...@apache.org>
>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> It’s not as unknown as you think but certainly we need empirical
>>> data to
>>>>>>>>> guide us on the reduce side. I’m also fine with continuing with the
>>>>>>>>> map-only code as it stands today until such time as we demonstrate
>>> ebtree
>>>>>>>>> meets or exceeds our needs (and I freely accept the possibility
>>> that it
>>>>>>>>> might not).
>>>>>>>>> 
>>>>>>>>> I think the principal enhancement to ebtree that would address most
>>>>>>>>> concerns is if it could store the leaf entries vertically (as they
>>> are
>>>>>>>>> currently). I have some thoughts which I’ll try to realise as
>>> working code.
>>>>>>>>> 
>>>>>>>>> I’ve confirmed that I do create spurious conflicts and will have a
>>> PR up
>>>>>>>>> today to fix that.
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Robert Samuel Newson
>>>>>>>>> rnew...@apache.org
>>>>>>>>> 
>>>>>>>>> On Fri, 24 Jul 2020, at 12:43, Garren Smith wrote:
>>>>>>>>>> Hi Bob,
>>>>>>>>>> 
>>>>>>>>>> Thanks for that explanation, that is really helpful and it is
>>> good we
>>>>>>>>> have
>>>>>>>>>> some options but it also does highlight a lot of unknowns.
>>> Whereas our
>>>>>>>>>> current map indexes are really simple and we know its behaviour.
>>> There
>>>>>>>>> are
>>>>>>>>>> no real unknowns. Adding ebtree here could make map indexes a
>>> fair bit
>>>>>>>>> more
>>>>>>>>>> complicated since we don't know the effect of managing different
>>> node
>>>>>>>>>> sizes, concurrent doc updates, and querying performance.
>>>>>>>>>> 
>>>>>>>>>> Could we do a compromise here, could we look at using ebtree only
>>> for
>>>>>>>>>> reduces now? That means that we can have reduce indexes working
>>> quite
>>>>>>>>> soon
>>>>>>>>>> on CouchDB on FDB. At the same time, we can work on ebtree and run
>>>>>>>>>> performance tests on ebtree for map indexes. Some interesting
>>> tests we
>>>>>>>>> can
>>>>>>>>>> do is see if a user emits KV's near the limits (8KB for keys and
>>> 50KB for
>>>>>>>>>> values) how does ebtree handle that? How it handles being updated
>>> in the
>>>>>>>>>> doc update transaction. And general query performance. What do
>>> you think?
>>>>>>>>>> 
>>>>>>>>>> Cheers
>>>>>>>>>> Garren
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jul 24, 2020 at 10:06 AM Robert Samuel Newson <
>>>>>>>>> rnew...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> A short preface at this stage is needed I think:
>>>>>>>>>>> 
>>>>>>>>>>> My goal with ebtree was to implement a complete and correct
>>> b+tree that
>>>>>>>>>>> also calculated and maintained inner reductions. I consciously
>>> chose
>>>>>>>>> not to
>>>>>>>>>>> go further than that before presenting it to the group for wider
>>> debate
>>>>>>>>>>> (indeed, the very debate we're having in this thread).
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> 
>>>>>>>>>>> Ebtree is not single writer, at least not inherently. Two
>>> updates to
>>>>>>>>> the
>>>>>>>>>>> same ebtree should both succeed as long as they don't modify the
>>> same
>>>>>>>>> nodes.
>>>>>>>>>>> 
>>>>>>>>>>> Modifying the same node is likely where there's a reduce
>>> function,
>>>>>>>>> though
>>>>>>>>>>> only if the reduction value actually changes, as that percolates
>>> up the
>>>>>>>>>>> tree. Where the reduce value does not change and when neither
>>>>>>>>> transaction
>>>>>>>>>>> causes a split, rebalance, or merge that affects the other, they
>>> should
>>>>>>>>>>> both commit without conflict. There is a rich history of
>>> optimizations
>>>>>>>>> in
>>>>>>>>>>> this space specifically around btrees. ebtree might cause
>>> spurious
>>>>>>>>>>> conflicts today, I will investigate and propose fixes if so
>>> (e.g, I
>>>>>>>>> think I
>>>>>>>>>>> probably call erlfdb:set on nodes that have not changed).
>>>>>>>>>>> 
>>>>>>>>>>> I certainly envisaged updating ebtree within the same txn as a
>>> doc
>>>>>>>>> update,
>>>>>>>>>>> which is why the first argument to all the public functions can
>>> be
>>>>>>>>> either
>>>>>>>>>>> an erlfdb Db or open Tx.
>>>>>>>>>>> 
>>>>>>>>>>> Parallelising the initial build is more difficult with ebtree
>>> than
>>>>>>>>>>> parallelising the existing couch_views map-only code, though
>>> it's worth
>>>>>>>>>>> noting that ebtree:insert benefits a great deal from batching
>>> (multiple
>>>>>>>>>>> calls to :insert in the same transaction). For an offline build
>>> (i.e,
>>>>>>>>> an
>>>>>>>>>>> index that the client cannot see until the entire build is
>>> complete)
>>>>>>>>> the
>>>>>>>>>>> batch size can be maximised. That is still a serial process in
>>> that
>>>>>>>>> there
>>>>>>>>>>> is only one transaction at a time updating the ebtree. I can't
>>> say
>>>>>>>>> offhand
>>>>>>>>>>> how fast that is in practice, but it is clearly less powerful
>>> than a
>>>>>>>>> fully
>>>>>>>>>>> parallelisable approach could be.
>>>>>>>>>>> 
>>>>>>>>>>> Any parallel build would require a way to divide the database
>>> into
>>>>>>>>>>> non-overlapping subsets of emitted keys. This is easy and
>>> natural if
>>>>>>>>> the
>>>>>>>>>>> fdb key is the emitted key, which is the case for the couch_views
>>>>>>>>> map-only
>>>>>>>>>>> code. For ebtree it might be enough to simply grab a large chunk
>>> of
>>>>>>>>>>> documents, perform the map transform, and then issues multiple
>>>>>>>>> transactions
>>>>>>>>>>> on subsets of those.
>>>>>>>>>>> 
>>>>>>>>>>> Another common technique for btrees is bulk loading (more or less
>>>>>>>>>>> literally constructing the btree nodes directly from the source,
>>> as
>>>>>>>>> long as
>>>>>>>>>>> you can sort it), which might be an option as well.
>>>>>>>>>>> 
>>>>>>>>>>> Parallelising a build _with_ a reduce function seems hard
>>> however we do
>>>>>>>>>>> it. The non-ebtree approach is parallelisable by virtue of
>>> paring down
>>>>>>>>> the
>>>>>>>>>>> reduce functionality itself (only whole key groups, only those
>>>>>>>>> functions
>>>>>>>>>>> that fdb has atomic operations for).
>>>>>>>>>>> 
>>>>>>>>>>> I will first of all verify the multi-writer nature of ebtree as
>>> it
>>>>>>>>> stands
>>>>>>>>>>> today and make a PR which fixes any spurious conflicts, and then
>>> ponder
>>>>>>>>>>> further  how true parallel builds might be possible.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On 24 Jul 2020, at 07:30, Garren Smith <gar...@apache.org>
>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> We haven't spoken much about updates with ebtree.
>>>>>>>>>>>> From my understanding of ebtree it can only do single writer,
>>> is that
>>>>>>>>>>>> correct?. If that is true it means we would not be able to
>>>>>>>>> parallelize
>>>>>>>>>>> the
>>>>>>>>>>>> background building of views to speed up view builds.
>>>>>>>>>>>> The other thing that would mean is that we cannot use it for
>>> mango
>>>>>>>>> where
>>>>>>>>>>> we
>>>>>>>>>>>> update the view in the doc transaction.
>>>>>>>>>>>> 
>>>>>>>>>>>> Cheers
>>>>>>>>>>>> Garren
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, Jul 24, 2020 at 2:35 AM Kyle Snavely <
>>> kjsnav...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> When in doubt, throw us a build at Cloudant with ebtree maps
>>> and
>>>>>>>>> we'll
>>>>>>>>>>> see
>>>>>>>>>>>>> if it comes close to the crazy fast KV map queries.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Kyle
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Thu, Jul 23, 2020, 2:17 PM Robert Samuel Newson <
>>>>>>>>> rnew...@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I (perhaps obviously) don't agree that I'm tying myself to old
>>>>>>>>> CouchDB
>>>>>>>>>>> or
>>>>>>>>>>>>>> failing to embrace FDB. FDB is a toolkit and does not, to my
>>> mind,
>>>>>>>>>>> force
>>>>>>>>>>>>> us
>>>>>>>>>>>>>> down any particular path.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I haven't sat down to modify couch_views in the manner I've
>>>>>>>>> suggested
>>>>>>>>>>>>>> (where ebtree is used as designed; being fed the output of the
>>>>>>>>> emit()
>>>>>>>>>>>>> calls
>>>>>>>>>>>>>> and calculating reductions as it does so) but I think it's a
>>>>>>>>> worthwhile
>>>>>>>>>>>>>> exercise. I'd be surprised if performance of map-only
>>> traversals
>>>>>>>>> would
>>>>>>>>>>> be
>>>>>>>>>>>>>> disappointing but who knows? I also expect it would allow for
>>>>>>>>>>> significant
>>>>>>>>>>>>>> simplification of the code, which is one of the higher
>>> virtues.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Adam, can you describe in a little more detail how you picture
>>>>>>>>> "b+tree
>>>>>>>>>>> is
>>>>>>>>>>>>>> only used for incremental aggregations,"? It's implied in your
>>>>>>>>> reply
>>>>>>>>>>> that
>>>>>>>>>>>>>> it would preserve the "interesting property" of keeping user
>>> data
>>>>>>>>> out
>>>>>>>>>>> of
>>>>>>>>>>>>>> FDB Keys (for casual readers: the new native database
>>> encryption,
>>>>>>>>>>> called
>>>>>>>>>>>>>> "aegis", only encrypts the FDB value. It can't encrypt the
>>> key as
>>>>>>>>> this
>>>>>>>>>>>>>> would change the order of keys, which the current code
>>> depends on).
>>>>>>>>>>> Did I
>>>>>>>>>>>>>> misread you?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> B.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On 23 Jul 2020, at 20:11, Adam Kocoloski <
>>> kocol...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> OK thanks for the clarification. As I said I wasn’t all that
>>>>>>>>> confident
>>>>>>>>>>>>> I
>>>>>>>>>>>>>> understood the design :)
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I like the idea that the b+tree is only used for incremental
>>>>>>>>>>>>>> aggregations, rather than storing the entire materialized
>>> view,
>>>>>>>>> for the
>>>>>>>>>>>>>> same reasons that Garren stated.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> An index maintained entirely in ebtree has the interesting
>>>>>>>>> property
>>>>>>>>>>>>> that
>>>>>>>>>>>>>> it does not leak any user data into FDB Keys, which could be
>>>>>>>>> attractive
>>>>>>>>>>>>> for
>>>>>>>>>>>>>> security reasons.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Adam
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Jul 23, 2020, at 1:54 PM, Garren Smith <
>>> gar...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Thu, Jul 23, 2020 at 6:55 PM Paul Davis <
>>>>>>>>>>>>> paul.joseph.da...@gmail.com
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I would like to keep ebtree to use just for the reduce
>>> index.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Could you expand on your reasoning here, Garren? I haven't
>>> done
>>>>>>>>> any
>>>>>>>>>>>>>>>>> experiments on my own to understand if I'm missing
>>> something
>>>>>>>>>>>>>>>>> important. My initial reaction is to not diverge too far
>>> from
>>>>>>>>> the
>>>>>>>>>>>>>>>>> previous shape of the implementation since we have a decent
>>>>>>>>> idea of
>>>>>>>>>>>>>>>>> how that behaves already but perhaps you've seen or
>>> measured
>>>>>>>>>>>>> something
>>>>>>>>>>>>>>>>> I'm not thinking of?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I think this must have been a misunderstanding on my part. I
>>>>>>>>> always
>>>>>>>>>>>>>> thought
>>>>>>>>>>>>>>>> of using ebtree to solve reduce and wasn't planning to use
>>> it
>>>>>>>>> for the
>>>>>>>>>>>>>> map
>>>>>>>>>>>>>>>> index.
>>>>>>>>>>>>>>>> I don't like the idea that we have ordered distributed
>>> key/value
>>>>>>>>>>> store
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> then implementing a b-tree on top of that for indexing.
>>>>>>>>> Especially
>>>>>>>>>>>>>> since we
>>>>>>>>>>>>>>>> know that the current map index is fast,
>>>>>>>>>>>>>>>> easy to use, and follows recommended practices in the FDB
>>>>>>>>> community
>>>>>>>>>>> on
>>>>>>>>>>>>>> how
>>>>>>>>>>>>>>>> to do indexing. ebtree makes sense for reduce and is a
>>> means to
>>>>>>>>> an
>>>>>>>>>>> end
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> give us CouchDB's reduce api, which is heavily reliant on a
>>>>>>>>> b-tree,
>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>> CouchDB on FDB. This feels like a step backward and I worry
>>> we
>>>>>>>>> are
>>>>>>>>>>>>> tying
>>>>>>>>>>>>>>>> ourselves heavily to old CouchDB instead of using the fact
>>> that
>>>>>>>>> we
>>>>>>>>>>>>>> moving
>>>>>>>>>>>>>>>> to FDB which then allows us to design new api's and
>>>>>>>>> functionality.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>

Re: [DISCUSS] Reduce on FDB take 3

Reply via email to