Re: [DISCUSS] Map Indexes

Jan Lehnardt Fri, 10 May 2019 04:17:32 -0700

> On 15. Apr 2019, at 16:24, Garren Smith <[email protected]> wrote:
> 
> Hi Everyone,
> 
> I want to start a discussion around creating map/reduce view indexes. One
> way to get views indexes to work with FoundationDB is to break up a view
> index into indexes for the map functions and indexes for the reduce
> functions. Along those lines, I’m going to break the discussions into two,
> this discussion around map functions and indexes and then a another one on
> reduce functions and the indexes that go with those.
> 
> ## Data model
> For a map function, we need to store the emitted keys and the emitted
> values:
> 
> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?MAP, <keys>,
> <_id>} -> <emitted_value>

in line with the “why keep view groups” question, the key path here include 
both the signature and the doc id and view id which already is a unique 
identifier, what purpose does the ?VIEW_SIGNATURE key have?

> 
> To briefly explain what the above means, it creates a views subspace in a
> database subspace, then every view defined on a design doc is grouped via
> the design doc’s view signature. The view_id is the name of the view in the
> design doc - we can look at ways to make that smaller to save some key
> space. The ?MAP groups the key/value into the view’s map index subspace,
> then we have the keys that were emitted for the map function and finally
> the _id field of the document used to create the keys for this row.
> 
> ## Emitted Value
> The value stored for the row is the emitted value from the map function.
> Because we have a limitation on the size of the value field one caveat
> around this design is that a user will run into issues if they emit a
> document that exceeds 100KB. In CouchDB we don’t recommend users emitting
> the doc, but there are some nice speed optimisations you get by emitting
> the document as the value. With CouchDB on FDB that performance
> optimisation won’t be required and so we will have to actively discourage
> users from doing that.
> 
> Just to note, a user would experience the same issue if they emit a value
> exceeding 100KB.
> 
> ## Key ordering
> There are some changes to how we will manage keys emitted from a map
> function. For strings we will need to generate a ICU sort string upfront
> instead of using the ICU comparison. To maintain the way CouchDB currently
> does view collation [1], we need to prepend a type value to each key so
> that we get the correct sort order of null < boolean < numbers < strings <
> arrays < objects. CouchDB currently allows duplicate keys to be emitted for
> an index, to allow for that a counter value will be added to the end of the
> keys.
> 
> ## Index Key Management
> For every document that needs to be processed for an index, we have to run
> the document through the javascript process to get the emitted keys and
> values. This means that it won’t be possible to update a map/reduce index
> in the same transaction that a document is updated. To account for this, we
> will need to keep an `id index` similar to the `id tree` we current keep.
> This index will hold the document id as the key and the value would be the
> keys that were emitted. We would then use this information to know which
> fields need to be updated or removed from the index when a document is
> changed.  A data model for this would be:
> 
> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?ID_INDEX, <_id>,
> <view_id>} -> [emitted keys]
> 
> ## Updating an index
> To help in knowing which documents have changed since a view was last
> updated, we will need to keep the latest update sequence. This will change
> from the really long string we currently have, to using the sequence value
> defined in the _changes RFC [2].
> 
> Based on all of that, a slightly expanded data model for map functions
> inside a database subspace would look like:
> 
> * views
>    * <signature>
>        * update_seq
>        * idtree
>            * (<_id>, <viewid>) -> [keys]
>        * views
>            * <viewid>
>                * map
>                    * (<key>, <_id>) -> <value>
> 
> ## Size limits
> There are some size limits that are worth listing and keeping in mind.
> 
> * Emitted keys will not be able to exceed 10 KB
> * Values cannot exceed 100 KB
> * Following from Alex’s email on how transaction sizes are calculated [3],
> there could be rare cases where the number of key-value pairs emitted for a
> map function could lead to a transaction either exceeding 10 MB which isn’t
> allowed or exceeding 5 MB which impacts the performance of the cluster. We
> will have to detect for those situations and split the transaction into
> smaller transactions
> 
> What do you think of that? Any questions or thoughts on this? Once again a
> big acknowledgment to Adam who did the initial investigation and design
> ideas around this.
> 
> Cheers
> Garren
> 
> [1]
> http://docs.couchdb.org/en/stable/ddocs/views/collation.html#collation-specification
> [2] https://github.com/apache/couchdb-documentation/pull/401
> [3]
> https://lists.apache.org/thread.html/4976e0b7e3df89c5d64f37b5299b04c2ed01088f357be8aceaeedec1@%3Cdev.couchdb.apache.org%3E

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/
Re: [DISCUSS] Map Indexes

Reply via email to