Re: [DISCUSS] Map Indexes

Adam Kocoloski Mon, 15 Apr 2019 14:51:11 -0700

I could certainly see handling `emit(key, doc)` as a special case. I would feel 
even better pursuing that special-case handling if we could characterize the 
performance delta between a) executing the first range operation on the view 
index + N additional range operations in paralle to retrieve the document data 
and b) a single, larger range operation against a view index containing all the 
needed data.


This is almost exactly the discussion about include_docs=true that we were 
having on the _all_docs thread, but with the added fun of trying to avoid value 
size limitations due to the view data model sticking to a single value per 
emit() statement.

Adam

> On Apr 15, 2019, at 11:15 AM, Will Holley <willhol...@gmail.com> wrote:
> 
> Thanks Garren,
> 
> As usual, a few questions :)
> 
> 1. The data model suggests the idea of view groups gets carried over to
> fdb. Are there API / behaviour reasons to keep them? Would an index update
> transaction scope to a view group rather than a single view?
> 2. Regarding emitting doc as the value in a view function, this is so
> common that I wonder if it's worth handling as a special case. It sounds
> like there wouldn't be a solution for customers who use this technique to
> ensure they can retrieve the version of the document that is consistent
> with the emitted key?
> 3. When you say "Emitted keys will not be able to exceed 10 KB", do you
> mean any single emitted key cannot exceed 10KB? The "id index" proposal
> suggests there would also be a 100KB limit on the combined emitted key
> length.
> 
> Cheers,
> 
> Will
> 
> On Mon, 15 Apr 2019 at 15:25, Garren Smith <gar...@apache.org> wrote:
> 
>> Hi Everyone,
>> 
>> I want to start a discussion around creating map/reduce view indexes. One
>> way to get views indexes to work with FoundationDB is to break up a view
>> index into indexes for the map functions and indexes for the reduce
>> functions. Along those lines, I’m going to break the discussions into two,
>> this discussion around map functions and indexes and then a another one on
>> reduce functions and the indexes that go with those.
>> 
>> ## Data model
>> For a map function, we need to store the emitted keys and the emitted
>> values:
>> 
>> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?MAP, <keys>,
>> <_id>} -> <emitted_value>
>> 
>> To briefly explain what the above means, it creates a views subspace in a
>> database subspace, then every view defined on a design doc is grouped via
>> the design doc’s view signature. The view_id is the name of the view in the
>> design doc - we can look at ways to make that smaller to save some key
>> space. The ?MAP groups the key/value into the view’s map index subspace,
>> then we have the keys that were emitted for the map function and finally
>> the _id field of the document used to create the keys for this row.
>> 
>> ## Emitted Value
>> The value stored for the row is the emitted value from the map function.
>> Because we have a limitation on the size of the value field one caveat
>> around this design is that a user will run into issues if they emit a
>> document that exceeds 100KB. In CouchDB we don’t recommend users emitting
>> the doc, but there are some nice speed optimisations you get by emitting
>> the document as the value. With CouchDB on FDB that performance
>> optimisation won’t be required and so we will have to actively discourage
>> users from doing that.
>> 
>> Just to note, a user would experience the same issue if they emit a value
>> exceeding 100KB.
>> 
>> ## Key ordering
>> There are some changes to how we will manage keys emitted from a map
>> function. For strings we will need to generate a ICU sort string upfront
>> instead of using the ICU comparison. To maintain the way CouchDB currently
>> does view collation [1], we need to prepend a type value to each key so
>> that we get the correct sort order of null < boolean < numbers < strings <
>> arrays < objects. CouchDB currently allows duplicate keys to be emitted for
>> an index, to allow for that a counter value will be added to the end of the
>> keys.
>> 
>> ## Index Key Management
>> For every document that needs to be processed for an index, we have to run
>> the document through the javascript process to get the emitted keys and
>> values. This means that it won’t be possible to update a map/reduce index
>> in the same transaction that a document is updated. To account for this, we
>> will need to keep an `id index` similar to the `id tree` we current keep.
>> This index will hold the document id as the key and the value would be the
>> keys that were emitted. We would then use this information to know which
>> fields need to be updated or removed from the index when a document is
>> changed.  A data model for this would be:
>> 
>> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?ID_INDEX, <_id>,
>> <view_id>} -> [emitted keys]
>> 
>> ## Updating an index
>> To help in knowing which documents have changed since a view was last
>> updated, we will need to keep the latest update sequence. This will change
>> from the really long string we currently have, to using the sequence value
>> defined in the _changes RFC [2].
>> 
>> Based on all of that, a slightly expanded data model for map functions
>> inside a database subspace would look like:
>> 
>> * views
>>    * <signature>
>>        * update_seq
>>        * idtree
>>            * (<_id>, <viewid>) -> [keys]
>>        * views
>>            * <viewid>
>>                * map
>>                    * (<key>, <_id>) -> <value>
>> 
>> ## Size limits
>> There are some size limits that are worth listing and keeping in mind.
>> 
>> * Emitted keys will not be able to exceed 10 KB
>> * Values cannot exceed 100 KB
>> * Following from Alex’s email on how transaction sizes are calculated [3],
>> there could be rare cases where the number of key-value pairs emitted for a
>> map function could lead to a transaction either exceeding 10 MB which isn’t
>> allowed or exceeding 5 MB which impacts the performance of the cluster. We
>> will have to detect for those situations and split the transaction into
>> smaller transactions
>> 
>> What do you think of that? Any questions or thoughts on this? Once again a
>> big acknowledgment to Adam who did the initial investigation and design
>> ideas around this.
>> 
>> Cheers
>> Garren
>> 
>> [1]
>> 
>> http://docs.couchdb.org/en/stable/ddocs/views/collation.html#collation-specification
>> [2] https://github.com/apache/couchdb-documentation/pull/401
>> [3]
>> 
>> https://lists.apache.org/thread.html/4976e0b7e3df89c5d64f37b5299b04c2ed01088f357be8aceaeedec1@%3Cdev.couchdb.apache.org%3E
>>

Re: [DISCUSS] Map Indexes

Reply via email to