Re: [DISCUSS] Map Indexes

Adam Kocoloski Wed, 15 May 2019 17:52:08 -0700

You make a really good point on the potential for inconsistency between the 
current version of the document and the version of the document that 
contributed the rows in the stale view. I hadn’t been thinking about that when 
I had made my earlier comment in this thread.


Adam

> On May 10, 2019, at 7:06 AM, Jan Lehnardt <j...@apache.org> wrote:
> 
> 
> 
>> On 15. Apr 2019, at 17:15, Will Holley <willhol...@gmail.com 
>> <mailto:willhol...@gmail.com>> wrote:
>> 
>> Thanks Garren,
>> 
>> As usual, a few questions :)
>> 
>> 1. The data model suggests the idea of view groups gets carried over to
>> fdb. Are there API / behaviour reasons to keep them? Would an index update
>> transaction scope to a view group rather than a single view?
>> 2. Regarding emitting doc as the value in a view function, this is so
>> common that I wonder if it's worth handling as a special case. It sounds
>> like there wouldn't be a solution for customers who use this technique to
>> ensure they can retrieve the version of the document that is consistent
>> with the emitted key?
> 
> 
> How would we implement this special case? I have some JS AST-walking code 
> that can pull out instances of “doc is used as emit value”, so we could 
> detect this case, but what then? do we put in a “foreign key include” marker 
> like it exists today, and hope that doc body is still available in the 
> revisions subspace, and what’s the desired failure scenario for when the body 
> is gone for good, or even purged? Or do we keep emitted doc bodies somewhere 
> else, or add some refcounting to the revisions store, even if a rev is 
> deleted=true?
> 
> I see some options, but I’m not clear on a good path.
> 
>> 3. When you say "Emitted keys will not be able to exceed 10 KB", do you
>> mean any single emitted key cannot exceed 10KB? The "id index" proposal
>> suggests there would also be a 100KB limit on the combined emitted key
>> length.
>> 
>> Cheers,
>> 
>> Will
>> 
>> On Mon, 15 Apr 2019 at 15:25, Garren Smith <gar...@apache.org> wrote:
>> 
>>> Hi Everyone,
>>> 
>>> I want to start a discussion around creating map/reduce view indexes. One
>>> way to get views indexes to work with FoundationDB is to break up a view
>>> index into indexes for the map functions and indexes for the reduce
>>> functions. Along those lines, I’m going to break the discussions into two,
>>> this discussion around map functions and indexes and then a another one on
>>> reduce functions and the indexes that go with those.
>>> 
>>> ## Data model
>>> For a map function, we need to store the emitted keys and the emitted
>>> values:
>>> 
>>> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?MAP, <keys>,
>>> <_id>} -> <emitted_value>
>>> 
>>> To briefly explain what the above means, it creates a views subspace in a
>>> database subspace, then every view defined on a design doc is grouped via
>>> the design doc’s view signature. The view_id is the name of the view in the
>>> design doc - we can look at ways to make that smaller to save some key
>>> space. The ?MAP groups the key/value into the view’s map index subspace,
>>> then we have the keys that were emitted for the map function and finally
>>> the _id field of the document used to create the keys for this row.
>>> 
>>> ## Emitted Value
>>> The value stored for the row is the emitted value from the map function.
>>> Because we have a limitation on the size of the value field one caveat
>>> around this design is that a user will run into issues if they emit a
>>> document that exceeds 100KB. In CouchDB we don’t recommend users emitting
>>> the doc, but there are some nice speed optimisations you get by emitting
>>> the document as the value. With CouchDB on FDB that performance
>>> optimisation won’t be required and so we will have to actively discourage
>>> users from doing that.
>>> 
>>> Just to note, a user would experience the same issue if they emit a value
>>> exceeding 100KB.
>>> 
>>> ## Key ordering
>>> There are some changes to how we will manage keys emitted from a map
>>> function. For strings we will need to generate a ICU sort string upfront
>>> instead of using the ICU comparison. To maintain the way CouchDB currently
>>> does view collation [1], we need to prepend a type value to each key so
>>> that we get the correct sort order of null < boolean < numbers < strings <
>>> arrays < objects. CouchDB currently allows duplicate keys to be emitted for
>>> an index, to allow for that a counter value will be added to the end of the
>>> keys.
>>> 
>>> ## Index Key Management
>>> For every document that needs to be processed for an index, we have to run
>>> the document through the javascript process to get the emitted keys and
>>> values. This means that it won’t be possible to update a map/reduce index
>>> in the same transaction that a document is updated. To account for this, we
>>> will need to keep an `id index` similar to the `id tree` we current keep.
>>> This index will hold the document id as the key and the value would be the
>>> keys that were emitted. We would then use this information to know which
>>> fields need to be updated or removed from the index when a document is
>>> changed.  A data model for this would be:
>>> 
>>> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?ID_INDEX, <_id>,
>>> <view_id>} -> [emitted keys]
>>> 
>>> ## Updating an index
>>> To help in knowing which documents have changed since a view was last
>>> updated, we will need to keep the latest update sequence. This will change
>>> from the really long string we currently have, to using the sequence value
>>> defined in the _changes RFC [2].
>>> 
>>> Based on all of that, a slightly expanded data model for map functions
>>> inside a database subspace would look like:
>>> 
>>> * views
>>>   * <signature>
>>>       * update_seq
>>>       * idtree
>>>           * (<_id>, <viewid>) -> [keys]
>>>       * views
>>>           * <viewid>
>>>               * map
>>>                   * (<key>, <_id>) -> <value>
>>> 
>>> ## Size limits
>>> There are some size limits that are worth listing and keeping in mind.
>>> 
>>> * Emitted keys will not be able to exceed 10 KB
>>> * Values cannot exceed 100 KB
>>> * Following from Alex’s email on how transaction sizes are calculated [3],
>>> there could be rare cases where the number of key-value pairs emitted for a
>>> map function could lead to a transaction either exceeding 10 MB which isn’t
>>> allowed or exceeding 5 MB which impacts the performance of the cluster. We
>>> will have to detect for those situations and split the transaction into
>>> smaller transactions
>>> 
>>> What do you think of that? Any questions or thoughts on this? Once again a
>>> big acknowledgment to Adam who did the initial investigation and design
>>> ideas around this.
>>> 
>>> Cheers
>>> Garren
>>> 
>>> [1]
>>> 
>>> http://docs.couchdb.org/en/stable/ddocs/views/collation.html#collation-specification
>>> [2] https://github.com/apache/couchdb-documentation/pull/401
>>> [3]
>>> 
>>> https://lists.apache.org/thread.html/4976e0b7e3df89c5d64f37b5299b04c2ed01088f357be8aceaeedec1@%3Cdev.couchdb.apache.org%3E
>>> 
> 
> -- 
> Professional Support for Apache CouchDB:
> https://neighbourhood.ie/couchdb-support/ 
> <https://neighbourhood.ie/couchdb-support/>

Re: [DISCUSS] Map Indexes

Reply via email to