You make a really good point on the potential for inconsistency between the current version of the document and the version of the document that contributed the rows in the stale view. I hadn’t been thinking about that when I had made my earlier comment in this thread.
Adam > On May 10, 2019, at 7:06 AM, Jan Lehnardt <j...@apache.org> wrote: > > > >> On 15. Apr 2019, at 17:15, Will Holley <willhol...@gmail.com >> <mailto:willhol...@gmail.com>> wrote: >> >> Thanks Garren, >> >> As usual, a few questions :) >> >> 1. The data model suggests the idea of view groups gets carried over to >> fdb. Are there API / behaviour reasons to keep them? Would an index update >> transaction scope to a view group rather than a single view? >> 2. Regarding emitting doc as the value in a view function, this is so >> common that I wonder if it's worth handling as a special case. It sounds >> like there wouldn't be a solution for customers who use this technique to >> ensure they can retrieve the version of the document that is consistent >> with the emitted key? > > > How would we implement this special case? I have some JS AST-walking code > that can pull out instances of “doc is used as emit value”, so we could > detect this case, but what then? do we put in a “foreign key include” marker > like it exists today, and hope that doc body is still available in the > revisions subspace, and what’s the desired failure scenario for when the body > is gone for good, or even purged? Or do we keep emitted doc bodies somewhere > else, or add some refcounting to the revisions store, even if a rev is > deleted=true? > > I see some options, but I’m not clear on a good path. > >> 3. When you say "Emitted keys will not be able to exceed 10 KB", do you >> mean any single emitted key cannot exceed 10KB? The "id index" proposal >> suggests there would also be a 100KB limit on the combined emitted key >> length. >> >> Cheers, >> >> Will >> >> On Mon, 15 Apr 2019 at 15:25, Garren Smith <gar...@apache.org> wrote: >> >>> Hi Everyone, >>> >>> I want to start a discussion around creating map/reduce view indexes. One >>> way to get views indexes to work with FoundationDB is to break up a view >>> index into indexes for the map functions and indexes for the reduce >>> functions. Along those lines, I’m going to break the discussions into two, >>> this discussion around map functions and indexes and then a another one on >>> reduce functions and the indexes that go with those. >>> >>> ## Data model >>> For a map function, we need to store the emitted keys and the emitted >>> values: >>> >>> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?MAP, <keys>, >>> <_id>} -> <emitted_value> >>> >>> To briefly explain what the above means, it creates a views subspace in a >>> database subspace, then every view defined on a design doc is grouped via >>> the design doc’s view signature. The view_id is the name of the view in the >>> design doc - we can look at ways to make that smaller to save some key >>> space. The ?MAP groups the key/value into the view’s map index subspace, >>> then we have the keys that were emitted for the map function and finally >>> the _id field of the document used to create the keys for this row. >>> >>> ## Emitted Value >>> The value stored for the row is the emitted value from the map function. >>> Because we have a limitation on the size of the value field one caveat >>> around this design is that a user will run into issues if they emit a >>> document that exceeds 100KB. In CouchDB we don’t recommend users emitting >>> the doc, but there are some nice speed optimisations you get by emitting >>> the document as the value. With CouchDB on FDB that performance >>> optimisation won’t be required and so we will have to actively discourage >>> users from doing that. >>> >>> Just to note, a user would experience the same issue if they emit a value >>> exceeding 100KB. >>> >>> ## Key ordering >>> There are some changes to how we will manage keys emitted from a map >>> function. For strings we will need to generate a ICU sort string upfront >>> instead of using the ICU comparison. To maintain the way CouchDB currently >>> does view collation [1], we need to prepend a type value to each key so >>> that we get the correct sort order of null < boolean < numbers < strings < >>> arrays < objects. CouchDB currently allows duplicate keys to be emitted for >>> an index, to allow for that a counter value will be added to the end of the >>> keys. >>> >>> ## Index Key Management >>> For every document that needs to be processed for an index, we have to run >>> the document through the javascript process to get the emitted keys and >>> values. This means that it won’t be possible to update a map/reduce index >>> in the same transaction that a document is updated. To account for this, we >>> will need to keep an `id index` similar to the `id tree` we current keep. >>> This index will hold the document id as the key and the value would be the >>> keys that were emitted. We would then use this information to know which >>> fields need to be updated or removed from the index when a document is >>> changed. A data model for this would be: >>> >>> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?ID_INDEX, <_id>, >>> <view_id>} -> [emitted keys] >>> >>> ## Updating an index >>> To help in knowing which documents have changed since a view was last >>> updated, we will need to keep the latest update sequence. This will change >>> from the really long string we currently have, to using the sequence value >>> defined in the _changes RFC [2]. >>> >>> Based on all of that, a slightly expanded data model for map functions >>> inside a database subspace would look like: >>> >>> * views >>> * <signature> >>> * update_seq >>> * idtree >>> * (<_id>, <viewid>) -> [keys] >>> * views >>> * <viewid> >>> * map >>> * (<key>, <_id>) -> <value> >>> >>> ## Size limits >>> There are some size limits that are worth listing and keeping in mind. >>> >>> * Emitted keys will not be able to exceed 10 KB >>> * Values cannot exceed 100 KB >>> * Following from Alex’s email on how transaction sizes are calculated [3], >>> there could be rare cases where the number of key-value pairs emitted for a >>> map function could lead to a transaction either exceeding 10 MB which isn’t >>> allowed or exceeding 5 MB which impacts the performance of the cluster. We >>> will have to detect for those situations and split the transaction into >>> smaller transactions >>> >>> What do you think of that? Any questions or thoughts on this? Once again a >>> big acknowledgment to Adam who did the initial investigation and design >>> ideas around this. >>> >>> Cheers >>> Garren >>> >>> [1] >>> >>> http://docs.couchdb.org/en/stable/ddocs/views/collation.html#collation-specification >>> [2] https://github.com/apache/couchdb-documentation/pull/401 >>> [3] >>> >>> https://lists.apache.org/thread.html/4976e0b7e3df89c5d64f37b5299b04c2ed01088f357be8aceaeedec1@%3Cdev.couchdb.apache.org%3E >>> > > -- > Professional Support for Apache CouchDB: > https://neighbourhood.ie/couchdb-support/ > <https://neighbourhood.ie/couchdb-support/>