I could certainly see handling `emit(key, doc)` as a special case. I would feel even better pursuing that special-case handling if we could characterize the performance delta between a) executing the first range operation on the view index + N additional range operations in paralle to retrieve the document data and b) a single, larger range operation against a view index containing all the needed data.
This is almost exactly the discussion about include_docs=true that we were having on the _all_docs thread, but with the added fun of trying to avoid value size limitations due to the view data model sticking to a single value per emit() statement. Adam > On Apr 15, 2019, at 11:15 AM, Will Holley <willhol...@gmail.com> wrote: > > Thanks Garren, > > As usual, a few questions :) > > 1. The data model suggests the idea of view groups gets carried over to > fdb. Are there API / behaviour reasons to keep them? Would an index update > transaction scope to a view group rather than a single view? > 2. Regarding emitting doc as the value in a view function, this is so > common that I wonder if it's worth handling as a special case. It sounds > like there wouldn't be a solution for customers who use this technique to > ensure they can retrieve the version of the document that is consistent > with the emitted key? > 3. When you say "Emitted keys will not be able to exceed 10 KB", do you > mean any single emitted key cannot exceed 10KB? The "id index" proposal > suggests there would also be a 100KB limit on the combined emitted key > length. > > Cheers, > > Will > > On Mon, 15 Apr 2019 at 15:25, Garren Smith <gar...@apache.org> wrote: > >> Hi Everyone, >> >> I want to start a discussion around creating map/reduce view indexes. One >> way to get views indexes to work with FoundationDB is to break up a view >> index into indexes for the map functions and indexes for the reduce >> functions. Along those lines, I’m going to break the discussions into two, >> this discussion around map functions and indexes and then a another one on >> reduce functions and the indexes that go with those. >> >> ## Data model >> For a map function, we need to store the emitted keys and the emitted >> values: >> >> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?MAP, <keys>, >> <_id>} -> <emitted_value> >> >> To briefly explain what the above means, it creates a views subspace in a >> database subspace, then every view defined on a design doc is grouped via >> the design doc’s view signature. The view_id is the name of the view in the >> design doc - we can look at ways to make that smaller to save some key >> space. The ?MAP groups the key/value into the view’s map index subspace, >> then we have the keys that were emitted for the map function and finally >> the _id field of the document used to create the keys for this row. >> >> ## Emitted Value >> The value stored for the row is the emitted value from the map function. >> Because we have a limitation on the size of the value field one caveat >> around this design is that a user will run into issues if they emit a >> document that exceeds 100KB. In CouchDB we don’t recommend users emitting >> the doc, but there are some nice speed optimisations you get by emitting >> the document as the value. With CouchDB on FDB that performance >> optimisation won’t be required and so we will have to actively discourage >> users from doing that. >> >> Just to note, a user would experience the same issue if they emit a value >> exceeding 100KB. >> >> ## Key ordering >> There are some changes to how we will manage keys emitted from a map >> function. For strings we will need to generate a ICU sort string upfront >> instead of using the ICU comparison. To maintain the way CouchDB currently >> does view collation [1], we need to prepend a type value to each key so >> that we get the correct sort order of null < boolean < numbers < strings < >> arrays < objects. CouchDB currently allows duplicate keys to be emitted for >> an index, to allow for that a counter value will be added to the end of the >> keys. >> >> ## Index Key Management >> For every document that needs to be processed for an index, we have to run >> the document through the javascript process to get the emitted keys and >> values. This means that it won’t be possible to update a map/reduce index >> in the same transaction that a document is updated. To account for this, we >> will need to keep an `id index` similar to the `id tree` we current keep. >> This index will hold the document id as the key and the value would be the >> keys that were emitted. We would then use this information to know which >> fields need to be updated or removed from the index when a document is >> changed. A data model for this would be: >> >> {?DATABASE, ?VIEWS, ?VIEW_SIGNATURE, ?VIEWS, <view_id>, ?ID_INDEX, <_id>, >> <view_id>} -> [emitted keys] >> >> ## Updating an index >> To help in knowing which documents have changed since a view was last >> updated, we will need to keep the latest update sequence. This will change >> from the really long string we currently have, to using the sequence value >> defined in the _changes RFC [2]. >> >> Based on all of that, a slightly expanded data model for map functions >> inside a database subspace would look like: >> >> * views >> * <signature> >> * update_seq >> * idtree >> * (<_id>, <viewid>) -> [keys] >> * views >> * <viewid> >> * map >> * (<key>, <_id>) -> <value> >> >> ## Size limits >> There are some size limits that are worth listing and keeping in mind. >> >> * Emitted keys will not be able to exceed 10 KB >> * Values cannot exceed 100 KB >> * Following from Alex’s email on how transaction sizes are calculated [3], >> there could be rare cases where the number of key-value pairs emitted for a >> map function could lead to a transaction either exceeding 10 MB which isn’t >> allowed or exceeding 5 MB which impacts the performance of the cluster. We >> will have to detect for those situations and split the transaction into >> smaller transactions >> >> What do you think of that? Any questions or thoughts on this? Once again a >> big acknowledgment to Adam who did the initial investigation and design >> ideas around this. >> >> Cheers >> Garren >> >> [1] >> >> http://docs.couchdb.org/en/stable/ddocs/views/collation.html#collation-specification >> [2] https://github.com/apache/couchdb-documentation/pull/401 >> [3] >> >> https://lists.apache.org/thread.html/4976e0b7e3df89c5d64f37b5299b04c2ed01088f357be8aceaeedec1@%3Cdev.couchdb.apache.org%3E >>