As is evident by the length of this thread, there's a pretty big design space to cover here, and it seems unlikely we'll have arrived at a "correct" solution even by the time this thing ships. Perhaps it would be worthwhile to treat the in-FDB representation of data as a first-class abstraction and support multiple representations simultaneously?
Obviously there's no such thing as a zero-cost abstraction - and I've not thought very hard about how far up the stack the document representation would need to leak - but supporting different layouts (primarily, as Adam points out, on the document body itself) might prove interesting and useful. I'm sure there are folks interested in a column-shaped CouchDB, for example. -- b On Tue, Feb 19, 2019 at 11:39 AM Robert Newson <rnew...@apache.org> wrote: > > Good points on revtree, I agree with you we should store that intelligently > to gain the benefits you mentioned. > > -- > Robert Samuel Newson > rnew...@apache.org > > On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote: > > I do not think we should store the revtree as a blob. The design where > > each edit branch is its own KV should save on network IO and CPU cycles > > for normal updates. We’ve performed too many heroics to keep > > couch_key_tree from stalling entire databases when trying to update a > > single document with a wide revision tree, I would much prefer to ignore > > other edit branches entirely when all we’re doing is extending one of > > them. > > > > I also do not think we should store JSON documents as blobs, but it’s a > > closer call. Some of my reasoning for preferring the exploded path > > design: > > > > - it lends itself nicely to sub-document operations, for which Jan > > crafted an RFC last year: https://github.com/apache/couchdb/issues/1559 > > - it optimizes the creation of Mango indexes on existing databases since > > we only need to retrieve the value(s) we want to index > > - it optimizes Mango queries that use field selectors > > - anyone who wanted to try their hand at GraphQL will find it very > > handy: https://github.com/apache/couchdb/issues/1499 > > - looking further ahead, it lets us play with smarter leaf value types > > like Counters (yes I’m still on the CRDT bandwagon, sorry) > > > > A few comments on the thread: > > > > >>> * Most documents bodies are probably going to be smaller than 100k. So > > >>> in > > >>> the majority of case it would be one write / one read to update and > > >>> fetch > > >>> the document body. > > > > We should test, but I expect reading 50KB of data in a range query is > > almost as efficient as reading a single 50 KB value. Similarly, writes > > to a contiguous set of keys should be quite efficient. > > > > I am concerned about the overhead of the repeated field paths in the > > keys with the exploded path option in the absence of key prefix > > compression. That would be my main reason to acquiesce and throw away > > all the document structure. > > > > Adam > > > > > On Feb 19, 2019, at 12:04 PM, Robert Newson <rnew...@apache.org> wrote: > > > > > > I like the idea that we'd reuse the same pattern (but perhaps not the > > > same _code_) for doc bodies, revtree and attachments. > > > > > > I hope we still get to delete couch_key_tree.erl, though. > > > > > > -- > > > Robert Samuel Newson > > > rnew...@apache.org > > > > > > On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote: > > >> I like the idea from a “trying a simple thing first” perspective, but > > >> Nick’s points below are especially convincing to with this for now. > > >> > > >> Best > > >> Jan > > >> — > > >> > > >>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatam...@gmail.com> wrote: > > >>> > > >>> Hi, > > >>> > > >>> Sorry for jumping in so late, I was following from the sidelines > > >>> mostly. A > > >>> lot of good discussion happening and am excited about the possibilities > > >>> here. > > >>> > > >>> I do like the simpler "chunking" approach for a few reasons: > > >>> > > >>> * Most documents bodies are probably going to be smaller than 100k. So > > >>> in > > >>> the majority of case it would be one write / one read to update and > > >>> fetch > > >>> the document body. > > >>> > > >>> * We could reuse the chunking code for attachment handling and possibly > > >>> revision key trees. So it's the general pattern of upload chunks to some > > >>> prefix, and when finished flip an atomic toggle to make it current. > > >>> > > >>> * Do the same thing with revision trees and we could re-use the revision > > >>> tree manipulation logic. That is, the key tree in most cases would be > > >>> small > > >>> enough to fit in 100k but if they get huge, they'd get chunked. This > > >>> would > > >>> allow us to reuse all the battle tested couch_key_tree code mostly as > > >>> is. > > >>> We even have property tests for it > > >>> https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl > > >>> > > >>> * It removes the need to explain the max exploded path length > > >>> limitation to > > >>> customers. > > >>> > > >>> Cheers, > > >>> -Nick > > >>> > > >>> > > >>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson <rnew...@apache.org> > > >>> wrote: > > >>> > > >>>> Hi, > > >>>> > > >>>> An alternative storage model that we should seriously consider is to > > >>>> follow our current approach in couch_file et al. Specifically, that the > > >>>> document _body_ is stored as an uninterpreted binary value. This would > > >>>> be > > >>>> much like the obvious plan for attachment storage; a key prefix that > > >>>> identifies the database and document, with the final item of that key > > >>>> tuple > > >>>> is an incrementing integer. Each of those keys has a binary value of > > >>>> up to > > >>>> 100k. Fetching all values with that key prefix, in fdb's natural > > >>>> ordering, > > >>>> will yield the full document body, which can be JSON decoded for > > >>>> further > > >>>> processing. > > >>>> > > >>>> I like this idea, and I like Adam's original proposal to explode > > >>>> documents > > >>>> into property paths. I have a slight preference for the simplicity of > > >>>> the > > >>>> idea in the previous paragraph, not least because it's close to what > > >>>> we do > > >>>> today. I also think it will be possible to migrate to alternative > > >>>> storage > > >>>> models in future, and foundationdb's transaction supports means we can > > >>>> do > > >>>> this migration seamlessly should we come to it. > > >>>> > > >>>> I'm very interested in knowing if anyone else is interested in going > > >>>> this > > >>>> simple, or considers it a wasted opportunity relative to the 'exploded' > > >>>> path. > > >>>> > > >>>> B. > > >>>> > > >>>> -- > > >>>> Robert Samuel Newson > > >>>> rnew...@apache.org > > >>>> > > >>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote: > > >>>>> I've been remiss here in not posting the data model ideas that IBM > > >>>>> worked up while we were thinking about using FoundationDB so I'm > > >>>>> posting > > >>>>> it now. This is Adam' Kocoloski's original work, I am just > > >>>>> transcribing > > >>>>> it, and this is the context that the folks from the IBM side came in > > >>>>> with, for full disclosure. > > >>>>> > > >>>>> Basics > > >>>>> > > >>>>> 1. All CouchDB databases are inside a Directory > > >>>>> 2. Each CouchDB database is a Directory within that Directory > > >>>>> 3. It's possible to list all subdirectories of a Directory, so > > >>>>> `_all_dbs` is the list of directories from 1. > > >>>>> 4. Each Directory representing a CouchdB database has several > > >>>>> Subspaces; > > >>>>> 4a. by_id/ doc subspace: actual document contents > > >>>>> 4b. by_seq/versionstamp subspace: for the _changes feed > > >>>>> 4c. index_definitions, indexes, ... > > >>>>> > > >>>>> JSON Mapping > > >>>>> > > >>>>> A hierarchical JSON object naturally maps to multiple KV pairs in FDB: > > >>>>> > > >>>>> { > > >>>>> “_id”: “foo”, > > >>>>> “owner”: “bob”, > > >>>>> “mylist”: [1,3,5], > > >>>>> “mymap”: { > > >>>>> “blue”: “#0000FF”, > > >>>>> “red”: “#FF0000” > > >>>>> } > > >>>>> } > > >>>>> > > >>>>> maps to > > >>>>> > > >>>>> (“foo”, “owner”) = “bob” > > >>>>> (“foo”, “mylist”, 0) = 1 > > >>>>> (“foo”, “mylist”, 1) = 3 > > >>>>> (“foo”, “mylist”, 2) = 5 > > >>>>> (“foo”, “mymap”, “blue”) = “#0000FF” > > >>>>> (“foo”, “mymap”, “red”) = “#FF0000” > > >>>>> > > >>>>> NB: this means that the 100KB limit applies to individual leafs in the > > >>>>> JSON object, not the entire doc > > >>>>> > > >>>>> Edit Conflicts > > >>>>> > > >>>>> We need to account for the presence of conflicts in various levels of > > >>>>> the doc due to replication. > > >>>>> > > >>>>> Proposal is to create a special value indicating that the subtree > > >>>>> below > > >>>>> our current cursor position is in an unresolvable conflict. Then add > > >>>>> additional KV pairs below to describe the conflicting entries. > > >>>>> > > >>>>> KV data model allows us to store these efficiently and minimize > > >>>>> duplication of data: > > >>>>> > > >>>>> A document with these two conflicts: > > >>>>> > > >>>>> { > > >>>>> “_id”: “foo”, > > >>>>> “_rev”: “1-abc”, > > >>>>> “owner”: “alice”, > > >>>>> “active”: true > > >>>>> } > > >>>>> { > > >>>>> “_id”: “foo”, > > >>>>> “_rev”: “1-def”, > > >>>>> “owner”: “bob”, > > >>>>> “active”: true > > >>>>> } > > >>>>> > > >>>>> could be stored thus: > > >>>>> > > >>>>> (“foo”, “active”) = true > > >>>>> (“foo”, “owner”) = kCONFLICT > > >>>>> (“foo”, “owner”, “1-abc”) = “alice” > > >>>>> (“foo”, “owner”, “1-def”) = “bob” > > >>>>> > > >>>>> So long as `kCONFLICT` is set at the top of the conflicting subtree > > >>>>> this > > >>>>> representation can handle conflicts of different data types as well. > > >>>>> > > >>>>> Missing fields need to be handled explicitly: > > >>>>> > > >>>>> { > > >>>>> “_id”: “foo”, > > >>>>> “_rev”: “1-abc”, > > >>>>> “owner”: “alice”, > > >>>>> “active”: true > > >>>>> } > > >>>>> > > >>>>> { > > >>>>> “_id”: “foo”, > > >>>>> “_rev”: “1-def”, > > >>>>> “owner”: { > > >>>>> “name”: “bob”, > > >>>>> “email”: “ > > >>>>> b...@example.com > > >>>>> " > > >>>>> } > > >>>>> } > > >>>>> > > >>>>> could be stored thus: > > >>>>> > > >>>>> (“foo”, “active”) = kCONFLICT > > >>>>> (“foo”, “active”, “1-abc”) = true > > >>>>> (“foo”, “active”, “1-def”) = kMISSING > > >>>>> (“foo”, “owner”) = kCONFLICT > > >>>>> (“foo”, “owner”, “1-abc”) = “alice” > > >>>>> (“foo”, “owner”, “1-def”, “name”) = “bob” > > >>>>> (“foo”, “owner”, “1-def”, “email”) = ... > > >>>>> > > >>>>> Revision Metadata > > >>>>> > > >>>>> * CouchDB uses a hash history for revisions > > >>>>> ** Each edit is identified by the hash of the content of the edit > > >>>>> including the base revision against which it was applied > > >>>>> ** Individual edit branches are bounded in length but the number of > > >>>>> branches is potentially unbounded > > >>>>> > > >>>>> * Size limits preclude us from storing the entire key tree as a single > > >>>>> value; in pathological situations > > >>>>> the tree could exceed 100KB (each entry is > 16 bytes) > > >>>>> > > >>>>> * Store each edit branch as a separate KV including deleted status in > > >>>>> a > > >>>>> special subspace > > >>>>> > > >>>>> * Structure key representation so that “winning” revision can be > > >>>>> automatically retrieved in a limit=1 > > >>>>> key range operation > > >>>>> > > >>>>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = [] > > >>>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) = > > >>>>> [“3-baz”,”2-bar”,”1-foo”] > > >>>>> <-- winner > > >>>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”] > > >>>>> > > >>>>> Changes Feed > > >>>>> > > >>>>> * FDB supports a concept called a versionstamp — a 10 byte, unique, > > >>>>> monotonically (but not sequentially) increasing value for each > > >>>>> committed > > >>>>> transaction. The first 8 bytes are the committed version of the > > >>>>> database. The last 2 bytes are monotonic in the serialization order > > >>>>> for > > >>>>> transactions. > > >>>>> > > >>>>> * A transaction can specify a particular index into a key where the > > >>>>> following 10 bytes will be overwritten by the versionstamp at commit > > >>>>> time > > >>>>> > > >>>>> * A subspace keyed on versionstamp naturally yields a _changes feed > > >>>>> > > >>>>> by_seq subspace > > >>>>> (“versionstamp1”) = (“foo”, “1-abc”) > > >>>>> (“versionstamp4”) = (“bar”, “4-def”) > > >>>>> > > >>>>> by_id subspace > > >>>>> (“bar”, “_vsn”) = “versionstamp4” > > >>>>> ... > > >>>>> (“foo”, “_vsn”) = “versionstamp1” > > >>>>> > > >>>>> JSON Indexes > > >>>>> > > >>>>> * “Mango” JSON indexes are defined by > > >>>>> ** a list of field names, each of which may be nested, > > >>>>> ** an optional partial_filter_selector which constrains the set of > > >>>>> docs > > >>>>> that contribute > > >>>>> ** an optional name defined by the ddoc field (the name is auto- > > >>>>> generated if not supplied) > > >>>>> > > >>>>> * Store index definitions in a single subspace to aid query planning > > >>>>> ** ((person,name), title, email) = (“name-title-email”, “{“student”: > > >>>>> true}”) > > >>>>> ** Store the values for each index in a dedicated subspace, adding the > > >>>>> document ID as the last element in the tuple > > >>>>> *** (“rosie revere”, “engineer”, “ro...@example.com", “foo”) = null > > >>>>> > > >>>>> B. > > >>>>> > > >>>>> -- > > >>>>> Robert Samuel Newson > > >>>>> rnew...@apache.org > > >>>>> > > >>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote: > > >>>>>> > > >>>>>> I want to fix previous mistakes. I did two mistakes in previous > > >>>>>> calculations: > > >>>>>> - I used 1Kb as base size for calculating expansion factor (although > > >>>> we > > >>>>>> don't know exact size of original document) > > >>>>>> - The expansion factor calculation included number of revisions (it > > >>>>>> shouldn't) > > >>>>>> > > >>>>>> I'll focus on flattened JSON docs model > > >>>>>> > > >>>>>> The following formula is used in previous calculation. > > >>>>>> storage_size_per_document=mapping_table_size*number_of_revisions + > > >>>>>> depth*number_of_paths*number_of_revisions + > > >>>>>> number_of_paths*value_size*number_of_revisions > > >>>>>> > > >>>>>> To clarify things a little bit I want to calculate space requirement > > >>>> for > > >>>>>> single revision this time. > > >>>>>> mapping_table_size=field_name_size*(field_name_length+4(integer > > >>>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes > > >>>>>> storage_size_per_document_per_revision_per_replica=mapping_table_size > > >>>> + > > >>>>>> depth*number_of_paths + value_size*number_of_paths = > > >>>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb > > >>>>>> > > >>>>>> We definitely can reduce requirement for mapping table by adopting > > >>>>>> rnewson's idea of a schema. > > >>>>>> > > >>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iil...@apache.org> wrote: > > >>>>>>> Hi Michael, > > >>>>>>> > > >>>>>>>> For example, hears a crazy thought: > > >>>>>>>> Map every distinct occurence of a key/value instance through a > > >>>> crypto hash > > >>>>>>>> function to get a set of hashes. > > >>>>>>>> > > >>>>>>>> These can be be precomputed by Couch without any lookups in FDB. > > >>>> These > > >>>>>>>> will be spread all over kingdom come in FDB and not lend > > >>>> themselves to > > >>>>>>>> range search well. > > >>>>>>>> > > >>>>>>>> So what you do is index them for frequency of occurring in the > > >>>> same set. > > >>>>>>>> In essence, you 'bucket them' statistically, and that bucket id > > >>>> becomes a > > >>>>>>>> key prefix. A crypto hash value can be copied into more than one > > >>>> bucket. > > >>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} > > >>>>>>> > > >>>>>>>> When writing a document, Couch submits the list/array of > > >>>> cryptohash values > > >>>>>>>> it computed to FDB and gets back the corresponding {val_id} (the > > >>>> id with > > >>>>>>>> the bucket prefixed). This can get somewhat expensive if there's > > >>>> always a > > >>>>>>>> lot of app local cache misses. > > >>>>>>>> > > >>>>>>>> A document's value is then a series of {val_id} arrays up to 100k > > >>>> per > > >>>>>>>> segment. > > >>>>>>>> > > >>>>>>>> When retrieving a document, you get the val_ids, find the distinct > > >>>> buckets > > >>>>>>>> and min/max entries for this doc, and then parallel query each > > >>>> bucket while > > >>>>>>>> reconstructing the document. > > >>>>>>> > > >>>>>>> Interesting idea. Let's try to think it through to see if we can > > >>>> make it viable. > > >>>>>>> Let's go through hypothetical example. Input data for the example: > > >>>>>>> - 1M of documents > > >>>>>>> - each document is around 10Kb > > >>>>>>> - each document consists of 1K of unique JSON paths > > >>>>>>> - each document has 100 unique JSON field names > > >>>>>>> - every scalar value is 100 bytes > > >>>>>>> - 10% of unique JSON paths for every document already stored in > > >>>> database under different doc or different revision of the current one > > >>>>>>> - we assume 3 independent copies for every key-value pair in FDB > > >>>>>>> - our hash key size is 32 bytes > > >>>>>>> - let's assume we can determine if key is already on the storage > > >>>> without doing query > > >>>>>>> - 1% of paths is in cache (unrealistic value, in real live the > > >>>> percentage is lower) > > >>>>>>> - every JSON field name is 20 bytes > > >>>>>>> - every JSON path is 10 levels deep > > >>>>>>> - document key prefix length is 50 > > >>>>>>> - every document has 10 revisions > > >>>>>>> Let's estimate the storage requirements and size of data we need to > > >>>> transmit. The calculations are not exact. > > >>>>>>> 1. storage_size_per_document (we cannot estimate exact numbers since > > >>>> we don't know how FDB stores it) > > >>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = > > >>>> 38Kb * 10 * 3 = 1140 Kb (11x) > > >>>>>>> 2. number of independent keys to retrieve on document read > > >>>> (non-range queries) per document > > >>>>>>> - 1K - (1K * 1%) = 990 > > >>>>>>> 3. number of range queries: 0 > > >>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 > > >>>> bytes) = 102 Kb (10x) > > >>>>>>> 5. read latency (we use 2ms per read based on numbers from > > >>>> https://apple.github.io/foundationdb/performance.html) > > >>>>>>> - sequential: 990*2ms = 1980ms > > >>>>>>> - range: 0 > > >>>>>>> Let's compare these numbers with initial proposal (flattened JSON > > >>>> docs without global schema and without cache) > > >>>>>>> 1. storage_size_per_document > > >>>>>>> - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes > > >>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes > > >>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = > > >>>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x) > > >>>>>>> 2. number of independent keys to retrieve: 0-2 (depending on index > > >>>> structure) > > >>>>>>> 3. number of range queries: 1 (1001 of keys in result) > > >>>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb > > >>>> (2.4x) > > >>>>>>> 5. read latency (we use 2ms per read based on numbers from > > >>>> https://apple.github.io/foundationdb/performance.html and estimate > > >>>> range > > >>>> read performance based on numbers from > > >>>> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test > > >>>> ) > > >>>>>>> - range read performance: Given read performance is about 305,000 > > >>>> reads/second and range performance 3,600,000 keys/second we estimate > > >>>> range > > >>>> performance to be 11.8x compared to read performance. If read > > >>>> performance > > >>>> is 2ms than range performance is 0.169ms (which is hard to believe). > > >>>>>>> - sequential: 2 * 2 = 4ms > > >>>>>>> - range: 0.169 > > >>>>>>> > > >>>>>>> It looks like we are dealing with a tradeoff: > > >>>>>>> - Map every distinct occurrence of a key/value instance through a > > >>>> crypto hash: > > >>>>>>> - 5.39x more disk space efficient > > >>>>>>> - 474x slower > > >>>>>>> - flattened JSON model > > >>>>>>> - 5.39x less efficient in disk space > > >>>>>>> - 474x faster > > >>>>>>> > > >>>>>>> In any case this unscientific exercise was very helpful. Since it > > >>>> uncovered the high cost in terms of disk space. 59.3x of original disk > > >>>> size > > >>>> is too much IMO. > > >>>>>>> > > >>>>>>> Are the any ways we can make Michael's model more performant? > > >>>>>>> > > >>>>>>> Also I don't quite understand few aspects of the global hash table > > >>>> proposal: > > >>>>>>> > > >>>>>>> 1. > - Map every distinct occurence of a key/value instance through > > >>>> a crypto hash function to get a set of hashes. > > >>>>>>> I think we are talking only about scalar values here? I.e. > > >>>> `"#/foo.bar.baz": 123` > > >>>>>>> Since I don't know how we can make it work for all possible JSON > > >>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}": > > >>>>>>> - foo > > >>>>>>> - foo.bar > > >>>>>>> - foo.bar.baz > > >>>>>>> > > >>>>>>> 2. how to delete documents > > >>>>>>> > > >>>>>>> Best regards, > > >>>>>>> ILYA > > >>>>>>> > > >>>>>>> > > >>>>>>> On 2019/01/30 23:33:22, Michael Fair <mich...@daclubhouse.net> > > >>>> wrote: > > >>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocol...@apache.org > > >>>> wrote: > > >>>>>>>> > > >>>>>>>>> Hi Michael, > > >>>>>>>>> > > >>>>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY. > > >>>>>>>>> > > >>>>>>>>> Yes that’s definitely one way to address storage of edit > > >>>> conflicts. I > > >>>>>>>>> think there are other, more compact representations that we can > > >>>> explore if > > >>>>>>>>> we have this “exploded” data model where each scalar value maps > > >>>> to an > > >>>>>>>>> individual KV pair. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> I agree, as I mentioned on the original thread, I see a scheme, > > >>>> that > > >>>>>>>> handles both conflicts and revisions, where you only have to store > > >>>> the most > > >>>>>>>> recent change to a field. Like you suggested, multiple revisions > > >>>> can share > > >>>>>>>> a key. Which in my mind's eye further begs the conflicts/revisions > > >>>>>>>> discussion along with the working within the limits discussion > > >>>> because it > > >>>>>>>> seems to me they are all intrinsically related as a "feature". > > >>>>>>>> > > >>>>>>>> Saying 'We'll break documents up into roughly 80k segments', then > > >>>> trying to > > >>>>>>>> overlay some kind of field sharing scheme for revisions/conflicts > > >>>> doesn't > > >>>>>>>> seem like it will work. > > >>>>>>>> > > >>>>>>>> I probably should have left out the trivial fix proposal as I > > >>>> don't think > > >>>>>>>> it's a feasible solution to actually use. > > >>>>>>>> > > >>>>>>>> The comment is more regarding that I do not see how this thread > > >>>> can escape > > >>>>>>>> including how to store/retrieve conflicts/revisions. > > >>>>>>>> > > >>>>>>>> For instance, the 'doc as individual fields' proposal lends itself > > >>>> to value > > >>>>>>>> sharing across mutiple documents (and I don't just mean revisions > > >>>> of the > > >>>>>>>> same doc, I mean the same key/value instance could be shared for > > >>>> every > > >>>>>>>> document). > > >>>>>>>> However that's not really relevant if we're not considering the > > >>>> amount of > > >>>>>>>> shared information across documents in the storage scheme. > > >>>>>>>> > > >>>>>>>> Simply storing documents in <100k segments (perhaps in some kind of > > >>>>>>>> compressed binary representation) to deal with that FDB limit > > >>>> seems fine. > > >>>>>>>> The only reason to consider doing something else is because of its > > >>>> impact > > >>>>>>>> to indexing, searches, reduce functions, revisions, on-disk size > > >>>> impact, > > >>>>>>>> etc. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>>>> I'm assuming the process will flatten the key paths of the > > >>>> document into > > >>>>>>>>> an array and then request the value of each key as multiple > > >>>> parallel > > >>>>>>>>> queries against FDB at once > > >>>>>>>>> > > >>>>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s trying > > >>>> to design a > > >>>>>>>>> model which allows the retrieval of a document with a single > > >>>> range read, > > >>>>>>>>> which is a good goal in my opinion. > > >>>>>>>>> > > >>>>>>>> > > >>>>>>>> I am not sure I agree. > > >>>>>>>> > > >>>>>>>> Think of bitTorrent, a single range read should pull back the > > >>>> structure of > > >>>>>>>> the document (the pieces to fetch), but not necessarily the whole > > >>>> document. > > >>>>>>>> > > >>>>>>>> What if you already have a bunch of pieces in common with other > > >>>> documents > > >>>>>>>> locally (a repeated header/footer/ or type for example); and you > > >>>> only need > > >>>>>>>> to get a few pieces of data you don't already have? > > >>>>>>>> > > >>>>>>>> The real goal to Couch I see is to treat your document set like the > > >>>>>>>> collection of structured information that it is. In some respects > > >>>> like an > > >>>>>>>> extension of your application's heap space for structured objects > > >>>> and > > >>>>>>>> efficiently querying that collection to get back subsets of the > > >>>> data. > > >>>>>>>> > > >>>>>>>> Otherwise it seems more like a slightly upgraded file system plus > > >>>> a fancy > > >>>>>>>> grep/find like feature... > > >>>>>>>> > > >>>>>>>> The best way I see to unlock more features/power is to a move > > >>>> towards a > > >>>>>>>> more granular and efficient way to store and retrieve the scalar > > >>>> values... > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> For example, hears a crazy thought: > > >>>>>>>> Map every distinct occurence of a key/value instance through a > > >>>> crypto hash > > >>>>>>>> function to get a set of hashes. > > >>>>>>>> > > >>>>>>>> These can be be precomputed by Couch without any lookups in FDB. > > >>>> These > > >>>>>>>> will be spread all over kingdom come in FDB and not lend > > >>>> themselves to > > >>>>>>>> range search well. > > >>>>>>>> > > >>>>>>>> So what you do is index them for frequency of occurring in the > > >>>> same set. > > >>>>>>>> In essence, you 'bucket them' statistically, and that bucket id > > >>>> becomes a > > >>>>>>>> key prefix. A crypto hash value can be copied into more than one > > >>>> bucket. > > >>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} > > >>>>>>>> > > >>>>>>>> When writing a document, Couch submits the list/array of > > >>>> cryptohash values > > >>>>>>>> it computed to FDB and gets back the corresponding {val_id} (the > > >>>> id with > > >>>>>>>> the bucket prefixed). This can get somewhat expensive if there's > > >>>> always a > > >>>>>>>> lot of app local cache misses. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> A document's value is then a series of {val_id} arrays up to 100k > > >>>> per > > >>>>>>>> segment. > > >>>>>>>> > > >>>>>>>> When retrieving a document, you get the val_ids, find the distinct > > >>>> buckets > > >>>>>>>> and min/max entries for this doc, and then parallel query each > > >>>> bucket while > > >>>>>>>> reconstructing the document. > > >>>>>>>> > > >>>>>>>> The values returned from the buckets query are the key/value > > >>>> strings > > >>>>>>>> required to reassemble this document. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> ---------- > > >>>>>>>> I put this forward primarily to hilite the idea that trying to > > >>>> match the > > >>>>>>>> storage representation of documents in a straight forward way to > > >>>> FDB keys > > >>>>>>>> to reduce query count might not be the most performance oriented > > >>>> approach. > > >>>>>>>> > > >>>>>>>> I'd much prefer a storage approach that reduced data duplication > > >>>> and > > >>>>>>>> enabled fast sub-document queries. > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> This clearly falls in the realm of what people want the 'use case' > > >>>> of Couch > > >>>>>>>> to be/become. By giving Couch more access to sub-document > > >>>> queries, I could > > >>>>>>>> eventually see queries as complicated as GraphQL submitted to > > >>>> Couch and > > >>>>>>>> pulling back ad-hoc aggregated data across multiple documents in a > > >>>> single > > >>>>>>>> application layer request. > > >>>>>>>> > > >>>>>>>> Hehe - one way to look at the database of Couch documents is that > > >>>> they are > > >>>>>>>> all conflict revisions of the single root empty document. What I > > >>>> mean be > > >>>>>>>> this is consider thinking of the entire document store as one > > >>>> giant DAG of > > >>>>>>>> key/value pairs. How even separate documents are still typically > > >>>> related to > > >>>>>>>> each other. For most applications there is a tremendous amount of > > >>>> data > > >>>>>>>> redundancy between docs and especially between revisions of those > > >>>> docs... > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> And all this is a long way of saying "I think there could be a lot > > >>>> of value > > >>>>>>>> in assuming documents are 'assembled' from multiple queries to > > >>>> FDB, with > > >>>>>>>> local caching, instead of simply retrieved" > > >>>>>>>> > > >>>>>>>> Thanks, I hope I'm not the only outlier here thinking this way!? > > >>>>>>>> > > >>>>>>>> Mike :-) > > >>>>>>>> > > >>>>>>> > > >>>> > > >> > >