I went and collected this discussion into an RFC proposing the “exploded KV” approach:
https://github.com/apache/couchdb-documentation/pull/403 <https://github.com/apache/couchdb-documentation/pull/403> Cheers, Adam > On Feb 20, 2019, at 10:47 AM, Paul Davis <paul.joseph.da...@gmail.com> wrote: > > Strongly agree that we very much don't want to have Erlang-isms being > pushed into fdb. Regardless of what we end up with I'd like to see a > very strong (de)?serialization layer with some significant test > coverage. > > On Tue, Feb 19, 2019 at 6:54 PM Adam Kocoloski <kocol...@apache.org> wrote: >> >> Yes, that sort of versioning has been omitted from the various concrete >> proposals but we definitely want to have it. We’ve seen the alternative in >> some of the Erlang records that we serialize to disk today and it ain’t >> pretty. >> >> I can imagine that we’ll want to have the codebase laid out in a way that >> allows us to upgrade to a smarter KV encoding over time without major >> surgery, which I think is a good “layer of abstraction”. I would be nervous >> if we started having abstract containers of data structures pushed down into >> FDB itself :) >> >> Adam >> >>> On Feb 19, 2019, at 5:41 PM, Paul Davis <paul.joseph.da...@gmail.com> wrote: >>> >>> A simple doc storage version number would likely be enough for future us to >>> do fancier things. >>> >>> On Tue, Feb 19, 2019 at 4:16 PM Benjamin Anderson <banjie...@apache.org> >>> wrote: >>> >>>>> I don’t think adding a layer of abstraction is the right move just yet, >>>> I think we should continue to find consensus on one answer to this question >>>> >>>> Agree that the theorycrafting stage is not optimal for making >>>> abstraction decisions, but I suspect it would be worthwhile somewhere >>>> between prototyping and releasing. Adam's proposal does seem to me the >>>> most appealing approach on the surface, and I don't see anyone signing >>>> up to do the work to deliver an alternative concurrently. >>>> >>>> -- >>>> ba >>>> >>>> On Tue, Feb 19, 2019 at 1:43 PM Robert Samuel Newson <rnew...@apache.org> >>>> wrote: >>>>> >>>>> Addendum: By “directory aliasing” I meant within a document (either the >>>> actual Directory thing or something equivalent of our own making). The >>>> directory aliasing for each database is a good way to reduce key size >>>> without a significant cost. Though if Redwood lands in time, even this >>>> would become an inutile obfuscation]. >>>>> >>>>>> On 19 Feb 2019, at 21:39, Robert Samuel Newson <rnew...@apache.org> >>>> wrote: >>>>>> >>>>>> Interesting suggestion, obviously the details might get the wrong kind >>>> of fun. >>>>>> >>>>>> Somewhere above I suggested this would be something we could change >>>> over time and even use different approaches for different documents within >>>> the same database. This is the long way of saying there are multiple ways >>>> to do this each with advantages and none without disadvantages. >>>>>> >>>>>> I don’t think adding a layer of abstraction is the right move just >>>> yet, I think we should continue to find consensus on one answer to this >>>> question (and the related ones in other threads) for the first release. >>>> It’s easy to say “we can change it later”, of course. We can, though it >>>> would be a chunk of work in the context of something that already works, >>>> I’ve rarely seen anyone sign up for that. >>>>>> >>>>>> I’m fine with the first proposal from Adam, where the keys are tuples >>>> of key parts pointing at terminal values. To make it easier for the first >>>> version, I would exclude optimisations like deduplication or the Directory >>>> aliasing or the schema thing that I suggested and that Ilya incorporated a >>>> variant of in a follow-up post. We’d accept that there are limits on the >>>> sizes of documents, including the awkward-to-express one about property >>>> depth. >>>>>> >>>>>> Stepping back, I’m not seeing any essential improvement over Adam’s >>>> original proposal besides the few corrections and clarifications made by >>>> various authors. Could we start an RFC based on Adam’s original proposal on >>>> document body, revision tree and index storage? We could then have PR’s >>>> against that for each additional optimisation (one person’s optimisation is >>>> another person’s needless complication)? >>>>>> >>>>>> If I’ve missed some genuine advance on the original proposal in this >>>> long thread, please call it out for me. >>>>>> >>>>>> B. >>>>>> >>>>>>> On 19 Feb 2019, at 21:15, Benjamin Anderson <banjie...@apache.org> >>>> wrote: >>>>>>> >>>>>>> As is evident by the length of this thread, there's a pretty big >>>>>>> design space to cover here, and it seems unlikely we'll have arrived >>>>>>> at a "correct" solution even by the time this thing ships. Perhaps it >>>>>>> would be worthwhile to treat the in-FDB representation of data as a >>>>>>> first-class abstraction and support multiple representations >>>>>>> simultaneously? >>>>>>> >>>>>>> Obviously there's no such thing as a zero-cost abstraction - and I've >>>>>>> not thought very hard about how far up the stack the document >>>>>>> representation would need to leak - but supporting different layouts >>>>>>> (primarily, as Adam points out, on the document body itself) might >>>>>>> prove interesting and useful. I'm sure there are folks interested in a >>>>>>> column-shaped CouchDB, for example. >>>>>>> >>>>>>> -- >>>>>>> b >>>>>>> >>>>>>> On Tue, Feb 19, 2019 at 11:39 AM Robert Newson <rnew...@apache.org> >>>> wrote: >>>>>>>> >>>>>>>> Good points on revtree, I agree with you we should store that >>>> intelligently to gain the benefits you mentioned. >>>>>>>> >>>>>>>> -- >>>>>>>> Robert Samuel Newson >>>>>>>> rnew...@apache.org >>>>>>>> >>>>>>>> On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote: >>>>>>>>> I do not think we should store the revtree as a blob. The design >>>> where >>>>>>>>> each edit branch is its own KV should save on network IO and CPU >>>> cycles >>>>>>>>> for normal updates. We’ve performed too many heroics to keep >>>>>>>>> couch_key_tree from stalling entire databases when trying to update >>>> a >>>>>>>>> single document with a wide revision tree, I would much prefer to >>>> ignore >>>>>>>>> other edit branches entirely when all we’re doing is extending one >>>> of >>>>>>>>> them. >>>>>>>>> >>>>>>>>> I also do not think we should store JSON documents as blobs, but >>>> it’s a >>>>>>>>> closer call. Some of my reasoning for preferring the exploded path >>>>>>>>> design: >>>>>>>>> >>>>>>>>> - it lends itself nicely to sub-document operations, for which Jan >>>>>>>>> crafted an RFC last year: >>>> https://github.com/apache/couchdb/issues/1559 >>>>>>>>> - it optimizes the creation of Mango indexes on existing databases >>>> since >>>>>>>>> we only need to retrieve the value(s) we want to index >>>>>>>>> - it optimizes Mango queries that use field selectors >>>>>>>>> - anyone who wanted to try their hand at GraphQL will find it very >>>>>>>>> handy: https://github.com/apache/couchdb/issues/1499 >>>>>>>>> - looking further ahead, it lets us play with smarter leaf value >>>> types >>>>>>>>> like Counters (yes I’m still on the CRDT bandwagon, sorry) >>>>>>>>> >>>>>>>>> A few comments on the thread: >>>>>>>>> >>>>>>>>>>>> * Most documents bodies are probably going to be smaller than >>>> 100k. So in >>>>>>>>>>>> the majority of case it would be one write / one read to update >>>> and fetch >>>>>>>>>>>> the document body. >>>>>>>>> >>>>>>>>> We should test, but I expect reading 50KB of data in a range query >>>> is >>>>>>>>> almost as efficient as reading a single 50 KB value. Similarly, >>>> writes >>>>>>>>> to a contiguous set of keys should be quite efficient. >>>>>>>>> >>>>>>>>> I am concerned about the overhead of the repeated field paths in the >>>>>>>>> keys with the exploded path option in the absence of key prefix >>>>>>>>> compression. That would be my main reason to acquiesce and throw >>>> away >>>>>>>>> all the document structure. >>>>>>>>> >>>>>>>>> Adam >>>>>>>>> >>>>>>>>>> On Feb 19, 2019, at 12:04 PM, Robert Newson <rnew...@apache.org> >>>> wrote: >>>>>>>>>> >>>>>>>>>> I like the idea that we'd reuse the same pattern (but perhaps not >>>> the same _code_) for doc bodies, revtree and attachments. >>>>>>>>>> >>>>>>>>>> I hope we still get to delete couch_key_tree.erl, though. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Robert Samuel Newson >>>>>>>>>> rnew...@apache.org >>>>>>>>>> >>>>>>>>>> On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote: >>>>>>>>>>> I like the idea from a “trying a simple thing first” perspective, >>>> but >>>>>>>>>>> Nick’s points below are especially convincing to with this for >>>> now. >>>>>>>>>>> >>>>>>>>>>> Best >>>>>>>>>>> Jan >>>>>>>>>>> — >>>>>>>>>>> >>>>>>>>>>>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatam...@gmail.com> >>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Sorry for jumping in so late, I was following from the sidelines >>>> mostly. A >>>>>>>>>>>> lot of good discussion happening and am excited about the >>>> possibilities >>>>>>>>>>>> here. >>>>>>>>>>>> >>>>>>>>>>>> I do like the simpler "chunking" approach for a few reasons: >>>>>>>>>>>> >>>>>>>>>>>> * Most documents bodies are probably going to be smaller than >>>> 100k. So in >>>>>>>>>>>> the majority of case it would be one write / one read to update >>>> and fetch >>>>>>>>>>>> the document body. >>>>>>>>>>>> >>>>>>>>>>>> * We could reuse the chunking code for attachment handling and >>>> possibly >>>>>>>>>>>> revision key trees. So it's the general pattern of upload chunks >>>> to some >>>>>>>>>>>> prefix, and when finished flip an atomic toggle to make it >>>> current. >>>>>>>>>>>> >>>>>>>>>>>> * Do the same thing with revision trees and we could re-use the >>>> revision >>>>>>>>>>>> tree manipulation logic. That is, the key tree in most cases >>>> would be small >>>>>>>>>>>> enough to fit in 100k but if they get huge, they'd get chunked. >>>> This would >>>>>>>>>>>> allow us to reuse all the battle tested couch_key_tree code >>>> mostly as is. >>>>>>>>>>>> We even have property tests for it >>>>>>>>>>>> >>>> https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl >>>>>>>>>>>> >>>>>>>>>>>> * It removes the need to explain the max exploded path length >>>> limitation to >>>>>>>>>>>> customers. >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> -Nick >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson < >>>> rnew...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> An alternative storage model that we should seriously consider >>>> is to >>>>>>>>>>>>> follow our current approach in couch_file et al. Specifically, >>>> that the >>>>>>>>>>>>> document _body_ is stored as an uninterpreted binary value. >>>> This would be >>>>>>>>>>>>> much like the obvious plan for attachment storage; a key prefix >>>> that >>>>>>>>>>>>> identifies the database and document, with the final item of >>>> that key tuple >>>>>>>>>>>>> is an incrementing integer. Each of those keys has a binary >>>> value of up to >>>>>>>>>>>>> 100k. Fetching all values with that key prefix, in fdb's >>>> natural ordering, >>>>>>>>>>>>> will yield the full document body, which can be JSON decoded >>>> for further >>>>>>>>>>>>> processing. >>>>>>>>>>>>> >>>>>>>>>>>>> I like this idea, and I like Adam's original proposal to >>>> explode documents >>>>>>>>>>>>> into property paths. I have a slight preference for the >>>> simplicity of the >>>>>>>>>>>>> idea in the previous paragraph, not least because it's close to >>>> what we do >>>>>>>>>>>>> today. I also think it will be possible to migrate to >>>> alternative storage >>>>>>>>>>>>> models in future, and foundationdb's transaction supports means >>>> we can do >>>>>>>>>>>>> this migration seamlessly should we come to it. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm very interested in knowing if anyone else is interested in >>>> going this >>>>>>>>>>>>> simple, or considers it a wasted opportunity relative to the >>>> 'exploded' >>>>>>>>>>>>> path. >>>>>>>>>>>>> >>>>>>>>>>>>> B. >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Robert Samuel Newson >>>>>>>>>>>>> rnew...@apache.org >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote: >>>>>>>>>>>>>> I've been remiss here in not posting the data model ideas that >>>> IBM >>>>>>>>>>>>>> worked up while we were thinking about using FoundationDB so >>>> I'm posting >>>>>>>>>>>>>> it now. This is Adam' Kocoloski's original work, I am just >>>> transcribing >>>>>>>>>>>>>> it, and this is the context that the folks from the IBM side >>>> came in >>>>>>>>>>>>>> with, for full disclosure. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Basics >>>>>>>>>>>>>> >>>>>>>>>>>>>> 1. All CouchDB databases are inside a Directory >>>>>>>>>>>>>> 2. Each CouchDB database is a Directory within that Directory >>>>>>>>>>>>>> 3. It's possible to list all subdirectories of a Directory, so >>>>>>>>>>>>>> `_all_dbs` is the list of directories from 1. >>>>>>>>>>>>>> 4. Each Directory representing a CouchdB database has several >>>> Subspaces; >>>>>>>>>>>>>> 4a. by_id/ doc subspace: actual document contents >>>>>>>>>>>>>> 4b. by_seq/versionstamp subspace: for the _changes feed >>>>>>>>>>>>>> 4c. index_definitions, indexes, ... >>>>>>>>>>>>>> >>>>>>>>>>>>>> JSON Mapping >>>>>>>>>>>>>> >>>>>>>>>>>>>> A hierarchical JSON object naturally maps to multiple KV pairs >>>> in FDB: >>>>>>>>>>>>>> >>>>>>>>>>>>>> { >>>>>>>>>>>>>> “_id”: “foo”, >>>>>>>>>>>>>> “owner”: “bob”, >>>>>>>>>>>>>> “mylist”: [1,3,5], >>>>>>>>>>>>>> “mymap”: { >>>>>>>>>>>>>> “blue”: “#0000FF”, >>>>>>>>>>>>>> “red”: “#FF0000” >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> maps to >>>>>>>>>>>>>> >>>>>>>>>>>>>> (“foo”, “owner”) = “bob” >>>>>>>>>>>>>> (“foo”, “mylist”, 0) = 1 >>>>>>>>>>>>>> (“foo”, “mylist”, 1) = 3 >>>>>>>>>>>>>> (“foo”, “mylist”, 2) = 5 >>>>>>>>>>>>>> (“foo”, “mymap”, “blue”) = “#0000FF” >>>>>>>>>>>>>> (“foo”, “mymap”, “red”) = “#FF0000” >>>>>>>>>>>>>> >>>>>>>>>>>>>> NB: this means that the 100KB limit applies to individual >>>> leafs in the >>>>>>>>>>>>>> JSON object, not the entire doc >>>>>>>>>>>>>> >>>>>>>>>>>>>> Edit Conflicts >>>>>>>>>>>>>> >>>>>>>>>>>>>> We need to account for the presence of conflicts in various >>>> levels of >>>>>>>>>>>>>> the doc due to replication. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Proposal is to create a special value indicating that the >>>> subtree below >>>>>>>>>>>>>> our current cursor position is in an unresolvable conflict. >>>> Then add >>>>>>>>>>>>>> additional KV pairs below to describe the conflicting entries. >>>>>>>>>>>>>> >>>>>>>>>>>>>> KV data model allows us to store these efficiently and minimize >>>>>>>>>>>>>> duplication of data: >>>>>>>>>>>>>> >>>>>>>>>>>>>> A document with these two conflicts: >>>>>>>>>>>>>> >>>>>>>>>>>>>> { >>>>>>>>>>>>>> “_id”: “foo”, >>>>>>>>>>>>>> “_rev”: “1-abc”, >>>>>>>>>>>>>> “owner”: “alice”, >>>>>>>>>>>>>> “active”: true >>>>>>>>>>>>>> } >>>>>>>>>>>>>> { >>>>>>>>>>>>>> “_id”: “foo”, >>>>>>>>>>>>>> “_rev”: “1-def”, >>>>>>>>>>>>>> “owner”: “bob”, >>>>>>>>>>>>>> “active”: true >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> could be stored thus: >>>>>>>>>>>>>> >>>>>>>>>>>>>> (“foo”, “active”) = true >>>>>>>>>>>>>> (“foo”, “owner”) = kCONFLICT >>>>>>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” >>>>>>>>>>>>>> (“foo”, “owner”, “1-def”) = “bob” >>>>>>>>>>>>>> >>>>>>>>>>>>>> So long as `kCONFLICT` is set at the top of the conflicting >>>> subtree this >>>>>>>>>>>>>> representation can handle conflicts of different data types as >>>> well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Missing fields need to be handled explicitly: >>>>>>>>>>>>>> >>>>>>>>>>>>>> { >>>>>>>>>>>>>> “_id”: “foo”, >>>>>>>>>>>>>> “_rev”: “1-abc”, >>>>>>>>>>>>>> “owner”: “alice”, >>>>>>>>>>>>>> “active”: true >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> { >>>>>>>>>>>>>> “_id”: “foo”, >>>>>>>>>>>>>> “_rev”: “1-def”, >>>>>>>>>>>>>> “owner”: { >>>>>>>>>>>>>> “name”: “bob”, >>>>>>>>>>>>>> “email”: “ >>>>>>>>>>>>>> b...@example.com >>>>>>>>>>>>>> " >>>>>>>>>>>>>> } >>>>>>>>>>>>>> } >>>>>>>>>>>>>> >>>>>>>>>>>>>> could be stored thus: >>>>>>>>>>>>>> >>>>>>>>>>>>>> (“foo”, “active”) = kCONFLICT >>>>>>>>>>>>>> (“foo”, “active”, “1-abc”) = true >>>>>>>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING >>>>>>>>>>>>>> (“foo”, “owner”) = kCONFLICT >>>>>>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” >>>>>>>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob” >>>>>>>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = ... >>>>>>>>>>>>>> >>>>>>>>>>>>>> Revision Metadata >>>>>>>>>>>>>> >>>>>>>>>>>>>> * CouchDB uses a hash history for revisions >>>>>>>>>>>>>> ** Each edit is identified by the hash of the content of the >>>> edit >>>>>>>>>>>>>> including the base revision against which it was applied >>>>>>>>>>>>>> ** Individual edit branches are bounded in length but the >>>> number of >>>>>>>>>>>>>> branches is potentially unbounded >>>>>>>>>>>>>> >>>>>>>>>>>>>> * Size limits preclude us from storing the entire key tree as >>>> a single >>>>>>>>>>>>>> value; in pathological situations >>>>>>>>>>>>>> the tree could exceed 100KB (each entry is > 16 bytes) >>>>>>>>>>>>>> >>>>>>>>>>>>>> * Store each edit branch as a separate KV including deleted >>>> status in a >>>>>>>>>>>>>> special subspace >>>>>>>>>>>>>> >>>>>>>>>>>>>> * Structure key representation so that “winning” revision can >>>> be >>>>>>>>>>>>>> automatically retrieved in a limit=1 >>>>>>>>>>>>>> key range operation >>>>>>>>>>>>>> >>>>>>>>>>>>>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = [] >>>>>>>>>>>>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) = >>>> [“3-baz”,”2-bar”,”1-foo”] >>>>>>>>>>>>>> <-- winner >>>>>>>>>>>>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”] >>>>>>>>>>>>>> >>>>>>>>>>>>>> Changes Feed >>>>>>>>>>>>>> >>>>>>>>>>>>>> * FDB supports a concept called a versionstamp — a 10 byte, >>>> unique, >>>>>>>>>>>>>> monotonically (but not sequentially) increasing value for each >>>> committed >>>>>>>>>>>>>> transaction. The first 8 bytes are the committed version of the >>>>>>>>>>>>>> database. The last 2 bytes are monotonic in the serialization >>>> order for >>>>>>>>>>>>>> transactions. >>>>>>>>>>>>>> >>>>>>>>>>>>>> * A transaction can specify a particular index into a key >>>> where the >>>>>>>>>>>>>> following 10 bytes will be overwritten by the versionstamp at >>>> commit >>>>>>>>>>>>>> time >>>>>>>>>>>>>> >>>>>>>>>>>>>> * A subspace keyed on versionstamp naturally yields a _changes >>>> feed >>>>>>>>>>>>>> >>>>>>>>>>>>>> by_seq subspace >>>>>>>>>>>>>> (“versionstamp1”) = (“foo”, “1-abc”) >>>>>>>>>>>>>> (“versionstamp4”) = (“bar”, “4-def”) >>>>>>>>>>>>>> >>>>>>>>>>>>>> by_id subspace >>>>>>>>>>>>>> (“bar”, “_vsn”) = “versionstamp4” >>>>>>>>>>>>>> ... >>>>>>>>>>>>>> (“foo”, “_vsn”) = “versionstamp1” >>>>>>>>>>>>>> >>>>>>>>>>>>>> JSON Indexes >>>>>>>>>>>>>> >>>>>>>>>>>>>> * “Mango” JSON indexes are defined by >>>>>>>>>>>>>> ** a list of field names, each of which may be nested, >>>>>>>>>>>>>> ** an optional partial_filter_selector which constrains the >>>> set of docs >>>>>>>>>>>>>> that contribute >>>>>>>>>>>>>> ** an optional name defined by the ddoc field (the name is >>>> auto- >>>>>>>>>>>>>> generated if not supplied) >>>>>>>>>>>>>> >>>>>>>>>>>>>> * Store index definitions in a single subspace to aid query >>>> planning >>>>>>>>>>>>>> ** ((person,name), title, email) = (“name-title-email”, >>>> “{“student”: >>>>>>>>>>>>>> true}”) >>>>>>>>>>>>>> ** Store the values for each index in a dedicated subspace, >>>> adding the >>>>>>>>>>>>>> document ID as the last element in the tuple >>>>>>>>>>>>>> *** (“rosie revere”, “engineer”, “ro...@example.com", “foo”) >>>> = null >>>>>>>>>>>>>> >>>>>>>>>>>>>> B. >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Robert Samuel Newson >>>>>>>>>>>>>> rnew...@apache.org >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I want to fix previous mistakes. I did two mistakes in >>>> previous >>>>>>>>>>>>>>> calculations: >>>>>>>>>>>>>>> - I used 1Kb as base size for calculating expansion factor >>>> (although >>>>>>>>>>>>> we >>>>>>>>>>>>>>> don't know exact size of original document) >>>>>>>>>>>>>>> - The expansion factor calculation included number of >>>> revisions (it >>>>>>>>>>>>>>> shouldn't) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'll focus on flattened JSON docs model >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The following formula is used in previous calculation. >>>>>>>>>>>>>>> >>>> storage_size_per_document=mapping_table_size*number_of_revisions + >>>>>>>>>>>>>>> depth*number_of_paths*number_of_revisions + >>>>>>>>>>>>>>> number_of_paths*value_size*number_of_revisions >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> To clarify things a little bit I want to calculate space >>>> requirement >>>>>>>>>>>>> for >>>>>>>>>>>>>>> single revision this time. >>>>>>>>>>>>>>> >>>> mapping_table_size=field_name_size*(field_name_length+4(integer >>>>>>>>>>>>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes >>>>>>>>>>>>>>> >>>> storage_size_per_document_per_revision_per_replica=mapping_table_size >>>>>>>>>>>>> + >>>>>>>>>>>>>>> depth*number_of_paths + value_size*number_of_paths = >>>>>>>>>>>>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> We definitely can reduce requirement for mapping table by >>>> adopting >>>>>>>>>>>>>>> rnewson's idea of a schema. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iil...@apache.org> >>>> wrote: >>>>>>>>>>>>>>>> Hi Michael, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> For example, hears a crazy thought: >>>>>>>>>>>>>>>>> Map every distinct occurence of a key/value instance >>>> through a >>>>>>>>>>>>> crypto hash >>>>>>>>>>>>>>>>> function to get a set of hashes. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> These can be be precomputed by Couch without any lookups in >>>> FDB. >>>>>>>>>>>>> These >>>>>>>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend >>>>>>>>>>>>> themselves to >>>>>>>>>>>>>>>>> range search well. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So what you do is index them for frequency of occurring in >>>> the >>>>>>>>>>>>> same set. >>>>>>>>>>>>>>>>> In essence, you 'bucket them' statistically, and that >>>> bucket id >>>>>>>>>>>>> becomes a >>>>>>>>>>>>>>>>> key prefix. A crypto hash value can be copied into more >>>> than one >>>>>>>>>>>>> bucket. >>>>>>>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> When writing a document, Couch submits the list/array of >>>>>>>>>>>>> cryptohash values >>>>>>>>>>>>>>>>> it computed to FDB and gets back the corresponding >>>> {val_id} (the >>>>>>>>>>>>> id with >>>>>>>>>>>>>>>>> the bucket prefixed). This can get somewhat expensive if >>>> there's >>>>>>>>>>>>> always a >>>>>>>>>>>>>>>>> lot of app local cache misses. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> A document's value is then a series of {val_id} arrays up >>>> to 100k >>>>>>>>>>>>> per >>>>>>>>>>>>>>>>> segment. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> When retrieving a document, you get the val_ids, find the >>>> distinct >>>>>>>>>>>>> buckets >>>>>>>>>>>>>>>>> and min/max entries for this doc, and then parallel query >>>> each >>>>>>>>>>>>> bucket while >>>>>>>>>>>>>>>>> reconstructing the document. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Interesting idea. Let's try to think it through to see if we >>>> can >>>>>>>>>>>>> make it viable. >>>>>>>>>>>>>>>> Let's go through hypothetical example. Input data for the >>>> example: >>>>>>>>>>>>>>>> - 1M of documents >>>>>>>>>>>>>>>> - each document is around 10Kb >>>>>>>>>>>>>>>> - each document consists of 1K of unique JSON paths >>>>>>>>>>>>>>>> - each document has 100 unique JSON field names >>>>>>>>>>>>>>>> - every scalar value is 100 bytes >>>>>>>>>>>>>>>> - 10% of unique JSON paths for every document already stored >>>> in >>>>>>>>>>>>> database under different doc or different revision of the >>>> current one >>>>>>>>>>>>>>>> - we assume 3 independent copies for every key-value pair in >>>> FDB >>>>>>>>>>>>>>>> - our hash key size is 32 bytes >>>>>>>>>>>>>>>> - let's assume we can determine if key is already on the >>>> storage >>>>>>>>>>>>> without doing query >>>>>>>>>>>>>>>> - 1% of paths is in cache (unrealistic value, in real live >>>> the >>>>>>>>>>>>> percentage is lower) >>>>>>>>>>>>>>>> - every JSON field name is 20 bytes >>>>>>>>>>>>>>>> - every JSON path is 10 levels deep >>>>>>>>>>>>>>>> - document key prefix length is 50 >>>>>>>>>>>>>>>> - every document has 10 revisions >>>>>>>>>>>>>>>> Let's estimate the storage requirements and size of data we >>>> need to >>>>>>>>>>>>> transmit. The calculations are not exact. >>>>>>>>>>>>>>>> 1. storage_size_per_document (we cannot estimate exact >>>> numbers since >>>>>>>>>>>>> we don't know how FDB stores it) >>>>>>>>>>>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 >>>> bytes) = >>>>>>>>>>>>> 38Kb * 10 * 3 = 1140 Kb (11x) >>>>>>>>>>>>>>>> 2. number of independent keys to retrieve on document read >>>>>>>>>>>>> (non-range queries) per document >>>>>>>>>>>>>>>> - 1K - (1K * 1%) = 990 >>>>>>>>>>>>>>>> 3. number of range queries: 0 >>>>>>>>>>>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + >>>> 32 >>>>>>>>>>>>> bytes) = 102 Kb (10x) >>>>>>>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from >>>>>>>>>>>>> https://apple.github.io/foundationdb/performance.html) >>>>>>>>>>>>>>>> - sequential: 990*2ms = 1980ms >>>>>>>>>>>>>>>> - range: 0 >>>>>>>>>>>>>>>> Let's compare these numbers with initial proposal (flattened >>>> JSON >>>>>>>>>>>>> docs without global schema and without cache) >>>>>>>>>>>>>>>> 1. storage_size_per_document >>>>>>>>>>>>>>>> - mapping table size: 100 * (20 + 4(integer size)) = 2400 >>>> bytes >>>>>>>>>>>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes >>>>>>>>>>>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 >>>> = >>>>>>>>>>>>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x) >>>>>>>>>>>>>>>> 2. number of independent keys to retrieve: 0-2 (depending on >>>> index >>>>>>>>>>>>> structure) >>>>>>>>>>>>>>>> 3. number of range queries: 1 (1001 of keys in result) >>>>>>>>>>>>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = >>>> 23.6 Kb >>>>>>>>>>>>> (2.4x) >>>>>>>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from >>>>>>>>>>>>> https://apple.github.io/foundationdb/performance.html and >>>> estimate range >>>>>>>>>>>>> read performance based on numbers from >>>>>>>>>>>>> >>>> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test >>>>>>>>>>>>> ) >>>>>>>>>>>>>>>> - range read performance: Given read performance is about >>>> 305,000 >>>>>>>>>>>>> reads/second and range performance 3,600,000 keys/second we >>>> estimate range >>>>>>>>>>>>> performance to be 11.8x compared to read performance. If read >>>> performance >>>>>>>>>>>>> is 2ms than range performance is 0.169ms (which is hard to >>>> believe). >>>>>>>>>>>>>>>> - sequential: 2 * 2 = 4ms >>>>>>>>>>>>>>>> - range: 0.169 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It looks like we are dealing with a tradeoff: >>>>>>>>>>>>>>>> - Map every distinct occurrence of a key/value instance >>>> through a >>>>>>>>>>>>> crypto hash: >>>>>>>>>>>>>>>> - 5.39x more disk space efficient >>>>>>>>>>>>>>>> - 474x slower >>>>>>>>>>>>>>>> - flattened JSON model >>>>>>>>>>>>>>>> - 5.39x less efficient in disk space >>>>>>>>>>>>>>>> - 474x faster >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In any case this unscientific exercise was very helpful. >>>> Since it >>>>>>>>>>>>> uncovered the high cost in terms of disk space. 59.3x of >>>> original disk size >>>>>>>>>>>>> is too much IMO. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Are the any ways we can make Michael's model more performant? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Also I don't quite understand few aspects of the global hash >>>> table >>>>>>>>>>>>> proposal: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 1. > - Map every distinct occurence of a key/value instance >>>> through >>>>>>>>>>>>> a crypto hash function to get a set of hashes. >>>>>>>>>>>>>>>> I think we are talking only about scalar values here? I.e. >>>>>>>>>>>>> `"#/foo.bar.baz": 123` >>>>>>>>>>>>>>>> Since I don't know how we can make it work for all possible >>>> JSON >>>>>>>>>>>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}": >>>>>>>>>>>>>>>> - foo >>>>>>>>>>>>>>>> - foo.bar >>>>>>>>>>>>>>>> - foo.bar.baz >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2. how to delete documents >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>>> ILYA >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 2019/01/30 23:33:22, Michael Fair < >>>> mich...@daclubhouse.net> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski < >>>> kocol...@apache.org >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hi Michael, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Yes that’s definitely one way to address storage of edit >>>>>>>>>>>>> conflicts. I >>>>>>>>>>>>>>>>>> think there are other, more compact representations that >>>> we can >>>>>>>>>>>>> explore if >>>>>>>>>>>>>>>>>> we have this “exploded” data model where each scalar value >>>> maps >>>>>>>>>>>>> to an >>>>>>>>>>>>>>>>>> individual KV pair. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I agree, as I mentioned on the original thread, I see a >>>> scheme, >>>>>>>>>>>>> that >>>>>>>>>>>>>>>>> handles both conflicts and revisions, where you only have >>>> to store >>>>>>>>>>>>> the most >>>>>>>>>>>>>>>>> recent change to a field. Like you suggested, multiple >>>> revisions >>>>>>>>>>>>> can share >>>>>>>>>>>>>>>>> a key. Which in my mind's eye further begs the >>>> conflicts/revisions >>>>>>>>>>>>>>>>> discussion along with the working within the limits >>>> discussion >>>>>>>>>>>>> because it >>>>>>>>>>>>>>>>> seems to me they are all intrinsically related as a >>>> "feature". >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Saying 'We'll break documents up into roughly 80k >>>> segments', then >>>>>>>>>>>>> trying to >>>>>>>>>>>>>>>>> overlay some kind of field sharing scheme for >>>> revisions/conflicts >>>>>>>>>>>>> doesn't >>>>>>>>>>>>>>>>> seem like it will work. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I probably should have left out the trivial fix proposal as >>>> I >>>>>>>>>>>>> don't think >>>>>>>>>>>>>>>>> it's a feasible solution to actually use. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The comment is more regarding that I do not see how this >>>> thread >>>>>>>>>>>>> can escape >>>>>>>>>>>>>>>>> including how to store/retrieve conflicts/revisions. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> For instance, the 'doc as individual fields' proposal lends >>>> itself >>>>>>>>>>>>> to value >>>>>>>>>>>>>>>>> sharing across mutiple documents (and I don't just mean >>>> revisions >>>>>>>>>>>>> of the >>>>>>>>>>>>>>>>> same doc, I mean the same key/value instance could be >>>> shared for >>>>>>>>>>>>> every >>>>>>>>>>>>>>>>> document). >>>>>>>>>>>>>>>>> However that's not really relevant if we're not considering >>>> the >>>>>>>>>>>>> amount of >>>>>>>>>>>>>>>>> shared information across documents in the storage scheme. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Simply storing documents in <100k segments (perhaps in some >>>> kind of >>>>>>>>>>>>>>>>> compressed binary representation) to deal with that FDB >>>> limit >>>>>>>>>>>>> seems fine. >>>>>>>>>>>>>>>>> The only reason to consider doing something else is because >>>> of its >>>>>>>>>>>>> impact >>>>>>>>>>>>>>>>> to indexing, searches, reduce functions, revisions, on-disk >>>> size >>>>>>>>>>>>> impact, >>>>>>>>>>>>>>>>> etc. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I'm assuming the process will flatten the key paths of the >>>>>>>>>>>>> document into >>>>>>>>>>>>>>>>>> an array and then request the value of each key as multiple >>>>>>>>>>>>> parallel >>>>>>>>>>>>>>>>>> queries against FDB at once >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s >>>> trying >>>>>>>>>>>>> to design a >>>>>>>>>>>>>>>>>> model which allows the retrieval of a document with a >>>> single >>>>>>>>>>>>> range read, >>>>>>>>>>>>>>>>>> which is a good goal in my opinion. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I am not sure I agree. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Think of bitTorrent, a single range read should pull back >>>> the >>>>>>>>>>>>> structure of >>>>>>>>>>>>>>>>> the document (the pieces to fetch), but not necessarily the >>>> whole >>>>>>>>>>>>> document. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> What if you already have a bunch of pieces in common with >>>> other >>>>>>>>>>>>> documents >>>>>>>>>>>>>>>>> locally (a repeated header/footer/ or type for example); >>>> and you >>>>>>>>>>>>> only need >>>>>>>>>>>>>>>>> to get a few pieces of data you don't already have? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The real goal to Couch I see is to treat your document set >>>> like the >>>>>>>>>>>>>>>>> collection of structured information that it is. In some >>>> respects >>>>>>>>>>>>> like an >>>>>>>>>>>>>>>>> extension of your application's heap space for structured >>>> objects >>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> efficiently querying that collection to get back subsets of >>>> the >>>>>>>>>>>>> data. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Otherwise it seems more like a slightly upgraded file >>>> system plus >>>>>>>>>>>>> a fancy >>>>>>>>>>>>>>>>> grep/find like feature... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The best way I see to unlock more features/power is to a >>>> move >>>>>>>>>>>>> towards a >>>>>>>>>>>>>>>>> more granular and efficient way to store and retrieve the >>>> scalar >>>>>>>>>>>>> values... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> For example, hears a crazy thought: >>>>>>>>>>>>>>>>> Map every distinct occurence of a key/value instance >>>> through a >>>>>>>>>>>>> crypto hash >>>>>>>>>>>>>>>>> function to get a set of hashes. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> These can be be precomputed by Couch without any lookups in >>>> FDB. >>>>>>>>>>>>> These >>>>>>>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend >>>>>>>>>>>>> themselves to >>>>>>>>>>>>>>>>> range search well. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So what you do is index them for frequency of occurring in >>>> the >>>>>>>>>>>>> same set. >>>>>>>>>>>>>>>>> In essence, you 'bucket them' statistically, and that >>>> bucket id >>>>>>>>>>>>> becomes a >>>>>>>>>>>>>>>>> key prefix. A crypto hash value can be copied into more >>>> than one >>>>>>>>>>>>> bucket. >>>>>>>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> When writing a document, Couch submits the list/array of >>>>>>>>>>>>> cryptohash values >>>>>>>>>>>>>>>>> it computed to FDB and gets back the corresponding >>>> {val_id} (the >>>>>>>>>>>>> id with >>>>>>>>>>>>>>>>> the bucket prefixed). This can get somewhat expensive if >>>> there's >>>>>>>>>>>>> always a >>>>>>>>>>>>>>>>> lot of app local cache misses. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> A document's value is then a series of {val_id} arrays up >>>> to 100k >>>>>>>>>>>>> per >>>>>>>>>>>>>>>>> segment. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> When retrieving a document, you get the val_ids, find the >>>> distinct >>>>>>>>>>>>> buckets >>>>>>>>>>>>>>>>> and min/max entries for this doc, and then parallel query >>>> each >>>>>>>>>>>>> bucket while >>>>>>>>>>>>>>>>> reconstructing the document. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The values returned from the buckets query are the key/value >>>>>>>>>>>>> strings >>>>>>>>>>>>>>>>> required to reassemble this document. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> ---------- >>>>>>>>>>>>>>>>> I put this forward primarily to hilite the idea that trying >>>> to >>>>>>>>>>>>> match the >>>>>>>>>>>>>>>>> storage representation of documents in a straight forward >>>> way to >>>>>>>>>>>>> FDB keys >>>>>>>>>>>>>>>>> to reduce query count might not be the most performance >>>> oriented >>>>>>>>>>>>> approach. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'd much prefer a storage approach that reduced data >>>> duplication >>>>>>>>>>>>> and >>>>>>>>>>>>>>>>> enabled fast sub-document queries. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This clearly falls in the realm of what people want the >>>> 'use case' >>>>>>>>>>>>> of Couch >>>>>>>>>>>>>>>>> to be/become. By giving Couch more access to sub-document >>>>>>>>>>>>> queries, I could >>>>>>>>>>>>>>>>> eventually see queries as complicated as GraphQL submitted >>>> to >>>>>>>>>>>>> Couch and >>>>>>>>>>>>>>>>> pulling back ad-hoc aggregated data across multiple >>>> documents in a >>>>>>>>>>>>> single >>>>>>>>>>>>>>>>> application layer request. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hehe - one way to look at the database of Couch documents >>>> is that >>>>>>>>>>>>> they are >>>>>>>>>>>>>>>>> all conflict revisions of the single root empty document. >>>> What I >>>>>>>>>>>>> mean be >>>>>>>>>>>>>>>>> this is consider thinking of the entire document store as >>>> one >>>>>>>>>>>>> giant DAG of >>>>>>>>>>>>>>>>> key/value pairs. How even separate documents are still >>>> typically >>>>>>>>>>>>> related to >>>>>>>>>>>>>>>>> each other. For most applications there is a tremendous >>>> amount of >>>>>>>>>>>>> data >>>>>>>>>>>>>>>>> redundancy between docs and especially between revisions of >>>> those >>>>>>>>>>>>> docs... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> And all this is a long way of saying "I think there could >>>> be a lot >>>>>>>>>>>>> of value >>>>>>>>>>>>>>>>> in assuming documents are 'assembled' from multiple queries >>>> to >>>>>>>>>>>>> FDB, with >>>>>>>>>>>>>>>>> local caching, instead of simply retrieved" >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Thanks, I hope I'm not the only outlier here thinking this >>>> way!? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Mike :-) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>> >>>>> >>>> >> >> >