I like the idea that we'd reuse the same pattern (but perhaps not the same _code_) for doc bodies, revtree and attachments.
I hope we still get to delete couch_key_tree.erl, though. -- Robert Samuel Newson rnew...@apache.org On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote: > I like the idea from a “trying a simple thing first” perspective, but > Nick’s points below are especially convincing to with this for now. > > Best > Jan > — > > > On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatam...@gmail.com> wrote: > > > > Hi, > > > > Sorry for jumping in so late, I was following from the sidelines mostly. A > > lot of good discussion happening and am excited about the possibilities > > here. > > > > I do like the simpler "chunking" approach for a few reasons: > > > > * Most documents bodies are probably going to be smaller than 100k. So in > > the majority of case it would be one write / one read to update and fetch > > the document body. > > > > * We could reuse the chunking code for attachment handling and possibly > > revision key trees. So it's the general pattern of upload chunks to some > > prefix, and when finished flip an atomic toggle to make it current. > > > > * Do the same thing with revision trees and we could re-use the revision > > tree manipulation logic. That is, the key tree in most cases would be small > > enough to fit in 100k but if they get huge, they'd get chunked. This would > > allow us to reuse all the battle tested couch_key_tree code mostly as is. > > We even have property tests for it > > https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl > > > > * It removes the need to explain the max exploded path length limitation to > > customers. > > > > Cheers, > > -Nick > > > > > > On Tue, Feb 19, 2019 at 11:18 AM Robert Newson <rnew...@apache.org> wrote: > > > >> Hi, > >> > >> An alternative storage model that we should seriously consider is to > >> follow our current approach in couch_file et al. Specifically, that the > >> document _body_ is stored as an uninterpreted binary value. This would be > >> much like the obvious plan for attachment storage; a key prefix that > >> identifies the database and document, with the final item of that key tuple > >> is an incrementing integer. Each of those keys has a binary value of up to > >> 100k. Fetching all values with that key prefix, in fdb's natural ordering, > >> will yield the full document body, which can be JSON decoded for further > >> processing. > >> > >> I like this idea, and I like Adam's original proposal to explode documents > >> into property paths. I have a slight preference for the simplicity of the > >> idea in the previous paragraph, not least because it's close to what we do > >> today. I also think it will be possible to migrate to alternative storage > >> models in future, and foundationdb's transaction supports means we can do > >> this migration seamlessly should we come to it. > >> > >> I'm very interested in knowing if anyone else is interested in going this > >> simple, or considers it a wasted opportunity relative to the 'exploded' > >> path. > >> > >> B. > >> > >> -- > >> Robert Samuel Newson > >> rnew...@apache.org > >> > >> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote: > >>> I've been remiss here in not posting the data model ideas that IBM > >>> worked up while we were thinking about using FoundationDB so I'm posting > >>> it now. This is Adam' Kocoloski's original work, I am just transcribing > >>> it, and this is the context that the folks from the IBM side came in > >>> with, for full disclosure. > >>> > >>> Basics > >>> > >>> 1. All CouchDB databases are inside a Directory > >>> 2. Each CouchDB database is a Directory within that Directory > >>> 3. It's possible to list all subdirectories of a Directory, so > >>> `_all_dbs` is the list of directories from 1. > >>> 4. Each Directory representing a CouchdB database has several Subspaces; > >>> 4a. by_id/ doc subspace: actual document contents > >>> 4b. by_seq/versionstamp subspace: for the _changes feed > >>> 4c. index_definitions, indexes, ... > >>> > >>> JSON Mapping > >>> > >>> A hierarchical JSON object naturally maps to multiple KV pairs in FDB: > >>> > >>> { > >>> “_id”: “foo”, > >>> “owner”: “bob”, > >>> “mylist”: [1,3,5], > >>> “mymap”: { > >>> “blue”: “#0000FF”, > >>> “red”: “#FF0000” > >>> } > >>> } > >>> > >>> maps to > >>> > >>> (“foo”, “owner”) = “bob” > >>> (“foo”, “mylist”, 0) = 1 > >>> (“foo”, “mylist”, 1) = 3 > >>> (“foo”, “mylist”, 2) = 5 > >>> (“foo”, “mymap”, “blue”) = “#0000FF” > >>> (“foo”, “mymap”, “red”) = “#FF0000” > >>> > >>> NB: this means that the 100KB limit applies to individual leafs in the > >>> JSON object, not the entire doc > >>> > >>> Edit Conflicts > >>> > >>> We need to account for the presence of conflicts in various levels of > >>> the doc due to replication. > >>> > >>> Proposal is to create a special value indicating that the subtree below > >>> our current cursor position is in an unresolvable conflict. Then add > >>> additional KV pairs below to describe the conflicting entries. > >>> > >>> KV data model allows us to store these efficiently and minimize > >>> duplication of data: > >>> > >>> A document with these two conflicts: > >>> > >>> { > >>> “_id”: “foo”, > >>> “_rev”: “1-abc”, > >>> “owner”: “alice”, > >>> “active”: true > >>> } > >>> { > >>> “_id”: “foo”, > >>> “_rev”: “1-def”, > >>> “owner”: “bob”, > >>> “active”: true > >>> } > >>> > >>> could be stored thus: > >>> > >>> (“foo”, “active”) = true > >>> (“foo”, “owner”) = kCONFLICT > >>> (“foo”, “owner”, “1-abc”) = “alice” > >>> (“foo”, “owner”, “1-def”) = “bob” > >>> > >>> So long as `kCONFLICT` is set at the top of the conflicting subtree this > >>> representation can handle conflicts of different data types as well. > >>> > >>> Missing fields need to be handled explicitly: > >>> > >>> { > >>> “_id”: “foo”, > >>> “_rev”: “1-abc”, > >>> “owner”: “alice”, > >>> “active”: true > >>> } > >>> > >>> { > >>> “_id”: “foo”, > >>> “_rev”: “1-def”, > >>> “owner”: { > >>> “name”: “bob”, > >>> “email”: “ > >>> b...@example.com > >>> " > >>> } > >>> } > >>> > >>> could be stored thus: > >>> > >>> (“foo”, “active”) = kCONFLICT > >>> (“foo”, “active”, “1-abc”) = true > >>> (“foo”, “active”, “1-def”) = kMISSING > >>> (“foo”, “owner”) = kCONFLICT > >>> (“foo”, “owner”, “1-abc”) = “alice” > >>> (“foo”, “owner”, “1-def”, “name”) = “bob” > >>> (“foo”, “owner”, “1-def”, “email”) = ... > >>> > >>> Revision Metadata > >>> > >>> * CouchDB uses a hash history for revisions > >>> ** Each edit is identified by the hash of the content of the edit > >>> including the base revision against which it was applied > >>> ** Individual edit branches are bounded in length but the number of > >>> branches is potentially unbounded > >>> > >>> * Size limits preclude us from storing the entire key tree as a single > >>> value; in pathological situations > >>> the tree could exceed 100KB (each entry is > 16 bytes) > >>> > >>> * Store each edit branch as a separate KV including deleted status in a > >>> special subspace > >>> > >>> * Structure key representation so that “winning” revision can be > >>> automatically retrieved in a limit=1 > >>> key range operation > >>> > >>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = [] > >>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) = [“3-baz”,”2-bar”,”1-foo”] > >>> <-- winner > >>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”] > >>> > >>> Changes Feed > >>> > >>> * FDB supports a concept called a versionstamp — a 10 byte, unique, > >>> monotonically (but not sequentially) increasing value for each committed > >>> transaction. The first 8 bytes are the committed version of the > >>> database. The last 2 bytes are monotonic in the serialization order for > >>> transactions. > >>> > >>> * A transaction can specify a particular index into a key where the > >>> following 10 bytes will be overwritten by the versionstamp at commit > >>> time > >>> > >>> * A subspace keyed on versionstamp naturally yields a _changes feed > >>> > >>> by_seq subspace > >>> (“versionstamp1”) = (“foo”, “1-abc”) > >>> (“versionstamp4”) = (“bar”, “4-def”) > >>> > >>> by_id subspace > >>> (“bar”, “_vsn”) = “versionstamp4” > >>> ... > >>> (“foo”, “_vsn”) = “versionstamp1” > >>> > >>> JSON Indexes > >>> > >>> * “Mango” JSON indexes are defined by > >>> ** a list of field names, each of which may be nested, > >>> ** an optional partial_filter_selector which constrains the set of docs > >>> that contribute > >>> ** an optional name defined by the ddoc field (the name is auto- > >>> generated if not supplied) > >>> > >>> * Store index definitions in a single subspace to aid query planning > >>> ** ((person,name), title, email) = (“name-title-email”, “{“student”: > >>> true}”) > >>> ** Store the values for each index in a dedicated subspace, adding the > >>> document ID as the last element in the tuple > >>> *** (“rosie revere”, “engineer”, “ro...@example.com", “foo”) = null > >>> > >>> B. > >>> > >>> -- > >>> Robert Samuel Newson > >>> rnew...@apache.org > >>> > >>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote: > >>>> > >>>> I want to fix previous mistakes. I did two mistakes in previous > >>>> calculations: > >>>> - I used 1Kb as base size for calculating expansion factor (although > >> we > >>>> don't know exact size of original document) > >>>> - The expansion factor calculation included number of revisions (it > >>>> shouldn't) > >>>> > >>>> I'll focus on flattened JSON docs model > >>>> > >>>> The following formula is used in previous calculation. > >>>> storage_size_per_document=mapping_table_size*number_of_revisions + > >>>> depth*number_of_paths*number_of_revisions + > >>>> number_of_paths*value_size*number_of_revisions > >>>> > >>>> To clarify things a little bit I want to calculate space requirement > >> for > >>>> single revision this time. > >>>> mapping_table_size=field_name_size*(field_name_length+4(integer > >>>> size))=100 * (20 + 4(integer size)) = 2400 bytes > >>>> storage_size_per_document_per_revision_per_replica=mapping_table_size > >> + > >>>> depth*number_of_paths + value_size*number_of_paths = > >>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb > >>>> > >>>> We definitely can reduce requirement for mapping table by adopting > >>>> rnewson's idea of a schema. > >>>> > >>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iil...@apache.org> wrote: > >>>>> Hi Michael, > >>>>> > >>>>>> For example, hears a crazy thought: > >>>>>> Map every distinct occurence of a key/value instance through a > >> crypto hash > >>>>>> function to get a set of hashes. > >>>>>> > >>>>>> These can be be precomputed by Couch without any lookups in FDB. > >> These > >>>>>> will be spread all over kingdom come in FDB and not lend > >> themselves to > >>>>>> range search well. > >>>>>> > >>>>>> So what you do is index them for frequency of occurring in the > >> same set. > >>>>>> In essence, you 'bucket them' statistically, and that bucket id > >> becomes a > >>>>>> key prefix. A crypto hash value can be copied into more than one > >> bucket. > >>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} > >>>>> > >>>>>> When writing a document, Couch submits the list/array of > >> cryptohash values > >>>>>> it computed to FDB and gets back the corresponding {val_id} (the > >> id with > >>>>>> the bucket prefixed). This can get somewhat expensive if there's > >> always a > >>>>>> lot of app local cache misses. > >>>>>> > >>>>>> A document's value is then a series of {val_id} arrays up to 100k > >> per > >>>>>> segment. > >>>>>> > >>>>>> When retrieving a document, you get the val_ids, find the distinct > >> buckets > >>>>>> and min/max entries for this doc, and then parallel query each > >> bucket while > >>>>>> reconstructing the document. > >>>>> > >>>>> Interesting idea. Let's try to think it through to see if we can > >> make it viable. > >>>>> Let's go through hypothetical example. Input data for the example: > >>>>> - 1M of documents > >>>>> - each document is around 10Kb > >>>>> - each document consists of 1K of unique JSON paths > >>>>> - each document has 100 unique JSON field names > >>>>> - every scalar value is 100 bytes > >>>>> - 10% of unique JSON paths for every document already stored in > >> database under different doc or different revision of the current one > >>>>> - we assume 3 independent copies for every key-value pair in FDB > >>>>> - our hash key size is 32 bytes > >>>>> - let's assume we can determine if key is already on the storage > >> without doing query > >>>>> - 1% of paths is in cache (unrealistic value, in real live the > >> percentage is lower) > >>>>> - every JSON field name is 20 bytes > >>>>> - every JSON path is 10 levels deep > >>>>> - document key prefix length is 50 > >>>>> - every document has 10 revisions > >>>>> Let's estimate the storage requirements and size of data we need to > >> transmit. The calculations are not exact. > >>>>> 1. storage_size_per_document (we cannot estimate exact numbers since > >> we don't know how FDB stores it) > >>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = > >> 38Kb * 10 * 3 = 1140 Kb (11x) > >>>>> 2. number of independent keys to retrieve on document read > >> (non-range queries) per document > >>>>> - 1K - (1K * 1%) = 990 > >>>>> 3. number of range queries: 0 > >>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 > >> bytes) = 102 Kb (10x) > >>>>> 5. read latency (we use 2ms per read based on numbers from > >> https://apple.github.io/foundationdb/performance.html) > >>>>> - sequential: 990*2ms = 1980ms > >>>>> - range: 0 > >>>>> Let's compare these numbers with initial proposal (flattened JSON > >> docs without global schema and without cache) > >>>>> 1. storage_size_per_document > >>>>> - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes > >>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes > >>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = > >> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x) > >>>>> 2. number of independent keys to retrieve: 0-2 (depending on index > >> structure) > >>>>> 3. number of range queries: 1 (1001 of keys in result) > >>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb > >> (2.4x) > >>>>> 5. read latency (we use 2ms per read based on numbers from > >> https://apple.github.io/foundationdb/performance.html and estimate range > >> read performance based on numbers from > >> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test > >> ) > >>>>> - range read performance: Given read performance is about 305,000 > >> reads/second and range performance 3,600,000 keys/second we estimate range > >> performance to be 11.8x compared to read performance. If read performance > >> is 2ms than range performance is 0.169ms (which is hard to believe). > >>>>> - sequential: 2 * 2 = 4ms > >>>>> - range: 0.169 > >>>>> > >>>>> It looks like we are dealing with a tradeoff: > >>>>> - Map every distinct occurrence of a key/value instance through a > >> crypto hash: > >>>>> - 5.39x more disk space efficient > >>>>> - 474x slower > >>>>> - flattened JSON model > >>>>> - 5.39x less efficient in disk space > >>>>> - 474x faster > >>>>> > >>>>> In any case this unscientific exercise was very helpful. Since it > >> uncovered the high cost in terms of disk space. 59.3x of original disk size > >> is too much IMO. > >>>>> > >>>>> Are the any ways we can make Michael's model more performant? > >>>>> > >>>>> Also I don't quite understand few aspects of the global hash table > >> proposal: > >>>>> > >>>>> 1. > - Map every distinct occurence of a key/value instance through > >> a crypto hash function to get a set of hashes. > >>>>> I think we are talking only about scalar values here? I.e. > >> `"#/foo.bar.baz": 123` > >>>>> Since I don't know how we can make it work for all possible JSON > >> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}": > >>>>> - foo > >>>>> - foo.bar > >>>>> - foo.bar.baz > >>>>> > >>>>> 2. how to delete documents > >>>>> > >>>>> Best regards, > >>>>> ILYA > >>>>> > >>>>> > >>>>> On 2019/01/30 23:33:22, Michael Fair <mich...@daclubhouse.net> > >> wrote: > >>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocol...@apache.org > >> wrote: > >>>>>> > >>>>>>> Hi Michael, > >>>>>>> > >>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY. > >>>>>>> > >>>>>>> Yes that’s definitely one way to address storage of edit > >> conflicts. I > >>>>>>> think there are other, more compact representations that we can > >> explore if > >>>>>>> we have this “exploded” data model where each scalar value maps > >> to an > >>>>>>> individual KV pair. > >>>>>> > >>>>>> > >>>>>> I agree, as I mentioned on the original thread, I see a scheme, > >> that > >>>>>> handles both conflicts and revisions, where you only have to store > >> the most > >>>>>> recent change to a field. Like you suggested, multiple revisions > >> can share > >>>>>> a key. Which in my mind's eye further begs the conflicts/revisions > >>>>>> discussion along with the working within the limits discussion > >> because it > >>>>>> seems to me they are all intrinsically related as a "feature". > >>>>>> > >>>>>> Saying 'We'll break documents up into roughly 80k segments', then > >> trying to > >>>>>> overlay some kind of field sharing scheme for revisions/conflicts > >> doesn't > >>>>>> seem like it will work. > >>>>>> > >>>>>> I probably should have left out the trivial fix proposal as I > >> don't think > >>>>>> it's a feasible solution to actually use. > >>>>>> > >>>>>> The comment is more regarding that I do not see how this thread > >> can escape > >>>>>> including how to store/retrieve conflicts/revisions. > >>>>>> > >>>>>> For instance, the 'doc as individual fields' proposal lends itself > >> to value > >>>>>> sharing across mutiple documents (and I don't just mean revisions > >> of the > >>>>>> same doc, I mean the same key/value instance could be shared for > >> every > >>>>>> document). > >>>>>> However that's not really relevant if we're not considering the > >> amount of > >>>>>> shared information across documents in the storage scheme. > >>>>>> > >>>>>> Simply storing documents in <100k segments (perhaps in some kind of > >>>>>> compressed binary representation) to deal with that FDB limit > >> seems fine. > >>>>>> The only reason to consider doing something else is because of its > >> impact > >>>>>> to indexing, searches, reduce functions, revisions, on-disk size > >> impact, > >>>>>> etc. > >>>>>> > >>>>>> > >>>>>> > >>>>>>>> I'm assuming the process will flatten the key paths of the > >> document into > >>>>>>> an array and then request the value of each key as multiple > >> parallel > >>>>>>> queries against FDB at once > >>>>>>> > >>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s trying > >> to design a > >>>>>>> model which allows the retrieval of a document with a single > >> range read, > >>>>>>> which is a good goal in my opinion. > >>>>>>> > >>>>>> > >>>>>> I am not sure I agree. > >>>>>> > >>>>>> Think of bitTorrent, a single range read should pull back the > >> structure of > >>>>>> the document (the pieces to fetch), but not necessarily the whole > >> document. > >>>>>> > >>>>>> What if you already have a bunch of pieces in common with other > >> documents > >>>>>> locally (a repeated header/footer/ or type for example); and you > >> only need > >>>>>> to get a few pieces of data you don't already have? > >>>>>> > >>>>>> The real goal to Couch I see is to treat your document set like the > >>>>>> collection of structured information that it is. In some respects > >> like an > >>>>>> extension of your application's heap space for structured objects > >> and > >>>>>> efficiently querying that collection to get back subsets of the > >> data. > >>>>>> > >>>>>> Otherwise it seems more like a slightly upgraded file system plus > >> a fancy > >>>>>> grep/find like feature... > >>>>>> > >>>>>> The best way I see to unlock more features/power is to a move > >> towards a > >>>>>> more granular and efficient way to store and retrieve the scalar > >> values... > >>>>>> > >>>>>> > >>>>>> > >>>>>> For example, hears a crazy thought: > >>>>>> Map every distinct occurence of a key/value instance through a > >> crypto hash > >>>>>> function to get a set of hashes. > >>>>>> > >>>>>> These can be be precomputed by Couch without any lookups in FDB. > >> These > >>>>>> will be spread all over kingdom come in FDB and not lend > >> themselves to > >>>>>> range search well. > >>>>>> > >>>>>> So what you do is index them for frequency of occurring in the > >> same set. > >>>>>> In essence, you 'bucket them' statistically, and that bucket id > >> becomes a > >>>>>> key prefix. A crypto hash value can be copied into more than one > >> bucket. > >>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} > >>>>>> > >>>>>> When writing a document, Couch submits the list/array of > >> cryptohash values > >>>>>> it computed to FDB and gets back the corresponding {val_id} (the > >> id with > >>>>>> the bucket prefixed). This can get somewhat expensive if there's > >> always a > >>>>>> lot of app local cache misses. > >>>>>> > >>>>>> > >>>>>> A document's value is then a series of {val_id} arrays up to 100k > >> per > >>>>>> segment. > >>>>>> > >>>>>> When retrieving a document, you get the val_ids, find the distinct > >> buckets > >>>>>> and min/max entries for this doc, and then parallel query each > >> bucket while > >>>>>> reconstructing the document. > >>>>>> > >>>>>> The values returned from the buckets query are the key/value > >> strings > >>>>>> required to reassemble this document. > >>>>>> > >>>>>> > >>>>>> ---------- > >>>>>> I put this forward primarily to hilite the idea that trying to > >> match the > >>>>>> storage representation of documents in a straight forward way to > >> FDB keys > >>>>>> to reduce query count might not be the most performance oriented > >> approach. > >>>>>> > >>>>>> I'd much prefer a storage approach that reduced data duplication > >> and > >>>>>> enabled fast sub-document queries. > >>>>>> > >>>>>> > >>>>>> This clearly falls in the realm of what people want the 'use case' > >> of Couch > >>>>>> to be/become. By giving Couch more access to sub-document > >> queries, I could > >>>>>> eventually see queries as complicated as GraphQL submitted to > >> Couch and > >>>>>> pulling back ad-hoc aggregated data across multiple documents in a > >> single > >>>>>> application layer request. > >>>>>> > >>>>>> Hehe - one way to look at the database of Couch documents is that > >> they are > >>>>>> all conflict revisions of the single root empty document. What I > >> mean be > >>>>>> this is consider thinking of the entire document store as one > >> giant DAG of > >>>>>> key/value pairs. How even separate documents are still typically > >> related to > >>>>>> each other. For most applications there is a tremendous amount of > >> data > >>>>>> redundancy between docs and especially between revisions of those > >> docs... > >>>>>> > >>>>>> > >>>>>> > >>>>>> And all this is a long way of saying "I think there could be a lot > >> of value > >>>>>> in assuming documents are 'assembled' from multiple queries to > >> FDB, with > >>>>>> local caching, instead of simply retrieved" > >>>>>> > >>>>>> Thanks, I hope I'm not the only outlier here thinking this way!? > >>>>>> > >>>>>> Mike :-) > >>>>>> > >>>>> > >> >