Addendum: By “directory aliasing” I meant within a document (either the actual 
Directory thing or something equivalent of our own making). The directory 
aliasing for each database is a good way to reduce key size without a 
significant cost. Though if Redwood lands in time, even this would become an 
inutile obfuscation].

> On 19 Feb 2019, at 21:39, Robert Samuel Newson <rnew...@apache.org> wrote:
> 
> Interesting suggestion, obviously the details might get the wrong kind of fun.
> 
> Somewhere above I suggested this would be something we could change over time 
> and even use different approaches for different documents within the same 
> database. This is the long way of saying there are multiple ways to do this 
> each with advantages and none without disadvantages.
> 
> I don’t think adding a layer of abstraction is the right move just yet, I 
> think we should continue to find consensus on one answer to this question 
> (and the related ones in other threads) for the first release. It’s easy to 
> say “we can change it later”, of course. We can, though it would be a chunk 
> of work in the context of something that already works, I’ve rarely seen 
> anyone sign up for that.
> 
> I’m fine with the first proposal from Adam, where the keys are tuples of key 
> parts pointing at terminal values. To make it easier for the first version, I 
> would exclude optimisations like deduplication or the Directory aliasing or 
> the schema thing that I suggested and that Ilya incorporated a variant of in 
> a follow-up post. We’d accept that there are limits on the sizes of 
> documents, including the awkward-to-express one about property depth.
> 
> Stepping back, I’m not seeing any essential improvement over Adam’s original 
> proposal besides the few corrections and clarifications made by various 
> authors. Could we start an RFC based on Adam’s original proposal on document 
> body, revision tree and index storage? We could then have PR’s against that 
> for each additional optimisation (one person’s optimisation is another 
> person’s needless complication)?
> 
> If I’ve missed some genuine advance on the original proposal in this long 
> thread, please call it out for me.
> 
> B.
> 
>> On 19 Feb 2019, at 21:15, Benjamin Anderson <banjie...@apache.org> wrote:
>> 
>> As is evident by the length of this thread, there's a pretty big
>> design space to cover here, and it seems unlikely we'll have arrived
>> at a "correct" solution even by the time this thing ships. Perhaps it
>> would be worthwhile to treat the in-FDB representation of data as a
>> first-class abstraction and support multiple representations
>> simultaneously?
>> 
>> Obviously there's no such thing as a zero-cost abstraction - and I've
>> not thought very hard about how far up the stack the document
>> representation would need to leak - but supporting different layouts
>> (primarily, as Adam points out, on the document body itself) might
>> prove interesting and useful. I'm sure there are folks interested in a
>> column-shaped CouchDB, for example.
>> 
>> --
>> b
>> 
>> On Tue, Feb 19, 2019 at 11:39 AM Robert Newson <rnew...@apache.org> wrote:
>>> 
>>> Good points on revtree, I agree with you we should store that intelligently 
>>> to gain the benefits you mentioned.
>>> 
>>> --
>>> Robert Samuel Newson
>>> rnew...@apache.org
>>> 
>>> On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote:
>>>> I do not think we should store the revtree as a blob. The design where
>>>> each edit branch is its own KV should save on network IO and CPU cycles
>>>> for normal updates. We’ve performed too many heroics to keep
>>>> couch_key_tree from stalling entire databases when trying to update a
>>>> single document with a wide revision tree, I would much prefer to ignore
>>>> other edit branches entirely when all we’re doing is extending one of
>>>> them.
>>>> 
>>>> I also do not think we should store JSON documents as blobs, but it’s a
>>>> closer call. Some of my reasoning for preferring the exploded path
>>>> design:
>>>> 
>>>> - it lends itself nicely to sub-document operations, for which Jan
>>>> crafted an RFC last year: https://github.com/apache/couchdb/issues/1559
>>>> - it optimizes the creation of Mango indexes on existing databases since
>>>> we only need to retrieve the value(s) we want to index
>>>> - it optimizes Mango queries that use field selectors
>>>> - anyone who wanted to try their hand at GraphQL will find it very
>>>> handy: https://github.com/apache/couchdb/issues/1499
>>>> - looking further ahead, it lets us play with smarter leaf value types
>>>> like Counters (yes I’m still on the CRDT bandwagon, sorry)
>>>> 
>>>> A few comments on the thread:
>>>> 
>>>>>>> * Most documents bodies are probably going to be smaller than 100k. So 
>>>>>>> in
>>>>>>> the majority of case it would be one write / one read to update and 
>>>>>>> fetch
>>>>>>> the document body.
>>>> 
>>>> We should test, but I expect reading 50KB of data in a range query is
>>>> almost as efficient as reading a single 50 KB value. Similarly, writes
>>>> to a contiguous set of keys should be quite efficient.
>>>> 
>>>> I am concerned about the overhead of the repeated field paths in the
>>>> keys with the exploded path option in the absence of key prefix
>>>> compression. That would be my main reason to acquiesce and throw away
>>>> all the document structure.
>>>> 
>>>> Adam
>>>> 
>>>>> On Feb 19, 2019, at 12:04 PM, Robert Newson <rnew...@apache.org> wrote:
>>>>> 
>>>>> I like the idea that we'd reuse the same pattern (but perhaps not the 
>>>>> same _code_) for doc bodies, revtree and attachments.
>>>>> 
>>>>> I hope we still get to delete couch_key_tree.erl, though.
>>>>> 
>>>>> --
>>>>> Robert Samuel Newson
>>>>> rnew...@apache.org
>>>>> 
>>>>> On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote:
>>>>>> I like the idea from a “trying a simple thing first” perspective, but
>>>>>> Nick’s points below are especially convincing to with this for now.
>>>>>> 
>>>>>> Best
>>>>>> Jan
>>>>>> —
>>>>>> 
>>>>>>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatam...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Sorry for jumping in so late, I was following from the sidelines 
>>>>>>> mostly. A
>>>>>>> lot of good discussion happening and am excited about the possibilities
>>>>>>> here.
>>>>>>> 
>>>>>>> I do like the simpler "chunking" approach for a few reasons:
>>>>>>> 
>>>>>>> * Most documents bodies are probably going to be smaller than 100k. So 
>>>>>>> in
>>>>>>> the majority of case it would be one write / one read to update and 
>>>>>>> fetch
>>>>>>> the document body.
>>>>>>> 
>>>>>>> * We could reuse the chunking code for attachment handling and possibly
>>>>>>> revision key trees. So it's the general pattern of upload chunks to some
>>>>>>> prefix, and when finished flip an atomic toggle to make it current.
>>>>>>> 
>>>>>>> * Do the same thing with revision trees and we could re-use the revision
>>>>>>> tree manipulation logic. That is, the key tree in most cases would be 
>>>>>>> small
>>>>>>> enough to fit in 100k but if they get huge, they'd get chunked. This 
>>>>>>> would
>>>>>>> allow us to reuse all the battle tested couch_key_tree code mostly as 
>>>>>>> is.
>>>>>>> We even have property tests for it
>>>>>>> https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl
>>>>>>> 
>>>>>>> * It removes the need to explain the max exploded path length 
>>>>>>> limitation to
>>>>>>> customers.
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> -Nick
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson <rnew...@apache.org> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> An alternative storage model that we should seriously consider is to
>>>>>>>> follow our current approach in couch_file et al. Specifically, that the
>>>>>>>> document _body_ is stored as an uninterpreted binary value. This would 
>>>>>>>> be
>>>>>>>> much like the obvious plan for attachment storage; a key prefix that
>>>>>>>> identifies the database and document, with the final item of that key 
>>>>>>>> tuple
>>>>>>>> is an incrementing integer. Each of those keys has a binary value of 
>>>>>>>> up to
>>>>>>>> 100k. Fetching all values with that key prefix, in fdb's natural 
>>>>>>>> ordering,
>>>>>>>> will yield the full document body, which can be JSON decoded for 
>>>>>>>> further
>>>>>>>> processing.
>>>>>>>> 
>>>>>>>> I like this idea, and I like Adam's original proposal to explode 
>>>>>>>> documents
>>>>>>>> into property paths. I have a slight preference for the simplicity of 
>>>>>>>> the
>>>>>>>> idea in the previous paragraph, not least because it's close to what 
>>>>>>>> we do
>>>>>>>> today. I also think it will be possible to migrate to alternative 
>>>>>>>> storage
>>>>>>>> models in future, and foundationdb's transaction supports means we can 
>>>>>>>> do
>>>>>>>> this migration seamlessly should we come to it.
>>>>>>>> 
>>>>>>>> I'm very interested in knowing if anyone else is interested in going 
>>>>>>>> this
>>>>>>>> simple, or considers it a wasted opportunity relative to the 'exploded'
>>>>>>>> path.
>>>>>>>> 
>>>>>>>> B.
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Robert Samuel Newson
>>>>>>>> rnew...@apache.org
>>>>>>>> 
>>>>>>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote:
>>>>>>>>> I've been remiss here in not posting the data model ideas that IBM
>>>>>>>>> worked up while we were thinking about using FoundationDB so I'm 
>>>>>>>>> posting
>>>>>>>>> it now. This is Adam' Kocoloski's original work, I am just 
>>>>>>>>> transcribing
>>>>>>>>> it, and this is the context that the folks from the IBM side came in
>>>>>>>>> with, for full disclosure.
>>>>>>>>> 
>>>>>>>>> Basics
>>>>>>>>> 
>>>>>>>>> 1. All CouchDB databases are inside a Directory
>>>>>>>>> 2. Each CouchDB database is a Directory within that Directory
>>>>>>>>> 3. It's possible to list all subdirectories of a Directory, so
>>>>>>>>> `_all_dbs` is the list of directories from 1.
>>>>>>>>> 4. Each Directory representing a CouchdB database has several 
>>>>>>>>> Subspaces;
>>>>>>>>> 4a. by_id/ doc subspace: actual document contents
>>>>>>>>> 4b. by_seq/versionstamp subspace: for the _changes feed
>>>>>>>>> 4c. index_definitions, indexes, ...
>>>>>>>>> 
>>>>>>>>> JSON Mapping
>>>>>>>>> 
>>>>>>>>> A hierarchical JSON object naturally maps to multiple KV pairs in FDB:
>>>>>>>>> 
>>>>>>>>> {
>>>>>>>>> “_id”: “foo”,
>>>>>>>>> “owner”: “bob”,
>>>>>>>>> “mylist”: [1,3,5],
>>>>>>>>> “mymap”: {
>>>>>>>>>     “blue”: “#0000FF”,
>>>>>>>>>     “red”: “#FF0000”
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> maps to
>>>>>>>>> 
>>>>>>>>> (“foo”, “owner”) = “bob”
>>>>>>>>> (“foo”, “mylist”, 0) = 1
>>>>>>>>> (“foo”, “mylist”, 1) = 3
>>>>>>>>> (“foo”, “mylist”, 2) = 5
>>>>>>>>> (“foo”, “mymap”, “blue”) = “#0000FF”
>>>>>>>>> (“foo”, “mymap”, “red”) = “#FF0000”
>>>>>>>>> 
>>>>>>>>> NB: this means that the 100KB limit applies to individual leafs in the
>>>>>>>>> JSON object, not the entire doc
>>>>>>>>> 
>>>>>>>>> Edit Conflicts
>>>>>>>>> 
>>>>>>>>> We need to account for the presence of conflicts in various levels of
>>>>>>>>> the doc due to replication.
>>>>>>>>> 
>>>>>>>>> Proposal is to create a special value indicating that the subtree 
>>>>>>>>> below
>>>>>>>>> our current cursor position is in an unresolvable conflict. Then add
>>>>>>>>> additional KV pairs below to describe the conflicting entries.
>>>>>>>>> 
>>>>>>>>> KV data model allows us to store these efficiently and minimize
>>>>>>>>> duplication of data:
>>>>>>>>> 
>>>>>>>>> A document with these two conflicts:
>>>>>>>>> 
>>>>>>>>> {
>>>>>>>>> “_id”: “foo”,
>>>>>>>>> “_rev”: “1-abc”,
>>>>>>>>> “owner”: “alice”,
>>>>>>>>> “active”: true
>>>>>>>>> }
>>>>>>>>> {
>>>>>>>>> “_id”: “foo”,
>>>>>>>>> “_rev”: “1-def”,
>>>>>>>>> “owner”: “bob”,
>>>>>>>>> “active”: true
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> could be stored thus:
>>>>>>>>> 
>>>>>>>>> (“foo”, “active”) = true
>>>>>>>>> (“foo”, “owner”) = kCONFLICT
>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
>>>>>>>>> (“foo”, “owner”, “1-def”) = “bob”
>>>>>>>>> 
>>>>>>>>> So long as `kCONFLICT` is set at the top of the conflicting subtree 
>>>>>>>>> this
>>>>>>>>> representation can handle conflicts of different data types as well.
>>>>>>>>> 
>>>>>>>>> Missing fields need to be handled explicitly:
>>>>>>>>> 
>>>>>>>>> {
>>>>>>>>> “_id”: “foo”,
>>>>>>>>> “_rev”: “1-abc”,
>>>>>>>>> “owner”: “alice”,
>>>>>>>>> “active”: true
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> {
>>>>>>>>> “_id”: “foo”,
>>>>>>>>> “_rev”: “1-def”,
>>>>>>>>> “owner”: {
>>>>>>>>> “name”: “bob”,
>>>>>>>>> “email”: “
>>>>>>>>> b...@example.com
>>>>>>>>> "
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> could be stored thus:
>>>>>>>>> 
>>>>>>>>> (“foo”, “active”) = kCONFLICT
>>>>>>>>> (“foo”, “active”, “1-abc”) = true
>>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING
>>>>>>>>> (“foo”, “owner”) = kCONFLICT
>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
>>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob”
>>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = ...
>>>>>>>>> 
>>>>>>>>> Revision Metadata
>>>>>>>>> 
>>>>>>>>> * CouchDB uses a hash history for revisions
>>>>>>>>> ** Each edit is identified by the hash of the content of the edit
>>>>>>>>> including the base revision against which it was applied
>>>>>>>>> ** Individual edit branches are bounded in length but the number of
>>>>>>>>> branches is potentially unbounded
>>>>>>>>> 
>>>>>>>>> * Size limits preclude us from storing the entire key tree as a single
>>>>>>>>> value; in pathological situations
>>>>>>>>> the tree could exceed 100KB (each entry is > 16 bytes)
>>>>>>>>> 
>>>>>>>>> * Store each edit branch as a separate KV including deleted status in 
>>>>>>>>> a
>>>>>>>>> special subspace
>>>>>>>>> 
>>>>>>>>> * Structure key representation so that “winning” revision can be
>>>>>>>>> automatically retrieved in a limit=1
>>>>>>>>> key range operation
>>>>>>>>> 
>>>>>>>>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = []
>>>>>>>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) = 
>>>>>>>>> [“3-baz”,”2-bar”,”1-foo”]
>>>>>>>>> <-- winner
>>>>>>>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”]
>>>>>>>>> 
>>>>>>>>> Changes Feed
>>>>>>>>> 
>>>>>>>>> * FDB supports a concept called a versionstamp — a 10 byte, unique,
>>>>>>>>> monotonically (but not sequentially) increasing value for each 
>>>>>>>>> committed
>>>>>>>>> transaction. The first 8 bytes are the committed version of the
>>>>>>>>> database. The last 2 bytes are monotonic in the serialization order 
>>>>>>>>> for
>>>>>>>>> transactions.
>>>>>>>>> 
>>>>>>>>> * A transaction can specify a particular index into a key where the
>>>>>>>>> following 10 bytes will be overwritten by the versionstamp at commit
>>>>>>>>> time
>>>>>>>>> 
>>>>>>>>> * A subspace keyed on versionstamp naturally yields a _changes feed
>>>>>>>>> 
>>>>>>>>> by_seq subspace
>>>>>>>>> (“versionstamp1”) = (“foo”, “1-abc”)
>>>>>>>>> (“versionstamp4”) = (“bar”, “4-def”)
>>>>>>>>> 
>>>>>>>>> by_id subspace
>>>>>>>>> (“bar”, “_vsn”) = “versionstamp4”
>>>>>>>>> ...
>>>>>>>>> (“foo”, “_vsn”) = “versionstamp1”
>>>>>>>>> 
>>>>>>>>> JSON Indexes
>>>>>>>>> 
>>>>>>>>> * “Mango” JSON indexes are defined by
>>>>>>>>> ** a list of field names, each of which may be nested,
>>>>>>>>> ** an optional partial_filter_selector which constrains the set of 
>>>>>>>>> docs
>>>>>>>>> that contribute
>>>>>>>>> ** an optional name defined by the ddoc field (the name is auto-
>>>>>>>>> generated if not supplied)
>>>>>>>>> 
>>>>>>>>> * Store index definitions in a single subspace to aid query planning
>>>>>>>>> ** ((person,name), title, email) = (“name-title-email”, “{“student”:
>>>>>>>>> true}”)
>>>>>>>>> ** Store the values for each index in a dedicated subspace, adding the
>>>>>>>>> document ID as the last element in the tuple
>>>>>>>>> *** (“rosie revere”, “engineer”, “ro...@example.com", “foo”) = null
>>>>>>>>> 
>>>>>>>>> B.
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Robert Samuel Newson
>>>>>>>>> rnew...@apache.org
>>>>>>>>> 
>>>>>>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote:
>>>>>>>>>> 
>>>>>>>>>> I want to fix previous mistakes. I did two mistakes in previous
>>>>>>>>>> calculations:
>>>>>>>>>> - I used 1Kb as base size for calculating expansion factor (although
>>>>>>>> we
>>>>>>>>>> don't know exact size of original document)
>>>>>>>>>> - The expansion factor calculation included number of revisions (it
>>>>>>>>>> shouldn't)
>>>>>>>>>> 
>>>>>>>>>> I'll focus on flattened JSON docs model
>>>>>>>>>> 
>>>>>>>>>> The following formula is used in previous calculation.
>>>>>>>>>> storage_size_per_document=mapping_table_size*number_of_revisions +
>>>>>>>>>> depth*number_of_paths*number_of_revisions +
>>>>>>>>>> number_of_paths*value_size*number_of_revisions
>>>>>>>>>> 
>>>>>>>>>> To clarify things a little bit I want to calculate space requirement
>>>>>>>> for
>>>>>>>>>> single revision this time.
>>>>>>>>>> mapping_table_size=field_name_size*(field_name_length+4(integer
>>>>>>>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes
>>>>>>>>>> storage_size_per_document_per_revision_per_replica=mapping_table_size
>>>>>>>> +
>>>>>>>>>> depth*number_of_paths + value_size*number_of_paths =
>>>>>>>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb
>>>>>>>>>> 
>>>>>>>>>> We definitely can reduce requirement for mapping table by adopting
>>>>>>>>>> rnewson's idea of a schema.
>>>>>>>>>> 
>>>>>>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iil...@apache.org> wrote:
>>>>>>>>>>> Hi Michael,
>>>>>>>>>>> 
>>>>>>>>>>>> For example, hears a crazy thought:
>>>>>>>>>>>> Map every distinct occurence of a key/value instance through a
>>>>>>>> crypto hash
>>>>>>>>>>>> function to get a set of hashes.
>>>>>>>>>>>> 
>>>>>>>>>>>> These can be be precomputed by Couch without any lookups in FDB.
>>>>>>>> These
>>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend
>>>>>>>> themselves to
>>>>>>>>>>>> range search well.
>>>>>>>>>>>> 
>>>>>>>>>>>> So what you do is index them for frequency of occurring in the
>>>>>>>> same set.
>>>>>>>>>>>> In essence, you 'bucket them' statistically, and that bucket id
>>>>>>>> becomes a
>>>>>>>>>>>> key prefix. A crypto hash value can be copied into more than one
>>>>>>>> bucket.
>>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
>>>>>>>>>>> 
>>>>>>>>>>>> When writing a document, Couch submits the list/array of
>>>>>>>> cryptohash values
>>>>>>>>>>>> it computed to FDB and gets back the corresponding  {val_id} (the
>>>>>>>> id with
>>>>>>>>>>>> the bucket prefixed).  This can get somewhat expensive if there's
>>>>>>>> always a
>>>>>>>>>>>> lot of app local cache misses.
>>>>>>>>>>>> 
>>>>>>>>>>>> A document's value is then a series of {val_id} arrays up to 100k
>>>>>>>> per
>>>>>>>>>>>> segment.
>>>>>>>>>>>> 
>>>>>>>>>>>> When retrieving a document, you get the val_ids, find the distinct
>>>>>>>> buckets
>>>>>>>>>>>> and min/max entries for this doc, and then parallel query each
>>>>>>>> bucket while
>>>>>>>>>>>> reconstructing the document.
>>>>>>>>>>> 
>>>>>>>>>>> Interesting idea. Let's try to think it through to see if we can
>>>>>>>> make it viable.
>>>>>>>>>>> Let's go through hypothetical example. Input data for the example:
>>>>>>>>>>> - 1M of documents
>>>>>>>>>>> - each document is around 10Kb
>>>>>>>>>>> - each document consists of 1K of unique JSON paths
>>>>>>>>>>> - each document has 100 unique JSON field names
>>>>>>>>>>> - every scalar value is 100 bytes
>>>>>>>>>>> - 10% of unique JSON paths for every document already stored in
>>>>>>>> database under different doc or different revision of the current one
>>>>>>>>>>> - we assume 3 independent copies for every key-value pair in FDB
>>>>>>>>>>> - our hash key size is 32 bytes
>>>>>>>>>>> - let's assume we can determine if key is already on the storage
>>>>>>>> without doing query
>>>>>>>>>>> - 1% of paths is in cache (unrealistic value, in real live the
>>>>>>>> percentage is lower)
>>>>>>>>>>> - every JSON field name is 20 bytes
>>>>>>>>>>> - every JSON path is 10 levels deep
>>>>>>>>>>> - document key prefix length is 50
>>>>>>>>>>> - every document has 10 revisions
>>>>>>>>>>> Let's estimate the storage requirements and size of data we need to
>>>>>>>> transmit. The calculations are not exact.
>>>>>>>>>>> 1. storage_size_per_document (we cannot estimate exact numbers since
>>>>>>>> we don't know how FDB stores it)
>>>>>>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) =
>>>>>>>> 38Kb * 10 * 3 = 1140 Kb (11x)
>>>>>>>>>>> 2. number of independent keys to retrieve on document read
>>>>>>>> (non-range queries) per document
>>>>>>>>>>> - 1K - (1K * 1%) = 990
>>>>>>>>>>> 3. number of range queries: 0
>>>>>>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32
>>>>>>>> bytes) = 102 Kb (10x)
>>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from
>>>>>>>> https://apple.github.io/foundationdb/performance.html)
>>>>>>>>>>> - sequential: 990*2ms = 1980ms
>>>>>>>>>>> - range: 0
>>>>>>>>>>> Let's compare these numbers with initial proposal (flattened JSON
>>>>>>>> docs without global schema and without cache)
>>>>>>>>>>> 1. storage_size_per_document
>>>>>>>>>>> - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
>>>>>>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes
>>>>>>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 =
>>>>>>>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x)
>>>>>>>>>>> 2. number of independent keys to retrieve: 0-2 (depending on index
>>>>>>>> structure)
>>>>>>>>>>> 3. number of range queries: 1 (1001 of keys in result)
>>>>>>>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb
>>>>>>>> (2.4x)
>>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from
>>>>>>>> https://apple.github.io/foundationdb/performance.html and estimate 
>>>>>>>> range
>>>>>>>> read performance based on numbers from
>>>>>>>> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test
>>>>>>>> )
>>>>>>>>>>> - range read performance: Given read performance is about 305,000
>>>>>>>> reads/second and range performance 3,600,000 keys/second we estimate 
>>>>>>>> range
>>>>>>>> performance to be 11.8x compared to read performance. If read 
>>>>>>>> performance
>>>>>>>> is 2ms than range performance is 0.169ms (which is hard to believe).
>>>>>>>>>>> - sequential: 2 * 2 = 4ms
>>>>>>>>>>> - range: 0.169
>>>>>>>>>>> 
>>>>>>>>>>> It looks like we are dealing with a tradeoff:
>>>>>>>>>>> - Map every distinct occurrence of a key/value instance through a
>>>>>>>> crypto hash:
>>>>>>>>>>> - 5.39x more disk space efficient
>>>>>>>>>>> - 474x slower
>>>>>>>>>>> - flattened JSON model
>>>>>>>>>>> - 5.39x less efficient in disk space
>>>>>>>>>>> - 474x faster
>>>>>>>>>>> 
>>>>>>>>>>> In any case this unscientific exercise was very helpful. Since it
>>>>>>>> uncovered the high cost in terms of disk space. 59.3x of original disk 
>>>>>>>> size
>>>>>>>> is too much IMO.
>>>>>>>>>>> 
>>>>>>>>>>> Are the any ways we can make Michael's model more performant?
>>>>>>>>>>> 
>>>>>>>>>>> Also I don't quite understand few aspects of the global hash table
>>>>>>>> proposal:
>>>>>>>>>>> 
>>>>>>>>>>> 1. > - Map every distinct occurence of a key/value instance through
>>>>>>>> a crypto hash function to get a set of hashes.
>>>>>>>>>>> I think we are talking only about scalar values here? I.e.
>>>>>>>> `"#/foo.bar.baz": 123`
>>>>>>>>>>> Since I don't know how we can make it work for all possible JSON
>>>>>>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}":
>>>>>>>>>>> - foo
>>>>>>>>>>> - foo.bar
>>>>>>>>>>> - foo.bar.baz
>>>>>>>>>>> 
>>>>>>>>>>> 2. how to delete documents
>>>>>>>>>>> 
>>>>>>>>>>> Best regards,
>>>>>>>>>>> ILYA
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 2019/01/30 23:33:22, Michael Fair <mich...@daclubhouse.net>
>>>>>>>> wrote:
>>>>>>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocol...@apache.org
>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Yes that’s definitely one way to address storage of edit
>>>>>>>> conflicts. I
>>>>>>>>>>>>> think there are other, more compact representations that we can
>>>>>>>> explore if
>>>>>>>>>>>>> we have this “exploded” data model where each scalar value maps
>>>>>>>> to an
>>>>>>>>>>>>> individual KV pair.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I agree, as I mentioned on the original thread, I see a scheme,
>>>>>>>> that
>>>>>>>>>>>> handles both conflicts and revisions, where you only have to store
>>>>>>>> the most
>>>>>>>>>>>> recent change to a field.  Like you suggested, multiple revisions
>>>>>>>> can share
>>>>>>>>>>>> a key.  Which in my mind's eye further begs the conflicts/revisions
>>>>>>>>>>>> discussion along with the working within the limits discussion
>>>>>>>> because it
>>>>>>>>>>>> seems to me they are all intrinsically related as a "feature".
>>>>>>>>>>>> 
>>>>>>>>>>>> Saying 'We'll break documents up into roughly 80k segments', then
>>>>>>>> trying to
>>>>>>>>>>>> overlay some kind of field sharing scheme for revisions/conflicts
>>>>>>>> doesn't
>>>>>>>>>>>> seem like it will work.
>>>>>>>>>>>> 
>>>>>>>>>>>> I probably should have left out the trivial fix proposal as I
>>>>>>>> don't think
>>>>>>>>>>>> it's a feasible solution to actually use.
>>>>>>>>>>>> 
>>>>>>>>>>>> The comment is more regarding that I do not see how this thread
>>>>>>>> can escape
>>>>>>>>>>>> including how to store/retrieve conflicts/revisions.
>>>>>>>>>>>> 
>>>>>>>>>>>> For instance, the 'doc as individual fields' proposal lends itself
>>>>>>>> to value
>>>>>>>>>>>> sharing across mutiple documents (and I don't just mean revisions
>>>>>>>> of the
>>>>>>>>>>>> same doc, I mean the same key/value instance could be shared for
>>>>>>>> every
>>>>>>>>>>>> document).
>>>>>>>>>>>> However that's not really relevant if we're not considering the
>>>>>>>> amount of
>>>>>>>>>>>> shared information across documents in the storage scheme.
>>>>>>>>>>>> 
>>>>>>>>>>>> Simply storing documents in <100k segments (perhaps in some kind of
>>>>>>>>>>>> compressed binary representation) to deal with that FDB limit
>>>>>>>> seems fine.
>>>>>>>>>>>> The only reason to consider doing something else is because of its
>>>>>>>> impact
>>>>>>>>>>>> to indexing, searches, reduce functions, revisions, on-disk size
>>>>>>>> impact,
>>>>>>>>>>>> etc.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm assuming the process will flatten the key paths of the
>>>>>>>> document into
>>>>>>>>>>>>> an array and then request the value of each key as multiple
>>>>>>>> parallel
>>>>>>>>>>>>> queries against FDB at once
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s trying
>>>>>>>> to design a
>>>>>>>>>>>>> model which allows the retrieval of a document with a single
>>>>>>>> range read,
>>>>>>>>>>>>> which is a good goal in my opinion.
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> I am not sure I agree.
>>>>>>>>>>>> 
>>>>>>>>>>>> Think of bitTorrent, a single range read should pull back the
>>>>>>>> structure of
>>>>>>>>>>>> the document (the pieces to fetch), but not necessarily the whole
>>>>>>>> document.
>>>>>>>>>>>> 
>>>>>>>>>>>> What if you already have a bunch of pieces in common with other
>>>>>>>> documents
>>>>>>>>>>>> locally (a repeated header/footer/ or type for example); and you
>>>>>>>> only need
>>>>>>>>>>>> to get a few pieces of data you don't already have?
>>>>>>>>>>>> 
>>>>>>>>>>>> The real goal to Couch I see is to treat your document set like the
>>>>>>>>>>>> collection of structured information that it is.  In some respects
>>>>>>>> like an
>>>>>>>>>>>> extension of your application's heap space for structured objects
>>>>>>>> and
>>>>>>>>>>>> efficiently querying that collection to get back subsets of the
>>>>>>>> data.
>>>>>>>>>>>> 
>>>>>>>>>>>> Otherwise it seems more like a slightly upgraded file system plus
>>>>>>>> a fancy
>>>>>>>>>>>> grep/find like feature...
>>>>>>>>>>>> 
>>>>>>>>>>>> The best way I see to unlock more features/power is to a move
>>>>>>>> towards a
>>>>>>>>>>>> more granular and efficient way to store and retrieve the scalar
>>>>>>>> values...
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> For example, hears a crazy thought:
>>>>>>>>>>>> Map every distinct occurence of a key/value instance through a
>>>>>>>> crypto hash
>>>>>>>>>>>> function to get a set of hashes.
>>>>>>>>>>>> 
>>>>>>>>>>>> These can be be precomputed by Couch without any lookups in FDB.
>>>>>>>> These
>>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend
>>>>>>>> themselves to
>>>>>>>>>>>> range search well.
>>>>>>>>>>>> 
>>>>>>>>>>>> So what you do is index them for frequency of occurring in the
>>>>>>>> same set.
>>>>>>>>>>>> In essence, you 'bucket them' statistically, and that bucket id
>>>>>>>> becomes a
>>>>>>>>>>>> key prefix. A crypto hash value can be copied into more than one
>>>>>>>> bucket.
>>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
>>>>>>>>>>>> 
>>>>>>>>>>>> When writing a document, Couch submits the list/array of
>>>>>>>> cryptohash values
>>>>>>>>>>>> it computed to FDB and gets back the corresponding  {val_id} (the
>>>>>>>> id with
>>>>>>>>>>>> the bucket prefixed).  This can get somewhat expensive if there's
>>>>>>>> always a
>>>>>>>>>>>> lot of app local cache misses.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> A document's value is then a series of {val_id} arrays up to 100k
>>>>>>>> per
>>>>>>>>>>>> segment.
>>>>>>>>>>>> 
>>>>>>>>>>>> When retrieving a document, you get the val_ids, find the distinct
>>>>>>>> buckets
>>>>>>>>>>>> and min/max entries for this doc, and then parallel query each
>>>>>>>> bucket while
>>>>>>>>>>>> reconstructing the document.
>>>>>>>>>>>> 
>>>>>>>>>>>> The values returned from the buckets query are the key/value
>>>>>>>> strings
>>>>>>>>>>>> required to reassemble this document.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> ----------
>>>>>>>>>>>> I put this forward primarily to hilite the idea that trying to
>>>>>>>> match the
>>>>>>>>>>>> storage representation of documents in a straight forward way to
>>>>>>>> FDB keys
>>>>>>>>>>>> to reduce query count might not be the most performance oriented
>>>>>>>> approach.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'd much prefer a storage approach that reduced data duplication
>>>>>>>> and
>>>>>>>>>>>> enabled fast sub-document queries.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> This clearly falls in the realm of what people want the 'use case'
>>>>>>>> of Couch
>>>>>>>>>>>> to be/become.  By giving Couch more access to sub-document
>>>>>>>> queries, I could
>>>>>>>>>>>> eventually see queries as complicated as GraphQL submitted to
>>>>>>>> Couch and
>>>>>>>>>>>> pulling back ad-hoc aggregated data across multiple documents in a
>>>>>>>> single
>>>>>>>>>>>> application layer request.
>>>>>>>>>>>> 
>>>>>>>>>>>> Hehe - one way to look at the database of Couch documents is that
>>>>>>>> they are
>>>>>>>>>>>> all conflict revisions of the single root empty document.   What I
>>>>>>>> mean be
>>>>>>>>>>>> this is consider thinking of the entire document store as one
>>>>>>>> giant DAG of
>>>>>>>>>>>> key/value pairs. How even separate documents are still typically
>>>>>>>> related to
>>>>>>>>>>>> each other.  For most applications there is a tremendous amount of
>>>>>>>> data
>>>>>>>>>>>> redundancy between docs and especially between revisions of those
>>>>>>>> docs...
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> And all this is a long way of saying "I think there could be a lot
>>>>>>>> of value
>>>>>>>>>>>> in assuming documents are 'assembled' from multiple queries to
>>>>>>>> FDB, with
>>>>>>>>>>>> local caching, instead of simply retrieved"
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks, I hope I'm not the only outlier here thinking this way!?
>>>>>>>>>>>> 
>>>>>>>>>>>> Mike :-)
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
> 

Reply via email to