As is evident by the length of this thread, there's a pretty big
design space to cover here, and it seems unlikely we'll have arrived
at a "correct" solution even by the time this thing ships. Perhaps it
would be worthwhile to treat the in-FDB representation of data as a
first-class abstraction and support multiple representations
simultaneously?

Obviously there's no such thing as a zero-cost abstraction - and I've
not thought very hard about how far up the stack the document
representation would need to leak - but supporting different layouts
(primarily, as Adam points out, on the document body itself) might
prove interesting and useful. I'm sure there are folks interested in a
column-shaped CouchDB, for example.

--
b

On Tue, Feb 19, 2019 at 11:39 AM Robert Newson <rnew...@apache.org> wrote:
>
> Good points on revtree, I agree with you we should store that intelligently 
> to gain the benefits you mentioned.
>
> --
>   Robert Samuel Newson
>   rnew...@apache.org
>
> On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote:
> > I do not think we should store the revtree as a blob. The design where
> > each edit branch is its own KV should save on network IO and CPU cycles
> > for normal updates. We’ve performed too many heroics to keep
> > couch_key_tree from stalling entire databases when trying to update a
> > single document with a wide revision tree, I would much prefer to ignore
> > other edit branches entirely when all we’re doing is extending one of
> > them.
> >
> > I also do not think we should store JSON documents as blobs, but it’s a
> > closer call. Some of my reasoning for preferring the exploded path
> > design:
> >
> > - it lends itself nicely to sub-document operations, for which Jan
> > crafted an RFC last year: https://github.com/apache/couchdb/issues/1559
> > - it optimizes the creation of Mango indexes on existing databases since
> > we only need to retrieve the value(s) we want to index
> > - it optimizes Mango queries that use field selectors
> > - anyone who wanted to try their hand at GraphQL will find it very
> > handy: https://github.com/apache/couchdb/issues/1499
> > - looking further ahead, it lets us play with smarter leaf value types
> > like Counters (yes I’m still on the CRDT bandwagon, sorry)
> >
> > A few comments on the thread:
> >
> > >>> * Most documents bodies are probably going to be smaller than 100k. So 
> > >>> in
> > >>> the majority of case it would be one write / one read to update and 
> > >>> fetch
> > >>> the document body.
> >
> > We should test, but I expect reading 50KB of data in a range query is
> > almost as efficient as reading a single 50 KB value. Similarly, writes
> > to a contiguous set of keys should be quite efficient.
> >
> > I am concerned about the overhead of the repeated field paths in the
> > keys with the exploded path option in the absence of key prefix
> > compression. That would be my main reason to acquiesce and throw away
> > all the document structure.
> >
> > Adam
> >
> > > On Feb 19, 2019, at 12:04 PM, Robert Newson <rnew...@apache.org> wrote:
> > >
> > > I like the idea that we'd reuse the same pattern (but perhaps not the 
> > > same _code_) for doc bodies, revtree and attachments.
> > >
> > > I hope we still get to delete couch_key_tree.erl, though.
> > >
> > > --
> > >  Robert Samuel Newson
> > >  rnew...@apache.org
> > >
> > > On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote:
> > >> I like the idea from a “trying a simple thing first” perspective, but
> > >> Nick’s points below are especially convincing to with this for now.
> > >>
> > >> Best
> > >> Jan
> > >> —
> > >>
> > >>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatam...@gmail.com> wrote:
> > >>>
> > >>> Hi,
> > >>>
> > >>> Sorry for jumping in so late, I was following from the sidelines 
> > >>> mostly. A
> > >>> lot of good discussion happening and am excited about the possibilities
> > >>> here.
> > >>>
> > >>> I do like the simpler "chunking" approach for a few reasons:
> > >>>
> > >>> * Most documents bodies are probably going to be smaller than 100k. So 
> > >>> in
> > >>> the majority of case it would be one write / one read to update and 
> > >>> fetch
> > >>> the document body.
> > >>>
> > >>> * We could reuse the chunking code for attachment handling and possibly
> > >>> revision key trees. So it's the general pattern of upload chunks to some
> > >>> prefix, and when finished flip an atomic toggle to make it current.
> > >>>
> > >>> * Do the same thing with revision trees and we could re-use the revision
> > >>> tree manipulation logic. That is, the key tree in most cases would be 
> > >>> small
> > >>> enough to fit in 100k but if they get huge, they'd get chunked. This 
> > >>> would
> > >>> allow us to reuse all the battle tested couch_key_tree code mostly as 
> > >>> is.
> > >>> We even have property tests for it
> > >>> https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl
> > >>>
> > >>> * It removes the need to explain the max exploded path length 
> > >>> limitation to
> > >>> customers.
> > >>>
> > >>> Cheers,
> > >>> -Nick
> > >>>
> > >>>
> > >>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson <rnew...@apache.org> 
> > >>> wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> An alternative storage model that we should seriously consider is to
> > >>>> follow our current approach in couch_file et al. Specifically, that the
> > >>>> document _body_ is stored as an uninterpreted binary value. This would 
> > >>>> be
> > >>>> much like the obvious plan for attachment storage; a key prefix that
> > >>>> identifies the database and document, with the final item of that key 
> > >>>> tuple
> > >>>> is an incrementing integer. Each of those keys has a binary value of 
> > >>>> up to
> > >>>> 100k. Fetching all values with that key prefix, in fdb's natural 
> > >>>> ordering,
> > >>>> will yield the full document body, which can be JSON decoded for 
> > >>>> further
> > >>>> processing.
> > >>>>
> > >>>> I like this idea, and I like Adam's original proposal to explode 
> > >>>> documents
> > >>>> into property paths. I have a slight preference for the simplicity of 
> > >>>> the
> > >>>> idea in the previous paragraph, not least because it's close to what 
> > >>>> we do
> > >>>> today. I also think it will be possible to migrate to alternative 
> > >>>> storage
> > >>>> models in future, and foundationdb's transaction supports means we can 
> > >>>> do
> > >>>> this migration seamlessly should we come to it.
> > >>>>
> > >>>> I'm very interested in knowing if anyone else is interested in going 
> > >>>> this
> > >>>> simple, or considers it a wasted opportunity relative to the 'exploded'
> > >>>> path.
> > >>>>
> > >>>> B.
> > >>>>
> > >>>> --
> > >>>> Robert Samuel Newson
> > >>>> rnew...@apache.org
> > >>>>
> > >>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote:
> > >>>>> I've been remiss here in not posting the data model ideas that IBM
> > >>>>> worked up while we were thinking about using FoundationDB so I'm 
> > >>>>> posting
> > >>>>> it now. This is Adam' Kocoloski's original work, I am just 
> > >>>>> transcribing
> > >>>>> it, and this is the context that the folks from the IBM side came in
> > >>>>> with, for full disclosure.
> > >>>>>
> > >>>>> Basics
> > >>>>>
> > >>>>> 1. All CouchDB databases are inside a Directory
> > >>>>> 2. Each CouchDB database is a Directory within that Directory
> > >>>>> 3. It's possible to list all subdirectories of a Directory, so
> > >>>>> `_all_dbs` is the list of directories from 1.
> > >>>>> 4. Each Directory representing a CouchdB database has several 
> > >>>>> Subspaces;
> > >>>>> 4a. by_id/ doc subspace: actual document contents
> > >>>>> 4b. by_seq/versionstamp subspace: for the _changes feed
> > >>>>> 4c. index_definitions, indexes, ...
> > >>>>>
> > >>>>> JSON Mapping
> > >>>>>
> > >>>>> A hierarchical JSON object naturally maps to multiple KV pairs in FDB:
> > >>>>>
> > >>>>> {
> > >>>>>   “_id”: “foo”,
> > >>>>>   “owner”: “bob”,
> > >>>>>   “mylist”: [1,3,5],
> > >>>>>   “mymap”: {
> > >>>>>       “blue”: “#0000FF”,
> > >>>>>       “red”: “#FF0000”
> > >>>>>   }
> > >>>>> }
> > >>>>>
> > >>>>> maps to
> > >>>>>
> > >>>>> (“foo”, “owner”) = “bob”
> > >>>>> (“foo”, “mylist”, 0) = 1
> > >>>>> (“foo”, “mylist”, 1) = 3
> > >>>>> (“foo”, “mylist”, 2) = 5
> > >>>>> (“foo”, “mymap”, “blue”) = “#0000FF”
> > >>>>> (“foo”, “mymap”, “red”) = “#FF0000”
> > >>>>>
> > >>>>> NB: this means that the 100KB limit applies to individual leafs in the
> > >>>>> JSON object, not the entire doc
> > >>>>>
> > >>>>> Edit Conflicts
> > >>>>>
> > >>>>> We need to account for the presence of conflicts in various levels of
> > >>>>> the doc due to replication.
> > >>>>>
> > >>>>> Proposal is to create a special value indicating that the subtree 
> > >>>>> below
> > >>>>> our current cursor position is in an unresolvable conflict. Then add
> > >>>>> additional KV pairs below to describe the conflicting entries.
> > >>>>>
> > >>>>> KV data model allows us to store these efficiently and minimize
> > >>>>> duplication of data:
> > >>>>>
> > >>>>> A document with these two conflicts:
> > >>>>>
> > >>>>> {
> > >>>>>   “_id”: “foo”,
> > >>>>>   “_rev”: “1-abc”,
> > >>>>>   “owner”: “alice”,
> > >>>>>   “active”: true
> > >>>>> }
> > >>>>> {
> > >>>>>   “_id”: “foo”,
> > >>>>>   “_rev”: “1-def”,
> > >>>>>   “owner”: “bob”,
> > >>>>>   “active”: true
> > >>>>> }
> > >>>>>
> > >>>>> could be stored thus:
> > >>>>>
> > >>>>> (“foo”, “active”) = true
> > >>>>> (“foo”, “owner”) = kCONFLICT
> > >>>>> (“foo”, “owner”, “1-abc”) = “alice”
> > >>>>> (“foo”, “owner”, “1-def”) = “bob”
> > >>>>>
> > >>>>> So long as `kCONFLICT` is set at the top of the conflicting subtree 
> > >>>>> this
> > >>>>> representation can handle conflicts of different data types as well.
> > >>>>>
> > >>>>> Missing fields need to be handled explicitly:
> > >>>>>
> > >>>>> {
> > >>>>> “_id”: “foo”,
> > >>>>> “_rev”: “1-abc”,
> > >>>>> “owner”: “alice”,
> > >>>>> “active”: true
> > >>>>> }
> > >>>>>
> > >>>>> {
> > >>>>> “_id”: “foo”,
> > >>>>> “_rev”: “1-def”,
> > >>>>> “owner”: {
> > >>>>>   “name”: “bob”,
> > >>>>>   “email”: “
> > >>>>> b...@example.com
> > >>>>> "
> > >>>>> }
> > >>>>> }
> > >>>>>
> > >>>>> could be stored thus:
> > >>>>>
> > >>>>> (“foo”, “active”) = kCONFLICT
> > >>>>> (“foo”, “active”, “1-abc”) = true
> > >>>>> (“foo”, “active”, “1-def”) = kMISSING
> > >>>>> (“foo”, “owner”) = kCONFLICT
> > >>>>> (“foo”, “owner”, “1-abc”) = “alice”
> > >>>>> (“foo”, “owner”, “1-def”, “name”) = “bob”
> > >>>>> (“foo”, “owner”, “1-def”, “email”) = ...
> > >>>>>
> > >>>>> Revision Metadata
> > >>>>>
> > >>>>> * CouchDB uses a hash history for revisions
> > >>>>> ** Each edit is identified by the hash of the content of the edit
> > >>>>> including the base revision against which it was applied
> > >>>>> ** Individual edit branches are bounded in length but the number of
> > >>>>> branches is potentially unbounded
> > >>>>>
> > >>>>> * Size limits preclude us from storing the entire key tree as a single
> > >>>>> value; in pathological situations
> > >>>>> the tree could exceed 100KB (each entry is > 16 bytes)
> > >>>>>
> > >>>>> * Store each edit branch as a separate KV including deleted status in 
> > >>>>> a
> > >>>>> special subspace
> > >>>>>
> > >>>>> * Structure key representation so that “winning” revision can be
> > >>>>> automatically retrieved in a limit=1
> > >>>>> key range operation
> > >>>>>
> > >>>>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = []
> > >>>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) = 
> > >>>>> [“3-baz”,”2-bar”,”1-foo”]
> > >>>>> <-- winner
> > >>>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”]
> > >>>>>
> > >>>>> Changes Feed
> > >>>>>
> > >>>>> * FDB supports a concept called a versionstamp — a 10 byte, unique,
> > >>>>> monotonically (but not sequentially) increasing value for each 
> > >>>>> committed
> > >>>>> transaction. The first 8 bytes are the committed version of the
> > >>>>> database. The last 2 bytes are monotonic in the serialization order 
> > >>>>> for
> > >>>>> transactions.
> > >>>>>
> > >>>>> * A transaction can specify a particular index into a key where the
> > >>>>> following 10 bytes will be overwritten by the versionstamp at commit
> > >>>>> time
> > >>>>>
> > >>>>> * A subspace keyed on versionstamp naturally yields a _changes feed
> > >>>>>
> > >>>>> by_seq subspace
> > >>>>> (“versionstamp1”) = (“foo”, “1-abc”)
> > >>>>> (“versionstamp4”) = (“bar”, “4-def”)
> > >>>>>
> > >>>>> by_id subspace
> > >>>>> (“bar”, “_vsn”) = “versionstamp4”
> > >>>>> ...
> > >>>>> (“foo”, “_vsn”) = “versionstamp1”
> > >>>>>
> > >>>>> JSON Indexes
> > >>>>>
> > >>>>> * “Mango” JSON indexes are defined by
> > >>>>> ** a list of field names, each of which may be nested,
> > >>>>> ** an optional partial_filter_selector which constrains the set of 
> > >>>>> docs
> > >>>>> that contribute
> > >>>>> ** an optional name defined by the ddoc field (the name is auto-
> > >>>>> generated if not supplied)
> > >>>>>
> > >>>>> * Store index definitions in a single subspace to aid query planning
> > >>>>> ** ((person,name), title, email) = (“name-title-email”, “{“student”:
> > >>>>> true}”)
> > >>>>> ** Store the values for each index in a dedicated subspace, adding the
> > >>>>> document ID as the last element in the tuple
> > >>>>> *** (“rosie revere”, “engineer”, “ro...@example.com", “foo”) = null
> > >>>>>
> > >>>>> B.
> > >>>>>
> > >>>>> --
> > >>>>> Robert Samuel Newson
> > >>>>> rnew...@apache.org
> > >>>>>
> > >>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote:
> > >>>>>>
> > >>>>>> I want to fix previous mistakes. I did two mistakes in previous
> > >>>>>> calculations:
> > >>>>>> - I used 1Kb as base size for calculating expansion factor (although
> > >>>> we
> > >>>>>> don't know exact size of original document)
> > >>>>>> - The expansion factor calculation included number of revisions (it
> > >>>>>> shouldn't)
> > >>>>>>
> > >>>>>> I'll focus on flattened JSON docs model
> > >>>>>>
> > >>>>>> The following formula is used in previous calculation.
> > >>>>>> storage_size_per_document=mapping_table_size*number_of_revisions +
> > >>>>>> depth*number_of_paths*number_of_revisions +
> > >>>>>> number_of_paths*value_size*number_of_revisions
> > >>>>>>
> > >>>>>> To clarify things a little bit I want to calculate space requirement
> > >>>> for
> > >>>>>> single revision this time.
> > >>>>>> mapping_table_size=field_name_size*(field_name_length+4(integer
> > >>>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes
> > >>>>>> storage_size_per_document_per_revision_per_replica=mapping_table_size
> > >>>> +
> > >>>>>> depth*number_of_paths + value_size*number_of_paths =
> > >>>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb
> > >>>>>>
> > >>>>>> We definitely can reduce requirement for mapping table by adopting
> > >>>>>> rnewson's idea of a schema.
> > >>>>>>
> > >>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iil...@apache.org> wrote:
> > >>>>>>> Hi Michael,
> > >>>>>>>
> > >>>>>>>> For example, hears a crazy thought:
> > >>>>>>>> Map every distinct occurence of a key/value instance through a
> > >>>> crypto hash
> > >>>>>>>> function to get a set of hashes.
> > >>>>>>>>
> > >>>>>>>> These can be be precomputed by Couch without any lookups in FDB.
> > >>>> These
> > >>>>>>>> will be spread all over kingdom come in FDB and not lend
> > >>>> themselves to
> > >>>>>>>> range search well.
> > >>>>>>>>
> > >>>>>>>> So what you do is index them for frequency of occurring in the
> > >>>> same set.
> > >>>>>>>> In essence, you 'bucket them' statistically, and that bucket id
> > >>>> becomes a
> > >>>>>>>> key prefix. A crypto hash value can be copied into more than one
> > >>>> bucket.
> > >>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
> > >>>>>>>
> > >>>>>>>> When writing a document, Couch submits the list/array of
> > >>>> cryptohash values
> > >>>>>>>> it computed to FDB and gets back the corresponding  {val_id} (the
> > >>>> id with
> > >>>>>>>> the bucket prefixed).  This can get somewhat expensive if there's
> > >>>> always a
> > >>>>>>>> lot of app local cache misses.
> > >>>>>>>>
> > >>>>>>>> A document's value is then a series of {val_id} arrays up to 100k
> > >>>> per
> > >>>>>>>> segment.
> > >>>>>>>>
> > >>>>>>>> When retrieving a document, you get the val_ids, find the distinct
> > >>>> buckets
> > >>>>>>>> and min/max entries for this doc, and then parallel query each
> > >>>> bucket while
> > >>>>>>>> reconstructing the document.
> > >>>>>>>
> > >>>>>>> Interesting idea. Let's try to think it through to see if we can
> > >>>> make it viable.
> > >>>>>>> Let's go through hypothetical example. Input data for the example:
> > >>>>>>> - 1M of documents
> > >>>>>>> - each document is around 10Kb
> > >>>>>>> - each document consists of 1K of unique JSON paths
> > >>>>>>> - each document has 100 unique JSON field names
> > >>>>>>> - every scalar value is 100 bytes
> > >>>>>>> - 10% of unique JSON paths for every document already stored in
> > >>>> database under different doc or different revision of the current one
> > >>>>>>> - we assume 3 independent copies for every key-value pair in FDB
> > >>>>>>> - our hash key size is 32 bytes
> > >>>>>>> - let's assume we can determine if key is already on the storage
> > >>>> without doing query
> > >>>>>>> - 1% of paths is in cache (unrealistic value, in real live the
> > >>>> percentage is lower)
> > >>>>>>> - every JSON field name is 20 bytes
> > >>>>>>> - every JSON path is 10 levels deep
> > >>>>>>> - document key prefix length is 50
> > >>>>>>> - every document has 10 revisions
> > >>>>>>> Let's estimate the storage requirements and size of data we need to
> > >>>> transmit. The calculations are not exact.
> > >>>>>>> 1. storage_size_per_document (we cannot estimate exact numbers since
> > >>>> we don't know how FDB stores it)
> > >>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) =
> > >>>> 38Kb * 10 * 3 = 1140 Kb (11x)
> > >>>>>>> 2. number of independent keys to retrieve on document read
> > >>>> (non-range queries) per document
> > >>>>>>> - 1K - (1K * 1%) = 990
> > >>>>>>> 3. number of range queries: 0
> > >>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32
> > >>>> bytes) = 102 Kb (10x)
> > >>>>>>> 5. read latency (we use 2ms per read based on numbers from
> > >>>> https://apple.github.io/foundationdb/performance.html)
> > >>>>>>>   - sequential: 990*2ms = 1980ms
> > >>>>>>>   - range: 0
> > >>>>>>> Let's compare these numbers with initial proposal (flattened JSON
> > >>>> docs without global schema and without cache)
> > >>>>>>> 1. storage_size_per_document
> > >>>>>>> - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
> > >>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes
> > >>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 =
> > >>>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x)
> > >>>>>>> 2. number of independent keys to retrieve: 0-2 (depending on index
> > >>>> structure)
> > >>>>>>> 3. number of range queries: 1 (1001 of keys in result)
> > >>>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb
> > >>>> (2.4x)
> > >>>>>>> 5. read latency (we use 2ms per read based on numbers from
> > >>>> https://apple.github.io/foundationdb/performance.html and estimate 
> > >>>> range
> > >>>> read performance based on numbers from
> > >>>> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test
> > >>>> )
> > >>>>>>> - range read performance: Given read performance is about 305,000
> > >>>> reads/second and range performance 3,600,000 keys/second we estimate 
> > >>>> range
> > >>>> performance to be 11.8x compared to read performance. If read 
> > >>>> performance
> > >>>> is 2ms than range performance is 0.169ms (which is hard to believe).
> > >>>>>>> - sequential: 2 * 2 = 4ms
> > >>>>>>> - range: 0.169
> > >>>>>>>
> > >>>>>>> It looks like we are dealing with a tradeoff:
> > >>>>>>> - Map every distinct occurrence of a key/value instance through a
> > >>>> crypto hash:
> > >>>>>>> - 5.39x more disk space efficient
> > >>>>>>> - 474x slower
> > >>>>>>> - flattened JSON model
> > >>>>>>> - 5.39x less efficient in disk space
> > >>>>>>> - 474x faster
> > >>>>>>>
> > >>>>>>> In any case this unscientific exercise was very helpful. Since it
> > >>>> uncovered the high cost in terms of disk space. 59.3x of original disk 
> > >>>> size
> > >>>> is too much IMO.
> > >>>>>>>
> > >>>>>>> Are the any ways we can make Michael's model more performant?
> > >>>>>>>
> > >>>>>>> Also I don't quite understand few aspects of the global hash table
> > >>>> proposal:
> > >>>>>>>
> > >>>>>>> 1. > - Map every distinct occurence of a key/value instance through
> > >>>> a crypto hash function to get a set of hashes.
> > >>>>>>> I think we are talking only about scalar values here? I.e.
> > >>>> `"#/foo.bar.baz": 123`
> > >>>>>>> Since I don't know how we can make it work for all possible JSON
> > >>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}":
> > >>>>>>> - foo
> > >>>>>>> - foo.bar
> > >>>>>>> - foo.bar.baz
> > >>>>>>>
> > >>>>>>> 2. how to delete documents
> > >>>>>>>
> > >>>>>>> Best regards,
> > >>>>>>> ILYA
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On 2019/01/30 23:33:22, Michael Fair <mich...@daclubhouse.net>
> > >>>> wrote:
> > >>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocol...@apache.org
> > >>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi Michael,
> > >>>>>>>>>
> > >>>>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
> > >>>>>>>>>
> > >>>>>>>>> Yes that’s definitely one way to address storage of edit
> > >>>> conflicts. I
> > >>>>>>>>> think there are other, more compact representations that we can
> > >>>> explore if
> > >>>>>>>>> we have this “exploded” data model where each scalar value maps
> > >>>> to an
> > >>>>>>>>> individual KV pair.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> I agree, as I mentioned on the original thread, I see a scheme,
> > >>>> that
> > >>>>>>>> handles both conflicts and revisions, where you only have to store
> > >>>> the most
> > >>>>>>>> recent change to a field.  Like you suggested, multiple revisions
> > >>>> can share
> > >>>>>>>> a key.  Which in my mind's eye further begs the conflicts/revisions
> > >>>>>>>> discussion along with the working within the limits discussion
> > >>>> because it
> > >>>>>>>> seems to me they are all intrinsically related as a "feature".
> > >>>>>>>>
> > >>>>>>>> Saying 'We'll break documents up into roughly 80k segments', then
> > >>>> trying to
> > >>>>>>>> overlay some kind of field sharing scheme for revisions/conflicts
> > >>>> doesn't
> > >>>>>>>> seem like it will work.
> > >>>>>>>>
> > >>>>>>>> I probably should have left out the trivial fix proposal as I
> > >>>> don't think
> > >>>>>>>> it's a feasible solution to actually use.
> > >>>>>>>>
> > >>>>>>>> The comment is more regarding that I do not see how this thread
> > >>>> can escape
> > >>>>>>>> including how to store/retrieve conflicts/revisions.
> > >>>>>>>>
> > >>>>>>>> For instance, the 'doc as individual fields' proposal lends itself
> > >>>> to value
> > >>>>>>>> sharing across mutiple documents (and I don't just mean revisions
> > >>>> of the
> > >>>>>>>> same doc, I mean the same key/value instance could be shared for
> > >>>> every
> > >>>>>>>> document).
> > >>>>>>>> However that's not really relevant if we're not considering the
> > >>>> amount of
> > >>>>>>>> shared information across documents in the storage scheme.
> > >>>>>>>>
> > >>>>>>>> Simply storing documents in <100k segments (perhaps in some kind of
> > >>>>>>>> compressed binary representation) to deal with that FDB limit
> > >>>> seems fine.
> > >>>>>>>> The only reason to consider doing something else is because of its
> > >>>> impact
> > >>>>>>>> to indexing, searches, reduce functions, revisions, on-disk size
> > >>>> impact,
> > >>>>>>>> etc.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>>> I'm assuming the process will flatten the key paths of the
> > >>>> document into
> > >>>>>>>>> an array and then request the value of each key as multiple
> > >>>> parallel
> > >>>>>>>>> queries against FDB at once
> > >>>>>>>>>
> > >>>>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s trying
> > >>>> to design a
> > >>>>>>>>> model which allows the retrieval of a document with a single
> > >>>> range read,
> > >>>>>>>>> which is a good goal in my opinion.
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>> I am not sure I agree.
> > >>>>>>>>
> > >>>>>>>> Think of bitTorrent, a single range read should pull back the
> > >>>> structure of
> > >>>>>>>> the document (the pieces to fetch), but not necessarily the whole
> > >>>> document.
> > >>>>>>>>
> > >>>>>>>> What if you already have a bunch of pieces in common with other
> > >>>> documents
> > >>>>>>>> locally (a repeated header/footer/ or type for example); and you
> > >>>> only need
> > >>>>>>>> to get a few pieces of data you don't already have?
> > >>>>>>>>
> > >>>>>>>> The real goal to Couch I see is to treat your document set like the
> > >>>>>>>> collection of structured information that it is.  In some respects
> > >>>> like an
> > >>>>>>>> extension of your application's heap space for structured objects
> > >>>> and
> > >>>>>>>> efficiently querying that collection to get back subsets of the
> > >>>> data.
> > >>>>>>>>
> > >>>>>>>> Otherwise it seems more like a slightly upgraded file system plus
> > >>>> a fancy
> > >>>>>>>> grep/find like feature...
> > >>>>>>>>
> > >>>>>>>> The best way I see to unlock more features/power is to a move
> > >>>> towards a
> > >>>>>>>> more granular and efficient way to store and retrieve the scalar
> > >>>> values...
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> For example, hears a crazy thought:
> > >>>>>>>> Map every distinct occurence of a key/value instance through a
> > >>>> crypto hash
> > >>>>>>>> function to get a set of hashes.
> > >>>>>>>>
> > >>>>>>>> These can be be precomputed by Couch without any lookups in FDB.
> > >>>> These
> > >>>>>>>> will be spread all over kingdom come in FDB and not lend
> > >>>> themselves to
> > >>>>>>>> range search well.
> > >>>>>>>>
> > >>>>>>>> So what you do is index them for frequency of occurring in the
> > >>>> same set.
> > >>>>>>>> In essence, you 'bucket them' statistically, and that bucket id
> > >>>> becomes a
> > >>>>>>>> key prefix. A crypto hash value can be copied into more than one
> > >>>> bucket.
> > >>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
> > >>>>>>>>
> > >>>>>>>> When writing a document, Couch submits the list/array of
> > >>>> cryptohash values
> > >>>>>>>> it computed to FDB and gets back the corresponding  {val_id} (the
> > >>>> id with
> > >>>>>>>> the bucket prefixed).  This can get somewhat expensive if there's
> > >>>> always a
> > >>>>>>>> lot of app local cache misses.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> A document's value is then a series of {val_id} arrays up to 100k
> > >>>> per
> > >>>>>>>> segment.
> > >>>>>>>>
> > >>>>>>>> When retrieving a document, you get the val_ids, find the distinct
> > >>>> buckets
> > >>>>>>>> and min/max entries for this doc, and then parallel query each
> > >>>> bucket while
> > >>>>>>>> reconstructing the document.
> > >>>>>>>>
> > >>>>>>>> The values returned from the buckets query are the key/value
> > >>>> strings
> > >>>>>>>> required to reassemble this document.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> ----------
> > >>>>>>>> I put this forward primarily to hilite the idea that trying to
> > >>>> match the
> > >>>>>>>> storage representation of documents in a straight forward way to
> > >>>> FDB keys
> > >>>>>>>> to reduce query count might not be the most performance oriented
> > >>>> approach.
> > >>>>>>>>
> > >>>>>>>> I'd much prefer a storage approach that reduced data duplication
> > >>>> and
> > >>>>>>>> enabled fast sub-document queries.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> This clearly falls in the realm of what people want the 'use case'
> > >>>> of Couch
> > >>>>>>>> to be/become.  By giving Couch more access to sub-document
> > >>>> queries, I could
> > >>>>>>>> eventually see queries as complicated as GraphQL submitted to
> > >>>> Couch and
> > >>>>>>>> pulling back ad-hoc aggregated data across multiple documents in a
> > >>>> single
> > >>>>>>>> application layer request.
> > >>>>>>>>
> > >>>>>>>> Hehe - one way to look at the database of Couch documents is that
> > >>>> they are
> > >>>>>>>> all conflict revisions of the single root empty document.   What I
> > >>>> mean be
> > >>>>>>>> this is consider thinking of the entire document store as one
> > >>>> giant DAG of
> > >>>>>>>> key/value pairs. How even separate documents are still typically
> > >>>> related to
> > >>>>>>>> each other.  For most applications there is a tremendous amount of
> > >>>> data
> > >>>>>>>> redundancy between docs and especially between revisions of those
> > >>>> docs...
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> And all this is a long way of saying "I think there could be a lot
> > >>>> of value
> > >>>>>>>> in assuming documents are 'assembled' from multiple queries to
> > >>>> FDB, with
> > >>>>>>>> local caching, instead of simply retrieved"
> > >>>>>>>>
> > >>>>>>>> Thanks, I hope I'm not the only outlier here thinking this way!?
> > >>>>>>>>
> > >>>>>>>> Mike :-)
> > >>>>>>>>
> > >>>>>>>
> > >>>>
> > >>
> >

Reply via email to