Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Robert Newson Tue, 19 Feb 2019 09:05:29 -0800

 I like the idea that we'd reuse the same pattern (but perhaps not the same 
_code_) for doc bodies, revtree and attachments.


I hope we still get to delete couch_key_tree.erl, though.

-- 
  Robert Samuel Newson
  rnew...@apache.org

On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote:
> I like the idea from a “trying a simple thing first” perspective, but 
> Nick’s points below are especially convincing to with this for now.
> 
> Best
> Jan
> —
> 
> > On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatam...@gmail.com> wrote:
> > 
> > Hi,
> > 
> > Sorry for jumping in so late, I was following from the sidelines mostly. A
> > lot of good discussion happening and am excited about the possibilities
> > here.
> > 
> > I do like the simpler "chunking" approach for a few reasons:
> > 
> > * Most documents bodies are probably going to be smaller than 100k. So in
> > the majority of case it would be one write / one read to update and fetch
> > the document body.
> > 
> > * We could reuse the chunking code for attachment handling and possibly
> > revision key trees. So it's the general pattern of upload chunks to some
> > prefix, and when finished flip an atomic toggle to make it current.
> > 
> > * Do the same thing with revision trees and we could re-use the revision
> > tree manipulation logic. That is, the key tree in most cases would be small
> > enough to fit in 100k but if they get huge, they'd get chunked. This would
> > allow us to reuse all the battle tested couch_key_tree code mostly as is.
> > We even have property tests for it
> > https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl
> > 
> > * It removes the need to explain the max exploded path length limitation to
> > customers.
> > 
> > Cheers,
> > -Nick
> > 
> > 
> > On Tue, Feb 19, 2019 at 11:18 AM Robert Newson <rnew...@apache.org> wrote:
> > 
> >> Hi,
> >> 
> >> An alternative storage model that we should seriously consider is to
> >> follow our current approach in couch_file et al. Specifically, that the
> >> document _body_ is stored as an uninterpreted binary value. This would be
> >> much like the obvious plan for attachment storage; a key prefix that
> >> identifies the database and document, with the final item of that key tuple
> >> is an incrementing integer. Each of those keys has a binary value of up to
> >> 100k. Fetching all values with that key prefix, in fdb's natural ordering,
> >> will yield the full document body, which can be JSON decoded for further
> >> processing.
> >> 
> >> I like this idea, and I like Adam's original proposal to explode documents
> >> into property paths. I have a slight preference for the simplicity of the
> >> idea in the previous paragraph, not least because it's close to what we do
> >> today. I also think it will be possible to migrate to alternative storage
> >> models in future, and foundationdb's transaction supports means we can do
> >> this migration seamlessly should we come to it.
> >> 
> >> I'm very interested in knowing if anyone else is interested in going this
> >> simple, or considers it a wasted opportunity relative to the 'exploded'
> >> path.
> >> 
> >> B.
> >> 
> >> --
> >>  Robert Samuel Newson
> >>  rnew...@apache.org
> >> 
> >> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote:
> >>> I've been remiss here in not posting the data model ideas that IBM
> >>> worked up while we were thinking about using FoundationDB so I'm posting
> >>> it now. This is Adam' Kocoloski's original work, I am just transcribing
> >>> it, and this is the context that the folks from the IBM side came in
> >>> with, for full disclosure.
> >>> 
> >>> Basics
> >>> 
> >>> 1. All CouchDB databases are inside a Directory
> >>> 2. Each CouchDB database is a Directory within that Directory
> >>> 3. It's possible to list all subdirectories of a Directory, so
> >>> `_all_dbs` is the list of directories from 1.
> >>> 4. Each Directory representing a CouchdB database has several Subspaces;
> >>> 4a. by_id/ doc subspace: actual document contents
> >>> 4b. by_seq/versionstamp subspace: for the _changes feed
> >>> 4c. index_definitions, indexes, ...
> >>> 
> >>> JSON Mapping
> >>> 
> >>> A hierarchical JSON object naturally maps to multiple KV pairs in FDB:
> >>> 
> >>> {
> >>>    “_id”: “foo”,
> >>>    “owner”: “bob”,
> >>>    “mylist”: [1,3,5],
> >>>    “mymap”: {
> >>>        “blue”: “#0000FF”,
> >>>        “red”: “#FF0000”
> >>>    }
> >>> }
> >>> 
> >>> maps to
> >>> 
> >>> (“foo”, “owner”) = “bob”
> >>> (“foo”, “mylist”, 0) = 1
> >>> (“foo”, “mylist”, 1) = 3
> >>> (“foo”, “mylist”, 2) = 5
> >>> (“foo”, “mymap”, “blue”) = “#0000FF”
> >>> (“foo”, “mymap”, “red”) = “#FF0000”
> >>> 
> >>> NB: this means that the 100KB limit applies to individual leafs in the
> >>> JSON object, not the entire doc
> >>> 
> >>> Edit Conflicts
> >>> 
> >>> We need to account for the presence of conflicts in various levels of
> >>> the doc due to replication.
> >>> 
> >>> Proposal is to create a special value indicating that the subtree below
> >>> our current cursor position is in an unresolvable conflict. Then add
> >>> additional KV pairs below to describe the conflicting entries.
> >>> 
> >>> KV data model allows us to store these efficiently and minimize
> >>> duplication of data:
> >>> 
> >>> A document with these two conflicts:
> >>> 
> >>> {
> >>>    “_id”: “foo”,
> >>>    “_rev”: “1-abc”,
> >>>    “owner”: “alice”,
> >>>    “active”: true
> >>> }
> >>> {
> >>>    “_id”: “foo”,
> >>>    “_rev”: “1-def”,
> >>>    “owner”: “bob”,
> >>>    “active”: true
> >>> }
> >>> 
> >>> could be stored thus:
> >>> 
> >>> (“foo”, “active”) = true
> >>> (“foo”, “owner”) = kCONFLICT
> >>> (“foo”, “owner”, “1-abc”) = “alice”
> >>> (“foo”, “owner”, “1-def”) = “bob”
> >>> 
> >>> So long as `kCONFLICT` is set at the top of the conflicting subtree this
> >>> representation can handle conflicts of different data types as well.
> >>> 
> >>> Missing fields need to be handled explicitly:
> >>> 
> >>> {
> >>>  “_id”: “foo”,
> >>>  “_rev”: “1-abc”,
> >>>  “owner”: “alice”,
> >>>  “active”: true
> >>> }
> >>> 
> >>> {
> >>>  “_id”: “foo”,
> >>>  “_rev”: “1-def”,
> >>>  “owner”: {
> >>>    “name”: “bob”,
> >>>    “email”: “
> >>> b...@example.com
> >>> "
> >>>  }
> >>> }
> >>> 
> >>> could be stored thus:
> >>> 
> >>> (“foo”, “active”) = kCONFLICT
> >>> (“foo”, “active”, “1-abc”) = true
> >>> (“foo”, “active”, “1-def”) = kMISSING
> >>> (“foo”, “owner”) = kCONFLICT
> >>> (“foo”, “owner”, “1-abc”) = “alice”
> >>> (“foo”, “owner”, “1-def”, “name”) = “bob”
> >>> (“foo”, “owner”, “1-def”, “email”) = ...
> >>> 
> >>> Revision Metadata
> >>> 
> >>> * CouchDB uses a hash history for revisions
> >>> ** Each edit is identified by the hash of the content of the edit
> >>> including the base revision against which it was applied
> >>> ** Individual edit branches are bounded in length but the number of
> >>> branches is potentially unbounded
> >>> 
> >>> * Size limits preclude us from storing the entire key tree as a single
> >>> value; in pathological situations
> >>> the tree could exceed 100KB (each entry is > 16 bytes)
> >>> 
> >>> * Store each edit branch as a separate KV including deleted status in a
> >>> special subspace
> >>> 
> >>> * Structure key representation so that “winning” revision can be
> >>> automatically retrieved in a limit=1
> >>> key range operation
> >>> 
> >>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = []
> >>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) = [“3-baz”,”2-bar”,”1-foo”]
> >>> <-- winner
> >>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”]
> >>> 
> >>> Changes Feed
> >>> 
> >>> * FDB supports a concept called a versionstamp — a 10 byte, unique,
> >>> monotonically (but not sequentially) increasing value for each committed
> >>> transaction. The first 8 bytes are the committed version of the
> >>> database. The last 2 bytes are monotonic in the serialization order for
> >>> transactions.
> >>> 
> >>> * A transaction can specify a particular index into a key where the
> >>> following 10 bytes will be overwritten by the versionstamp at commit
> >>> time
> >>> 
> >>> * A subspace keyed on versionstamp naturally yields a _changes feed
> >>> 
> >>> by_seq subspace
> >>>  (“versionstamp1”) = (“foo”, “1-abc”)
> >>>  (“versionstamp4”) = (“bar”, “4-def”)
> >>> 
> >>> by_id subspace
> >>>  (“bar”, “_vsn”) = “versionstamp4”
> >>>  ...
> >>>  (“foo”, “_vsn”) = “versionstamp1”
> >>> 
> >>> JSON Indexes
> >>> 
> >>> * “Mango” JSON indexes are defined by
> >>> ** a list of field names, each of which may be nested,
> >>> ** an optional partial_filter_selector which constrains the set of docs
> >>> that contribute
> >>> ** an optional name defined by the ddoc field (the name is auto-
> >>> generated if not supplied)
> >>> 
> >>> * Store index definitions in a single subspace to aid query planning
> >>> ** ((person,name), title, email) = (“name-title-email”, “{“student”:
> >>> true}”)
> >>> ** Store the values for each index in a dedicated subspace, adding the
> >>> document ID as the last element in the tuple
> >>> *** (“rosie revere”, “engineer”, “ro...@example.com", “foo”) = null
> >>> 
> >>> B.
> >>> 
> >>> --
> >>>  Robert Samuel Newson
> >>>  rnew...@apache.org
> >>> 
> >>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote:
> >>>> 
> >>>> I want to fix previous mistakes. I did two mistakes in previous
> >>>> calculations:
> >>>> - I used 1Kb as base size for calculating expansion factor (although
> >> we
> >>>> don't know exact size of original document)
> >>>> - The expansion factor calculation included number of revisions (it
> >>>> shouldn't)
> >>>> 
> >>>> I'll focus on flattened JSON docs model
> >>>> 
> >>>> The following formula is used in previous calculation.
> >>>> storage_size_per_document=mapping_table_size*number_of_revisions +
> >>>> depth*number_of_paths*number_of_revisions +
> >>>> number_of_paths*value_size*number_of_revisions
> >>>> 
> >>>> To clarify things a little bit I want to calculate space requirement
> >> for
> >>>> single revision this time.
> >>>> mapping_table_size=field_name_size*(field_name_length+4(integer
> >>>> size))=100 * (20 + 4(integer size)) = 2400 bytes
> >>>> storage_size_per_document_per_revision_per_replica=mapping_table_size
> >> +
> >>>> depth*number_of_paths + value_size*number_of_paths =
> >>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb
> >>>> 
> >>>> We definitely can reduce requirement for mapping table by adopting
> >>>> rnewson's idea of a schema.
> >>>> 
> >>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iil...@apache.org> wrote:
> >>>>> Hi Michael,
> >>>>> 
> >>>>>> For example, hears a crazy thought:
> >>>>>> Map every distinct occurence of a key/value instance through a
> >> crypto hash
> >>>>>> function to get a set of hashes.
> >>>>>> 
> >>>>>> These can be be precomputed by Couch without any lookups in FDB.
> >> These
> >>>>>> will be spread all over kingdom come in FDB and not lend
> >> themselves to
> >>>>>> range search well.
> >>>>>> 
> >>>>>> So what you do is index them for frequency of occurring in the
> >> same set.
> >>>>>> In essence, you 'bucket them' statistically, and that bucket id
> >> becomes a
> >>>>>> key prefix. A crypto hash value can be copied into more than one
> >> bucket.
> >>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
> >>>>> 
> >>>>>> When writing a document, Couch submits the list/array of
> >> cryptohash values
> >>>>>> it computed to FDB and gets back the corresponding  {val_id} (the
> >> id with
> >>>>>> the bucket prefixed).  This can get somewhat expensive if there's
> >> always a
> >>>>>> lot of app local cache misses.
> >>>>>> 
> >>>>>> A document's value is then a series of {val_id} arrays up to 100k
> >> per
> >>>>>> segment.
> >>>>>> 
> >>>>>> When retrieving a document, you get the val_ids, find the distinct
> >> buckets
> >>>>>> and min/max entries for this doc, and then parallel query each
> >> bucket while
> >>>>>> reconstructing the document.
> >>>>> 
> >>>>> Interesting idea. Let's try to think it through to see if we can
> >> make it viable.
> >>>>> Let's go through hypothetical example. Input data for the example:
> >>>>> - 1M of documents
> >>>>> - each document is around 10Kb
> >>>>> - each document consists of 1K of unique JSON paths
> >>>>> - each document has 100 unique JSON field names
> >>>>> - every scalar value is 100 bytes
> >>>>> - 10% of unique JSON paths for every document already stored in
> >> database under different doc or different revision of the current one
> >>>>> - we assume 3 independent copies for every key-value pair in FDB
> >>>>> - our hash key size is 32 bytes
> >>>>> - let's assume we can determine if key is already on the storage
> >> without doing query
> >>>>> - 1% of paths is in cache (unrealistic value, in real live the
> >> percentage is lower)
> >>>>> - every JSON field name is 20 bytes
> >>>>> - every JSON path is 10 levels deep
> >>>>> - document key prefix length is 50
> >>>>> - every document has 10 revisions
> >>>>> Let's estimate the storage requirements and size of data we need to
> >> transmit. The calculations are not exact.
> >>>>> 1. storage_size_per_document (we cannot estimate exact numbers since
> >> we don't know how FDB stores it)
> >>>>>  - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) =
> >> 38Kb * 10 * 3 = 1140 Kb (11x)
> >>>>> 2. number of independent keys to retrieve on document read
> >> (non-range queries) per document
> >>>>>  - 1K - (1K * 1%) = 990
> >>>>> 3. number of range queries: 0
> >>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32
> >> bytes) = 102 Kb (10x)
> >>>>> 5. read latency (we use 2ms per read based on numbers from
> >> https://apple.github.io/foundationdb/performance.html)
> >>>>>    - sequential: 990*2ms = 1980ms
> >>>>>    - range: 0
> >>>>> Let's compare these numbers with initial proposal (flattened JSON
> >> docs without global schema and without cache)
> >>>>> 1. storage_size_per_document
> >>>>>  - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
> >>>>>  - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes
> >>>>>  - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 =
> >> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x)
> >>>>> 2. number of independent keys to retrieve: 0-2 (depending on index
> >> structure)
> >>>>> 3. number of range queries: 1 (1001 of keys in result)
> >>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb
> >> (2.4x)
> >>>>> 5. read latency (we use 2ms per read based on numbers from
> >> https://apple.github.io/foundationdb/performance.html and estimate range
> >> read performance based on numbers from
> >> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test
> >> )
> >>>>>  - range read performance: Given read performance is about 305,000
> >> reads/second and range performance 3,600,000 keys/second we estimate range
> >> performance to be 11.8x compared to read performance. If read performance
> >> is 2ms than range performance is 0.169ms (which is hard to believe).
> >>>>>  - sequential: 2 * 2 = 4ms
> >>>>>  - range: 0.169
> >>>>> 
> >>>>> It looks like we are dealing with a tradeoff:
> >>>>> - Map every distinct occurrence of a key/value instance through a
> >> crypto hash:
> >>>>>  - 5.39x more disk space efficient
> >>>>>  - 474x slower
> >>>>> - flattened JSON model
> >>>>>  - 5.39x less efficient in disk space
> >>>>>  - 474x faster
> >>>>> 
> >>>>> In any case this unscientific exercise was very helpful. Since it
> >> uncovered the high cost in terms of disk space. 59.3x of original disk size
> >> is too much IMO.
> >>>>> 
> >>>>> Are the any ways we can make Michael's model more performant?
> >>>>> 
> >>>>> Also I don't quite understand few aspects of the global hash table
> >> proposal:
> >>>>> 
> >>>>> 1. > - Map every distinct occurence of a key/value instance through
> >> a crypto hash function to get a set of hashes.
> >>>>> I think we are talking only about scalar values here? I.e.
> >> `"#/foo.bar.baz": 123`
> >>>>> Since I don't know how we can make it work for all possible JSON
> >> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}":
> >>>>> - foo
> >>>>> - foo.bar
> >>>>> - foo.bar.baz
> >>>>> 
> >>>>> 2. how to delete documents
> >>>>> 
> >>>>> Best regards,
> >>>>> ILYA
> >>>>> 
> >>>>> 
> >>>>> On 2019/01/30 23:33:22, Michael Fair <mich...@daclubhouse.net>
> >> wrote:
> >>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocol...@apache.org
> >> wrote:
> >>>>>> 
> >>>>>>> Hi Michael,
> >>>>>>> 
> >>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
> >>>>>>> 
> >>>>>>> Yes that’s definitely one way to address storage of edit
> >> conflicts. I
> >>>>>>> think there are other, more compact representations that we can
> >> explore if
> >>>>>>> we have this “exploded” data model where each scalar value maps
> >> to an
> >>>>>>> individual KV pair.
> >>>>>> 
> >>>>>> 
> >>>>>> I agree, as I mentioned on the original thread, I see a scheme,
> >> that
> >>>>>> handles both conflicts and revisions, where you only have to store
> >> the most
> >>>>>> recent change to a field.  Like you suggested, multiple revisions
> >> can share
> >>>>>> a key.  Which in my mind's eye further begs the conflicts/revisions
> >>>>>> discussion along with the working within the limits discussion
> >> because it
> >>>>>> seems to me they are all intrinsically related as a "feature".
> >>>>>> 
> >>>>>> Saying 'We'll break documents up into roughly 80k segments', then
> >> trying to
> >>>>>> overlay some kind of field sharing scheme for revisions/conflicts
> >> doesn't
> >>>>>> seem like it will work.
> >>>>>> 
> >>>>>> I probably should have left out the trivial fix proposal as I
> >> don't think
> >>>>>> it's a feasible solution to actually use.
> >>>>>> 
> >>>>>> The comment is more regarding that I do not see how this thread
> >> can escape
> >>>>>> including how to store/retrieve conflicts/revisions.
> >>>>>> 
> >>>>>> For instance, the 'doc as individual fields' proposal lends itself
> >> to value
> >>>>>> sharing across mutiple documents (and I don't just mean revisions
> >> of the
> >>>>>> same doc, I mean the same key/value instance could be shared for
> >> every
> >>>>>> document).
> >>>>>> However that's not really relevant if we're not considering the
> >> amount of
> >>>>>> shared information across documents in the storage scheme.
> >>>>>> 
> >>>>>> Simply storing documents in <100k segments (perhaps in some kind of
> >>>>>> compressed binary representation) to deal with that FDB limit
> >> seems fine.
> >>>>>> The only reason to consider doing something else is because of its
> >> impact
> >>>>>> to indexing, searches, reduce functions, revisions, on-disk size
> >> impact,
> >>>>>> etc.
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>>>> I'm assuming the process will flatten the key paths of the
> >> document into
> >>>>>>> an array and then request the value of each key as multiple
> >> parallel
> >>>>>>> queries against FDB at once
> >>>>>>> 
> >>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s trying
> >> to design a
> >>>>>>> model which allows the retrieval of a document with a single
> >> range read,
> >>>>>>> which is a good goal in my opinion.
> >>>>>>> 
> >>>>>> 
> >>>>>> I am not sure I agree.
> >>>>>> 
> >>>>>> Think of bitTorrent, a single range read should pull back the
> >> structure of
> >>>>>> the document (the pieces to fetch), but not necessarily the whole
> >> document.
> >>>>>> 
> >>>>>> What if you already have a bunch of pieces in common with other
> >> documents
> >>>>>> locally (a repeated header/footer/ or type for example); and you
> >> only need
> >>>>>> to get a few pieces of data you don't already have?
> >>>>>> 
> >>>>>> The real goal to Couch I see is to treat your document set like the
> >>>>>> collection of structured information that it is.  In some respects
> >> like an
> >>>>>> extension of your application's heap space for structured objects
> >> and
> >>>>>> efficiently querying that collection to get back subsets of the
> >> data.
> >>>>>> 
> >>>>>> Otherwise it seems more like a slightly upgraded file system plus
> >> a fancy
> >>>>>> grep/find like feature...
> >>>>>> 
> >>>>>> The best way I see to unlock more features/power is to a move
> >> towards a
> >>>>>> more granular and efficient way to store and retrieve the scalar
> >> values...
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> For example, hears a crazy thought:
> >>>>>> Map every distinct occurence of a key/value instance through a
> >> crypto hash
> >>>>>> function to get a set of hashes.
> >>>>>> 
> >>>>>> These can be be precomputed by Couch without any lookups in FDB.
> >> These
> >>>>>> will be spread all over kingdom come in FDB and not lend
> >> themselves to
> >>>>>> range search well.
> >>>>>> 
> >>>>>> So what you do is index them for frequency of occurring in the
> >> same set.
> >>>>>> In essence, you 'bucket them' statistically, and that bucket id
> >> becomes a
> >>>>>> key prefix. A crypto hash value can be copied into more than one
> >> bucket.
> >>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
> >>>>>> 
> >>>>>> When writing a document, Couch submits the list/array of
> >> cryptohash values
> >>>>>> it computed to FDB and gets back the corresponding  {val_id} (the
> >> id with
> >>>>>> the bucket prefixed).  This can get somewhat expensive if there's
> >> always a
> >>>>>> lot of app local cache misses.
> >>>>>> 
> >>>>>> 
> >>>>>> A document's value is then a series of {val_id} arrays up to 100k
> >> per
> >>>>>> segment.
> >>>>>> 
> >>>>>> When retrieving a document, you get the val_ids, find the distinct
> >> buckets
> >>>>>> and min/max entries for this doc, and then parallel query each
> >> bucket while
> >>>>>> reconstructing the document.
> >>>>>> 
> >>>>>> The values returned from the buckets query are the key/value
> >> strings
> >>>>>> required to reassemble this document.
> >>>>>> 
> >>>>>> 
> >>>>>> ----------
> >>>>>> I put this forward primarily to hilite the idea that trying to
> >> match the
> >>>>>> storage representation of documents in a straight forward way to
> >> FDB keys
> >>>>>> to reduce query count might not be the most performance oriented
> >> approach.
> >>>>>> 
> >>>>>> I'd much prefer a storage approach that reduced data duplication
> >> and
> >>>>>> enabled fast sub-document queries.
> >>>>>> 
> >>>>>> 
> >>>>>> This clearly falls in the realm of what people want the 'use case'
> >> of Couch
> >>>>>> to be/become.  By giving Couch more access to sub-document
> >> queries, I could
> >>>>>> eventually see queries as complicated as GraphQL submitted to
> >> Couch and
> >>>>>> pulling back ad-hoc aggregated data across multiple documents in a
> >> single
> >>>>>> application layer request.
> >>>>>> 
> >>>>>> Hehe - one way to look at the database of Couch documents is that
> >> they are
> >>>>>> all conflict revisions of the single root empty document.   What I
> >> mean be
> >>>>>> this is consider thinking of the entire document store as one
> >> giant DAG of
> >>>>>> key/value pairs. How even separate documents are still typically
> >> related to
> >>>>>> each other.  For most applications there is a tremendous amount of
> >> data
> >>>>>> redundancy between docs and especially between revisions of those
> >> docs...
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> And all this is a long way of saying "I think there could be a lot
> >> of value
> >>>>>> in assuming documents are 'assembled' from multiple queries to
> >> FDB, with
> >>>>>> local caching, instead of simply retrieved"
> >>>>>> 
> >>>>>> Thanks, I hope I'm not the only outlier here thinking this way!?
> >>>>>> 
> >>>>>> Mike :-)
> >>>>>> 
> >>>>> 
> >> 
>

Re: [DISCUSS] : things we need to solve/decide : storing JSON documents

Reply via email to