I am OK not doing term sharing. I would like to keep these two performance efficiencies discussed earlier:
1) A client can retrieve the “winning” revision of the document (and only that revision) in a single roundtrip knowing only the document ID 2) A client can update an edit branch of a document without retrieving the body of the current revision These are two very common access patterns and we should optimize for them. I don’t think we quite have that sorted out yet. Adam > On Feb 7, 2019, at 3:36 PM, Ilya Khlopotov <iil...@apache.org> wrote: > > We cannot do simple thing if we want to support sharing of JSON terms. I > think if we want the simplest path we should move sharing out of the scope. > The problem with sharing is we need to know the location of shared terms when > we do write. This means that we have to read full document on every write. > There might be tricks to replace full document read with some sort of > hierarchical signature or sketch of a document. However these tricks do not > fall into simplest solution category. We need to choose the design goals: > - simple > - performance > - reduced storage overhead > > best regards, > iilyak > > On 2019/02/07 12:45:34, Garren Smith <gar...@apache.org> wrote: >> I’m also in favor of keeping it really simple and then testing and >> measuring it. >> >> What is the best way to measure that we have something that works? I’m not >> sure just relying on our current tests will prove that? Should we define >> and build some more complex situations e.g docs with lots of conflicts or >> docs with wide revisions and make sure we can solve for those? >> >> On Thu, Feb 7, 2019 at 12:33 PM Jan Lehnardt <j...@apache.org> wrote: >> >>> I’m also very much in favour with starting with the simplest thing that >>> can possibly work and doesn’t go against the advertised best practices of >>> FoundationDB. Let’s get that going and get a feel for how it all works >>> together, before trying to optimise things we can’t measure yet. >>> >>> Best >>> Jan >>> — >>> >>>> On 6. Feb 2019, at 16:58, Robert Samuel Newson <rnew...@apache.org> >>> wrote: >>>> >>>> Hi, >>>> >>>> With the Redwood storage engine under development and with prefix >>> elision part of its design, I don’t think we should get too hung up on >>> adding complications and indirections in the key space just yet. We haven’t >>> written a line of code or run any tests, this is premature optimisation. >>>> >>>> I’d like to focus on the simplest solution that yields all required >>> properties. We can embellish later (if warranted). >>>> >>>> I am intrigued by all the ideas that might allow us cheaper inserts and >>> updates than the current code where there are multiple edit branches in the >>> stored document. >>>> >>>> B. >>>> >>>>> On 6 Feb 2019, at 02:18, Ilya Khlopotov <iil...@apache.org> wrote: >>>>> >>>>> While reading Adam's proposal I came to realize that: we don't have to >>> calculate winning revision at read time. >>>>> Since FDB's transactions are atomic we can calculate it when we write. >>> This means we can just write latest values into separate range. This makes >>> lookup of latest version fast. >>>>> Another realization is if we want to share values for some json paths >>> we would have to introduce a level of indirection. >>>>> Bellow is the data model inspired by Adam's idea to share json_paths. >>> In this model the json_path is stored in the revision where it was first >>> added (we call that revision an owner of a json_path). The values for >>> json_path key can be scalar values, parts of scalar values or pointers to >>> owner location. >>>>> The below snippets are sketches of transactions. >>>>> The transactions will include updates to other keys as needed >>> (`external_size`, `by_seq` and so on). The revision tree management is not >>> covered yet. >>>>> The `rev -> vsn` indirection is not strictly required. It is added >>> because it saves some space since `rev` is a long string and `vsn` is FDB >>> versionstamp of fixed size. >>>>> >>>>> - `{NS} / {docid} / _by_rev / {rev} = vsn` >>>>> - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL` >>>>> - `{NS} / {docid} / _data / {json_path} = latest_value | part` >>>>> - `{NS} / {docid} / {vsn} / _data / {json_path} = value | part | >>> {another_vsn}` >>>>> >>>>> ``` >>>>> write(txn, doc_id, prev_rev, json): >>>>> txn.add_write_conflict_key("{NS} / {doc_id} / _rev") >>>>> rev = generate_new_rev() >>>>> txn["{NS} / {docid} / _by_rev / {rev}"] = vsn >>>>> for every json_path in flattened json >>>>> - {NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL >>>>> if rev is HEAD: >>>>> # this range contains values for all json paths for the latest >>> revision (read optimization) >>>>> - {NS} / {docid} / _data / {json_path} = latest_value | part >>>>> - {NS} / {docid} / {vsn} / _data / {json_path} = value | part | >>> {another_vsn} >>>>> txn["{NS} / {doc_id} / _rev"] = rev >>>>> >>>>> get_current(txn, doc_id): >>>>> # there is no sharing of json_paths in this range (read optimization) >>>>> txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} / _data >>> / 0xFF" ) >>>>> >>>>> get_revision(txn, doc_id, rev): >>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"] >>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", >>> "{NS} / {vsn} / {docid} / _data / 0xFF" ) >>>>> for every json_path in json_paths: >>>>> if value has type vsn: >>>>> another_vsn = value >>>>> value = txn["{NS} / {docid} / {another_vsn} / _data / >>> {json_path}"] >>>>> result[json_path] = value >>>>> >>>>> delete_revision(txn, doc_id, rev): >>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"] >>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", >>> "{NS} / {vsn} / {docid} / _data / 0xFF" ) >>>>> for every json_path in json_paths: >>>>> if value has type vsn: >>>>> # remove reference to deleted revision from the owner >>>>> del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}] >>>>> # check if deleted revision of json_path is not used by anything else >>>>> if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}", >>> limit=1) == []: >>>>> del txn["{NS} / {docid} / {vsn} / _data / {json_path}"] >>>>> if vsn is HEAD: >>>>> copy range for winning revision into "{NS} / {docid} / _data / >>> {json_path}" >>>>> ``` >>>>> >>>>> best regards, >>>>> iilyak >>>>> >>>>> On 2019/02/04 23:22:09, Adam Kocoloski <kocol...@apache.org> wrote: >>>>>> I think it’s fine to start a focused discussion here as it might help >>> inform some of the broader debate over in that thread. >>>>>> >>>>>> As a reminder, today CouchDB writes the entire body of each document >>> revision on disk as a separate blob. Edit conflicts that have common fields >>> between them do not share any storage on disk. The revision tree is encoded >>> into a compact format and a copy of it is stored directly in both the by_id >>> tree and the by_seq tree. Each leaf entry in the revision tree contain a >>> pointer to the position of the associated doc revision on disk. >>>>>> >>>>>> As a further reminder, CouchDB 2.x clusters can generate edit conflict >>> revisions just from multiple clients concurrently updating the same >>> document in a single cluster. This won’t happen when FoundationDB is >>> running under the hood, but users who deploy multiple CouchDB or PouchDB >>> servers and replicate between them can of course still produce conflicts >>> just like they could in CouchDB 1.x, so we need a solution. >>>>>> >>>>>> Let’s consider the two sub-topics separately: 1) storage of edit >>> conflict bodies and 2) revision trees >>>>>> >>>>>> ## Edit Conflict Storage >>>>>> >>>>>> The simplest possible solution would be to store each document >>> revision separately, like we do today. We could store document bodies with >>> (“docid”, “revid”) as the key prefix, and each transaction could clear the >>> key range associated with the base revision against which the edit is being >>> attempted. This would work, but I think we can try to be a bit more clever >>> and save on storage space given that we’re splitting JSON documents into >>> multiple KV pairs. >>>>>> >>>>>> One thought I’d had is to introduce a special enum Value which >>> indicates that the subtree “beneath” the given Key is in conflict. For >>> example, consider the documents >>>>>> >>>>>> { >>>>>> “_id”: “foo”, >>>>>> “_rev”: “1-abc”, >>>>>> “owner”: “alice”, >>>>>> “active”: true >>>>>> } >>>>>> >>>>>> and >>>>>> >>>>>> { >>>>>> “_id”: “foo”, >>>>>> “_rev”: “1-def”, >>>>>> “owner”: “bob”, >>>>>> “active”: true >>>>>> } >>>>>> >>>>>> We could represent these using the following set of KVs: >>>>>> >>>>>> (“foo”, “active”) = true >>>>>> (“foo”, “owner”) = kCONFLICT >>>>>> (“foo”, “owner”, “1-abc”) = “alice” >>>>>> (“foo”, “owner”, “1-def”) = “bob” >>>>>> >>>>>> This approach also extends to conflicts where the two versions have >>> different data types. Consider a more complicated example where bob dropped >>> the “active” field and changed the “owner” field to an object: >>>>>> >>>>>> { >>>>>> “_id”: “foo”, >>>>>> “_rev”: “1-def”, >>>>>> “owner”: { >>>>>> “name”: “bob”, >>>>>> “email”: “b...@example.com" >>>>>> } >>>>>> } >>>>>> >>>>>> Now the set of KVs for “foo” looks like this (note that a missing >>> field needs to be handled explicitly): >>>>>> >>>>>> (“foo”, “active”) = kCONFLICT >>>>>> (“foo”, “active”, “1-abc”) = true >>>>>> (“foo”, “active”, “1-def”) = kMISSING >>>>>> (“foo”, “owner”) = kCONFLICT >>>>>> (“foo”, “owner”, “1-abc”) = “alice” >>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob” >>>>>> (“foo”, “owner”, “1-def”, “email”) = “b...@example.com” >>>>>> >>>>>> I like this approach for the common case where documents share most of >>> their data in common but have a conflict in a very specific field or set of >>> fields. >>>>>> >>>>>> I’ve encountered one important downside, though: an edit that >>> replicates in and conflicts with the entire document can cause a bit of a >>> data explosion. Consider a case where I have 10 conflicting versions of a >>> 100KB document, but the conflicts are all related to a single scalar value. >>> Now I replicate in an empty document, and suddenly I have a kCONFLICT at >>> the root. In this model I now need to list out every path of every one of >>> the 10 existing revisions and I end up with a 1MB update. Yuck. That’s >>> technically no worse in the end state than the “zero sharing” case above, >>> but one could easily imagine overrunning the transaction size limit this >>> way. >>>>>> >>>>>> I suspect there’s a smart path out of this. Maybe the system detects a >>> “default” value for each field and uses that instead of writing out the >>> value for every revision in a conflicted subtree. Worth some discussion. >>>>>> >>>>>> ## Revision Trees >>>>>> >>>>>> In CouchDB we currently represent revisions as a hash history tree; >>> each revision identifier is derived from the content of the revision >>> including the revision identifier of its parent. Individual edit branches >>> are bounded in *length* (I believe the default is 1000 entries), but the >>> number of edit branches is technically unbounded. >>>>>> >>>>>> The size limits in FoundationDB preclude us from storing the entire >>> key tree as a single value; in pathological situations the tree could >>> exceed 100KB. Rather, I think it would make sense to store each edit >>> *branch* as a separate KV. We stem the branch long before it hits the value >>> size limit, and in the happy case of no edit conflicts this means we store >>> the edit history metadata in a single KV. It also means that we can apply >>> an interactive edit without retrieving the entire conflicted revision tree; >>> we need only retrieve and modify the single branch against which the edit >>> is being applied. The downside is that we duplicate historical revision >>> identifiers shared by multiple edit branches, but I think this is a >>> worthwhile tradeoff. >>>>>> >>>>>> I would furthermore try to structure the keys so that it is possible >>> to retrieve the “winning” revision in a single limit=1 range query. Ideally >>> I’d like to proide the following properties: >>>>>> >>>>>> 1) a document read does not need to retrieve the revision tree at all, >>> just the winning revision identifier (which would be stored with the rest >>> of the doc) >>>>>> 2) a document update only needs to read the edit branch of the >>> revision tree against which the update is being applied, and it can read >>> that branch immediately knowing only the content of the edit that is being >>> attempted (i.e., it does not need to read the current version of the >>> document itself). >>>>>> >>>>>> So, I’d propose a separate subspace (maybe “_meta”?) for the revision >>> trees, with keys and values that look like >>>>>> >>>>>> (“_meta”, DocID, IsDeleted, RevPosition, RevHash) = [ParentRev, >>> GrandparentRev, …] >>>>>> >>>>>> The inclusion of IsDeleted, RevPosition and RevHash in the key should >>> be sufficient (with the right encoding) to create a range query that >>> automatically selects the “winner” according to CouchDB’s arcane rules, >>> which are something like >>>>>> >>>>>> 1) deleted=false beats deleted=true >>>>>> 2) longer paths (i.e. higher RevPosition) beat shorter ones >>>>>> 3) RevHashes with larger binary values beat ones with smaller values >>>>>> >>>>>> =========== >>>>>> >>>>>> OK, that’s all on this topic from me for now. I think this is a >>> particularly exciting area where we start to see the dividends of splitting >>> up data into multiple KV pairs in FoundationDB :) Cheers, >>>>>> >>>>>> Adam >>>>>> >>>>>> >>>>>>> On Feb 4, 2019, at 2:41 PM, Robert Newson <rnew...@apache.org> wrote: >>>>>>> >>>>>>> This one is quite tightly coupled to the other thread on data model, >>> should we start much conversation here before that one gets closer to a >>> solution? >>>>>>> >>>>>>> -- >>>>>>> Robert Samuel Newson >>>>>>> rnew...@apache.org >>>>>>> >>>>>>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote: >>>>>>>> This is a beginning of a discussion thread about storage of edit >>>>>>>> conflicts and everything which relates to revisions. >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>> >>> -- >>> Professional Support for Apache CouchDB: >>> https://neighbourhood.ie/couchdb-support/ >>> >>> >>