Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts

Adam Kocoloski Thu, 07 Feb 2019 13:09:40 -0800

I am OK not doing term sharing. I would like to keep these two performance 
efficiencies discussed earlier:


1) A client can retrieve the “winning” revision of the document (and only that 
revision) in a single roundtrip knowing only the document ID
2) A client can update an edit branch of a document without retrieving the body 
of the current revision

These are two very common access patterns and we should optimize for them. I 
don’t think we quite have that sorted out yet.

Adam

> On Feb 7, 2019, at 3:36 PM, Ilya Khlopotov <iil...@apache.org> wrote:
> 
> We cannot do simple thing if we want to support sharing of JSON terms. I 
> think if we want the simplest path we should move sharing out of the scope. 
> The problem with sharing is we need to know the location of shared terms when 
> we do write. This means that we have to read full document on every write. 
> There might be tricks to replace full document read with some sort of 
> hierarchical signature or sketch of a document. However these tricks do not 
> fall into simplest solution category. We need to choose the design goals:
> - simple
> - performance
> - reduced storage overhead
> 
> best regards,
> iilyak
> 
> On 2019/02/07 12:45:34, Garren Smith <gar...@apache.org> wrote: 
>> I’m also in favor of keeping it really simple and then testing and
>> measuring it.
>> 
>> What is the best way to measure that we have something that works? I’m not
>> sure just relying on our current tests will prove that? Should we define
>> and build some more complex situations e.g docs with lots of conflicts or
>> docs with wide revisions and make sure we can solve for those?
>> 
>> On Thu, Feb 7, 2019 at 12:33 PM Jan Lehnardt <j...@apache.org> wrote:
>> 
>>> I’m also very much in favour with starting with the simplest thing that
>>> can possibly work and doesn’t go against the advertised best practices of
>>> FoundationDB. Let’s get that going and get a feel for how it all works
>>> together, before trying to optimise things we can’t measure yet.
>>> 
>>> Best
>>> Jan
>>> —
>>> 
>>>> On 6. Feb 2019, at 16:58, Robert Samuel Newson <rnew...@apache.org>
>>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> With the Redwood storage engine under development and with prefix
>>> elision part of its design, I don’t think we should get too hung up on
>>> adding complications and indirections in the key space just yet. We haven’t
>>> written a line of code or run any tests, this is premature optimisation.
>>>> 
>>>> I’d like to focus on the simplest solution that yields all required
>>> properties. We can embellish later (if warranted).
>>>> 
>>>> I am intrigued by all the ideas that might allow us cheaper inserts and
>>> updates than the current code where there are multiple edit branches in the
>>> stored document.
>>>> 
>>>> B.
>>>> 
>>>>> On 6 Feb 2019, at 02:18, Ilya Khlopotov <iil...@apache.org> wrote:
>>>>> 
>>>>> While reading Adam's proposal I came to realize that: we don't have to
>>> calculate winning revision at read time.
>>>>> Since FDB's transactions are atomic we can calculate it when we write.
>>> This means we can just write latest values into separate range. This makes
>>> lookup of latest version fast.
>>>>> Another realization is if we want to share values for some json paths
>>> we would have to introduce a level of indirection.
>>>>> Bellow is the data model inspired by Adam's idea to share json_paths.
>>> In this model the json_path is stored in the revision where it was first
>>> added (we call that revision an owner of a json_path). The values for
>>> json_path key can be scalar values, parts of scalar values or pointers to
>>> owner location.
>>>>> The below snippets are sketches of transactions.
>>>>> The transactions will include updates to other keys as needed
>>> (`external_size`, `by_seq` and so on).  The revision tree management is not
>>> covered yet.
>>>>> The `rev -> vsn` indirection is not strictly required. It is added
>>> because it saves some space since `rev` is a long string and `vsn` is FDB
>>> versionstamp of fixed size.
>>>>> 
>>>>> - `{NS} / {docid} / _by_rev / {rev} = vsn`
>>>>> - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL`
>>>>> - `{NS} / {docid} / _data / {json_path} = latest_value | part`
>>>>> - `{NS} / {docid} / {vsn} / _data / {json_path} = value | part |
>>> {another_vsn}`
>>>>> 
>>>>> ```
>>>>> write(txn, doc_id, prev_rev, json):
>>>>> txn.add_write_conflict_key("{NS} / {doc_id} / _rev")
>>>>> rev = generate_new_rev()
>>>>> txn["{NS} / {docid} / _by_rev / {rev}"] = vsn
>>>>> for every json_path in flattened json
>>>>>   - {NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL
>>>>>   if rev is HEAD:
>>>>>     # this range contains values for all json paths for the latest
>>> revision (read optimization)
>>>>>     - {NS} / {docid} / _data / {json_path} = latest_value | part
>>>>>   - {NS} / {docid} / {vsn} / _data / {json_path} = value | part |
>>> {another_vsn}
>>>>> txn["{NS} / {doc_id} / _rev"] = rev
>>>>> 
>>>>> get_current(txn, doc_id):
>>>>> # there is no sharing of json_paths in this range (read optimization)
>>>>> txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} / _data
>>> / 0xFF" )
>>>>> 
>>>>> get_revision(txn, doc_id, rev):
>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"]
>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00",
>>> "{NS} / {vsn} / {docid} / _data / 0xFF" )
>>>>> for every json_path in json_paths:
>>>>>  if value has type vsn:
>>>>>     another_vsn = value
>>>>>        value = txn["{NS} / {docid} / {another_vsn} / _data /
>>> {json_path}"]
>>>>>  result[json_path] = value
>>>>> 
>>>>> delete_revision(txn, doc_id, rev):
>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"]
>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00",
>>> "{NS} / {vsn} / {docid} / _data / 0xFF" )
>>>>> for every json_path in json_paths:
>>>>>  if value has type vsn:
>>>>>    # remove reference to deleted revision from the owner
>>>>>     del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}]
>>>>>  # check if deleted revision of json_path is not used by anything else
>>>>>  if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}",
>>> limit=1) == []:
>>>>>     del txn["{NS} / {docid} / {vsn} / _data / {json_path}"]
>>>>>  if vsn is HEAD:
>>>>>     copy range for winning revision into "{NS} / {docid} / _data /
>>> {json_path}"
>>>>> ```
>>>>> 
>>>>> best regards,
>>>>> iilyak
>>>>> 
>>>>> On 2019/02/04 23:22:09, Adam Kocoloski <kocol...@apache.org> wrote:
>>>>>> I think it’s fine to start a focused discussion here as it might help
>>> inform some of the broader debate over in that thread.
>>>>>> 
>>>>>> As a reminder, today CouchDB writes the entire body of each document
>>> revision on disk as a separate blob. Edit conflicts that have common fields
>>> between them do not share any storage on disk. The revision tree is encoded
>>> into a compact format and a copy of it is stored directly in both the by_id
>>> tree and the by_seq tree. Each leaf entry in the revision tree contain a
>>> pointer to the position of the associated doc revision on disk.
>>>>>> 
>>>>>> As a further reminder, CouchDB 2.x clusters can generate edit conflict
>>> revisions just from multiple clients concurrently updating the same
>>> document in a single cluster. This won’t happen when FoundationDB is
>>> running under the hood, but users who deploy multiple CouchDB or PouchDB
>>> servers and replicate between them can of course still produce conflicts
>>> just like they could in CouchDB 1.x, so we need a solution.
>>>>>> 
>>>>>> Let’s consider the two sub-topics separately: 1) storage of edit
>>> conflict bodies and 2) revision trees
>>>>>> 
>>>>>> ## Edit Conflict Storage
>>>>>> 
>>>>>> The simplest possible solution would be to store each document
>>> revision separately, like we do today. We could store document bodies with
>>> (“docid”, “revid”) as the key prefix, and each transaction could clear the
>>> key range associated with the base revision against which the edit is being
>>> attempted. This would work, but I think we can try to be a bit more clever
>>> and save on storage space given that we’re splitting JSON documents into
>>> multiple KV pairs.
>>>>>> 
>>>>>> One thought I’d had is to introduce a special enum Value which
>>> indicates that the subtree “beneath” the given Key is in conflict. For
>>> example, consider the documents
>>>>>> 
>>>>>> {
>>>>>>  “_id”: “foo”,
>>>>>>  “_rev”: “1-abc”,
>>>>>>  “owner”: “alice”,
>>>>>>  “active”: true
>>>>>> }
>>>>>> 
>>>>>> and
>>>>>> 
>>>>>> {
>>>>>>  “_id”: “foo”,
>>>>>>  “_rev”: “1-def”,
>>>>>>  “owner”: “bob”,
>>>>>>  “active”: true
>>>>>> }
>>>>>> 
>>>>>> We could represent these using the following set of KVs:
>>>>>> 
>>>>>> (“foo”, “active”) = true
>>>>>> (“foo”, “owner”) = kCONFLICT
>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
>>>>>> (“foo”, “owner”, “1-def”) = “bob”
>>>>>> 
>>>>>> This approach also extends to conflicts where the two versions have
>>> different data types. Consider a more complicated example where bob dropped
>>> the “active” field and changed the “owner” field to an object:
>>>>>> 
>>>>>> {
>>>>>> “_id”: “foo”,
>>>>>> “_rev”: “1-def”,
>>>>>> “owner”: {
>>>>>>  “name”: “bob”,
>>>>>>  “email”: “b...@example.com"
>>>>>> }
>>>>>> }
>>>>>> 
>>>>>> Now the set of KVs for “foo” looks like this (note that a missing
>>> field needs to be handled explicitly):
>>>>>> 
>>>>>> (“foo”, “active”) = kCONFLICT
>>>>>> (“foo”, “active”, “1-abc”) = true
>>>>>> (“foo”, “active”, “1-def”) = kMISSING
>>>>>> (“foo”, “owner”) = kCONFLICT
>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob”
>>>>>> (“foo”, “owner”, “1-def”, “email”) = “b...@example.com”
>>>>>> 
>>>>>> I like this approach for the common case where documents share most of
>>> their data in common but have a conflict in a very specific field or set of
>>> fields.
>>>>>> 
>>>>>> I’ve encountered one important downside, though: an edit that
>>> replicates in and conflicts with the entire document can cause a bit of a
>>> data explosion. Consider a case where I have 10 conflicting versions of a
>>> 100KB document, but the conflicts are all related to a single scalar value.
>>> Now I replicate in an empty document, and suddenly I have a kCONFLICT at
>>> the root. In this model I now need to list out every path of every one of
>>> the 10 existing revisions and I end up with a 1MB update. Yuck. That’s
>>> technically no worse in the end state than the “zero sharing” case above,
>>> but one could easily imagine overrunning the transaction size limit this
>>> way.
>>>>>> 
>>>>>> I suspect there’s a smart path out of this. Maybe the system detects a
>>> “default” value for each field and uses that instead of writing out the
>>> value for every revision in a conflicted subtree. Worth some discussion.
>>>>>> 
>>>>>> ## Revision Trees
>>>>>> 
>>>>>> In CouchDB we currently represent revisions as a hash history tree;
>>> each revision identifier is derived from the content of the revision
>>> including the revision identifier of its parent. Individual edit branches
>>> are bounded in *length* (I believe the default is 1000 entries), but the
>>> number of edit branches is technically unbounded.
>>>>>> 
>>>>>> The size limits in FoundationDB preclude us from storing the entire
>>> key tree as a single value; in pathological situations the tree could
>>> exceed 100KB. Rather, I think it would make sense to store each edit
>>> *branch* as a separate KV. We stem the branch long before it hits the value
>>> size limit, and in the happy case of no edit conflicts this means we store
>>> the edit history metadata in a single KV. It also means that we can apply
>>> an interactive edit without retrieving the entire conflicted revision tree;
>>> we need only retrieve and modify the single branch against which the edit
>>> is being applied. The downside is that we duplicate historical revision
>>> identifiers shared by multiple edit branches, but I think this is a
>>> worthwhile tradeoff.
>>>>>> 
>>>>>> I would furthermore try to structure the keys so that it is possible
>>> to retrieve the “winning” revision in a single limit=1 range query. Ideally
>>> I’d like to proide the following properties:
>>>>>> 
>>>>>> 1) a document read does not need to retrieve the revision tree at all,
>>> just the winning revision identifier (which would be stored with the rest
>>> of the doc)
>>>>>> 2) a document update only needs to read the edit branch of the
>>> revision tree against which the update is being applied, and it can read
>>> that branch immediately knowing only the content of the edit that is being
>>> attempted (i.e., it does not need to read the current version of the
>>> document itself).
>>>>>> 
>>>>>> So, I’d propose a separate subspace (maybe “_meta”?) for the revision
>>> trees, with keys and values that look like
>>>>>> 
>>>>>> (“_meta”, DocID, IsDeleted, RevPosition, RevHash) = [ParentRev,
>>> GrandparentRev, …]
>>>>>> 
>>>>>> The inclusion of IsDeleted, RevPosition and RevHash in the key should
>>> be sufficient (with the right encoding) to create a range query that
>>> automatically selects the “winner” according to CouchDB’s arcane rules,
>>> which are something like
>>>>>> 
>>>>>> 1) deleted=false beats deleted=true
>>>>>> 2) longer paths (i.e. higher RevPosition) beat shorter ones
>>>>>> 3) RevHashes with larger binary values beat ones with smaller values
>>>>>> 
>>>>>> ===========
>>>>>> 
>>>>>> OK, that’s all on this topic from me for now. I think this is a
>>> particularly exciting area where we start to see the dividends of splitting
>>> up data into multiple KV pairs in FoundationDB :) Cheers,
>>>>>> 
>>>>>> Adam
>>>>>> 
>>>>>> 
>>>>>>> On Feb 4, 2019, at 2:41 PM, Robert Newson <rnew...@apache.org> wrote:
>>>>>>> 
>>>>>>> This one is quite tightly coupled to the other thread on data model,
>>> should we start much conversation here before that one gets closer to a
>>> solution?
>>>>>>> 
>>>>>>> --
>>>>>>> Robert Samuel Newson
>>>>>>> rnew...@apache.org
>>>>>>> 
>>>>>>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote:
>>>>>>>> This is a beginning of a discussion thread about storage of edit
>>>>>>>> conflicts and everything which relates to revisions.
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> --
>>> Professional Support for Apache CouchDB:
>>> https://neighbourhood.ie/couchdb-support/
>>> 
>>> 
>>

Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts

Reply via email to