Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts

Adam Kocoloski Fri, 08 Feb 2019 06:56:41 -0800

Agreed, that’s a good fit and is clearly related to the kinds of work FDB is 
doing with its simulator. If you watch Wil WIlson’s presentations on the topic 
he calls out property-based testing as one isolated example of the broader 
theme that he’s trying to drive here.


Adam 

> On Feb 8, 2019, at 8:56 AM, Paul Davis <[email protected]> wrote:
> 
> Cheers to that Garren!
> 
> Whatever we decide on for the data model I'd like to see a fairly
> extensive property based test suite around it. I almost said for
> anything above chunked based storage but even for that I'd think that
> I'd still want property testing around various keys and tree
> mutations.
> 
> On Fri, Feb 8, 2019 at 12:45 AM Garren Smith <[email protected]> wrote:
>> 
>> Hi Adam,
>> 
>> Thanks for the detailed email. In terms of the data model, that makes a lot
>> of sense.
>> 
>> I’m still playing a bit of catchup on understanding how fdb works, so I
>> can’t comment on the best way to retrieve a document.
>> 
>> From my side, I would like to see our decisions also driven by testing and
>> validating that our data model works. I find the way that fdb was tested
>> and built really impressive. I would love to see us apply some of that to
>> the way we build our CouchDB layer.
>> 
>> Cheers
>> Garren
>> 
>> On Fri, Feb 8, 2019 at 5:35 AM Adam Kocoloski <[email protected]> wrote:
>> 
>>> Bob, Garren, Jan - heard you loud and clear, K.I.S.S. I do think it’s a
>>> bit “simplistic" to exclusively choose simplicity over performance and
>>> storage density. We’re (re)building a database here, one that has some
>>> users with pretty demanding performance and scalability requirements. And
>>> yes, we should certainly be testing and measuring. Kyle and team are
>>> setting up infrastructure in IBM land to help with that now, but I also
>>> believe we can design the data model and architecture with a basic
>>> performance model of FoundationDB in mind:
>>> 
>>> - reads cost 1ms
>>> - short range reads are the same cost as a single lookup
>>> - reads of independent parts of the keyspace can be parallelized for cheap
>>> - writes are zero-cost until commit time
>>> 
>>> We ought to be able to use these assumptions to drive some decisions about
>>> data models ahead of any end-to-end performance test.
>>> 
>>> If there are specific elements of the edit conflicts management where you
>>> think greater simplicity is warranted, let’s get those called out. Ilya
>>> noted (correctly, in my opinion) that the term sharing stuff is one of
>>> those items. It’s relatively complex, potentially a performance hit, and
>>> only saves on storage density in the corner case of lots of edit conflicts.
>>> That’s a good one to drop.
>>> 
>>> I’m relatively happy with the revision history data model at this point.
>>> Hopefully folks find it easy to grok, and it’s efficient for both reads and
>>> writes. It costs some extra storage for conflict revisions compared to the
>>> current tree representation (up to 16K per edit branch, with default
>>> _revs_limit) but knowing what we know about the performance death spiral
>>> for wide revision trees today I’ll happily make a storage vs. performance
>>> tradeoff here :)
>>> 
>>> Setting the shared term approach aside, I’ve still been mulling over the
>>> key structure for the actual document data:
>>> 
>>> -  I thought about trying to construct a special _conflicts subspace, but
>>> I don’t like that approach because the choice of a “winning" revision can
>>> flip back and forth very quickly with concurrent writers to different edit
>>> branches. I think we really want to have a way for revisions to naturally
>>> sort themselves so the winner is the first or last revision in a list.
>>> 
>>> - Assuming we’re using key paths of the form (docid, revision-ish, path,
>>> to, field), the goal here is to find an efficient way to get the last key
>>> with prefix “docid” (assuming winner sorts last), and then all the keys
>>> that share the same (docid, revision-ish) prefix as that one. I see two
>>> possible approaches so far, neither perfect:
>>> 
>>> Option 1: Execute a get_key() operation with a key selector that asks for
>>> the last key less than “docid\xFF” (again assuming winner sorts last), and
>>> then do a get_range_startswith() request setting the streaming mode to
>>> “want_all” and the prefix to the docid plus whatever revision-ish we found
>>> from the get_key() request. This is two roundtrips instead of one, but it
>>> always retrieves exactly the right set of keys, and the second step is
>>> executed as fast as possible.
>>> 
>>> Option 2: Jump straight to get_range_startswith() request using only
>>> “docid” as the prefix, then cancel the iteration once we reach a revision
>>> not equal to the first one we see. We might transfer too much data, or we
>>> might end up doing multiple roundtrips if the default “iterator” streaming
>>> mode sends too little data to start (I haven’t checked what the default
>>> iteration block is there), but in the typical case of zero edit conflicts
>>> we have a good chance of retrieving the full document in one roundtrip.
>>> 
>>> I don’t have a good sense of which option wins out here from a performance
>>> perspective, but they’re both operating on the same data model so easy
>>> enough to test the alternatives. The important bit is getting the
>>> revision-ish things to sort correctly. I think we can do that by generating
>>> something like
>>> 
>>> revision-ish = NotDeleted/1bit : RevPos : RevHash
>>> 
>>> with some suitable order-preserving encoding on the RevPos integer.
>>> 
>>> Apologies for the long email. Happy for any comments, either here or over
>>> on IRC. Cheers,
>>> 
>>> Adam
>>> 
>>>> On Feb 7, 2019, at 4:52 PM, Robert Newson <[email protected]> wrote:
>>>> 
>>>> I think we should choose simple. We can then see if performance is too
>>> low or storage overhead too high and then see what we can do about it.
>>>> 
>>>> B.
>>>> 
>>>> --
>>>> Robert Samuel Newson
>>>> [email protected]
>>>> 
>>>> On Thu, 7 Feb 2019, at 20:36, Ilya Khlopotov wrote:
>>>>> We cannot do simple thing if we want to support sharing of JSON terms.
>>> I
>>>>> think if we want the simplest path we should move sharing out of the
>>>>> scope. The problem with sharing is we need to know the location of
>>>>> shared terms when we do write. This means that we have to read full
>>>>> document on every write. There might be tricks to replace full document
>>>>> read with some sort of hierarchical signature or sketch of a document.
>>>>> However these tricks do not fall into simplest solution category. We
>>>>> need to choose the design goals:
>>>>> - simple
>>>>> - performance
>>>>> - reduced storage overhead
>>>>> 
>>>>> best regards,
>>>>> iilyak
>>>>> 
>>>>> On 2019/02/07 12:45:34, Garren Smith <[email protected]> wrote:
>>>>>> I’m also in favor of keeping it really simple and then testing and
>>>>>> measuring it.
>>>>>> 
>>>>>> What is the best way to measure that we have something that works? I’m
>>> not
>>>>>> sure just relying on our current tests will prove that? Should we
>>> define
>>>>>> and build some more complex situations e.g docs with lots of conflicts
>>> or
>>>>>> docs with wide revisions and make sure we can solve for those?
>>>>>> 
>>>>>> On Thu, Feb 7, 2019 at 12:33 PM Jan Lehnardt <[email protected]> wrote:
>>>>>> 
>>>>>>> I’m also very much in favour with starting with the simplest thing
>>> that
>>>>>>> can possibly work and doesn’t go against the advertised best
>>> practices of
>>>>>>> FoundationDB. Let’s get that going and get a feel for how it all works
>>>>>>> together, before trying to optimise things we can’t measure yet.
>>>>>>> 
>>>>>>> Best
>>>>>>> Jan
>>>>>>> —
>>>>>>> 
>>>>>>>> On 6. Feb 2019, at 16:58, Robert Samuel Newson <[email protected]>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> With the Redwood storage engine under development and with prefix
>>>>>>> elision part of its design, I don’t think we should get too hung up on
>>>>>>> adding complications and indirections in the key space just yet. We
>>> haven’t
>>>>>>> written a line of code or run any tests, this is premature
>>> optimisation.
>>>>>>>> 
>>>>>>>> I’d like to focus on the simplest solution that yields all required
>>>>>>> properties. We can embellish later (if warranted).
>>>>>>>> 
>>>>>>>> I am intrigued by all the ideas that might allow us cheaper inserts
>>> and
>>>>>>> updates than the current code where there are multiple edit branches
>>> in the
>>>>>>> stored document.
>>>>>>>> 
>>>>>>>> B.
>>>>>>>> 
>>>>>>>>> On 6 Feb 2019, at 02:18, Ilya Khlopotov <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> While reading Adam's proposal I came to realize that: we don't have
>>> to
>>>>>>> calculate winning revision at read time.
>>>>>>>>> Since FDB's transactions are atomic we can calculate it when we
>>> write.
>>>>>>> This means we can just write latest values into separate range. This
>>> makes
>>>>>>> lookup of latest version fast.
>>>>>>>>> Another realization is if we want to share values for some json
>>> paths
>>>>>>> we would have to introduce a level of indirection.
>>>>>>>>> Bellow is the data model inspired by Adam's idea to share
>>> json_paths.
>>>>>>> In this model the json_path is stored in the revision where it was
>>> first
>>>>>>> added (we call that revision an owner of a json_path). The values for
>>>>>>> json_path key can be scalar values, parts of scalar values or
>>> pointers to
>>>>>>> owner location.
>>>>>>>>> The below snippets are sketches of transactions.
>>>>>>>>> The transactions will include updates to other keys as needed
>>>>>>> (`external_size`, `by_seq` and so on).  The revision tree management
>>> is not
>>>>>>> covered yet.
>>>>>>>>> The `rev -> vsn` indirection is not strictly required. It is added
>>>>>>> because it saves some space since `rev` is a long string and `vsn` is
>>> FDB
>>>>>>> versionstamp of fixed size.
>>>>>>>>> 
>>>>>>>>> - `{NS} / {docid} / _by_rev / {rev} = vsn`
>>>>>>>>> - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL`
>>>>>>>>> - `{NS} / {docid} / _data / {json_path} = latest_value | part`
>>>>>>>>> - `{NS} / {docid} / {vsn} / _data / {json_path} = value | part |
>>>>>>> {another_vsn}`
>>>>>>>>> 
>>>>>>>>> ```
>>>>>>>>> write(txn, doc_id, prev_rev, json):
>>>>>>>>> txn.add_write_conflict_key("{NS} / {doc_id} / _rev")
>>>>>>>>> rev = generate_new_rev()
>>>>>>>>> txn["{NS} / {docid} / _by_rev / {rev}"] = vsn
>>>>>>>>> for every json_path in flattened json
>>>>>>>>>  - {NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL
>>>>>>>>>  if rev is HEAD:
>>>>>>>>>    # this range contains values for all json paths for the latest
>>>>>>> revision (read optimization)
>>>>>>>>>    - {NS} / {docid} / _data / {json_path} = latest_value | part
>>>>>>>>>  - {NS} / {docid} / {vsn} / _data / {json_path} = value | part |
>>>>>>> {another_vsn}
>>>>>>>>> txn["{NS} / {doc_id} / _rev"] = rev
>>>>>>>>> 
>>>>>>>>> get_current(txn, doc_id):
>>>>>>>>> # there is no sharing of json_paths in this range (read
>>> optimization)
>>>>>>>>> txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} /
>>> _data
>>>>>>> / 0xFF" )
>>>>>>>>> 
>>>>>>>>> get_revision(txn, doc_id, rev):
>>>>>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"]
>>>>>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00",
>>>>>>> "{NS} / {vsn} / {docid} / _data / 0xFF" )
>>>>>>>>> for every json_path in json_paths:
>>>>>>>>> if value has type vsn:
>>>>>>>>>    another_vsn = value
>>>>>>>>>       value = txn["{NS} / {docid} / {another_vsn} / _data /
>>>>>>> {json_path}"]
>>>>>>>>> result[json_path] = value
>>>>>>>>> 
>>>>>>>>> delete_revision(txn, doc_id, rev):
>>>>>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"]
>>>>>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00",
>>>>>>> "{NS} / {vsn} / {docid} / _data / 0xFF" )
>>>>>>>>> for every json_path in json_paths:
>>>>>>>>> if value has type vsn:
>>>>>>>>>   # remove reference to deleted revision from the owner
>>>>>>>>>    del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}]
>>>>>>>>> # check if deleted revision of json_path is not used by anything
>>> else
>>>>>>>>> if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}",
>>>>>>> limit=1) == []:
>>>>>>>>>    del txn["{NS} / {docid} / {vsn} / _data / {json_path}"]
>>>>>>>>> if vsn is HEAD:
>>>>>>>>>    copy range for winning revision into "{NS} / {docid} / _data /
>>>>>>> {json_path}"
>>>>>>>>> ```
>>>>>>>>> 
>>>>>>>>> best regards,
>>>>>>>>> iilyak
>>>>>>>>> 
>>>>>>>>> On 2019/02/04 23:22:09, Adam Kocoloski <[email protected]> wrote:
>>>>>>>>>> I think it’s fine to start a focused discussion here as it might
>>> help
>>>>>>> inform some of the broader debate over in that thread.
>>>>>>>>>> 
>>>>>>>>>> As a reminder, today CouchDB writes the entire body of each
>>> document
>>>>>>> revision on disk as a separate blob. Edit conflicts that have common
>>> fields
>>>>>>> between them do not share any storage on disk. The revision tree is
>>> encoded
>>>>>>> into a compact format and a copy of it is stored directly in both the
>>> by_id
>>>>>>> tree and the by_seq tree. Each leaf entry in the revision tree
>>> contain a
>>>>>>> pointer to the position of the associated doc revision on disk.
>>>>>>>>>> 
>>>>>>>>>> As a further reminder, CouchDB 2.x clusters can generate edit
>>> conflict
>>>>>>> revisions just from multiple clients concurrently updating the same
>>>>>>> document in a single cluster. This won’t happen when FoundationDB is
>>>>>>> running under the hood, but users who deploy multiple CouchDB or
>>> PouchDB
>>>>>>> servers and replicate between them can of course still produce
>>> conflicts
>>>>>>> just like they could in CouchDB 1.x, so we need a solution.
>>>>>>>>>> 
>>>>>>>>>> Let’s consider the two sub-topics separately: 1) storage of edit
>>>>>>> conflict bodies and 2) revision trees
>>>>>>>>>> 
>>>>>>>>>> ## Edit Conflict Storage
>>>>>>>>>> 
>>>>>>>>>> The simplest possible solution would be to store each document
>>>>>>> revision separately, like we do today. We could store document bodies
>>> with
>>>>>>> (“docid”, “revid”) as the key prefix, and each transaction could
>>> clear the
>>>>>>> key range associated with the base revision against which the edit is
>>> being
>>>>>>> attempted. This would work, but I think we can try to be a bit more
>>> clever
>>>>>>> and save on storage space given that we’re splitting JSON documents
>>> into
>>>>>>> multiple KV pairs.
>>>>>>>>>> 
>>>>>>>>>> One thought I’d had is to introduce a special enum Value which
>>>>>>> indicates that the subtree “beneath” the given Key is in conflict. For
>>>>>>> example, consider the documents
>>>>>>>>>> 
>>>>>>>>>> {
>>>>>>>>>> “_id”: “foo”,
>>>>>>>>>> “_rev”: “1-abc”,
>>>>>>>>>> “owner”: “alice”,
>>>>>>>>>> “active”: true
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> and
>>>>>>>>>> 
>>>>>>>>>> {
>>>>>>>>>> “_id”: “foo”,
>>>>>>>>>> “_rev”: “1-def”,
>>>>>>>>>> “owner”: “bob”,
>>>>>>>>>> “active”: true
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> We could represent these using the following set of KVs:
>>>>>>>>>> 
>>>>>>>>>> (“foo”, “active”) = true
>>>>>>>>>> (“foo”, “owner”) = kCONFLICT
>>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
>>>>>>>>>> (“foo”, “owner”, “1-def”) = “bob”
>>>>>>>>>> 
>>>>>>>>>> This approach also extends to conflicts where the two versions have
>>>>>>> different data types. Consider a more complicated example where bob
>>> dropped
>>>>>>> the “active” field and changed the “owner” field to an object:
>>>>>>>>>> 
>>>>>>>>>> {
>>>>>>>>>> “_id”: “foo”,
>>>>>>>>>> “_rev”: “1-def”,
>>>>>>>>>> “owner”: {
>>>>>>>>>> “name”: “bob”,
>>>>>>>>>> “email”: “[email protected]"
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>> 
>>>>>>>>>> Now the set of KVs for “foo” looks like this (note that a missing
>>>>>>> field needs to be handled explicitly):
>>>>>>>>>> 
>>>>>>>>>> (“foo”, “active”) = kCONFLICT
>>>>>>>>>> (“foo”, “active”, “1-abc”) = true
>>>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING
>>>>>>>>>> (“foo”, “owner”) = kCONFLICT
>>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
>>>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob”
>>>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = “[email protected]”
>>>>>>>>>> 
>>>>>>>>>> I like this approach for the common case where documents share
>>> most of
>>>>>>> their data in common but have a conflict in a very specific field or
>>> set of
>>>>>>> fields.
>>>>>>>>>> 
>>>>>>>>>> I’ve encountered one important downside, though: an edit that
>>>>>>> replicates in and conflicts with the entire document can cause a bit
>>> of a
>>>>>>> data explosion. Consider a case where I have 10 conflicting versions
>>> of a
>>>>>>> 100KB document, but the conflicts are all related to a single scalar
>>> value.
>>>>>>> Now I replicate in an empty document, and suddenly I have a kCONFLICT
>>> at
>>>>>>> the root. In this model I now need to list out every path of every
>>> one of
>>>>>>> the 10 existing revisions and I end up with a 1MB update. Yuck. That’s
>>>>>>> technically no worse in the end state than the “zero sharing” case
>>> above,
>>>>>>> but one could easily imagine overrunning the transaction size limit
>>> this
>>>>>>> way.
>>>>>>>>>> 
>>>>>>>>>> I suspect there’s a smart path out of this. Maybe the system
>>> detects a
>>>>>>> “default” value for each field and uses that instead of writing out
>>> the
>>>>>>> value for every revision in a conflicted subtree. Worth some
>>> discussion.
>>>>>>>>>> 
>>>>>>>>>> ## Revision Trees
>>>>>>>>>> 
>>>>>>>>>> In CouchDB we currently represent revisions as a hash history tree;
>>>>>>> each revision identifier is derived from the content of the revision
>>>>>>> including the revision identifier of its parent. Individual edit
>>> branches
>>>>>>> are bounded in *length* (I believe the default is 1000 entries), but
>>> the
>>>>>>> number of edit branches is technically unbounded.
>>>>>>>>>> 
>>>>>>>>>> The size limits in FoundationDB preclude us from storing the entire
>>>>>>> key tree as a single value; in pathological situations the tree could
>>>>>>> exceed 100KB. Rather, I think it would make sense to store each edit
>>>>>>> *branch* as a separate KV. We stem the branch long before it hits the
>>> value
>>>>>>> size limit, and in the happy case of no edit conflicts this means we
>>> store
>>>>>>> the edit history metadata in a single KV. It also means that we can
>>> apply
>>>>>>> an interactive edit without retrieving the entire conflicted revision
>>> tree;
>>>>>>> we need only retrieve and modify the single branch against which the
>>> edit
>>>>>>> is being applied. The downside is that we duplicate historical
>>> revision
>>>>>>> identifiers shared by multiple edit branches, but I think this is a
>>>>>>> worthwhile tradeoff.
>>>>>>>>>> 
>>>>>>>>>> I would furthermore try to structure the keys so that it is
>>> possible
>>>>>>> to retrieve the “winning” revision in a single limit=1 range query.
>>> Ideally
>>>>>>> I’d like to proide the following properties:
>>>>>>>>>> 
>>>>>>>>>> 1) a document read does not need to retrieve the revision tree at
>>> all,
>>>>>>> just the winning revision identifier (which would be stored with the
>>> rest
>>>>>>> of the doc)
>>>>>>>>>> 2) a document update only needs to read the edit branch of the
>>>>>>> revision tree against which the update is being applied, and it can
>>> read
>>>>>>> that branch immediately knowing only the content of the edit that is
>>> being
>>>>>>> attempted (i.e., it does not need to read the current version of the
>>>>>>> document itself).
>>>>>>>>>> 
>>>>>>>>>> So, I’d propose a separate subspace (maybe “_meta”?) for the
>>> revision
>>>>>>> trees, with keys and values that look like
>>>>>>>>>> 
>>>>>>>>>> (“_meta”, DocID, IsDeleted, RevPosition, RevHash) = [ParentRev,
>>>>>>> GrandparentRev, …]
>>>>>>>>>> 
>>>>>>>>>> The inclusion of IsDeleted, RevPosition and RevHash in the key
>>> should
>>>>>>> be sufficient (with the right encoding) to create a range query that
>>>>>>> automatically selects the “winner” according to CouchDB’s arcane
>>> rules,
>>>>>>> which are something like
>>>>>>>>>> 
>>>>>>>>>> 1) deleted=false beats deleted=true
>>>>>>>>>> 2) longer paths (i.e. higher RevPosition) beat shorter ones
>>>>>>>>>> 3) RevHashes with larger binary values beat ones with smaller
>>> values
>>>>>>>>>> 
>>>>>>>>>> ===========
>>>>>>>>>> 
>>>>>>>>>> OK, that’s all on this topic from me for now. I think this is a
>>>>>>> particularly exciting area where we start to see the dividends of
>>> splitting
>>>>>>> up data into multiple KV pairs in FoundationDB :) Cheers,
>>>>>>>>>> 
>>>>>>>>>> Adam
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Feb 4, 2019, at 2:41 PM, Robert Newson <[email protected]>
>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> This one is quite tightly coupled to the other thread on data
>>> model,
>>>>>>> should we start much conversation here before that one gets closer to
>>> a
>>>>>>> solution?
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Robert Samuel Newson
>>>>>>>>>>> [email protected]
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote:
>>>>>>>>>>>> This is a beginning of a discussion thread about storage of edit
>>>>>>>>>>>> conflicts and everything which relates to revisions.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Professional Support for Apache CouchDB:
>>>>>>> https://neighbourhood.ie/couchdb-support/
>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>>>

Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts

Reply via email to