Agreed, that’s a good fit and is clearly related to the kinds of work FDB is doing with its simulator. If you watch Wil WIlson’s presentations on the topic he calls out property-based testing as one isolated example of the broader theme that he’s trying to drive here.
Adam > On Feb 8, 2019, at 8:56 AM, Paul Davis <paul.joseph.da...@gmail.com> wrote: > > Cheers to that Garren! > > Whatever we decide on for the data model I'd like to see a fairly > extensive property based test suite around it. I almost said for > anything above chunked based storage but even for that I'd think that > I'd still want property testing around various keys and tree > mutations. > > On Fri, Feb 8, 2019 at 12:45 AM Garren Smith <gar...@apache.org> wrote: >> >> Hi Adam, >> >> Thanks for the detailed email. In terms of the data model, that makes a lot >> of sense. >> >> I’m still playing a bit of catchup on understanding how fdb works, so I >> can’t comment on the best way to retrieve a document. >> >> From my side, I would like to see our decisions also driven by testing and >> validating that our data model works. I find the way that fdb was tested >> and built really impressive. I would love to see us apply some of that to >> the way we build our CouchDB layer. >> >> Cheers >> Garren >> >> On Fri, Feb 8, 2019 at 5:35 AM Adam Kocoloski <kocol...@apache.org> wrote: >> >>> Bob, Garren, Jan - heard you loud and clear, K.I.S.S. I do think it’s a >>> bit “simplistic" to exclusively choose simplicity over performance and >>> storage density. We’re (re)building a database here, one that has some >>> users with pretty demanding performance and scalability requirements. And >>> yes, we should certainly be testing and measuring. Kyle and team are >>> setting up infrastructure in IBM land to help with that now, but I also >>> believe we can design the data model and architecture with a basic >>> performance model of FoundationDB in mind: >>> >>> - reads cost 1ms >>> - short range reads are the same cost as a single lookup >>> - reads of independent parts of the keyspace can be parallelized for cheap >>> - writes are zero-cost until commit time >>> >>> We ought to be able to use these assumptions to drive some decisions about >>> data models ahead of any end-to-end performance test. >>> >>> If there are specific elements of the edit conflicts management where you >>> think greater simplicity is warranted, let’s get those called out. Ilya >>> noted (correctly, in my opinion) that the term sharing stuff is one of >>> those items. It’s relatively complex, potentially a performance hit, and >>> only saves on storage density in the corner case of lots of edit conflicts. >>> That’s a good one to drop. >>> >>> I’m relatively happy with the revision history data model at this point. >>> Hopefully folks find it easy to grok, and it’s efficient for both reads and >>> writes. It costs some extra storage for conflict revisions compared to the >>> current tree representation (up to 16K per edit branch, with default >>> _revs_limit) but knowing what we know about the performance death spiral >>> for wide revision trees today I’ll happily make a storage vs. performance >>> tradeoff here :) >>> >>> Setting the shared term approach aside, I’ve still been mulling over the >>> key structure for the actual document data: >>> >>> - I thought about trying to construct a special _conflicts subspace, but >>> I don’t like that approach because the choice of a “winning" revision can >>> flip back and forth very quickly with concurrent writers to different edit >>> branches. I think we really want to have a way for revisions to naturally >>> sort themselves so the winner is the first or last revision in a list. >>> >>> - Assuming we’re using key paths of the form (docid, revision-ish, path, >>> to, field), the goal here is to find an efficient way to get the last key >>> with prefix “docid” (assuming winner sorts last), and then all the keys >>> that share the same (docid, revision-ish) prefix as that one. I see two >>> possible approaches so far, neither perfect: >>> >>> Option 1: Execute a get_key() operation with a key selector that asks for >>> the last key less than “docid\xFF” (again assuming winner sorts last), and >>> then do a get_range_startswith() request setting the streaming mode to >>> “want_all” and the prefix to the docid plus whatever revision-ish we found >>> from the get_key() request. This is two roundtrips instead of one, but it >>> always retrieves exactly the right set of keys, and the second step is >>> executed as fast as possible. >>> >>> Option 2: Jump straight to get_range_startswith() request using only >>> “docid” as the prefix, then cancel the iteration once we reach a revision >>> not equal to the first one we see. We might transfer too much data, or we >>> might end up doing multiple roundtrips if the default “iterator” streaming >>> mode sends too little data to start (I haven’t checked what the default >>> iteration block is there), but in the typical case of zero edit conflicts >>> we have a good chance of retrieving the full document in one roundtrip. >>> >>> I don’t have a good sense of which option wins out here from a performance >>> perspective, but they’re both operating on the same data model so easy >>> enough to test the alternatives. The important bit is getting the >>> revision-ish things to sort correctly. I think we can do that by generating >>> something like >>> >>> revision-ish = NotDeleted/1bit : RevPos : RevHash >>> >>> with some suitable order-preserving encoding on the RevPos integer. >>> >>> Apologies for the long email. Happy for any comments, either here or over >>> on IRC. Cheers, >>> >>> Adam >>> >>>> On Feb 7, 2019, at 4:52 PM, Robert Newson <rnew...@apache.org> wrote: >>>> >>>> I think we should choose simple. We can then see if performance is too >>> low or storage overhead too high and then see what we can do about it. >>>> >>>> B. >>>> >>>> -- >>>> Robert Samuel Newson >>>> rnew...@apache.org >>>> >>>> On Thu, 7 Feb 2019, at 20:36, Ilya Khlopotov wrote: >>>>> We cannot do simple thing if we want to support sharing of JSON terms. >>> I >>>>> think if we want the simplest path we should move sharing out of the >>>>> scope. The problem with sharing is we need to know the location of >>>>> shared terms when we do write. This means that we have to read full >>>>> document on every write. There might be tricks to replace full document >>>>> read with some sort of hierarchical signature or sketch of a document. >>>>> However these tricks do not fall into simplest solution category. We >>>>> need to choose the design goals: >>>>> - simple >>>>> - performance >>>>> - reduced storage overhead >>>>> >>>>> best regards, >>>>> iilyak >>>>> >>>>> On 2019/02/07 12:45:34, Garren Smith <gar...@apache.org> wrote: >>>>>> I’m also in favor of keeping it really simple and then testing and >>>>>> measuring it. >>>>>> >>>>>> What is the best way to measure that we have something that works? I’m >>> not >>>>>> sure just relying on our current tests will prove that? Should we >>> define >>>>>> and build some more complex situations e.g docs with lots of conflicts >>> or >>>>>> docs with wide revisions and make sure we can solve for those? >>>>>> >>>>>> On Thu, Feb 7, 2019 at 12:33 PM Jan Lehnardt <j...@apache.org> wrote: >>>>>> >>>>>>> I’m also very much in favour with starting with the simplest thing >>> that >>>>>>> can possibly work and doesn’t go against the advertised best >>> practices of >>>>>>> FoundationDB. Let’s get that going and get a feel for how it all works >>>>>>> together, before trying to optimise things we can’t measure yet. >>>>>>> >>>>>>> Best >>>>>>> Jan >>>>>>> — >>>>>>> >>>>>>>> On 6. Feb 2019, at 16:58, Robert Samuel Newson <rnew...@apache.org> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> With the Redwood storage engine under development and with prefix >>>>>>> elision part of its design, I don’t think we should get too hung up on >>>>>>> adding complications and indirections in the key space just yet. We >>> haven’t >>>>>>> written a line of code or run any tests, this is premature >>> optimisation. >>>>>>>> >>>>>>>> I’d like to focus on the simplest solution that yields all required >>>>>>> properties. We can embellish later (if warranted). >>>>>>>> >>>>>>>> I am intrigued by all the ideas that might allow us cheaper inserts >>> and >>>>>>> updates than the current code where there are multiple edit branches >>> in the >>>>>>> stored document. >>>>>>>> >>>>>>>> B. >>>>>>>> >>>>>>>>> On 6 Feb 2019, at 02:18, Ilya Khlopotov <iil...@apache.org> wrote: >>>>>>>>> >>>>>>>>> While reading Adam's proposal I came to realize that: we don't have >>> to >>>>>>> calculate winning revision at read time. >>>>>>>>> Since FDB's transactions are atomic we can calculate it when we >>> write. >>>>>>> This means we can just write latest values into separate range. This >>> makes >>>>>>> lookup of latest version fast. >>>>>>>>> Another realization is if we want to share values for some json >>> paths >>>>>>> we would have to introduce a level of indirection. >>>>>>>>> Bellow is the data model inspired by Adam's idea to share >>> json_paths. >>>>>>> In this model the json_path is stored in the revision where it was >>> first >>>>>>> added (we call that revision an owner of a json_path). The values for >>>>>>> json_path key can be scalar values, parts of scalar values or >>> pointers to >>>>>>> owner location. >>>>>>>>> The below snippets are sketches of transactions. >>>>>>>>> The transactions will include updates to other keys as needed >>>>>>> (`external_size`, `by_seq` and so on). The revision tree management >>> is not >>>>>>> covered yet. >>>>>>>>> The `rev -> vsn` indirection is not strictly required. It is added >>>>>>> because it saves some space since `rev` is a long string and `vsn` is >>> FDB >>>>>>> versionstamp of fixed size. >>>>>>>>> >>>>>>>>> - `{NS} / {docid} / _by_rev / {rev} = vsn` >>>>>>>>> - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL` >>>>>>>>> - `{NS} / {docid} / _data / {json_path} = latest_value | part` >>>>>>>>> - `{NS} / {docid} / {vsn} / _data / {json_path} = value | part | >>>>>>> {another_vsn}` >>>>>>>>> >>>>>>>>> ``` >>>>>>>>> write(txn, doc_id, prev_rev, json): >>>>>>>>> txn.add_write_conflict_key("{NS} / {doc_id} / _rev") >>>>>>>>> rev = generate_new_rev() >>>>>>>>> txn["{NS} / {docid} / _by_rev / {rev}"] = vsn >>>>>>>>> for every json_path in flattened json >>>>>>>>> - {NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL >>>>>>>>> if rev is HEAD: >>>>>>>>> # this range contains values for all json paths for the latest >>>>>>> revision (read optimization) >>>>>>>>> - {NS} / {docid} / _data / {json_path} = latest_value | part >>>>>>>>> - {NS} / {docid} / {vsn} / _data / {json_path} = value | part | >>>>>>> {another_vsn} >>>>>>>>> txn["{NS} / {doc_id} / _rev"] = rev >>>>>>>>> >>>>>>>>> get_current(txn, doc_id): >>>>>>>>> # there is no sharing of json_paths in this range (read >>> optimization) >>>>>>>>> txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} / >>> _data >>>>>>> / 0xFF" ) >>>>>>>>> >>>>>>>>> get_revision(txn, doc_id, rev): >>>>>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"] >>>>>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", >>>>>>> "{NS} / {vsn} / {docid} / _data / 0xFF" ) >>>>>>>>> for every json_path in json_paths: >>>>>>>>> if value has type vsn: >>>>>>>>> another_vsn = value >>>>>>>>> value = txn["{NS} / {docid} / {another_vsn} / _data / >>>>>>> {json_path}"] >>>>>>>>> result[json_path] = value >>>>>>>>> >>>>>>>>> delete_revision(txn, doc_id, rev): >>>>>>>>> vsn = txn["{NS} / {docid} / _by_rev / {rev}"] >>>>>>>>> json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", >>>>>>> "{NS} / {vsn} / {docid} / _data / 0xFF" ) >>>>>>>>> for every json_path in json_paths: >>>>>>>>> if value has type vsn: >>>>>>>>> # remove reference to deleted revision from the owner >>>>>>>>> del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}] >>>>>>>>> # check if deleted revision of json_path is not used by anything >>> else >>>>>>>>> if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}", >>>>>>> limit=1) == []: >>>>>>>>> del txn["{NS} / {docid} / {vsn} / _data / {json_path}"] >>>>>>>>> if vsn is HEAD: >>>>>>>>> copy range for winning revision into "{NS} / {docid} / _data / >>>>>>> {json_path}" >>>>>>>>> ``` >>>>>>>>> >>>>>>>>> best regards, >>>>>>>>> iilyak >>>>>>>>> >>>>>>>>> On 2019/02/04 23:22:09, Adam Kocoloski <kocol...@apache.org> wrote: >>>>>>>>>> I think it’s fine to start a focused discussion here as it might >>> help >>>>>>> inform some of the broader debate over in that thread. >>>>>>>>>> >>>>>>>>>> As a reminder, today CouchDB writes the entire body of each >>> document >>>>>>> revision on disk as a separate blob. Edit conflicts that have common >>> fields >>>>>>> between them do not share any storage on disk. The revision tree is >>> encoded >>>>>>> into a compact format and a copy of it is stored directly in both the >>> by_id >>>>>>> tree and the by_seq tree. Each leaf entry in the revision tree >>> contain a >>>>>>> pointer to the position of the associated doc revision on disk. >>>>>>>>>> >>>>>>>>>> As a further reminder, CouchDB 2.x clusters can generate edit >>> conflict >>>>>>> revisions just from multiple clients concurrently updating the same >>>>>>> document in a single cluster. This won’t happen when FoundationDB is >>>>>>> running under the hood, but users who deploy multiple CouchDB or >>> PouchDB >>>>>>> servers and replicate between them can of course still produce >>> conflicts >>>>>>> just like they could in CouchDB 1.x, so we need a solution. >>>>>>>>>> >>>>>>>>>> Let’s consider the two sub-topics separately: 1) storage of edit >>>>>>> conflict bodies and 2) revision trees >>>>>>>>>> >>>>>>>>>> ## Edit Conflict Storage >>>>>>>>>> >>>>>>>>>> The simplest possible solution would be to store each document >>>>>>> revision separately, like we do today. We could store document bodies >>> with >>>>>>> (“docid”, “revid”) as the key prefix, and each transaction could >>> clear the >>>>>>> key range associated with the base revision against which the edit is >>> being >>>>>>> attempted. This would work, but I think we can try to be a bit more >>> clever >>>>>>> and save on storage space given that we’re splitting JSON documents >>> into >>>>>>> multiple KV pairs. >>>>>>>>>> >>>>>>>>>> One thought I’d had is to introduce a special enum Value which >>>>>>> indicates that the subtree “beneath” the given Key is in conflict. For >>>>>>> example, consider the documents >>>>>>>>>> >>>>>>>>>> { >>>>>>>>>> “_id”: “foo”, >>>>>>>>>> “_rev”: “1-abc”, >>>>>>>>>> “owner”: “alice”, >>>>>>>>>> “active”: true >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> and >>>>>>>>>> >>>>>>>>>> { >>>>>>>>>> “_id”: “foo”, >>>>>>>>>> “_rev”: “1-def”, >>>>>>>>>> “owner”: “bob”, >>>>>>>>>> “active”: true >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> We could represent these using the following set of KVs: >>>>>>>>>> >>>>>>>>>> (“foo”, “active”) = true >>>>>>>>>> (“foo”, “owner”) = kCONFLICT >>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” >>>>>>>>>> (“foo”, “owner”, “1-def”) = “bob” >>>>>>>>>> >>>>>>>>>> This approach also extends to conflicts where the two versions have >>>>>>> different data types. Consider a more complicated example where bob >>> dropped >>>>>>> the “active” field and changed the “owner” field to an object: >>>>>>>>>> >>>>>>>>>> { >>>>>>>>>> “_id”: “foo”, >>>>>>>>>> “_rev”: “1-def”, >>>>>>>>>> “owner”: { >>>>>>>>>> “name”: “bob”, >>>>>>>>>> “email”: “b...@example.com" >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Now the set of KVs for “foo” looks like this (note that a missing >>>>>>> field needs to be handled explicitly): >>>>>>>>>> >>>>>>>>>> (“foo”, “active”) = kCONFLICT >>>>>>>>>> (“foo”, “active”, “1-abc”) = true >>>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING >>>>>>>>>> (“foo”, “owner”) = kCONFLICT >>>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice” >>>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob” >>>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = “b...@example.com” >>>>>>>>>> >>>>>>>>>> I like this approach for the common case where documents share >>> most of >>>>>>> their data in common but have a conflict in a very specific field or >>> set of >>>>>>> fields. >>>>>>>>>> >>>>>>>>>> I’ve encountered one important downside, though: an edit that >>>>>>> replicates in and conflicts with the entire document can cause a bit >>> of a >>>>>>> data explosion. Consider a case where I have 10 conflicting versions >>> of a >>>>>>> 100KB document, but the conflicts are all related to a single scalar >>> value. >>>>>>> Now I replicate in an empty document, and suddenly I have a kCONFLICT >>> at >>>>>>> the root. In this model I now need to list out every path of every >>> one of >>>>>>> the 10 existing revisions and I end up with a 1MB update. Yuck. That’s >>>>>>> technically no worse in the end state than the “zero sharing” case >>> above, >>>>>>> but one could easily imagine overrunning the transaction size limit >>> this >>>>>>> way. >>>>>>>>>> >>>>>>>>>> I suspect there’s a smart path out of this. Maybe the system >>> detects a >>>>>>> “default” value for each field and uses that instead of writing out >>> the >>>>>>> value for every revision in a conflicted subtree. Worth some >>> discussion. >>>>>>>>>> >>>>>>>>>> ## Revision Trees >>>>>>>>>> >>>>>>>>>> In CouchDB we currently represent revisions as a hash history tree; >>>>>>> each revision identifier is derived from the content of the revision >>>>>>> including the revision identifier of its parent. Individual edit >>> branches >>>>>>> are bounded in *length* (I believe the default is 1000 entries), but >>> the >>>>>>> number of edit branches is technically unbounded. >>>>>>>>>> >>>>>>>>>> The size limits in FoundationDB preclude us from storing the entire >>>>>>> key tree as a single value; in pathological situations the tree could >>>>>>> exceed 100KB. Rather, I think it would make sense to store each edit >>>>>>> *branch* as a separate KV. We stem the branch long before it hits the >>> value >>>>>>> size limit, and in the happy case of no edit conflicts this means we >>> store >>>>>>> the edit history metadata in a single KV. It also means that we can >>> apply >>>>>>> an interactive edit without retrieving the entire conflicted revision >>> tree; >>>>>>> we need only retrieve and modify the single branch against which the >>> edit >>>>>>> is being applied. The downside is that we duplicate historical >>> revision >>>>>>> identifiers shared by multiple edit branches, but I think this is a >>>>>>> worthwhile tradeoff. >>>>>>>>>> >>>>>>>>>> I would furthermore try to structure the keys so that it is >>> possible >>>>>>> to retrieve the “winning” revision in a single limit=1 range query. >>> Ideally >>>>>>> I’d like to proide the following properties: >>>>>>>>>> >>>>>>>>>> 1) a document read does not need to retrieve the revision tree at >>> all, >>>>>>> just the winning revision identifier (which would be stored with the >>> rest >>>>>>> of the doc) >>>>>>>>>> 2) a document update only needs to read the edit branch of the >>>>>>> revision tree against which the update is being applied, and it can >>> read >>>>>>> that branch immediately knowing only the content of the edit that is >>> being >>>>>>> attempted (i.e., it does not need to read the current version of the >>>>>>> document itself). >>>>>>>>>> >>>>>>>>>> So, I’d propose a separate subspace (maybe “_meta”?) for the >>> revision >>>>>>> trees, with keys and values that look like >>>>>>>>>> >>>>>>>>>> (“_meta”, DocID, IsDeleted, RevPosition, RevHash) = [ParentRev, >>>>>>> GrandparentRev, …] >>>>>>>>>> >>>>>>>>>> The inclusion of IsDeleted, RevPosition and RevHash in the key >>> should >>>>>>> be sufficient (with the right encoding) to create a range query that >>>>>>> automatically selects the “winner” according to CouchDB’s arcane >>> rules, >>>>>>> which are something like >>>>>>>>>> >>>>>>>>>> 1) deleted=false beats deleted=true >>>>>>>>>> 2) longer paths (i.e. higher RevPosition) beat shorter ones >>>>>>>>>> 3) RevHashes with larger binary values beat ones with smaller >>> values >>>>>>>>>> >>>>>>>>>> =========== >>>>>>>>>> >>>>>>>>>> OK, that’s all on this topic from me for now. I think this is a >>>>>>> particularly exciting area where we start to see the dividends of >>> splitting >>>>>>> up data into multiple KV pairs in FoundationDB :) Cheers, >>>>>>>>>> >>>>>>>>>> Adam >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Feb 4, 2019, at 2:41 PM, Robert Newson <rnew...@apache.org> >>> wrote: >>>>>>>>>>> >>>>>>>>>>> This one is quite tightly coupled to the other thread on data >>> model, >>>>>>> should we start much conversation here before that one gets closer to >>> a >>>>>>> solution? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Robert Samuel Newson >>>>>>>>>>> rnew...@apache.org >>>>>>>>>>> >>>>>>>>>>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote: >>>>>>>>>>>> This is a beginning of a discussion thread about storage of edit >>>>>>>>>>>> conflicts and everything which relates to revisions. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Professional Support for Apache CouchDB: >>>>>>> https://neighbourhood.ie/couchdb-support/ >>>>>>> >>>>>>> >>>>>> >>> >>>