On Thu, Feb 7, 2019 at 9:35 PM Adam Kocoloski <kocol...@apache.org> wrote: > > Bob, Garren, Jan - heard you loud and clear, K.I.S.S. I do think it’s a bit > “simplistic" to exclusively choose simplicity over performance and storage > density. We’re (re)building a database here, one that has some users with > pretty demanding performance and scalability requirements. And yes, we should > certainly be testing and measuring. Kyle and team are setting up > infrastructure in IBM land to help with that now, but I also believe we can > design the data model and architecture with a basic performance model of > FoundationDB in mind: > > - reads cost 1ms > - short range reads are the same cost as a single lookup > - reads of independent parts of the keyspace can be parallelized for cheap > - writes are zero-cost until commit time > > We ought to be able to use these assumptions to drive some decisions about > data models ahead of any end-to-end performance test. > > If there are specific elements of the edit conflicts management where you > think greater simplicity is warranted, let’s get those called out. Ilya noted > (correctly, in my opinion) that the term sharing stuff is one of those items. > It’s relatively complex, potentially a performance hit, and only saves on > storage density in the corner case of lots of edit conflicts. That’s a good > one to drop. > > I’m relatively happy with the revision history data model at this point. > Hopefully folks find it easy to grok, and it’s efficient for both reads and > writes. It costs some extra storage for conflict revisions compared to the > current tree representation (up to 16K per edit branch, with default > _revs_limit) but knowing what we know about the performance death spiral for > wide revision trees today I’ll happily make a storage vs. performance > tradeoff here :) > > Setting the shared term approach aside, I’ve still been mulling over the key > structure for the actual document data: > > - I thought about trying to construct a special _conflicts subspace, but I > don’t like that approach because the choice of a “winning" revision can flip > back and forth very quickly with concurrent writers to different edit > branches. I think we really want to have a way for revisions to naturally > sort themselves so the winner is the first or last revision in a list. > > - Assuming we’re using key paths of the form (docid, revision-ish, path, to, > field), the goal here is to find an efficient way to get the last key with > prefix “docid” (assuming winner sorts last), and then all the keys that share > the same (docid, revision-ish) prefix as that one. I see two possible > approaches so far, neither perfect: > > Option 1: Execute a get_key() operation with a key selector that asks for the > last key less than “docid\xFF” (again assuming winner sorts last), and then > do a get_range_startswith() request setting the streaming mode to “want_all” > and the prefix to the docid plus whatever revision-ish we found from the > get_key() request. This is two roundtrips instead of one, but it always > retrieves exactly the right set of keys, and the second step is executed as > fast as possible. > > Option 2: Jump straight to get_range_startswith() request using only “docid” > as the prefix, then cancel the iteration once we reach a revision not equal > to the first one we see. We might transfer too much data, or we might end up > doing multiple roundtrips if the default “iterator” streaming mode sends too > little data to start (I haven’t checked what the default iteration block is > there), but in the typical case of zero edit conflicts we have a good chance > of retrieving the full document in one roundtrip. >
I'm working through getting fdb bindings written and fully tested so have a couple notes for the range streaming bits. The first somewhat surprising bit was that the want_all streaming mode doesn't actually return all rows in a single call. If the range is large enough then a client has to invoke the get_range a number of times tweaking parameters to know when to stop. I just ran into that last night but discovered the logic isn't terribly complicated. The Python implementation is at [1] for reference. > I don’t have a good sense of which option wins out here from a performance > perspective, but they’re both operating on the same data model so easy enough > to test the alternatives. The important bit is getting the revision-ish > things to sort correctly. I think we can do that by generating something like > There are a number of different streaming modes and my wager will be that we'll have to play with them a bit to get an idea of which is best and likely it'll be which is best in which scenario. For instance, when we have a defined range, want_all likely wins, but in the "iterate until next revision found" situation I'd wager the `iterator` mode would likely be best as it appears to adaptively size row batches. > revision-ish = NotDeleted/1bit : RevPos : RevHash > > with some suitable order-preserving encoding on the RevPos integer. > The tuple layer already supports minimally sized order preserving encodings for integers [2] so reusing that is enough. Shrinking the deleted flag into one bit may be a bit of a bother as the tuple layer encodes booleans as a single byte (though its the single type byte, not a type+value byte). A possibly simple approach to shrinking that would be that for all deleted revisions we just encode the RevPos as a negative number. That seems legit though its definitely registering on my "too cute" meter so I may be missing a reason that'd break before I'm properly caffeinated. Paul [1] https://github.com/apple/foundationdb/blob/6.0.15/bindings/python/fdb/impl.py#L358-L395 [2] https://github.com/apple/foundationdb/blob/master/design/tuple.md#integer