On Thu, Mar 22, 2012 at 11:26 AM, Bela Ban <[email protected]> wrote: > Thanks Dan, > > here are some comments / questions: > > "2. Each cache member receives the ownership information from the > coordinator and starts rejecting all commands started in the old cache view" > > How do you know a command was started in the old cache view; does this > mean you're shipping a cache view ID with every request ? >
Yes, the plan is to ship the cache view id with every command - I mentioned in the "Command Execution During State Transfer" section that we set the view id in the command, I should make it clear that we also ship it to remote nodes. > > > "2.1. Commands with the new cache view id will be blocked until we have > installed the new CH and we have received lock information from the > previous owners" > > Doesn't this make this design *blocking* again ? Or do you queue > requests with the new view-ID, return immediately and apply them when > the new view-id is installed ? If the latter is the case, what do you > return ? An OK (how do you know the request will apply OK) ? > You're right, it will be blocking. However, the in-flight transaction data should be much smaller than the entire state, so I'm hoping that this blocking phase won't be as painful as it is now. Furthermore, commands waiting for a lock for for a sync RPC won't block state transfer anymore, so deadlocks between writes and state transfer (that now end with the write command timing out) should be impossible. I am not planning for any queuing at this point - we could only queue if we had the same order between the writes on all the nodes, otherwise we would get inconsistencies. For async commands we will have an implicit queue in the JGroups thread pool, but we won't have anything for sync commands. > > > "A merge can be coalesced with multiple state transfers, one running in > each partition. So in the general case a coalesced state transfer > contain a tree of cache view changes." > > Hmm, this can make a state transfer message quite large. Are we trimming > the modification list ? E.g if we have 10 PUTs on K, 1 removal of K, and > another 4 PUTs, do we just send the *last* PUT, or do we send a > modification list of 15 ? > Yep, my idea was to keep only a set of keys that have been modified for each cache view id. The second put to the same key wouldn't modify the set. When we transfer the entries to the new owners, we iterate the data container and only send each key once. Before writing the key to this set (and the entry to the data container), the primary owner will synchronously invalidate the key not only on the L1 requestors, but on all the owners in the previous cache views (that are no longer owners in the latest view). This will ensure that the new owner will only receive one copy of each key and won't have to buffer and sort the state received from all the previous owners. > > > "Get commands can write to the data container as well when L1 is > enabled, so we can't block just write commands." > > Another downside of the L1 cache being part of the regular cache. IMO it > would be much better to separate the 2, as I wrote in previous emails > yesterday. > I haven't read those yet, but I'm not sure moving the L1 cache to another container would eliminate the problem completely. It's probably not clear from the document, but if we empty the L1 on cache view changes then we have to ensure that we don't write to L1 anything from an old owner. Otherwise we're left with a value in L1 that none of the new owners knows about and won't invalidate on a put to that key. > > "The new owners will start receiving commands before they have received > all the state > * In order to handle those commands, the new owners will have to > get the values from the old owners > * We know that the owners in the later views have a newer version > of the entry (if they have it at all). So we need to go back on the > cache views tree and ask all the nodes on one level at the same time - > if we don't get any certain anwer we go to the next level and repeat." > > > How does a member C know that it *will* receive any state at all ? E.g. > if we had key K on A and B, and now B crashed, then A would push a copy > of K to C. > > So when C receives a request R from another member E, but hasn't > received a state transfer from A yet, how does it know whether to apply > or queue R ? Does C wait until it get an END-OF-ST message from the > coordinator ? > Yep, each node will signal to the coordinator that it finished pushing data and when the coordinator gets the confirmation from all the nodes it will broadcast the end-of-ST message to everyone. > > (skipping the rest due to exhaustion by complexity :-)) > > > Over the last couple of days, I've exchanged a couple of emails with the > Cloud-TM guys, and I'm more and more convinced that their total order > solution is the simpler approach to (1) transactional updates and (2) > state transfer. They don't have a solution for (2) yet, but I believe > this can be done as another totally ordered transaction, applied in the > correct location within the update stream. Or, we could possibly use a > flush: as we don't need to wait for pending TXs to complete and release > their locks, this should be quite fast. > Sounds intriguing, but I'm not sure how making the entire state transfer a single transaction would allow us to handle transactions while that state is being transferred. Blocking state transfer works with 2PC already. Well, at least as long as we don't have any merges... > > So my 5 cents: > > #1 We should focus on the total order approach, and get rid of the 2PC > and locking business for transactional updates > The biggest problem I remember total order having is TM transactions that have other participants (as opposed to cache-only transactions). I haven't followed the TO discussion on the mailing list very closely, does that work now? Regarding state transfer in particular, remember, non-blocking state transfer with 2PC sounded very easy as well, before we got into the details. The proof-of-concept you had running on the 4.2 branch was much simpler than what we have now, and even what we have now doesn't cover everything. > #2 Really focus on the eventual consistency approach > Again, it's very tempting, but I fear much of what makes eventual consistency tempting is that we don't know enough of its implementation details yet. Manik has been investigating eventual consistency for a while, I wonder what he thinks... Cheers Dan _______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
