On 10/09/2014 04:41 PM, Dan Berindei wrote: > > > On Thu, Oct 9, 2014 at 3:40 PM, William Burns <[email protected] > <mailto:[email protected]>> wrote: > > Actually this was something I was hoping to get to possibly in the > near future. > > I already have to do https://issues.jboss.org/browse/ISPN-4358 which > will require rewriting parts of the distributed entry iterator. In > doing so I was planning on breaking this out to a more generic > framework where you could run a given operation by segment > guaranteeing it was only ran once per entry. In doing so I was > thinking I could try to move M/R on top of this to allow it to also be > resilient to rehash events. > > Additional comments inline. > > On Thu, Oct 9, 2014 at 8:18 AM, Emmanuel Bernard > <[email protected] <mailto:[email protected]>> wrote: > > Pedro and I have been having discussions with the LEADS guys on their > experience of Map / Reduce especially around stability during topology > changes. > > > > This ties to the .size() thread you guys have been exchanging on (I > only could read it partially). > > > > On the requirements, theirs is pretty straightforward and expected I > think from most users. > > They are fine with inconsistencies with entries create/updated/deleted > between the M/R start and the end. > > There is no way we can fix this without adding a very strict isolation > level like SERIALIZABLE. > > > > They are *not* fine with seeing the same key/value several time for the > duration of the M/R execution. This AFAIK can happen when a topology change > occurs. > > This can happen if it was processed on one node and then rehash > migrates the entry to another and runs it there. > > > > > Here is a proposal. > > Why not run the M/R job not per node but rather per segment? > > The point is that segments are stable across topology changes. The M/R > tasks would then be about iterating over the keys in a given segment. > > > > The M/R request would send the task per segments on each node where the > segment is primary. > > This is exactly what the iterator does today but also watches for > rehashes to send the request to a new owner when the segment moves > between nodes. > > > (We can imagine interesting things like sending it to one of the > backups for workload optimization purposes or sending it to both primary and > backups and to comparisons). > > The M/R requester would be in an interesting situation. It could detect > that a segment M/R never returns and trigger a new computation on another > node than the one initially sent. > > > > One tricky question around that is when the M/R job store data in an > intermediary state. We need some sort of way to expose the user indirectly to > segments so that we can evict per segment intermediary caches in case of > failure or retry. > > This was one place I was thinking I would need to take special care to > look into when doing a conversion like this. > > > I'd rather not expose this to the user. Instead, we could split the > intermediary values for each key by the source segment, and do the > invalidation of the retried segments in our M/R framework (e.g. when we > detect that the primary owner at the start of the map/combine phase is > not an owner at all at the end). > > I think we have another problem with the publishing of intermediary > values not being idempotent. The default configuration for the > intermediate cache is non-transactional, and retrying the put(delta) > command after a topology change could add the same intermediate values > twice. A transactional intermediary cache should be safe, though, > because the tx won't commit on the old owner until the new owner knows > about the tx.
can you elaborate on it? anyway, I think the retry mechanism should solve it. If we detect a topology change (during the iteration of segment _i_) and the segment _i_ is moved, then we can cancel the iteration, remove all the intermediate values generated in segment _i_ and restart (on the primary owner). > > > > > > But before getting ahead of ourselves, what do you thing of the general > idea? Even without retry framework, this approach would be more stable than > our current per node approach during topology changes and improve > dependability. > > Doing it solely based on segment would remove the possibility of > having duplicates. However without a mechanism to send a new request > on rehash it would be possible to only find a subset of values (if a > segment is removed while iterating on it). > > > > > Emmanuel > > _______________________________________________ > > infinispan-dev mailing list > > [email protected] > <mailto:[email protected]> > > https://lists.jboss.org/mailman/listinfo/infinispan-dev > _______________________________________________ > infinispan-dev mailing list > [email protected] <mailto:[email protected]> > https://lists.jboss.org/mailman/listinfo/infinispan-dev > > > > > _______________________________________________ > infinispan-dev mailing list > [email protected] > https://lists.jboss.org/mailman/listinfo/infinispan-dev > _______________________________________________ infinispan-dev mailing list [email protected] https://lists.jboss.org/mailman/listinfo/infinispan-dev
