On Tue, Feb 18, 2014 at 9:22 AM, Eric Grant <[email protected]> wrote: > On Feb 17, 2014, at 10:50 AM, Dmitriy V'jukov wrote: > >> On Sun, Feb 16, 2014 at 6:58 AM, Eric Grant <[email protected]> wrote: >>> >>> <snip> >>> >>> 1: mov node, prev // for 'xchg' below >>> >>> .rollback_enter >>> >>> 2: movq $0, NEXT(node) // node->next = 0 >>> 3: mov TAIL(queue), scratch // scratch = queue->tail >>> 4: mov NEXT(scratch), scratch // scratch = scratch->next (ignore >>> result) >>> 5: xchg prev, (tail) // prev = XCHG (&tail, node) >>> >>> .rollforward_enter >>> >>> 6: mov node, NEXT(prev) // prev->next = node >>> >>> .rollforward_leave >> >> Hi! > > Hi Dmitriy, > >> How do you prevent that the data gets evicted from memory between 5 and 6? > > The data will not be evicted from memory between 5 and 6 because paging out a > block of data from RAM to disk and replacing it with other data from disk is > Long Complicated Operation that will take the kernel's virtual memory > subsystem *hundreds* of instructions to accomplish, involving the acquisition > of one or more locks. By definition, the latency in our routine is at most > *four* instructions that will execute in sequence without interruption.
The page eviction can be initiated on another core, and "take effect" exactly between 5 and 6. > >> If you consider things like page faults and interrupts, I do not think >> that you can solve the problem of forward progress in user space. > > You are certainly correct in this regard. We can instruct the kernel to do > things like roll forward and roll back only if the kernel is prepared to heed > these instructions. That means modifying the kernel source or at least > adding code to the kernel dynamically, e.g., a kernel module on Linux or a > kernel extension on the Mac OS. > >> Even >> if you solve the problem on producer side, what if consumer hits page >> fault right after reading an element? It will have essentially the >> same effect of lack of forward progress. > > I'm not sure what you mean by this. If a consumer thread faults or is > otherwise preempted *before* executing the CAS (cmpxchg[16b] on x86-64) that > dequeues a node, that preemption will not prevent other consumer threads from > dequeuing the node while the first thread is suspended. If the consumer > thread is preempted *after* the CAS that successfully dequeues a node, then > again that preemption will not prevent other consumer threads from dequeuing > whatever nodes remain. The concept I presented is not intended to guarantee > any particular rate of forward progress; it is intended to prevent the > "inconsistency" (as you yourself called it) that occurs if preemption occurs > at one critical location in the enqueue routine. I meant that what matters in the end is user visible properties like latency and bandwidth. And solving just the "window of inconsistency" problem buys you nothing. >> If you are in kernel space, then you can ensure some of the properties >> by disabling interrupts around the operations. Then you can be sure >> that each operaition finishes in N machine instructions. >> But even then, if you have a simple MOV instruction and the data is in >> memory, how long can it take? There is actually no upper bound. And >> this gets us back to where we started. >> So I argue that all that matters is common case performance. And my >> algorithm is already quite good from this point of view. > > I agree that your algorithm is already quite good. Wouldn't it be better if > the window of inconsistency were eliminated? It definitely would be better. The question that is interesting to me is -- how much exactly would it be better. And I suspect that the answer is not that much. I may be wrong. What is your expectation? >> If you go for hard real time, you need to start from designing and >> manufacturing own hardware. Then OS. Only then you can have guarantees >> that some operations will finish in bounded amount of time. Until then >> lock/wait-free algorithms are illusive. > > I didn't offer any guarantee of hard real time or of a bounded amount of time > to finish. I merely offered a (theoretical) improvement on an already > efficient algorithm. You are correct, though, that (as noted above) the > improvement is in the realm of the OS, not pure user-space code. > > If we can eliminate the window of inconsistency, then we can vastly simplify > the dequeue routine. I will present that modification in another post. > > Cheers, > Eric > > -- > > --- > You received this message because you are subscribed to the Google Groups > "Scalable Synchronization Algorithms" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/lock-free/ACFA6EEA-D299-42A9-9191-006F1C5D6CAE%40eagrant.com. > For more options, visit https://groups.google.com/groups/opt_out. -- --- You received this message because you are subscribed to the Google Groups "Scalable Synchronization Algorithms" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/lock-free/CAEeQi3u7SXeWSZ36SvMYQ55v2xYDWoiYcbHLSBL8kpgJsPdArA%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
