Hi Adam, On 06/15/2012 11:44 PM, Adam Hraska wrote: > Visibility vs MBs > ----------------- > I base the following discussion on [18, 19, 20]. > > In order for a store on one cpu (CPU_1) to become > visible to a load on another cpu (CPU_2): > 1) CPU_1 must execute a MB after the store (wrt > instruction order). > 2) After CPU_1's MB completes (wrt cache-coherency > bus traffic) CPU_2 must execute a MB. Then (wrt > instruction order) CPU_2 can issue a load which > is guaranteed to see the stored value. > > If either of the MBs are omitted, CPU_2 may never > load the stored value (even if it loads it in a loop). > In practice, CPU_2 will eventually see the new value. > Due to having a store buffer and an invalidate queue > limited in size, a cpu in the system would eventually > have to stall if stores were to be hidden indefinitely > [18]. > > What I would like to know is how long it takes for > CPU_2 to first see CPU_1's store if CPU_2 omits its > MB, but CPU_1 does not. CPU_2's cache may be busy, so > its invalidate queue may not get to be processed [18]. > However, CPU_2's performance is not affected. It > can continue working with its cache without worrying > about the inv. queue (that's why the queue is not > been processed by the cache in the first place - the > cpu is busy using its cache). Therefore, CPU_2 has no > motivation to process the queue. Even if the queue > becomes full, at worst some other cpu will stall but > not CPU_2.
Interestingly, there are such examples in [18]: a = 0 b = 0 CPU_1: a = 1; mb(); b = 1; CPU_2: while (b == 0) continue; mb(); assert(a == 1); The two mb()'s are essential so that the assertion is not hit, but note that there is no mb() in front of the while cycle, which essentially monitors a change of state of a shared variable without issuing either barrier. Therefore it seems to me like the barriers are only useful to ensure that all processors observe the same ordering of two or more memory operations. >From [18] and your logic, it would seem that CPU_2 could theoretically indefinitely loop. I'd tend to think that on any reasonable CPU architecture this would be either impossible (by forcing processing of the invalidation queues every now and then) or highly unlikely (comparable to the potentially unbounded looping on a spinlock). Btw, [18] provides some interesting read, thanks for sharing the link! Jakub _______________________________________________ HelenOS-devel mailing list [email protected] http://lists.modry.cz/cgi-bin/listinfo/helenos-devel
