Some points that may help with expectations and understanding:

- mprotect() works on a process as a whole, it applies to the process 
(single) address space, and to all threads in that process that share that 
address space. There are no per-process mprotect semantics (at least not in 
most OSs I know of).

- The effect of mprotect is not guaranteed to (and should never expected 
to) appear atomically to all threads. The order in which protection changes 
applies can vary, and is virtually impossible to predict.

- TLB caches are [on all hardware I've ever played with] not coherent. TLB 
invalidates apply to logical CPU cores, not to threads.

- In practice, protect implementations typically impose their semantic 
changes by changing a memory-resident page table, followed by 
TLB-invalidate request signals (an interrupt, typically) to all processor 
(logical) cores that are executing threads in the process. Once all 
involved cores respond to the TLB invalidate request, the change is known 
to be committed, as no thread in the process can observe the pre-change 
page table entry state.

- As any thread could be executing own any core, and one core typically has 
no idea what specific thread another core is running, the order of 
invalidation across threads can vary erratically.

So while I think that your more specific statements about case (2) above 
will hold, the (3) transitive thing between thread that observed a fault 
and one that didn't is unlikely.

BTW, one of the interesting APIs we use for performance with the C4 
collector is a set of no-TLB-invalidating semantic parallels for 
mprotect(), mermap(), and munmap(). We separate TLB invalidation from 
address space mapping changes, and enforce TLB invalidation only at very 
coarse, explicitly requested boundaries. The collector accepts that the 
page table may be potentially inconsistent across most operations, and 
enforces consistency (via explicit TLB invalidate requests) only at points 
where it actually needs it. Since TLB invalidates represent the bulk of 
execution time cost for mprotect(), mermap(), and munmap() calls, this 
provides us with dramatically 
higher MBs-of-address-space-affected-per-second metrics. You can find some 
early discussion and old numbers in the C4 paper 
<http://www.azulsystems.com/sites/default/files/images/c4_paper_acm.pdf> , 
including some reasoning for why a high map-changing rate is needed for 
sustaining reasonable allocation rates in collectors that perform such 
changes in the main/common case compaction paths (see section 5 i n the 
paper).

On Tuesday, May 2, 2017 at 8:39:12 PM UTC-7, Yichao Yu wrote:
>
> >> 3. Does transitivity work? i.e. if therer's a thread 3 that's also 
> loading 
> >> from p, and if thread 2 faults on the load while thread 3 doesn't, can 
> we 
> >> say that thread 2 faults after the mprotect on thread 1 which is after 
> the 
> >> load on thread 3 and therefore the load (and fault) on thread 2 happens 
> >> after the load on thread 3? 
> > 
> > 
> > Same as (2) above. You'd need to describe how threads 1 and 2 observe 
> > whether or not thread 3's load has happened, and whether or not it has 
> > faulted. Then, depending on how that information is communicated, you 
> would 
> > be able to establish some ordering. 
>
> The "safe assumption" 2 above also made it more clear to me why I feel 
> like this might not be true even if both of the previous ones might be 
> true. I'm now imagining a naive implementation of mprotect where the 
> access bit is sequentially flipped on all the threads from the thread 
> issuing mprotect, i.e. mprotect(p, PROT_NONE) is really 
>
> ``` 
> for (thread: all_threads) { 
>     mprotect_on_thread(thread, p, PROT_NONE) 
> } 
> ``` 
>
> In that case for three threads executing (starting `*ga == 0`) 
>
> ``` 
> // thread 1 
> mprotect(p, PROT_NONE) 
>
> // thread 2 
> *(volatile int*)p; 
> a = *ga; // in signal handler 
>
> // thread 3 
> *ga = 1; 
> *(volatile int*)p; 
> ``` 
>
> A possible "sequential consistent" execution could be 
>
> ``` 
> mprotect_on_thread(1, p, PROT_NONE) // Thread 1 
> mprotect_on_thread(2, p, PROT_NONE) // Thread 1 
> *(volatile int*)p; // Thread 2, this faults. 
> a = *ga; // in signal handler // Thread 2 
> *ga = 1; // Thread 3 
> *(volatile int*)p; // Thread 3, this doesn't fault. 
> mprotect_on_thread(3, p, PROT_NONE) // Thread 1 
> ``` 
>
> After all three threads finishes execution we'll have `a == 0`. Even 
> though the `a = *ga` appears to be executed after `mprotect` (though 
> not after `mprotect` returns) and `*ga = 1` appears to be executed 
> before `mprotect`. 
>
> Note that this naive model won't break 1 or 2 (both thread 2 and 3 
> will still be synchronized to thread 1 since thread 1 cannot do 
> anything before mprotect returns). Can this actually happen? (I'll 
> probably test this myself later though I've never tested anything like 
> this that involves more than two threads yet....) 
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to