Date: Wed, 13 Mar 2019 11:44:51 -0700 From: Jason Thorpe <thor...@me.com> Message-ID: <9a2a4a34-35b0-490e-9a92-aab44174f...@me.com>
| I would suggest instrumenting-with-printf the "new_pageable" | case of uvm_map_pageable() That didn't show much we didn't already know. Turns out the problem is in the other side, the mlock() case, that's where things go bad, then the munlock() falls foul of the corrupted setup. I haven't tested it yet, but I will, but I am fairly sure that simply doing munlock(addr, 0) when there has been no mlock() will cause no problems at all. I am less sure about the case mlock(addr, 100); munlock(addr+50, 0); but I will investigate that one too. What happens in the mlock(addr, 0) case is this: uvm_map_pageable() starts by validating the start and end addresses (ed6000 for both, since size == 0 ... that is the real address from the test, but with all the upper bits dropped for the e-mail, they're irrelevant). uvm_map_lookup_entry() finds the entry for the start address, which in this case is ecd000..eff000 (start..end) with wired_count == 0. Since this is the !new_pageable case we drop to pass 1 of the "do wiring" code. There entry != &map->header (the end ... that's impossible in the first iteration I think) and entry->start (ecd000) < end (ed6000) so we enter the loop. The map isn't wired (wired_count == 0) so we do the test to see if an amap_copy() is needed. It isn't, fortunately, or I'd have to go learn that that's all about... Then we get to the first interesting bit; In general, if we have a map entry covering addresses a..b and we want to lock a range x..y where a<x<y<b what happens is that the two calls UVM_MAP_CLIP_START(map, entry, start); UVM_MAP_CLIP_END(map, entry, end); split the map entry into three. The first line makes a..x and x..b with "entry" becoming x..b, and then the second divides that one, into x..y and y..b with entry becoming x..y [Aside here, we know a<=x that's what the initial uvm_map_lookup_entry() did for us, and we know a<b (a property of the map entries) and x<=y (I think we know that, I did not yet look to see if something checks for a negative length - or a very big one which would cause wrap around of the end=start+length addition. Nothing else is known.] [2nd aside added later: I did check for a wraparound, check, VM_MAP_RANGE_CHECK() forces start = end in that case. I think that's wrong, it should be an error, not a "wire nothing" which currently breaks things, but with the suggested changes way way down below, would result in a success return. If we were to keep what VM_MAP_RANGE_CHECK() does, it should be "end = start" rather than the opposite which it does now. It is "start" that the sys call requested, end was computed from the size.] In all of these ignore my fudging of the border addresses - nothing is really in two maps (or to be precise, that's how it is represented, but the start..end means addresses start .. end-1 -- none of this is relevant.) That's what is intended to happen - if it happened that we had a==x or y==b we'd just get two map entries, and if both a==x and y==b nothing would be split and the entry we started with is the one that needs wiring or unwiring. But nothing here considered the possibility that x==y which is what we have. In this case, the UVM_MAP_CLIP_START divides the entry into two, a..x and x..b just as above, as x != a, and that's fine. But the UVM_MAP_CLIP_END (just like UVM_MAP_CLIP_START) does if ((VA) > (ENTRY)->start && (VA) < (ENTRY)->end) { \ uvm_map_clip_end(MAP,ENTRY,VA); \ } \ (that's a macro definition, hence the \'s). Here ENTRY->start == x, ENTRY->end == b, and VA == y. But x == y, so VA == x too. Then since VA == ENTRY->start the condition fails, and no splitting (clipping) is done. So, now we have two entries [ecd000..ed6000] and [ed6000..eff000] with "entry" (the one we're going to work on) being the second of those (it is the one that contains the addresses we want to lock) whereas the first is now irrelevant, untouched, and we can (and will) simply forget about. The second one gets its wired_count incremented (which in our case moves it from 0 to 1.) That's where things really start to go wrong, as that's supposed to mean that the addresses [ed6000..eff000] are wired. That might be OK if we actually wired them (yet to come) but it is not what was requested. There is then a check for holes, not really sure what that's about, but if it happens we get an error (I'm guessing it indicates that we have the situation a<x<b<y and there is no later map entry covering the b..y address range). Never mind, that doesn't happen here, so we can ignore all of that as well. >From this we go onto the next map entry, which in our case isn't interesting (cannot possibly be: either there are no more, or the next has a value for its ->start which is >= eff000, which is > end (ed6000) so one way or the other, the loop that is pass1 terminates. On to pass 2. The entry we start at remains the [ed6000..eff000] which now has wired_count == 1. The pass2 loop is controlled by while (entry != &map->header && entry->start < end) { The first part of that is not an issue, and like in pass1, cannot ever happen the first time around. But: entry->start == ed6000, and end == ed6000, entry->start is not < end, and so we never enter the loop. Oops. That loop is what actually makes the pages wired, and we're doing none of that. Pass2 loop "done", rv remains 0 (it only gets set inside the loop, where we never go) so we simply return. mlock() "succeeded". And now, for the lead in to the panic (new KASSERT failed) when we come to munlock()... (but of course, it is not simple!) We enter uvm_map_pageable() again, with start==end==ed6000 since we are doing an munlock(same addr as locked, 0). As nothing else has affected this process, uvm_map_lookup_entry() finds the entry we were just working on [ed6000..eff000] - the entry that contains the start addr (just the same as last time). But now that entry has wired_count == 1 (which happened just a bit earlier in this message.) The new_pagable case starts with UVM_MAP_CLIP_START(map, entry, start); to strip off any leading section of the map entry, that we are not unlocking, but here start == entry->start so the unwire begins at the beginning of the map entry. That call changes nothing. As with the wiring case, there are 2 loops, loop 1 while ((entry != &map->header) && (entry->start < end)) { Just like before the first condition is impossible for the first iteration, but for the second, entry->start == end. Loop done before it started. Doesn't matter, all that loop does is validity checking on the range of addresses to be unlocked (just like the "hole" check above I think). Our address range is valid (well, kind of). On to the second loop. while ((entry != &map->header) && (entry->start < end)) { Well, that's boring. Another loop that does nothing. munlock() is finished, successfully. Note that nothing changed the entry's wired_count - it remains 1, even though it has nominally been unlocked. Oops. The wired count would have been set to 0 in uvm_map_entry_unwire() which is called in the second unwire loop, if the entry indicated it was wired (wired_count > 0). But we never went there. Note no panic yet... Next thing the test does is mlock(buf, 1024); munlock(buf, 1024). I did not get diagnostics from these (all my debug output was conditioned on start==end so I wouldn't be swamped by noise, I considered that an unlikely enough case that I'd get to see what I was looking for and nothing else .. which worked.) But even without seeing what actually happened, it is fairly obvious from reading the code and knowing what numbers are being used... mlock() calls uvm_map_pageable() again, this time start==ed6000 and end==ed7000 (not ed6400 as the size is rounded up to a page). uvm_map_lookup_entry() finds the entry containing the start addr, our good friend [ed6000..eff000] which has retained its wired_count==1 Pass 1 of the wired case. while ((entry != &map->header) && (entry->start < end)) { entry->start < end, so we enter the loop. This time the entry is marked as already wired (though it is not) so we skip that amap() required test completely. UVM_MAP_CLIP_START(map, entry, start); UVM_MAP_CLIP_END(map, entry, end); We do that again. This time we have a == x < y < b so the UVM_MAP_CLIP_START() does nothing, but the UVM_MAP_CLIP_END() does divide the entry into two [ed6000..ed7000] and [ed7000..eff000] Both have wired_count == 1 (that isn't changed by the map entry split.) "entry" is the first of them, that's the one that contains (in this case, *is*) the range of addresses we're wiring. Wiring those increases the wired_count wich becomes 2 for that entry. (Incidentally, I'm not sure wy we bother counting the number of wirings, it just goes 1 2 3 ... as we mlock() the region over and over again - and then drops to 0 as soon as we munlock() - one munlock() undoes any number of mlock() calls on the same address range. Given that I'd expect that all we'd need is a simple "is wired" flag, rather than a count (and avoid the issue of wired_count wraparound that was raised as a possibility - would only take 4 billion mlock() calls for that to happen (2 billion to become negative, but nothing ever checks for that, so we can consider that it is effectively unsigned) - that's probably only a few minutes of a for(;;) mlock(addr, size); loop... But I haven't examined all the code to see if there is anything anywhere which actually ever cares about wired_count being either 0 or 1. [See below]. (nb: this is different than the lower level page wiring counts, those are shared between processes I believe, and one process can munlock() a shared page while another keeps the same page wired down ... but to overflow one of those counters you'd need 64 thousand simultaneous maps referring to the same locked page - which seems like a scenario that would be difficult to fabricate.) Anyway, back to pass 1... After splitting the entry, and incrementing wired_count of the first one (the one we know we are wiring down - the other will be handled next iteration of the loop) we check for holes again, still don't find one, and go onto the next loop iteration. This time we know there's another map entry (we just made it) so we're not at the end of the list, but entry->start == ed7000 and end == ed7000 so !(entry->start < end) and the loop terminates. That's all correct. Pass 2. We revert to our start_entry (the [ed6000..ed7000] entry now), and while (entry != &map->header && entry->start < end) { end remains ed7000 so entry->start < end (and as always, on the first iteration, the initial test cannot be true) so we enter the loop. if (entry->wired_count == 1) { it's not, it is 2 now. OK, here is a case where wired_count > 1 is relevant. So there are 3 values of interest, 0 (unwired), 1 (just made wired) and 2 (was already wired down, but we have a request to do it again). Values > 2 don't seem interesting however. Anyway we don't enter that block of code -- that's the block that actually arranges for the pmap's to be marked as wired, so we have still done none of that. Oops. That's the whole of the pass2 loop, so we go onto the next entry. This one is [ed7000..eff000] (wired_count == 1). But here entry->start (ed7000) == end (ed7000) so is not < end, so the loop is done. That's it, the mlock() has succeeded again, and still has not actually wired down a single page (just made more of a mess of the uvm map entries). Next the test's call of munlock(addr, 1024); Again, the size is rounded up to a page, so we enter uvm_map_pageable() with start==ed6000 and end==ed7000 Again uvm_map_lookup_entry() finds the [ed6000..ed7000] entry, with write_count == 2 now. That's "correct". The new-pagable case (unwiring) starts with UVM_MAP_CLIP_START(map, entry, start); which has nothing to do this time, entry->start == start. Then we enter the 1st unwire loop while ((entry != &map->header) && (entry->start < end)) { and here entry->start < end so we actually get inside this loop this time. But just like last time, all this loop does is validate the args against the map, and they're fine. Next iteration, the [ed7000..eff000] entry, entry->start == end, so the loop is done. Loop 2. Back to the start_entry [ed6000..ed7000] while ((entry != &map->header) && (entry->start < end)) { same loop condition, so just like last time, we either run both loops, or neither, and this time it is both. UVM_MAP_CLIP_END(map, entry, end); nothing to clip (split) entry->end == end if (VM_MAPENT_ISWIRED(entry)) uvm_map_entry_unwire(map, entry); The entry is wired (wired_count == 2) so we call uvm_map_entry_unwire() this time. That function is entry->wired_count = 0; uvm_fault_unwire_locked(map, entry->start, entry->end); (that's all of it). So we set the entry's wired_count back to 0 and call uvm_fault_unwire_locked() uvm_fault_unwire_locked() does a lookup on the start addr (uvm_map_lookup_entry() same as earlier) whch will find our familiar [ed6000..ed7000] entry, and then loops ... for (va = start; va < end; va += PAGE_SIZE) { (that will loop exactly once, as our entry is 1 page long) if (pmap_extract(pmap, va, &pa) == false) continue; Find the physical addr. That should work OK, the page exists. Then we have some validity checking (KASSERT() - none of which fire, so they're all OK) and fiddly locking stuff (which for now we can assume "just works" - and is irrelevant to this essay.) After that if (VM_MAPENT_ISWIRED(entry) == 0) pmap_unwire(pmap, va); Since uvm_map_entry_unwire() set wired_count for our entry to 0, the condition is true and we call pmap_unwire(). That is, I believe, the source of the pmap_unwire: wiring for pmap 0xffffaa80028e4980 va 0x7f7ff7ed6000 did not change! message (note the ed6000) in the message. pmap_unwire() is very confused about being asked to make no changes. But of itself, that seems to be relatively harmless (noisy but harmless). Next: pg = PHYS_TO_VM_PAGE(pa); if (pg) uvm_pageunwire(pg); pg != 0 (which I know, only because I know that uvm_pageunwire() gets called) as that's where that new KASSERT() was added... In it: KASSERT(pg->wire_count != 0); pg->wire_count--; pg->wire_count == 0 - as in all of this, we have still never actually wired a single page. Before the KASSERT (or when running without DIAGNOSTIC defined) the wire_count goes negative (really, wraps to 65535 since it is a uint16_t) and we get an even bigger mess than before, which is what was leading to the consumption of all of the available VM in a !DIAGNOSTIC kernel, so no KASSERT() (after several more iterations of mlock/munlook loops continually working on bogus data). Anyway, that KASSERT is the panic(). I believe all of these problems go away if we simply do nothing in the size == 0 case. All the problems started with that first mlock(addr,0) call which made a uvm map entry marked as wired, but wired no pages. However, I think we can do better than nothing. We need 2 changes I believe. First in sys_mlock() and sys_munlock() we change: size += pageoff; size = (vsize_t)round_page(size); into if (size != 0) { size += pageoff; size = (vsize_t)round_page(size); } So, and attempt to mlock 0 bytes remains an attempt to mlock 0 bytes. (o is guaranteed to be aligned). If we're locking nothing in the page, it doesn't really matter which nothing it is. This change isn't really required to fix the problem but does avoid locking a page, when the program asked to lock nothing, in cases where the addr is not page aligned. I'd prefer doing it this way than if (size == 0) return XXX: where XXX is either EINVAL or 0, depending on your state of mind at the time, as (particularly if we decide size==0 is harmless, and just return 0) we would lose all the checks on the validity of addr. We should probably add (after that block) if (end < start) return EINVAL; (or some error code that this function is permitted to return). That will handle the mlock(addr, 0xffffff000); type problem (I didn't count the f's, but enough so that this is just 1 (or 2) pages < the maximum possible size_t) and is better than allowing VM_MAP_RANGE_CHECK() to simply reduce the range size to 0. Then in uvm_map_pageable() after the: if (uvm_map_lookup_entry(map, start, &start_entry) == false) { if ((lockflags & UVM_LK_EXIT) == 0) vm_map_unlock(map); UVMHIST_LOG(maphist,"<- done (fault)",0,0,0,0); return EFAULT; } block, add if (start == end) { if ((lockflags & UVM_LK_EXIT) == 0) vm_map_unlock(map); UVMHIST_LOG(maphist,"<- done (nothing)",0,0,0,0); return 0; } I am guessing at the UVMHIST_LOG() stuff (copy/paste/edit...) Does all of this sound plausible, and reasonable to do ? Please do remember, that before this, I have never been anywhere near uvm or anything pmap related, so have mercy! kre