> On Jun 11, 2019, at 11:26 AM, Thomas Hellström (VMware) > <thellst...@vmwopensource.org> wrote: > > Hi, Nadav, > > On 6/11/19 7:21 PM, Nadav Amit wrote: >>> On Jun 11, 2019, at 5:24 AM, Thomas Hellström (VMware) >>> <thellst...@vmwopensource.org> wrote: >>> >>> From: Thomas Hellstrom <thellst...@vmware.com> >> [ snip ] >> >>> +/** >>> + * apply_pt_wrprotect - Leaf pte callback to write-protect a pte >>> + * @pte: Pointer to the pte >>> + * @token: Page table token, see apply_to_pfn_range() >>> + * @addr: The virtual page address >>> + * @closure: Pointer to a struct pfn_range_apply embedded in a >>> + * struct apply_as >>> + * >>> + * The function write-protects a pte and records the range in >>> + * virtual address space of touched ptes for efficient range TLB flushes. >>> + * >>> + * Return: Always zero. >>> + */ >>> +static int apply_pt_wrprotect(pte_t *pte, pgtable_t token, >>> + unsigned long addr, >>> + struct pfn_range_apply *closure) >>> +{ >>> + struct apply_as *aas = container_of(closure, typeof(*aas), base); >>> + pte_t ptent = *pte; >>> + >>> + if (pte_write(ptent)) { >>> + pte_t old_pte = ptep_modify_prot_start(aas->vma, addr, pte); >>> + >>> + ptent = pte_wrprotect(old_pte); >>> + ptep_modify_prot_commit(aas->vma, addr, pte, old_pte, ptent); >>> + aas->total++; >>> + aas->start = min(aas->start, addr); >>> + aas->end = max(aas->end, addr + PAGE_SIZE); >>> + } >>> + >>> + return 0; >>> +} >>> + >>> +/** >>> + * struct apply_as_clean - Closure structure for apply_as_clean >>> + * @base: struct apply_as we derive from >>> + * @bitmap_pgoff: Address_space Page offset of the first bit in @bitmap >>> + * @bitmap: Bitmap with one bit for each page offset in the address_space >>> range >>> + * covered. >>> + * @start: Address_space page offset of first modified pte relative >>> + * to @bitmap_pgoff >>> + * @end: Address_space page offset of last modified pte relative >>> + * to @bitmap_pgoff >>> + */ >>> +struct apply_as_clean { >>> + struct apply_as base; >>> + pgoff_t bitmap_pgoff; >>> + unsigned long *bitmap; >>> + pgoff_t start; >>> + pgoff_t end; >>> +}; >>> + >>> +/** >>> + * apply_pt_clean - Leaf pte callback to clean a pte >>> + * @pte: Pointer to the pte >>> + * @token: Page table token, see apply_to_pfn_range() >>> + * @addr: The virtual page address >>> + * @closure: Pointer to a struct pfn_range_apply embedded in a >>> + * struct apply_as_clean >>> + * >>> + * The function cleans a pte and records the range in >>> + * virtual address space of touched ptes for efficient TLB flushes. >>> + * It also records dirty ptes in a bitmap representing page offsets >>> + * in the address_space, as well as the first and last of the bits >>> + * touched. >>> + * >>> + * Return: Always zero. >>> + */ >>> +static int apply_pt_clean(pte_t *pte, pgtable_t token, >>> + unsigned long addr, >>> + struct pfn_range_apply *closure) >>> +{ >>> + struct apply_as *aas = container_of(closure, typeof(*aas), base); >>> + struct apply_as_clean *clean = container_of(aas, typeof(*clean), base); >>> + pte_t ptent = *pte; >>> + >>> + if (pte_dirty(ptent)) { >>> + pgoff_t pgoff = ((addr - aas->vma->vm_start) >> PAGE_SHIFT) + >>> + aas->vma->vm_pgoff - clean->bitmap_pgoff; >>> + pte_t old_pte = ptep_modify_prot_start(aas->vma, addr, pte); >>> + >>> + ptent = pte_mkclean(old_pte); >>> + ptep_modify_prot_commit(aas->vma, addr, pte, old_pte, ptent); >>> + >>> + aas->total++; >>> + aas->start = min(aas->start, addr); >>> + aas->end = max(aas->end, addr + PAGE_SIZE); >>> + >>> + __set_bit(pgoff, clean->bitmap); >>> + clean->start = min(clean->start, pgoff); >>> + clean->end = max(clean->end, pgoff + 1); >>> + } >>> + >>> + return 0; >> Usually, when a PTE is write-protected, or when a dirty-bit is cleared, the >> TLB flush must be done while the page-table lock for that specific table is >> taken (i.e., within apply_pt_clean() and apply_pt_wrprotect() in this case). >> >> Otherwise, in the case of apply_pt_clean() for example, another core might >> shortly after (before the TLB flush) write to the same page whose PTE was >> changed. The dirty-bit in such case might not be set, and the change get >> lost. > > Hmm. Let's assume that was the case, we have two possible situations: > > A: pt_clean > > 1. That core's TLB entry is invalid. It will set the PTE dirty bit and > continue. The dirty bit will probably remain set after the TLB flush.
I guess you mean the PTE is not cached in the TLB. > 2. That core's TLB entry is valid. It will just continue. The dirty bit will > remain clear after the TLB flush. > > But I fail to see how having the TLB flush within the page table lock would > help in this case. Since the writing core will never attempt to take it? In > any case, if such a race occurs, the corresponding bit in the bitmap would > have been set and we've recorded that the page is dirty. I don’t understand. What do you mean “recorded that the page is dirty”? IIUC, the PTE is clear in this case - you mean PG_dirty is set? To clarify, this code actually may work correctly on Intel CPUs, based on a recent discussion with Dave Hansen. Apparently, most Intel CPUs set the dirty bit in memory atomically when a page is first written. But this is a generic code and not arch-specific. My concern is that a certain page might be written to, but would not be marked as dirty in either the bitmap or the PTE. The practice of flushing cleaned/write-protected PTEs while hold the page-table lock related (sorry for my confusion). > B: wrprotect situation, the situation is a bit different: > > 1. That core's TLB entry is invalid. It will read the PTE, cause a fault and > block in mkwrite() on an external address space lock which is held over this > operation. (Is it this situation that is your main concern?) > 2. That core's TLB entry is valid. It will just continue regardless of any > locks. > > For both mkwrite() and dirty() if we act on the recorded pages *after* the > TLB flush, we're OK. The difference is that just after the TLB flush there > should be no write-enabled PTEs in the write-protect case, but there may be > dirty PTEs in the pt_clean case. Something that is mentioned in the docs > already. The wrprotect might work correctly, I guess. It does work to mprotect() (again, sorry for confusing). >> Does this function regards a certain use-case in which deferring the TLB >> flushes is fine? If so, assertions and documentation of the related >> assumption would be useful. > > If I understand your comment correctly, the page table lock is sometimes used > as the lock in B1, blocking a possible software fault until the TLB flush has > happened. Here we assume an external address space lock taken both around > the wrprotect operation and in mkwrite(). Would it be OK if I add comments > about the necessity of an external lock to the doc? Ok with a follow-up patch? I think the patch should explain itself. I think the comment: > + * WARNING: This function should only be used for address spaces that > + * completely own the pages / memory the page table points to. Typically a > + * device file. ... should be more concrete (define address spaces that completely own memory), and possibly backed by an (debug) assertion to ensure that it is only used correctly.