On Mon, 9 Nov 2015 11:44:54 -0800 Shaohua Li <[email protected]> wrote:
> In jemalloc, a free(3) doesn't immediately free the memory to OS even
> the memory is page aligned/size, and hope the memory can be reused soon.
> Later the virtual address becomes fragmented, and more and more free
> memory are aggregated. If the free memory size is large, jemalloc uses
> madvise(DONT_NEED) to actually free the memory back to OS.
>
> The madvise has significantly overhead paritcularly because of TLB
> flush. jemalloc does madvise for several virtual address space ranges
> one time. Instead of calling madvise for each of the ranges, we
> introduce a new syscall to purge memory for several ranges one time. In
> this way, we can merge several TLB flush for the ranges to one big TLB
> flush. This also reduce mmap_sem locking and kernel/userspace switching.
>
> I'm running a simple memory allocation benchmark. 32 threads do random
> malloc/free/realloc. Corresponding jemalloc patch to utilize this API is
> attached.
> Without patch:
> real 0m18.923s
> user 1m11.819s
> sys 7m44.626s
> each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush
> is hotest functions. mmap_sem read locking (because of page fault) is
> also heavy.
>
> with patch:
> real 0m15.026s
> user 0m48.548s
> sys 6m41.153s
> each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at
> all. mmap_sem read locking (still because of page fault) becomes the
> sole hot spot.
>
> Another test malloc a bunch of memory in 48 threads, then all threads
> free the memory. I measure the time of the memory free.
> Without patch: 34.332s
> With patch: 17.429s
>
> Current implementation only supports MADV_DONTNEED. Should be trival to
> support MADV_FREE if necessary later.
I'd like to see a full description of the proposed userspace interface:
arguments, data structures, return values, etc. A propotype manpage,
basically.
I'd also like to see an analysis of which other userspace allocators
will benefit from this. glibc? tcmalloc?
>
> ...
>
> +/*
> + * The vector madvise(). Like madvise except running for a vector of virtual
> + * address ranges
> + */
> +SYSCALL_DEFINE3(madvisev, const struct iovec __user *, uvector,
> + unsigned long, nr_segs, int, behavior)
> +{
> + struct iovec iovstack[UIO_FASTIOV];
> + struct iovec *iov = NULL;
> + unsigned long start, end = 0;
> + int unmapped_error = 0;
> + size_t len;
> + struct mmu_gather tlb;
> + int error;
> + int i;
> +
> + if (behavior != MADV_DONTNEED)
> + return -EINVAL;
> +
> + error = rw_copy_check_uvector(CHECK_IOVEC_ONLY, uvector, nr_segs,
> + UIO_FASTIOV, iovstack, &iov);
> + if (error <= 0)
> + goto out;
> + /* Make sure address in ascend order */
> + sort(iov, nr_segs, sizeof(struct iovec), iov_cmp_func, NULL);
Do we really need to sort the addresses? That's something which can be
done in userspace and we can easily add a check-for-sortedness to the
below loop.
It depends on whether userspace can easily generate a sorted array. If
basically all userspace will always need to run sort() then it doesn't
matter much whether it's done in the kernel or in userspace. But if
*some* userspace can naturally generate its array in sorted form then
neither userspace nor the kernel needs to run sort() and we should take
this out.
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html