Re: [osv-dev] map-based large allocation, map_populate and mremap()

Waldek Kozaczuk Fri, 03 Apr 2020 15:00:04 -0700

I have added new issue https://github.com/cloudius-systems/osv/issues/1079 that 
captures all suggestions for improvements we have talked here about.


On Monday, March 30, 2020 at 5:10:44 PM UTC-4, Nadav Har'El wrote:
>
> On Mon, Mar 30, 2020 at 10:06 PM Waldek Kozaczuk <[email protected] 
> <javascript:>> wrote:
>
>>
>>
>> On Monday, March 30, 2020 at 4:33:05 AM UTC-4, Nadav Har'El wrote:
>>>
>>> On Mon, Mar 30, 2020 at 7:38 AM Waldek Kozaczuk <[email protected]> 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I came up with this test trying to replicate the fragmentation scenario 
>>>> Rick Payne was experiencing:
>>>>
>>>> int main() {
>>>>    size_t size = 0x4C1000;
>>>>
>>>>    void* mem = malloc(size);
>>>>    auto start = std::chrono::high_resolution_clock::now();
>>>>
>>>>    for (int i = 0; i < 100; i++) {
>>>>        size = (size * 105) / 100;
>>>>        malloc(0x4000);
>>>>        printf("%d allocation of %ld bytes\n", i + 1, size);
>>>>        mem = realloc(mem, size);
>>>>    }
>>>>
>>>>    auto end = std::chrono::high_resolution_clock::now();
>>>>    std::chrono::duration<double> elapsed = end - start;
>>>>    std::cout << "Elapsed time: " << elapsed.count() << " s\n";
>>>>    printf("DONE - last size: %ld!\n", size);
>>>>
>>>>    sleep(1000);
>>>>    free(mem);
>>>> }
>>>>
>>>> So before the latest patch that implements mapped-based malloc_large() 
>>>> this test needs at least 1.6GB of memory to pass - the last realloc() 
>>>> would 
>>>> try to allocate ~ 625MB or memory. However, I did not really see the 
>>>> fragmentation I expected to see in physical memory so I do not think this 
>>>> test is very representative.
>>>>
>>>> The latest version of the code with all patches applied would need a 
>>>> minimum of 1.2GB - which is not surprising given that realloc() needs two 
>>>> copies of similar size to copy data from one place to another one until it 
>>>> frees the smaller one. But the good thing is, it does not need contiguous 
>>>> 625MB of physical memory to do it anymore.
>>>>
>>>> However, with this experiment, I noticed that new malloc_large is 
>>>> slower - this test takes 3 seconds vs 2 seconds before the patch. As I 
>>>> suspected the culprit was the fact that mapped_malloc_large calls map_anon 
>>>> with mmap_populate that pre-faults entire memory which I thought was 
>>>> necessary to do. But it turns out that when I replaced mmap_populate 
>>>> with mmap_uninitialized all the unit test and a couple of other apps were 
>>>> still working just fine. And the test above would take a similar amount of 
>>>> time as before - around 2 seconds. So maybe we should use 
>>>> mmap_uninitialized. The only concern would be kernel code running in 
>>>> preemption disabled mode and trying to access memory that was allocated 
>>>> with mapped_malloc_large(). Could it happen?
>>>>
>>>
>>> Although malloc() cannot be called in preemption disabled mode (it uses 
>>> locks), you're right that all preemption-disabled kernel code that *uses* 
>>> memory previously allocated today just silently assumes that this memory is 
>>> already populated ("swapped in"). There is even more delicate issues like 
>>> the issue of lazy TLB flush explained in  commit 
>>> c9e5af6deef6ef9420bdf6437f8c34371c8df4e9. So this sort of 
>>> preemption-disabled kernel code today relies on the old-style pre-populated 
>>> and visible-on-all-cores malloc(), and may not work correctly with mmap(). 
>>> However, I don't think this sort of preemption-disabled code in the kernel 
>>> ever works with very large allocations for which you fall back to mmap()?
>>>
>>
>> With my patches malloc_large() uses map_anon() for any allocations >= 2MB 
>> (greater than so called "huge pages) and for allocations > 4K and < 2MB 
>> ONLY if we cannot find large enough page ranges that would fit 
>> ("fallback"). So do you think it is safe to change mmap_populate to  
>> mmap_uninitialized for the 1st case but not for the other? It is based on 
>> your thinking that kernel does not operate on such large allocated memory 
>> areas?
>>
>
> Maybe that's indeed a good idea. Also for small regions the population 
> part would be relatively cheap?
>
>
>> Also as I understand each CPU has it own TLB so it has to be flushed to 
>> see any changes made to the global page tables structures, right? So the 
>> commit ("mmu: flush tlb lazily for cpus that are running system threads" - 
>> https://github.com/cloudius-systems/osv/commit/7e38453390d6c0164a72e30b2616b0f3c3025349
>>  by 
>> Gleb) optimizes flushing logic by making it lazy for system (?) threads 
>> (reading the code in this commit I think we are optimizing the application 
>> threads). This somehow affected the scheduler code which would sometimes 
>> end up operating on mmapped areas of memory where thread info used to be 
>> stored and lead to the a page fault when preemption was disabled, right?. 
>> So your commit ("sched: only allocate sched::thread objects on the heap" - 
>> https://github.com/cloudius-systems/osv/commit/c9e5af6deef6ef9420bdf6437f8c34371c8df4e9)
>>  
>> addressed it by making sure "new thread" is hidden by thread::make() that 
>> makes sure to use regular non-mmaped way of allocating memory. So we hope 
>> that my commit did not break any of that. Also even if we change to lazy 
>> population for >2GB allocation we should be fine as the allocation made by 
>> thread::make() should not try to allocate more than 4K of memory and call 
>> malloc_large(), right?
>>
>
> I don't remember what is sizeof(thread), I think it is around 20K, not 
> less than 4KB (you can check).
> I think that indeed, we need to make sure that thread::make() should 
> return old-style malloc()ed memory, not mmap'ed() memory. 
> Perhaps thread::make() should use instead of aligned_alloc() a special 
> function which ensures the old-style behavior? Maybe like we have 
> alloc_phys_contiguous_aligned() we should have another function like 
> alloc_populated_and_visible_on_all() (or some other name) which ensures the 
> memory is pre-populated and visible *now* (not later) on all CPUs.
> Alternatively, I wonder if we could add an mmap() option which will ensure 
> a full TLB flush (with no "laziness") and use this only from the 
> malloc()-doing-mmap() code.
>
> I think outside the scheduler and thread::make(), we don't need to worry 
> about Gleb's lazy TLB flush optimization: 
>
> The goal of this optimization for reducing unnecessary IPIs to CPUs 
> currently running kernel threads - such as I/O threads and even idle 
> threads.
> The thinking is that kernel threads would not be holding pointers to 
> inside user's mmap()ed memory, since the kernel has no control over when 
> this memory may be unmapped by the user. So if some CPU is currently 
> running a kernel thread, there is no reason to interrupt it (via an IPI), 
> which is pretty slow and even more so on VMs, and flush the TLB - on every 
> mmap() call, and we can flush the TLB only when returning to an application 
> thread which may want to use this newly-mapped area (via a pointer it gets 
> from some shared memory area).
> I think the same thinking also follows for malloc() - I think kernel 
> threads would also not be keeping pointers to user malloc()ed memory 
> because the user may free it any time. So I think the TLB flush 
> optimization is still fine for malloc, or in other words: It's fine that 
> malloc() actually calls mmap() which has this TLB flush optimization.
>
>  
>
>>
>>     static thread* make(Args&&... args) {
>>         return new thread(std::forward<Args>(args)...);
>>     }
>>
>> Lastly, do you think the changes to malloc_large() affect the patch I 
>> sent to make application threads use lazily populated stack - 
>> https://groups.google.com/d/msg/osv-dev/tZnwiScmjZY/GkY0hV9EAwAJ ?
>>
>>
>>>> Finally, when I ran the same test on Linux host it would complete under 
>>>> 2ms (milliseconds) - so 1000 times faster. Why is that? Linux obviously 
>>>> uses mremap() to implement reallloc() and OSv copies the data using 
>>>> memcpy(). So if we want to make realloc() work faster (it could be also 
>>>> useful to speedup ramfs), we should implement remap() - here is an open 
>>>> issue for that - https://github.com/cloudius-systems/osv/issues/184 and 
>>>> supposedly Pekka sent a patch a while ago which we can build upon.
>>>>
>>>
>>> As I noted a few weeks ago on this mailing list, Linux had hundreds of 
>>> people optimizing every tiny detail. The question is whether this 
>>> optimization is worth it. Is it common for people to realloc() huge 
>>> allocations? If they do, do they do it many times, or exponentially (i.e., 
>>> double the size each time)?
>>> If you think it's worth it, then, yes - your new malloc() code now knows 
>>> it used mmap(), and can use mremap() to do the actual work. I think the 
>>> question is just if you want to optimize this specific use case.
>>>  
>>>
>>>>
>>>> Waldek 
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "OSv Development" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/osv-dev/54a9bb96-ea86-4c75-8d86-9a0be128fc58%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/osv-dev/54a9bb96-ea86-4c75-8d86-9a0be128fc58%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "OSv Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/osv-dev/efd5f149-43cd-492d-bb01-501fe8994bdd%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/osv-dev/efd5f149-43cd-492d-bb01-501fe8994bdd%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/6b73b671-675b-4b5b-b09d-bb9388b2e8d9%40googlegroups.com.

Re: [osv-dev] map-based large allocation, map_populate and mremap()

Reply via email to