I have added new issue https://github.com/cloudius-systems/osv/issues/1079 that captures all suggestions for improvements we have talked here about.
On Monday, March 30, 2020 at 5:10:44 PM UTC-4, Nadav Har'El wrote: > > On Mon, Mar 30, 2020 at 10:06 PM Waldek Kozaczuk <[email protected] > <javascript:>> wrote: > >> >> >> On Monday, March 30, 2020 at 4:33:05 AM UTC-4, Nadav Har'El wrote: >>> >>> On Mon, Mar 30, 2020 at 7:38 AM Waldek Kozaczuk <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>> I came up with this test trying to replicate the fragmentation scenario >>>> Rick Payne was experiencing: >>>> >>>> int main() { >>>> size_t size = 0x4C1000; >>>> >>>> void* mem = malloc(size); >>>> auto start = std::chrono::high_resolution_clock::now(); >>>> >>>> for (int i = 0; i < 100; i++) { >>>> size = (size * 105) / 100; >>>> malloc(0x4000); >>>> printf("%d allocation of %ld bytes\n", i + 1, size); >>>> mem = realloc(mem, size); >>>> } >>>> >>>> auto end = std::chrono::high_resolution_clock::now(); >>>> std::chrono::duration<double> elapsed = end - start; >>>> std::cout << "Elapsed time: " << elapsed.count() << " s\n"; >>>> printf("DONE - last size: %ld!\n", size); >>>> >>>> sleep(1000); >>>> free(mem); >>>> } >>>> >>>> So before the latest patch that implements mapped-based malloc_large() >>>> this test needs at least 1.6GB of memory to pass - the last realloc() >>>> would >>>> try to allocate ~ 625MB or memory. However, I did not really see the >>>> fragmentation I expected to see in physical memory so I do not think this >>>> test is very representative. >>>> >>>> The latest version of the code with all patches applied would need a >>>> minimum of 1.2GB - which is not surprising given that realloc() needs two >>>> copies of similar size to copy data from one place to another one until it >>>> frees the smaller one. But the good thing is, it does not need contiguous >>>> 625MB of physical memory to do it anymore. >>>> >>>> However, with this experiment, I noticed that new malloc_large is >>>> slower - this test takes 3 seconds vs 2 seconds before the patch. As I >>>> suspected the culprit was the fact that mapped_malloc_large calls map_anon >>>> with mmap_populate that pre-faults entire memory which I thought was >>>> necessary to do. But it turns out that when I replaced mmap_populate >>>> with mmap_uninitialized all the unit test and a couple of other apps were >>>> still working just fine. And the test above would take a similar amount of >>>> time as before - around 2 seconds. So maybe we should use >>>> mmap_uninitialized. The only concern would be kernel code running in >>>> preemption disabled mode and trying to access memory that was allocated >>>> with mapped_malloc_large(). Could it happen? >>>> >>> >>> Although malloc() cannot be called in preemption disabled mode (it uses >>> locks), you're right that all preemption-disabled kernel code that *uses* >>> memory previously allocated today just silently assumes that this memory is >>> already populated ("swapped in"). There is even more delicate issues like >>> the issue of lazy TLB flush explained in commit >>> c9e5af6deef6ef9420bdf6437f8c34371c8df4e9. So this sort of >>> preemption-disabled kernel code today relies on the old-style pre-populated >>> and visible-on-all-cores malloc(), and may not work correctly with mmap(). >>> However, I don't think this sort of preemption-disabled code in the kernel >>> ever works with very large allocations for which you fall back to mmap()? >>> >> >> With my patches malloc_large() uses map_anon() for any allocations >= 2MB >> (greater than so called "huge pages) and for allocations > 4K and < 2MB >> ONLY if we cannot find large enough page ranges that would fit >> ("fallback"). So do you think it is safe to change mmap_populate to >> mmap_uninitialized for the 1st case but not for the other? It is based on >> your thinking that kernel does not operate on such large allocated memory >> areas? >> > > Maybe that's indeed a good idea. Also for small regions the population > part would be relatively cheap? > > >> Also as I understand each CPU has it own TLB so it has to be flushed to >> see any changes made to the global page tables structures, right? So the >> commit ("mmu: flush tlb lazily for cpus that are running system threads" - >> https://github.com/cloudius-systems/osv/commit/7e38453390d6c0164a72e30b2616b0f3c3025349 >> by >> Gleb) optimizes flushing logic by making it lazy for system (?) threads >> (reading the code in this commit I think we are optimizing the application >> threads). This somehow affected the scheduler code which would sometimes >> end up operating on mmapped areas of memory where thread info used to be >> stored and lead to the a page fault when preemption was disabled, right?. >> So your commit ("sched: only allocate sched::thread objects on the heap" - >> https://github.com/cloudius-systems/osv/commit/c9e5af6deef6ef9420bdf6437f8c34371c8df4e9) >> >> addressed it by making sure "new thread" is hidden by thread::make() that >> makes sure to use regular non-mmaped way of allocating memory. So we hope >> that my commit did not break any of that. Also even if we change to lazy >> population for >2GB allocation we should be fine as the allocation made by >> thread::make() should not try to allocate more than 4K of memory and call >> malloc_large(), right? >> > > I don't remember what is sizeof(thread), I think it is around 20K, not > less than 4KB (you can check). > I think that indeed, we need to make sure that thread::make() should > return old-style malloc()ed memory, not mmap'ed() memory. > Perhaps thread::make() should use instead of aligned_alloc() a special > function which ensures the old-style behavior? Maybe like we have > alloc_phys_contiguous_aligned() we should have another function like > alloc_populated_and_visible_on_all() (or some other name) which ensures the > memory is pre-populated and visible *now* (not later) on all CPUs. > Alternatively, I wonder if we could add an mmap() option which will ensure > a full TLB flush (with no "laziness") and use this only from the > malloc()-doing-mmap() code. > > I think outside the scheduler and thread::make(), we don't need to worry > about Gleb's lazy TLB flush optimization: > > The goal of this optimization for reducing unnecessary IPIs to CPUs > currently running kernel threads - such as I/O threads and even idle > threads. > The thinking is that kernel threads would not be holding pointers to > inside user's mmap()ed memory, since the kernel has no control over when > this memory may be unmapped by the user. So if some CPU is currently > running a kernel thread, there is no reason to interrupt it (via an IPI), > which is pretty slow and even more so on VMs, and flush the TLB - on every > mmap() call, and we can flush the TLB only when returning to an application > thread which may want to use this newly-mapped area (via a pointer it gets > from some shared memory area). > I think the same thinking also follows for malloc() - I think kernel > threads would also not be keeping pointers to user malloc()ed memory > because the user may free it any time. So I think the TLB flush > optimization is still fine for malloc, or in other words: It's fine that > malloc() actually calls mmap() which has this TLB flush optimization. > > > >> >> static thread* make(Args&&... args) { >> return new thread(std::forward<Args>(args)...); >> } >> >> Lastly, do you think the changes to malloc_large() affect the patch I >> sent to make application threads use lazily populated stack - >> https://groups.google.com/d/msg/osv-dev/tZnwiScmjZY/GkY0hV9EAwAJ ? >> >> >>>> Finally, when I ran the same test on Linux host it would complete under >>>> 2ms (milliseconds) - so 1000 times faster. Why is that? Linux obviously >>>> uses mremap() to implement reallloc() and OSv copies the data using >>>> memcpy(). So if we want to make realloc() work faster (it could be also >>>> useful to speedup ramfs), we should implement remap() - here is an open >>>> issue for that - https://github.com/cloudius-systems/osv/issues/184 and >>>> supposedly Pekka sent a patch a while ago which we can build upon. >>>> >>> >>> As I noted a few weeks ago on this mailing list, Linux had hundreds of >>> people optimizing every tiny detail. The question is whether this >>> optimization is worth it. Is it common for people to realloc() huge >>> allocations? If they do, do they do it many times, or exponentially (i.e., >>> double the size each time)? >>> If you think it's worth it, then, yes - your new malloc() code now knows >>> it used mmap(), and can use mremap() to do the actual work. I think the >>> question is just if you want to optimize this specific use case. >>> >>> >>>> >>>> Waldek >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "OSv Development" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/osv-dev/54a9bb96-ea86-4c75-8d86-9a0be128fc58%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/osv-dev/54a9bb96-ea86-4c75-8d86-9a0be128fc58%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "OSv Development" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/osv-dev/efd5f149-43cd-492d-bb01-501fe8994bdd%40googlegroups.com >> >> <https://groups.google.com/d/msgid/osv-dev/efd5f149-43cd-492d-bb01-501fe8994bdd%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/6b73b671-675b-4b5b-b09d-bb9388b2e8d9%40googlegroups.com.
