Re: How memory is allocated in OSv and how we could save 6MB or more

Nadav Har'El Sun, 14 Oct 2018 08:25:54 -0700

On Wed, Oct 3, 2018 at 12:38 AM Waldek Kozaczuk <jwkozac...@gmail.com>
wrote:


> Last week or so I was trying to learn and understand how memory is
> allocated in OSv during its lifecycle since the moment it boots. The
> motivating force was to understand why OSv needs 27 MB (after recent
> "almost-in-place" kernel decompression patch) to run simple hello example
> in C and to identify ideas for potential improvements.
>

Great. Don't forget that memory allocation is probably part of the story,
but not all of it. We also have the issue that
our entire kernel is saved in memory, and that is already 7 MB (after your
recent shaving of 2MB off it), so it will be hard to reach the includeos
number you reported (2MB). We also have another silly megabyte wasted in
the beginning, we have various threads, stacks, buffers, queues, etc., and
of course filesystem cache, all taking memory. We have issues for some of
those things too. But it's good you're starting somewhere.

For comparison IncludeOS needs only 2MB to run simple example (have not
> verified it myself though). Also eventually I hope to create new wiki page
> describing memory allocation that is more up-to-date than this one -
> https://github.com/cloudius-systems/osv/wiki/Managing-Memory-Pages.
>

Good.


> Before I start describing what I have noticed during my experiments and
> suggesting potential  improvements, I wanted to briefly explain how I think
> memory allocation works in OSv. Feel free to correct me where I am wrong or
> missing some important detail.
>
> Upon boot time in arch_setup_free_memory OSv discovers how much physical
> memory there is available by reading ent820 entries and then linearly maps
> the identified memory ranges by calling memory::free_initial_memory_range.
> So for example given 100MB from host OSv would find single memory range
> starting at wherever loader.elf ends - roughly 12MB offset and ending at
> 100MB (roughly ~88MB big). For more details see
> https://github.com/cloudius-systems/osv/blob/186779b2e477815bbcea8ccff6ba26a7e21cea09/arch/x64/arch-setup.cc#L120-L205.
> Once memory is setup, all memory ranges are ultimately registered in
> memory::*free_page_ranges* (of type page_range_allocator - see
> https://github.com/cloudius-systems/osv/blob/186779b2e477815bbcea8ccff6ba26a7e21cea09/core/mempool.cc#L541-L820)
> that effectively tracks all used/free physical memory and implements lowest
> level memory allocation logic. At this level memory is
> tracked/allocated/freed in 4K chunks (pages) where each page starts at
> 0x???000 address.
>
> From this point on OSv is ready to handle "malloc/free" family and
> memory::alloc_page()/free_page() calls by drawing/releasing memory from/to
> free_page_ranges in form of page_range objects (see methods
> page_range_allocator::alloc(), alloc_aligned() and free()) and mapping to
> virtual address ranges. However until much later when SMP is enabled
> (multiple vCPUs are fully activated), the allocations would be handled at
> different granularity than after SMP is on. In addition in first phase
> (pre-SMP enabled) the allocations draw pages directly from
> *free_page_ranges* object, whereas after SMP is enabled they draw memory
> from L1/L2 pools. There are as many L1 pools as vCPUs (per-cpu construct)
> and single global L2 pool. The L1 pools draw pages from L2 pool which in
> turn draws page ranges from *free_page_ranges*. Both L1 and L2 pools
> operate at page size level and implement low/high watermark algorithm (for
> example L1 pools keep at least 128 pages of memory available).
>
> It is also worth noting that most malloc functions (except malloc_large())
> end up calling std_malloc() (see
> https://github.com/cloudius-systems/osv/blob/186779b2e477815bbcea8ccff6ba26a7e21cea09/core/mempool.cc#L1544-L1565)
> that allocates virtual memory in different ways depending on whether we are
> in pre/post-SMP enabled mode and depending on size of memory request. The
> sizes ranges are: x <= 1024 (page size/4), 1024 < x <= 4096 and x > 4096.
> If we are in SMP-enabled mode and requested size is less or equal 1024
> bytes, the allocation is going to be delegated to malloc pools (see
> https://github.com/cloudius-systems/osv/blob/186779b2e477815bbcea8ccff6ba26a7e21cea09/core/mempool.cc#L177-L353).
>  Malloc
> pools are setup per-CPU and dedicated to specific size range (2^(k-1) < x
> <=2^k where k is less or equal 10). The way std_malloc() handles <= 4K
> allocations directly impacts varying degrees of underlying physical memory
> utilization - for example any request above 1024 bytes will use whole page
> and in worst case scenario waste 3K of physical memory. Similarly malloc
> pool allocations in worst case scenario may waste up to half of 2^k-1
> segment size. In the following paragraphs I will try to suggest some ideas
> of how this waste can be mitigated.
>

See also https://github.com/cloudius-systems/osv/issues/1000


> The malloc_large/free_parge() calls draw memory directly free_page_range
> in both pre- and post-SMP-enabled phases.
>
> Most of my explanation above describes logic in core/mempool.cc. It does
> not really touch on what is happening mmu.cc, which mostly (as I could
> tell) deals with setting up page tables (PTEs) and updating TLB.
> Unfortunately a lot of code in mmu.cc is heavily based on complex templates
> so if someone else could explain this part I would really appreciate it.
>

Gleb (CCed) is the only one who understand these templates ;-)
The good news is that understanding them is not necessary for understanding
the memory allocation. They just deal with the page tables, which is an
implementation detail (all we need to know is that x86 can map memory at
page granularity).
Moreover, for "normal" malloc(), not mmap(), we don't need to worry about
page tables at all - we have a static mapping for the entire memory, and
malloc() returns addresses from that mapping, and doesn't need to modify
the page tables.


> To measure memory usage I added atomic counters at various points in
> mempool.cc and used alloc tracker mechanism (core/alloctracker.cc, osv
> leak) to capture individual allocations. The OSv alloc tracker is a
> wonderful tool to debug memory usage however during my experiments I
> encountered some crashes when capturing allocations in pre-SMP phase (I
> could narrow it down to some issue in backtrace_safe()/safe_load());
>

Is this the bug you fixed with the memcpy and old_size?


> I also had to use debug version of loader.elf as captured stack traces
> with release version would be malformed due to compiler optimizations.
> Finally I limited myself to simplest image with native-example (C hello
> world app) running on QEMU/KVM with single vCPU and 50MB of memory.
>
> I have divided my measurements and findings in two - pre-SMP-enabled and
> post-SMP-enabled phases.
>
> *Pre-SMP-enabled phase:*
> -----------------------
>
>    - counted 1884 whole page std_malloc allocations and 249 deallocations
>    to satisfy <=4K requests to allocate total of 120,116 bytes (~120 KB) ->
>    used and retained net total of 1582 pages (6328 KB) from free_page_ranges;
>    allmoc most allocations were far below 4096, instead in 64 bytes range ->
>    please see
>    https://github.com/cloudius-systems/osv/issues/270#issuecomment-424811320
>    for more details
>
>
>    - counted 27 malloc_large() (>4K) allocations (all retained) to
>    satisfy total of 1288 KB
>
> ----------
> *TOTAL:  allocated and retained total of 7616 KB*
> *Per free stats allocated physical memory at this point- 7620 KB*
>
> !!! *It would be great to save 6MB of memory* - issue 270 !!!
>

Indeed :-) It's even worse - 10 times worse than I estimated when I created
this issue. So either I mis-counted, or we are allocating now a lot more
crap during initialization :-(


> *Post-SMP-enabled phase:*
> ---------------------------------------------------
>
>    - used 345,888 bytes (~340K) to satisfy requested 309,953 bytes for
>    the < 1024 allocations
>
>
I think this is good enough.

>
>    -
>    - used 176 pages (704K) in order to allocate 330543 bytes for the 1K <
>    x < 4K allocations
>       - 40 full allocations of those above where for <16 bytes requested
>       which seems incorrect because - (please see the if condition
>       
> https://github.com/cloudius-systems/osv/blob/186779b2e477815bbcea8ccff6ba26a7e21cea09/core/mempool.cc#L1549that
>       needs to take into account that aligment of anything under 16 bytes will
>       lead to this condition being false -> possible saving 160K!!!)
>
>
Very interesting! Can you please open an issue for this? When size is tiny,
we anyway allocate not "size" but rather  memory::pool::min_object_size
(see the "max" code below), so we need the alignment to be less than that,
not the size.

I think you can easily create a unit test for this - do malloc(1) a hundred
times. If all of them are 0 modulo 4096, our malloc() did something wrong.


>    - allocated 650 pages (2600 KB) and released 107 pages (428 KB) to
>    satisfy malloc_large()/free_large() calls; net total retained - 2.2 MB
>
>
To say if this is good or bad, you are missing the total desired size of
these allocated objects.


>    -
>    - counted 512 full-page allocations from mmu::map_anon() and 256
>    full-page allocations from virtio::net::fill_rx_ring() - net total of 3MB
>
> --------------
> *TOTAL: Allocated and retained 704K + 340K + 2.2 MB +3 MB ~ 6.2 MB*
> *Per free stats free physical memory is 6312 in pages and 25853952 in
> bytes, used 7028 KB*
> -- The difference may be coming from the fact that L1 pools maintains
> buffer of 128 pages (0.5MB) plus some in L2 - see below
>

> *L1/L2 pools:*
>
>    - L2 pool filled 1248 pages (4992 KB) and unfilled 256 pages (1024 KB)
>    - per-cpu L1 pools filled 1152 pages (4608 KB) and unfilled 480 pages
>    (1920 KB)
>    - L1 users (allocation <= 4K) grabbed 1113 pages (4452 KB) and
>    returned 686 pages (2744 KB) to L1
>    - L1/L2 observation -> L1 needs to stay above low water-mark (128
>       pages = 512K per cpu) so that much is potentially over-allocated
>
>
Maybe we should have an option, or a case depending on physical memory
size, or something, where these pools are resized depending on the memory
size. There's nothing holy with the existing sizes, we may make due with
less if we have a tiny physical memory.


> All in all to run simple hello world app OSv needs around 27 MB = 12MB
> (kernel) + 7.5MB (pre-SMP) + 7 MB (post-SMP-enabled)
>
>
Thanks. Very good analysis. I think we can do (and have been doing) work on
reducing *all* these numbers.
I think much of the last number is various buffers and queues with
arbitrary sizes which we can rethink.


> I mostly used the counters values to describe my findings above but I am
> also attaching memory leak output I captured for post-SMP-enabled phase.
> Please see it attached.
>
> *Here are some unvarnished ideas for improvements besides what is
> described in #270 (the biggest potential saving anyway):*
>
>    - fix what seems like a bug that causes allocations less than 16 bytes
>    to be handled as full page allocations (see my note above); in case of Java
>    and Python images I counted around 500 allocations like these (2MB)
>
>
Indeed. I suggested above you open an issue about it (and fix it ;-)).

>
>    -
>    - utilize physical memory under 2MB offset (currently 1MB+640KB are
>    wasted)
>
> True. I don't remember why we ever did this.


>
>    -
>    - improve malloc pool allocation logic by introducing more size
>    levels; so besides 2^2,3^2 ... 10^2 add 756, 1024 + 256, 1024 + 512 and
>    1024 + 756 which would allow us to pack 2 allocations or more per page
>    instead of using full page
>
>
I'm not sure its worthwhile to add a lot of levels, and complicate the
already complicated code in mempool.cc, but in
https://github.com/cloudius-systems/osv/issues/1000 I suggested maybe
adding just one more level for 1025-2038 bytes should be easy and provide
"enough" improvement.

>
>
>    - malloc pools could be further improved by adding second list of
>    smaller segments that would even better utilize unused parts for sizes
>    greater than 1024 -> this may be quite cumbersome to implement.
>    - look into whether we need 1MB for virtio ring and possibly make it
>    configurable
>
>
>    -
>    - improve alloc tracker by adding ability to capture information about
>    how much memory was used to satisfy give allocation request vs what was
>    requested and allow for some categorization like non-SMP/SMP, malloc
>    pool/alloc_page/std_malloc and purpose generic malloc/mmap anon/mmap fault
>    handling, etc -> this would allow to collect same information without
>    having to put these special counters
>    - for non-ZFS images disable zfs_init() that consumes around 0.3 MB
>
> Before I create any new issues I would like to run what I have written
> here by others to validate my findings and ideas. I hope some of these
> ideas may be implemented as part of next 0.53 release.
>

Very good analysis. Thanks!


> My regards,
> Waldek
>
> --
> You received this message because you are subscribed to the Google Groups
> "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to osv-dev+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to osv-dev+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: How memory is allocated in OSv and how we could save 6MB or more

Reply via email to