On Thu, Jan 28, 2021 at 03:25:46PM +0000, Marek Klein wrote:

> > On Wed, Jan 27, 2021 at 08:39:46AM +0100, Otto Moerbeek wrote:
> > 
> > > On Tue, Jan 26, 2021 at 04:08:40PM +0000, Marek Klein wrote:
> > >
> > > > Hi,
> > > >
> > > > We are working on an appliance like product that is based on OpenBSD.
> > > > Recently we found out that our performance critical C++ program is
> > > > ~2.5 times slower on OpenBSD compared to Ubuntu 20.04.
> > > >
> > > > The program basically just reads data from stdin, does some
> > > > transformation of the data, and returns the result on stdout, thus
> > > > the program does not perform any further I/O operations nor interacts
> > > > with other programs. We extensively use the C++ standard library string
> > > > class for manipulation of data.
> > > >
> > > > We started searching for the reason, and eliminated I/O as a factor.
> > > > During some experiments we found out that one, perhaps not the only
> > > > one, factor is OpenBSD's memory management. To test this assumption we
> > > > wrote a simple program that allocates and frees memory in a loop.
> > > > Something like:
> > > >
> > > > for (...) {
> > > >   void *buffer = malloc(...);
> > > >   ...
> > > >   free(buffer);
> > > > }
> > > >
> > > > We compiled it on OpenBSD with clang
> > > > $ /usr/bin/c++ --version
> > > > OpenBSD clang version 10.0.1
> > > > Target: amd64-unknown-openbsd6.8
> > > > Thread model: posix
> > > > InstalledDir: /usr/bin
> > > >
> > > > using options '-O3 -DNDEBUG -std=gnu++11' and ran it without memory
> > > > junking.
> > > >
> > > > $ time MALLOC_OPTIONS=jj ./memory_allocs --cycles 123456789 --size
> > 1024
> > > >
> > > > real    0m27.218s
> > > > user    0m27.220s
> > > > sys     0m0.020s
> > > >
> > > > We compiled the same program on Ubuntu 20.04 with g++
> > > > $ /usr/bin/c++ --version
> > > > c++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
> > > >
> > > > using the same options '-O3 -DNDEBUG -std=gnu++11'
> > > >
> > > > $ time ./memory_allocs --cycles 123456789 --size 1024
> > > >
> > > > real    0m1,920s
> > > > user    0m1,915s
> > > > sys     0m0,004s
> > > >
> > > > Both systems were tested in the same virtualized environment (VSphere),
> > > > thus we can assume the "hardware" is the same.
> > > >
> > > > Given the virtual environment, the tests might not be scientifically
> > > > the best choice, but they serve the observation well enough. We
> > > > actually ruled out virtualization as a cause in other tests.
> > >
> > > Short story: the slowness is because you get more security.
> > >
> > > Somewhat longer story: depending on the size if the allocation actual
> > > unmaps take place on free. This will catch use-after-free always. For
> > > smaller allocations, caching takes place, sadly you did not tell us
> > > how big the total of your allocations are. So I cannot predict if
> > > enlargering the cache will help you.
> > >
> > > Now the differnence is quite big so I like to know what you are doing
> > > exactly in your test program.  Please provide the full test porogram
> > > so I can take a look.
> > >
> > > >
> > > > What other options are there we could try in order to speed the memory
> > > > management up?
> > >
> > > Some hintss: allocate/free less, use better algorithms that do not
> > > allocate as much.  With C++ make sure your code uses moves of objects
> > > instead of copies whenever possible. Use reserve() wisely. If all else
> > > fails you might go for custom allocaters, but you will loose security
> > > features.
> > >
> > >   -Otto
> > >
> > > >
> > > > Also are there any other known areas, for CPU bound processing, where
> > > > OpenBSD performs worse than other "common" platforms?
> > > >
> > > > Cheers,
> > > > Marek
> > > >
> > >
> > 
> > To reply to myself.
> > 
> > Be VERY careful when drawing conclusions from these kinds of test
> > programs. To demonstrate, the loop in the test program below gets
> > compiled out by some compilers with some settings.
> > 
> > So again, please provide your test program.
> > 
> >     -Otto
> > 
> > #include <err.h>
> > #include <limits.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > 
> > int
> > main(int argc, char *argv[])
> > {
> >     size_t count, sz, i;
> >     char *p;
> >     const char *errstr;
> > 
> >     count = strtonum(argv[1], 0, LONG_MAX, &errstr);
> >     if (errstr)
> >             errx(1, "%s: %s", argv[1], errstr);
> >     sz = strtonum(argv[2], 0, LONG_MAX, &errstr);
> >     if (errstr)
> >             errx(1, "%s: %s", argv[2], errstr);
> > 
> >     printf("Run with %zu %zu\n", count, sz);
> > 
> >     for (i = 0; i < count; i++) {
> >             p = malloc(sz);
> >             if (p == NULL)
> >                     err(1, NULL);
> >             *p = 1;
> >             free(p);
> >     }
> > }
> > 
> > 
> 
> Hi Otto,
> 
> My test program does something very similar.
> 
> As stated before I compile with
> 1. OpenBSD: clang version 10.0.1 and
> 2. Ubuntu: g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
> with the same options '-O3 -DNDEBUG -std=gnu++11'.
> 
> The execution time grows with the number of cycles and also with
> the size of allocated memory on both platforms, thus I think the loop
> is not optimized out.
> 
> OpenBSD needs consistently ~10x longer to finish the test compared to
> Ubuntu. Regarding the size of allocations, we operate on relatively
> short strings, e.g., 25 bytes long.

OK, some observations

1. malloc on OpenBSD is indeed slower than glibc's one. That is no
suprise.

2. Your test (and mine) emphasize (or ragther exaggerate!) the
difference in speed by only doing malloc and free with virtually no
real work done. That makes the larger overhead really bigger than it
is in real-life applications.

4. There are no obvious changes to make to malloc that would make it
faster without sacrificing security feautures. I verified that by
doing profiling and review of my code. The extra costs are mainly due
to randomization and having all meta-data not in or near the chunks
returned to the applictaion.

5. The loops in the test programs can be trivially made quicker by
moving the allocations outside the loop, re-using the allocated
memory. The speed difference then becomes a no-issue.

6. If the difference in speed is indeed an issue in your application,
I would profile (compile with -pg -static) and then run gprof(1) and
hunt for cases where you can do optimization. Sometimes it is indeed
as easy as moving an allocation/deallocation pair outside a loop.

7. If you are not able to find optmizations that satisfy your
performance requiremens, consider using custom allocators, but realize
you will lose security features. If that is not possible, maybe
OpenBSD is the the right platform for you.

        -Otto

> 
> Cheers,
> Marek
> 
> #include <iostream>
> #include <memory>
> #include <sstream>
> #include <stdexcept>
> #include <string.h>
> 
> class dynamic_buffer {
> public:
>   dynamic_buffer(size_t size)
>     : m_memory(NULL) {
>     m_memory = (char *)malloc(size);
>     if (m_memory == NULL) {
>       throw std::runtime_error("out of memory");
>     }
>   }
> 
>   dynamic_buffer() = delete;
>   dynamic_buffer(const dynamic_buffer&) = delete;
>   dynamic_buffer(dynamic_buffer&&) noexcept = delete;
>   dynamic_buffer& operator=(const dynamic_buffer&) = delete;
>   dynamic_buffer& operator=(dynamic_buffer&&) noexcept = delete;
> 
>   char* raw_memory() {
>     return m_memory;
>   }
> 
>   ~dynamic_buffer() {
>     if (m_memory != NULL) {
>       free(m_memory);
>     }
>   }
> private:
>   char *m_memory;
> };
> 
> static std::string help(const std::string &program_name) {
>   std::stringstream help;
>   help << program_name
>        << " --cycles <number of cycles> --size <size of buffer>"
>        << std::endl;
> 
>   return help.str();
> }
> 
> int main(int argc, const char *argv[]) {
>   try {
>     if (argc != 5) {
>       throw std::logic_error(help(std::string(argv[0])));
>     }
> 
>     int number_of_cycles = atoi(argv[2]);
>     int size_of_buffer = atoi(argv[4]);
> 
>     for (int i = 0; i < number_of_cycles; i++) {
>       dynamic_buffer buffer(size_of_buffer);
>       if (*reinterpret_cast<unsigned int*>(buffer.raw_memory()) == 
> 0xDEADBEEF) {
>         std::cout << "Bingo!" << std::endl;
>       }
>     }
>     return 0;
>   } catch (const std::exception &e) {
>     std::cerr << e.what() << std::endl;
>   } catch (...) {
>     std::cerr << "Something went really wrong" << std::endl;
>   }
> 
>   return 1;
> }

Reply via email to