> On Wed, Jan 27, 2021 at 08:39:46AM +0100, Otto Moerbeek wrote:
> 
> > On Tue, Jan 26, 2021 at 04:08:40PM +0000, Marek Klein wrote:
> >
> > > Hi,
> > >
> > > We are working on an appliance like product that is based on OpenBSD.
> > > Recently we found out that our performance critical C++ program is
> > > ~2.5 times slower on OpenBSD compared to Ubuntu 20.04.
> > >
> > > The program basically just reads data from stdin, does some
> > > transformation of the data, and returns the result on stdout, thus
> > > the program does not perform any further I/O operations nor interacts
> > > with other programs. We extensively use the C++ standard library string
> > > class for manipulation of data.
> > >
> > > We started searching for the reason, and eliminated I/O as a factor.
> > > During some experiments we found out that one, perhaps not the only
> > > one, factor is OpenBSD's memory management. To test this assumption we
> > > wrote a simple program that allocates and frees memory in a loop.
> > > Something like:
> > >
> > > for (...) {
> > >   void *buffer = malloc(...);
> > >   ...
> > >   free(buffer);
> > > }
> > >
> > > We compiled it on OpenBSD with clang
> > > $ /usr/bin/c++ --version
> > > OpenBSD clang version 10.0.1
> > > Target: amd64-unknown-openbsd6.8
> > > Thread model: posix
> > > InstalledDir: /usr/bin
> > >
> > > using options '-O3 -DNDEBUG -std=gnu++11' and ran it without memory
> > > junking.
> > >
> > > $ time MALLOC_OPTIONS=jj ./memory_allocs --cycles 123456789 --size
> 1024
> > >
> > > real      0m27.218s
> > > user      0m27.220s
> > > sys       0m0.020s
> > >
> > > We compiled the same program on Ubuntu 20.04 with g++
> > > $ /usr/bin/c++ --version
> > > c++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
> > >
> > > using the same options '-O3 -DNDEBUG -std=gnu++11'
> > >
> > > $ time ./memory_allocs --cycles 123456789 --size 1024
> > >
> > > real      0m1,920s
> > > user      0m1,915s
> > > sys       0m0,004s
> > >
> > > Both systems were tested in the same virtualized environment (VSphere),
> > > thus we can assume the "hardware" is the same.
> > >
> > > Given the virtual environment, the tests might not be scientifically
> > > the best choice, but they serve the observation well enough. We
> > > actually ruled out virtualization as a cause in other tests.
> >
> > Short story: the slowness is because you get more security.
> >
> > Somewhat longer story: depending on the size if the allocation actual
> > unmaps take place on free. This will catch use-after-free always. For
> > smaller allocations, caching takes place, sadly you did not tell us
> > how big the total of your allocations are. So I cannot predict if
> > enlargering the cache will help you.
> >
> > Now the differnence is quite big so I like to know what you are doing
> > exactly in your test program.  Please provide the full test porogram
> > so I can take a look.
> >
> > >
> > > What other options are there we could try in order to speed the memory
> > > management up?
> >
> > Some hintss: allocate/free less, use better algorithms that do not
> > allocate as much.  With C++ make sure your code uses moves of objects
> > instead of copies whenever possible. Use reserve() wisely. If all else
> > fails you might go for custom allocaters, but you will loose security
> > features.
> >
> >     -Otto
> >
> > >
> > > Also are there any other known areas, for CPU bound processing, where
> > > OpenBSD performs worse than other "common" platforms?
> > >
> > > Cheers,
> > > Marek
> > >
> >
> 
> To reply to myself.
> 
> Be VERY careful when drawing conclusions from these kinds of test
> programs. To demonstrate, the loop in the test program below gets
> compiled out by some compilers with some settings.
> 
> So again, please provide your test program.
> 
>       -Otto
> 
> #include <err.h>
> #include <limits.h>
> #include <stdio.h>
> #include <stdlib.h>
> 
> int
> main(int argc, char *argv[])
> {
>       size_t count, sz, i;
>       char *p;
>       const char *errstr;
> 
>       count = strtonum(argv[1], 0, LONG_MAX, &errstr);
>       if (errstr)
>               errx(1, "%s: %s", argv[1], errstr);
>       sz = strtonum(argv[2], 0, LONG_MAX, &errstr);
>       if (errstr)
>               errx(1, "%s: %s", argv[2], errstr);
> 
>       printf("Run with %zu %zu\n", count, sz);
> 
>       for (i = 0; i < count; i++) {
>               p = malloc(sz);
>               if (p == NULL)
>                       err(1, NULL);
>               *p = 1;
>               free(p);
>       }
> }
> 
> 

Hi Otto,

My test program does something very similar.

As stated before I compile with
1. OpenBSD: clang version 10.0.1 and
2. Ubuntu: g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
with the same options '-O3 -DNDEBUG -std=gnu++11'.

The execution time grows with the number of cycles and also with
the size of allocated memory on both platforms, thus I think the loop
is not optimized out.

OpenBSD needs consistently ~10x longer to finish the test compared to
Ubuntu. Regarding the size of allocations, we operate on relatively
short strings, e.g., 25 bytes long.

Cheers,
Marek

#include <iostream>
#include <memory>
#include <sstream>
#include <stdexcept>
#include <string.h>

class dynamic_buffer {
public:
  dynamic_buffer(size_t size)
    : m_memory(NULL) {
    m_memory = (char *)malloc(size);
    if (m_memory == NULL) {
      throw std::runtime_error("out of memory");
    }
  }

  dynamic_buffer() = delete;
  dynamic_buffer(const dynamic_buffer&) = delete;
  dynamic_buffer(dynamic_buffer&&) noexcept = delete;
  dynamic_buffer& operator=(const dynamic_buffer&) = delete;
  dynamic_buffer& operator=(dynamic_buffer&&) noexcept = delete;

  char* raw_memory() {
    return m_memory;
  }

  ~dynamic_buffer() {
    if (m_memory != NULL) {
      free(m_memory);
    }
  }
private:
  char *m_memory;
};

static std::string help(const std::string &program_name) {
  std::stringstream help;
  help << program_name
       << " --cycles <number of cycles> --size <size of buffer>"
       << std::endl;

  return help.str();
}

int main(int argc, const char *argv[]) {
  try {
    if (argc != 5) {
      throw std::logic_error(help(std::string(argv[0])));
    }

    int number_of_cycles = atoi(argv[2]);
    int size_of_buffer = atoi(argv[4]);

    for (int i = 0; i < number_of_cycles; i++) {
      dynamic_buffer buffer(size_of_buffer);
      if (*reinterpret_cast<unsigned int*>(buffer.raw_memory()) == 0xDEADBEEF) {
        std::cout << "Bingo!" << std::endl;
      }
    }
    return 0;
  } catch (const std::exception &e) {
    std::cerr << e.what() << std::endl;
  } catch (...) {
    std::cerr << "Something went really wrong" << std::endl;
  }

  return 1;
}

Reply via email to