Thanks Greg!
Will test and report back on Monday.

Cheers,

Charles


On Fri, 14 Sep 2018, 20:32 Gregory Farnum, <[email protected]> wrote:

> [Adding ceph-devel]
>
> On Fri, Sep 14, 2018 at 5:22 AM, Charles-François Natali
> <[email protected]> wrote:
> > See
> >
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029780.html
> > for the original thread.
> >
> > Here is a trivial reproducer not using any aio or dynamically allocated
> > memory to store the objects read.
> > It simply reads 20,000 1MB large objects sequentially: when run, instead
> of
> > using a roughly fixed amount of memory, the RSS increases up to 400 MB.
> >
> > <code>
> > #include <assert.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <rados/librados.h>
> >
> > #define SIZE 1000000
> >
> > void readobj(rados_ioctx_t* io, char objname[], unsigned long size) {
> >     static char data[SIZE];
> >     unsigned long bytes_read;
> >     int retval;
> >
> >     rados_read_op_t read_op = rados_create_read_op();
> >     rados_read_op_read(read_op, 0, size, data, &bytes_read, &retval);
> >     retval = rados_read_op_operate(read_op, *io, objname, 0);
> >     assert(retval == 0);
> >     rados_release_read_op(read_op);
> > }
> >
> > int main() {
> >     rados_t cluster;
> >     rados_create(&cluster, "test");
> >     rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
> >     rados_connect(cluster);
> >
> >     rados_ioctx_t io;
> >     rados_ioctx_create(cluster, "test", &io);
> >
> >     char smallobj_name[16];
> >     int i;
> >     long long total_bytes_read = 0;
> >
> >     for (i = 0; i < 20000; i++) {
> >         sprintf(smallobj_name, "1mobj_%d", i);
> >         readobj(&io, smallobj_name, SIZE);
> >         readobj(&io, smallobj_name, 1);
> >
> >         total_bytes_read += SIZE;
> >         printf("Read %s for total %lld\n", smallobj_name,
> total_bytes_read);
> >     }
> >
> >     return 0;
> > }
> > </code>
> >
> >
> > Running it under valgrind with --show-reachable=yes shows where the
> problem
> > is:
> >
> > ==10253== 423,755,776 bytes in 424 blocks are still reachable in loss
> record
> > 882 of 882
> > ==10253==    at 0x4C2FEA6: memalign (in
> > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> > ==10253==    by 0x4C2FFB1: posix_memalign (in
> > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
> > ==10253==    by 0x57A3C06:
> ceph::buffer::create_aligned_in_mempool(unsigned
> > int, unsigned int, int) (in /usr/lib/ceph/libceph-common.so.0)
> > ==10253==    by 0x5984B49: AsyncConnection::process() (in
> > /usr/lib/ceph/libceph-common.so.0)
> > ==10253==    by 0x5994C97: EventCenter::process_events(int,
> > std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (in
> > /usr/lib/ceph/libceph-common.so.0)
> > ==10253==    by 0x5999817: ??? (in /usr/lib/ceph/libceph-common.so.0)
> > ==10253==    by 0xE960C7F: ??? (in
> > /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
> > ==10253==    by 0xE6926B9: start_thread (pthread_create.c:333)
> > ==10253==    by 0x52A041C: clone (clone.S:109)
> >
> > We can see that the memory is allocated in AsyncConnection::process,
> > precisely here:
> >
> https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L614
> >
> > From my understanding - I have never read ceph code before, so my
> > understanding might be flawed - this code is allocating a buffer to store
> > the data contained in the message received from a ceph node and adding
> it to
> > a buffers list `data_buf`.
> > Note that this list of buffers is per AsyncConnection, that's important.
> >
> > Unfortunately, looking through the file shows that `data_buf` only gets
> > cleared here:
> >
> https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L477
> > when it transitions to the STATE_OPEN_MESSAGE_HEADER state, IIUC only
> occurs
> > when it starts receiving a new CEPH_MSGR_TAG_MSG message.
> >
> > Which means that basically the following happens - the below is for a
> single
> > AsyncConnection:
> >
> > repeat:
> >    0. wait for message
> >    1. parse message header
> >    2. realize it's an CEPH_MSGR_TAG_MSG message (not a heartbeat, etc):
> > transition to STATE_OPEN_MESSAGE_HEADER
> >    3. clear all buffers
> > (
> https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L477
> )
> >    4. start parsing message, allocate memory buffer
> > (
> https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L614
> )
> >    ...
> >    n. done processing message
> >
> > Which means that the memory allocated when processing a message - at
> step 4
> > - is only released when a subsequent CEPH_MSGR_TAG_MSG message is
> received.
> > For a single connection that's not too bad - memory is just freed later
> than
> > it could, but with a large OSD pool, it adds up.
> > Our cluster is made of over 40 ceph OSDs, which results in around 400
> > connections. Which means that if reading for example 1MB objects, this
> > delayed freeing can use up to 400MB (1MB per connection), which is
> exactly
> > what we're seeing.
> >
> > It would also explains:
> > - why reading a small subset of objects over and over again doesn't
> result
> > in a memory increase: because they are distributed over few ceph OSDs,
> hence
> > use few connections
> > - the shape of the memory consumption over time: if it were a leak (even
> > with reachable objects), it would grow linearly, whereas we expect it to
> > grow slower and slower as we've connected all ceph OSDs in the pool
> > - why interspersing small reads helps: if one does a large read follows
> by a
> > small read over the same connection, the large read buffer will be
> cleared
> > before the small read buffer is allocated
> >
> > The followingg is enough to keep the memory usage under control:
> > <diff>
> > --- a.c     2018-09-14 12:56:13.770350595 +0100
> > +++ b.c        2018-09-14 12:55:32.921856548 +0100
> > @@ -34,7 +34,6 @@
> >      for (i = 0; i < 20000; i++) {
> >          sprintf(smallobj_name, "1mobj_%d", i);
> >          readobj(&io, smallobj_name, SIZE);
> > +        readobj(&io, smallobj_name, 1);
> >
> >          total_bytes_read += SIZE;
> >          printf("Read %s for total %lld\n", smallobj_name,
> > total_bytes_read);
> > </diff>
> >
> > Reading just 1 byte of the same object after having read the object will
> > make the AsyncConnection clear the buffer allocated for processing the
> large
> > read result, and allocate a tiny buffer for the 1 byte read result.
> >
> >
> > TLDR: It seems that the buffers allocated while processing a message in
> > AsyncConnection are not cleared when the message has been processed, but
> > only when a subsequent message is received, which yields to a significant
> > memory usage if there are many connections in use (objects stored on many
> > different OSDs).
>
> Nice sleuthing! This does look correct to me. I've created a tracker
> issue (http://tracker.ceph.com/issues/35987) and pull request
> (https://github.com/ceph/ceph/pull/24111).
>
> If you wanted to try testing that branch (or cherry-picking the patch,
> if you're building it yourself) with your librados client to see if
> that resolves the issue, it'd be helpful! The packages should be
> available shortly from the dev branch repository (see
> http://docs.ceph.com/docs/master/install/get-packages/#add-ceph-development
> ).
> -Greg
>
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to