Thanks Greg! Will test and report back on Monday. Cheers,
Charles On Fri, 14 Sep 2018, 20:32 Gregory Farnum, <[email protected]> wrote: > [Adding ceph-devel] > > On Fri, Sep 14, 2018 at 5:22 AM, Charles-François Natali > <[email protected]> wrote: > > See > > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029780.html > > for the original thread. > > > > Here is a trivial reproducer not using any aio or dynamically allocated > > memory to store the objects read. > > It simply reads 20,000 1MB large objects sequentially: when run, instead > of > > using a roughly fixed amount of memory, the RSS increases up to 400 MB. > > > > <code> > > #include <assert.h> > > #include <stdio.h> > > #include <stdlib.h> > > #include <string.h> > > #include <rados/librados.h> > > > > #define SIZE 1000000 > > > > void readobj(rados_ioctx_t* io, char objname[], unsigned long size) { > > static char data[SIZE]; > > unsigned long bytes_read; > > int retval; > > > > rados_read_op_t read_op = rados_create_read_op(); > > rados_read_op_read(read_op, 0, size, data, &bytes_read, &retval); > > retval = rados_read_op_operate(read_op, *io, objname, 0); > > assert(retval == 0); > > rados_release_read_op(read_op); > > } > > > > int main() { > > rados_t cluster; > > rados_create(&cluster, "test"); > > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); > > rados_connect(cluster); > > > > rados_ioctx_t io; > > rados_ioctx_create(cluster, "test", &io); > > > > char smallobj_name[16]; > > int i; > > long long total_bytes_read = 0; > > > > for (i = 0; i < 20000; i++) { > > sprintf(smallobj_name, "1mobj_%d", i); > > readobj(&io, smallobj_name, SIZE); > > readobj(&io, smallobj_name, 1); > > > > total_bytes_read += SIZE; > > printf("Read %s for total %lld\n", smallobj_name, > total_bytes_read); > > } > > > > return 0; > > } > > </code> > > > > > > Running it under valgrind with --show-reachable=yes shows where the > problem > > is: > > > > ==10253== 423,755,776 bytes in 424 blocks are still reachable in loss > record > > 882 of 882 > > ==10253== at 0x4C2FEA6: memalign (in > > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) > > ==10253== by 0x4C2FFB1: posix_memalign (in > > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) > > ==10253== by 0x57A3C06: > ceph::buffer::create_aligned_in_mempool(unsigned > > int, unsigned int, int) (in /usr/lib/ceph/libceph-common.so.0) > > ==10253== by 0x5984B49: AsyncConnection::process() (in > > /usr/lib/ceph/libceph-common.so.0) > > ==10253== by 0x5994C97: EventCenter::process_events(int, > > std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (in > > /usr/lib/ceph/libceph-common.so.0) > > ==10253== by 0x5999817: ??? (in /usr/lib/ceph/libceph-common.so.0) > > ==10253== by 0xE960C7F: ??? (in > > /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21) > > ==10253== by 0xE6926B9: start_thread (pthread_create.c:333) > > ==10253== by 0x52A041C: clone (clone.S:109) > > > > We can see that the memory is allocated in AsyncConnection::process, > > precisely here: > > > https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L614 > > > > From my understanding - I have never read ceph code before, so my > > understanding might be flawed - this code is allocating a buffer to store > > the data contained in the message received from a ceph node and adding > it to > > a buffers list `data_buf`. > > Note that this list of buffers is per AsyncConnection, that's important. > > > > Unfortunately, looking through the file shows that `data_buf` only gets > > cleared here: > > > https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L477 > > when it transitions to the STATE_OPEN_MESSAGE_HEADER state, IIUC only > occurs > > when it starts receiving a new CEPH_MSGR_TAG_MSG message. > > > > Which means that basically the following happens - the below is for a > single > > AsyncConnection: > > > > repeat: > > 0. wait for message > > 1. parse message header > > 2. realize it's an CEPH_MSGR_TAG_MSG message (not a heartbeat, etc): > > transition to STATE_OPEN_MESSAGE_HEADER > > 3. clear all buffers > > ( > https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L477 > ) > > 4. start parsing message, allocate memory buffer > > ( > https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L614 > ) > > ... > > n. done processing message > > > > Which means that the memory allocated when processing a message - at > step 4 > > - is only released when a subsequent CEPH_MSGR_TAG_MSG message is > received. > > For a single connection that's not too bad - memory is just freed later > than > > it could, but with a large OSD pool, it adds up. > > Our cluster is made of over 40 ceph OSDs, which results in around 400 > > connections. Which means that if reading for example 1MB objects, this > > delayed freeing can use up to 400MB (1MB per connection), which is > exactly > > what we're seeing. > > > > It would also explains: > > - why reading a small subset of objects over and over again doesn't > result > > in a memory increase: because they are distributed over few ceph OSDs, > hence > > use few connections > > - the shape of the memory consumption over time: if it were a leak (even > > with reachable objects), it would grow linearly, whereas we expect it to > > grow slower and slower as we've connected all ceph OSDs in the pool > > - why interspersing small reads helps: if one does a large read follows > by a > > small read over the same connection, the large read buffer will be > cleared > > before the small read buffer is allocated > > > > The followingg is enough to keep the memory usage under control: > > <diff> > > --- a.c 2018-09-14 12:56:13.770350595 +0100 > > +++ b.c 2018-09-14 12:55:32.921856548 +0100 > > @@ -34,7 +34,6 @@ > > for (i = 0; i < 20000; i++) { > > sprintf(smallobj_name, "1mobj_%d", i); > > readobj(&io, smallobj_name, SIZE); > > + readobj(&io, smallobj_name, 1); > > > > total_bytes_read += SIZE; > > printf("Read %s for total %lld\n", smallobj_name, > > total_bytes_read); > > </diff> > > > > Reading just 1 byte of the same object after having read the object will > > make the AsyncConnection clear the buffer allocated for processing the > large > > read result, and allocate a tiny buffer for the 1 byte read result. > > > > > > TLDR: It seems that the buffers allocated while processing a message in > > AsyncConnection are not cleared when the message has been processed, but > > only when a subsequent message is received, which yields to a significant > > memory usage if there are many connections in use (objects stored on many > > different OSDs). > > Nice sleuthing! This does look correct to me. I've created a tracker > issue (http://tracker.ceph.com/issues/35987) and pull request > (https://github.com/ceph/ceph/pull/24111). > > If you wanted to try testing that branch (or cherry-picking the patch, > if you're building it yourself) with your librados client to see if > that resolves the issue, it'd be helpful! The packages should be > available shortly from the dev branch repository (see > http://docs.ceph.com/docs/master/install/get-packages/#add-ceph-development > ). > -Greg > > > > > > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
