Yep, those completions are maintaining bufferlist references IIRC, so
they’re definitely holding the memory buffers in place!
On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley <[email protected]> wrote:
>
>
> On 09/12/2018 05:29 AM, Daniel Goldbach wrote:
> > Hi all,
> >
> > We're reading from a Ceph Luminous pool using the librados asychronous
> > I/O API. We're seeing some concerning memory usage patterns when we
> > read many objects in sequence.
> >
> > The expected behaviour is that our memory usage stabilises at a small
> > amount, since we're just fetching objects and ignoring their data.
> > What we instead find is that the memory usage of our program grows
> > linearly with the amount of data read for an interval of time, and
> > then continues to grow at a much slower but still consistent pace.
> > This memory is not freed until program termination. My guess is that
> > this is an issue with Ceph's memory allocator.
> >
> > To demonstrate, we create 20000 objects of size 10KB, and of size
> > 100KB, and of size 1MB:
> >
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <rados/librados.h>
> >
> > int main() {
> > rados_t cluster;
> > rados_create(&cluster, "test");
> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
> > rados_connect(cluster);
> >
> > rados_ioctx_t io;
> > rados_ioctx_create(cluster, "test", &io);
> >
> > char data[1000000];
> > memset(data, 'a', 1000000);
> >
> > char smallobj_name[16], mediumobj_name[16], largeobj_name[16];
> > int i;
> > for (i = 0; i < 20000; i++) {
> > sprintf(smallobj_name, "10kobj_%d", i);
> > rados_write(io, smallobj_name, data, 10000, 0);
> >
> > sprintf(mediumobj_name, "100kobj_%d", i);
> > rados_write(io, mediumobj_name, data, 100000, 0);
> >
> > sprintf(largeobj_name, "1mobj_%d", i);
> > rados_write(io, largeobj_name, data, 1000000, 0);
> >
> > printf("wrote %s of size 10000, %s of size 100000, %s of size 1000000\n",
> > smallobj_name, mediumobj_name, largeobj_name);
> > }
> >
> > return 0;
> > }
> >
> > $ gcc create.c -lrados -o create
> > $ ./create
> > wrote 10kobj_0 of size 10000, 100kobj_0 of size 100000, 1mobj_0 of
> > size 1000000
> > wrote 10kobj_1 of size 10000, 100kobj_1 of size 100000, 1mobj_1 of
> > size 1000000
> > [...]
> > wrote 10kobj_19998 of size 10000, 100kobj_19998 of size 100000,
> > 1mobj_19998 of size 1000000
> > wrote 10kobj_19999 of size 10000, 100kobj_19999 of size 100000,
> > 1mobj_19999 of size 1000000
> >
> > Now we read each of these objects with the async API, into the same
> > buffer. First we read just the the 10KB objects first:
> >
> > #include <assert.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <rados/librados.h>
> >
> > void readobj(rados_ioctx_t* io, char objname[]);
> >
> > int main() {
> > rados_t cluster;
> > rados_create(&cluster, "test");
> > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
> > rados_connect(cluster);
> >
> > rados_ioctx_t io;
> > rados_ioctx_create(cluster, "test", &io);
> >
> > char smallobj_name[16];
> > int i, total_bytes_read = 0;
> >
> > for (i = 0; i < 20000; i++) {
> > sprintf(smallobj_name, "10kobj_%d", i);
> > readobj(&io, smallobj_name);
> >
> > total_bytes_read += 10000;
> > printf("Read %s for total %d\n", smallobj_name, total_bytes_read);
> > }
> >
> > getchar();
> > return 0;
> > }
> >
> > void readobj(rados_ioctx_t* io, char objname[]) {
> > char data[1000000];
> > unsigned long bytes_read;
> > rados_completion_t completion;
> > int retval;
> >
> > rados_read_op_t read_op = rados_create_read_op();
> > rados_read_op_read(read_op, 0, 10000, data, &bytes_read, &retval);
> > retval = rados_aio_create_completion(NULL, NULL, NULL,
> > &completion);
> > assert(retval == 0);
> >
> > retval = rados_aio_read_op_operate(read_op, *io, completion,
> > objname, 0);
> > assert(retval == 0);
> >
> > rados_aio_wait_for_complete(completion);
> > rados_aio_get_return_value(completion);
> > }
> >
> > $ gcc read.c -lrados -o read_small -Wall -g && ./read_small
> > Read 10kobj_0 for total 10000
> > Read 10kobj_1 for total 20000
> > [...]
> > Read 10kobj_19998 for total 199990000
> > Read 10kobj_19999 for total 200000000
> >
> > We read 200MB. A graph of the resident set size of the program is
> > attached as mem-graph-10k.png, with seconds on x axis and KB on the y
> > axis. You can see that the memory usage increases throughout, which
> > itself is unexpected since that memory should be freed over time and
> > we should only hold 10KB of object data in memory at a time. The rate
> > of growth decreases and eventually stabilises, and by the end we've
> > used 60MB of RAM.
> >
> > We repeat this experiment for the 100KB and 1MB objects and find that
> > after all reads they use 140MB and 500MB of RAM, and memory usage
> > presumably would continue to grow if there were more objects. This is
> > orders of magnitude more memory than what I would expect these
> > programs to use.
> >
> > * We do not get this behaviour with the synchronous API, and the
> > memory usage remains stable at just a few MB.
> > * We've found that for some reason, this doesn't happen (or doesn't
> > happen as severely) if we intersperse large reads with much
> > smaller reads. In this case, the memory usage seems to stabilise
> > at a reasonable number.
> > * Valgrind only reports a trivial amount of unreachable memory.
> > * Memory usage doesn't increase in this manner if we repeatedly read
> > the same object over and over again. It hovers around 20MB.
> > * In other experiments we've done, with different object data and
> > distributions of object sizes, we've seen memory usage grow even
> > larger in proportion to the amount of data read.
> >
> > We maintain a long-running (order of weeks) services that read objects
> > from Ceph and send them elsewhere. Over time, the memory usage of some
> > of these services have grown to more than 6GB, which is unreasonable.
> >
> > --
> > Regards,
> > Dan G
> >
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> It looks like the async example is missing calls to rados_aio_release()
> to clean up the completions. I'm not sure that would account for all of
> the memory growth, but that's where I would start. Past that, running
> the client under valgrind massif should help with further investigation.
>
> Casey
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com