I am using librados in application to read and write many small files
(<128MB) concurrently, both in the same process and in different processes
(across many nodes). The application is built on Tensorflow (the read and
write operations are custom kernels I wrote).

I'm having an issue with this application where, after a few minutes, the
all of my processes stop reading and writing to RADOS. In the debugging I
can see that they're all waiting, with some variation of the following
stack trace (edited for brevity), for various stat/read/write/write_full
operations:

#0 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/x86_64-linux-gnu/libpthread.so.0
#1 in Cond::Wait (this=this@entry=0x7f7f977dce20, mutex=...) at
./common/Cond.h:56
#2 in librados::IoCtxImpl::operate_read (this=this@entry=0x7f7ed40b4190,
oid=..., o=o@entry=0x7f7f977dd050, pbl=pbl@entry=0x0, flags=flags@entry=0)
    at librados/IoCtxImpl.cc:725
#3 in librados::IoCtxImpl::stat (this=0x7f7ed40b4190, oid=...,
psize=psize@entry=0x7f7f977dd198, pmtime=pmtime@entry=0x7f7f977dd1a0) at
librados/IoCtxImpl.cc:1238
#4 in librados::IoCtx::stat (this=0x7f7f977dd290, oid=...,
psize=0x7f7f977dd198, pmtime=0x7f7f977dd1a0) at librados/librados.cc:1260

The application then proceeds to complete requests at a glacial pace (~3-5
an hour) indefinitely.

When I run the application with a very low level of concurrency, it works
properly. This "lock up" doesn't happen.

All reads and writes are to a single pool from the same user. No files are
concurrently modified by different requests (i.e. completely independent /
embarrassingly parallel architecture in my app).

How might I go about troubleshooting this? I'm not sure which logs to look
at and what I might be looking for (if it is even logged).

I'm running Ceph 12.2.2, all machines running Ubuntu 16.04.

--
Sam Whitlock
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to