Thank you! I will try what you suggested.

Here is the full backtrace:
#0  0x00007f816529b360 in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007f80c716bb82 in Cond::Wait (this=this@entry=0x7f7f977dce20,
mutex=...) at ./common/Cond.h:56
#2  0x00007f80c717656f in librados::IoCtxImpl::operate_read
(this=this@entry=0x7f7ed40b4190,
oid=..., o=o@entry=0x7f7f977dd050, pbl=pbl@entry=0x0, flags=flags@entry=0)
    at librados/IoCtxImpl.cc:725
#3  0x00007f80c7180032 in librados::IoCtxImpl::stat (this=0x7f7ed40b4190,
oid=..., psize=psize@entry=0x7f7f977dd198, pmtime=pmtime@entry=0x7f7f977dd1a0)
at librados/IoCtxImpl.cc:1238
#4  0x00007f80c7142606 in librados::IoCtx::stat (this=0x7f7f977dd290,
oid=..., psize=0x7f7f977dd198, pmtime=0x7f7f977dd1a0) at
librados/librados.cc:1260

My custom tensorflow kernels

#5  0x00007f80d0e8fd23 in tensorflow::CephReaderOp::CephReadObject
(this=0x7f7e9c0607d0, file_key="ERR174324_1_05_20600000",
name_space="ERR174324_1_05", ref_buffer=0x7f7f18000cd0,
    io_ctx=...) at tensorflow/contrib/ceph_src/kernels/ceph_reader_op.cc:104
#6  0x00007f80d0e8f854 in tensorflow::CephReaderOp::Compute
(this=0x7f7e9c0607d0, ctx=0x7f7f977dd8f0) at
tensorflow/contrib/ceph_src/kernels/ceph_reader_op.cc:64

Tensorflow runtime

#7  0x00007f80d873d5c1 in tensorflow::ThreadPoolDevice::Compute
(this=0x7f812077a270, op_kernel=0x7f7e9c0607d0, context=0x7f7f977dd8f0)
    at tensorflow/core/common_runtime/threadpool_device.cc:60
#8  0x00007f80d86c80e4 in tensorflow::(anonymous
namespace)::ExecutorState::Process (this=0x7f78d80014b0, tagged_node=...,
scheduled_usec=0)
    at tensorflow/core/common_runtime/executor.cc:1658
#9  0x00007f80d86ca634 in tensorflow::(anonymous
namespace)::ExecutorState::<lambda()>::operator()(void) const
(__closure=0x7f7ed0000ee0) at
tensorflow/core/common_runtime/executor.cc:2094
#10 0x00007f80d86d1882 in std::_Function_handler<void(),
tensorflow::(anonymous namespace)::ExecutorState::ScheduleReady(const
TaggedNodeSeq&, tensorflow::(anonymous
namespace)::ExecutorState::TaggedNodeReadyQueue*)::<lambda()>
>::_M_invoke(const std::_Any_data &) (__functor=...) at
/usr/include/c++/5/functional:1871
#11 0x00007f80d7f449f6 in std::function<void ()>::operator()() const
(this=0x7f7ed0030ef0) at /usr/include/c++/5/functional:2267

Tensorflow threadpool

#12 0x00007f80d8188c86 in tensorflow::thread::EigenEnvironment::ExecuteTask
(this=0x7f81209e1778, t=...) at tensorflow/core/lib/core/threadpool.cc:83
#13 0x00007f80d818b6f4 in
Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::WorkerLoop
(this=0x7f81209e1770, thread_id=46)
    at
external/eigen_archive/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h:232
#14 0x00007f80d8189b46 in
Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::NonBlockingThreadPoolTempl(int,
bool, tensorflow::thread::EigenEnvironment)::{lambda()#1}::operator()()
const () at
external/eigen_archive/unsupported/Eigen/CXX11/src/ThreadPool/NonBlockingThreadPool.h:65
#15 0x00007f80d818cd14 in std::_Function_handler<void (),
Eigen::NonBlockingThreadPoolTempl<tensorflow::thread::EigenEnvironment>::NonBlockingThreadPoolTempl(int,
bool,
tensorflow::thread::EigenEnvironment)::{lambda()#1}>::_M_invoke(std::_Any_data
const&) (__functor=...) at /usr/include/c++/5/functional:1871
#16 0x00007f80d7f449f6 in std::function<void ()>::operator()() const
(this=0x7f8120aac960) at /usr/include/c++/5/functional:2267
#17 0x00007f80d81889d3 in
tensorflow::thread::EigenEnvironment::CreateThread(std::function<void
()>)::{lambda()#1}::operator()() const (__closure=0x7f8120aac960)
    at tensorflow/core/lib/core/threadpool.cc:56
#18 0x00007f80d818a970 in std::_Function_handler<void (),
tensorflow::thread::EigenEnvironment::CreateThread(std::function<void
()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (
    __functor=...) at /usr/include/c++/5/functional:1871
#19 0x00007f80d7f449f6 in std::function<void ()>::operator()() const
(this=0x7f8120aac9b8) at /usr/include/c++/5/functional:2267
#20 0x00007f80d81d1446 in std::_Bind_simple<std::function<void ()>
()>::_M_invoke<>(std::_Index_tuple<>) (this=0x7f8120aac9b8) at
/usr/include/c++/5/functional:1531
#21 0x00007f80d81d13af in std::_Bind_simple<std::function<void ()>
()>::operator()() (this=0x7f8120aac9b8) at
/usr/include/c++/5/functional:1520
#22 0x00007f80d81d134e in
std::thread::_Impl<std::_Bind_simple<std::function<void ()> ()> >::_M_run()
(this=0x7f8120aac9a0) at /usr/include/c++/5/thread:115
#23 0x00007f80d6c66c80 in ?? () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#24 0x00007f81652956ba in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#25 0x00007f8164fcb41d in clone () from /lib/x86_64-linux-gnu/libc.so.6

On Tue, May 1, 2018 at 7:47 PM Gregory Farnum <[email protected]> wrote:

> Can you provide the full backtrace? It kinda looks like you've left
> something out.
>
> In general though, a Wait inside of an operate call just means the thread
> has submitted its request and is waiting for the answer to come back. It
> could be blocked locally or remotely. If it's blocked remotely, the OSDs
> should be reporting to the mon/mgr that they have slow requests, which you
> can observe in "ceph -w" or whatever. If it's local, hrmm, not sure the
> easiest way to debug without just cranking up logging.
> Generically, I'd use the admin socket on your clients to look at what the
> status of in-flight requests is, and to check the value of the throttle
> limits in the perfcounter. If the requests are being handled slowly on the
> OSD, do the same there. That will probably give you some clues.
> -Greg
>
> On Tue, May 1, 2018 at 5:20 AM Sam Whitlock <[email protected]> wrote:
>
>> I am using librados in application to read and write many small files
>> (<128MB) concurrently, both in the same process and in different processes
>> (across many nodes). The application is built on Tensorflow (the read and
>> write operations are custom kernels I wrote).
>>
>> I'm having an issue with this application where, after a few minutes, the
>> all of my processes stop reading and writing to RADOS. In the debugging I
>> can see that they're all waiting, with some variation of the following
>> stack trace (edited for brevity), for various stat/read/write/write_full
>> operations:
>>
>> #0 in pthread_cond_wait@@GLIBC_2.3.2 () from
>> /lib/x86_64-linux-gnu/libpthread.so.0
>> #1 in Cond::Wait (this=this@entry=0x7f7f977dce20, mutex=...) at
>> ./common/Cond.h:56
>> #2 in librados::IoCtxImpl::operate_read (this=this@entry=0x7f7ed40b4190,
>> oid=..., o=o@entry=0x7f7f977dd050, pbl=pbl@entry=0x0, flags=flags@entry
>> =0)
>>     at librados/IoCtxImpl.cc:725
>> #3 in librados::IoCtxImpl::stat (this=0x7f7ed40b4190, oid=...,
>> psize=psize@entry=0x7f7f977dd198, pmtime=pmtime@entry=0x7f7f977dd1a0) at
>> librados/IoCtxImpl.cc:1238
>> #4 in librados::IoCtx::stat (this=0x7f7f977dd290, oid=...,
>> psize=0x7f7f977dd198, pmtime=0x7f7f977dd1a0) at librados/librados.cc:1260
>>
>> The application then proceeds to complete requests at a glacial pace
>> (~3-5 an hour) indefinitely.
>>
>> When I run the application with a very low level of concurrency, it works
>> properly. This "lock up" doesn't happen.
>>
>> All reads and writes are to a single pool from the same user. No files
>> are concurrently modified by different requests (i.e. completely
>> independent / embarrassingly parallel architecture in my app).
>>
>> How might I go about troubleshooting this? I'm not sure which logs to
>> look at and what I might be looking for (if it is even logged).
>>
>> I'm running Ceph 12.2.2, all machines running Ubuntu 16.04.
>>
>> --
>> Sam Whitlock
>>
> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to