Thank jake, can you confirm are you testing this in which ceph version - the out of memory you noticed. There is already a memory leak issue reported in kraken v11.2.0 . which addressed in this tracker .. http://tracker.ceph.com/issues/18924 ..
#ceph -v Ok so you are mounting/mapping ceph as a rbd and writing into it. We are discussing luminous v12.0.3 issue here, I think we are all on the same path. Thanks Jayaram On Thu, Jun 8, 2017 at 8:13 PM, Jake Grimmett <j...@mrc-lmb.cam.ac.uk> wrote: > Hi Mark / Jayaram, > > After running the cluster last night, I noticed lots of > "Out Of Memory" errors in /var/log/messages, many of these correlate to > dead OSD's. If this is the problem, this might now be another case of > the high memory use issues reported in Kraken. > > e.g. my script logs: > Thu 8 Jun 08:26:37 BST 2017 restart OSD 1 > > and /var/log/messages states... > > Jun 8 08:26:35 ceph1 kernel: Out of memory: Kill process 7899 > (ceph-osd) score 113 or sacrifice child > Jun 8 08:26:35 ceph1 kernel: Killed process 7899 (ceph-osd) > total-vm:8569516kB, anon-rss:7518836kB, file-rss:0kB, shmem-rss:0kB > Jun 8 08:26:36 ceph1 systemd: ceph-osd@1.service: main process exited, > code=killed, status=9/KILL > Jun 8 08:26:36 ceph1 systemd: Unit ceph-osd@1.service entered failed > state. > > The OSD nodes have 64GB RAM, presumably enough RAM for 10 OSD's doing > 4+1 EC ? > > I've added "bluestore_cache_size = 104857600" to ceph.conf, and am > retesting. I will see if OSD problems occur, and report back. > > As to loading the cluster, I run an rsync job on each node, pulling data > from an NFS mounted Isilon. A single node pulls ~200MB/s, with all 7 > nodes running, the ceph -w reports between 700 > 1500MB/s writes. > > as requested, here is my "restart_OSD_and_log-this.sh" script: > > ************************************************************************ > #!/bin/bash > # catches single failed OSDs, log and restart > while : ; do > OSD=`ceph osd tree 2> /dev/null | grep down | \ > awk '{ print $3}' | awk -F "." '{print $2 }'` > if [ "$OSD" != "" ] ; then > DATE=`date` > echo $DATE " restart OSD " $OSD >> /root/osd_restart_log > echo "OSD" $OSD "is down, restarting.." > OSDHOST=`ceph osd find $OSD | grep host | awk -F '"' '{print $4}'` > ssh $OSDHOST systemctl restart ceph-osd@$OSD > sleep 30 > else > echo -ne "\r\033[k" > echo -ne "all OSD OK" > fi > sleep 1 > done > ************************************************************************ > > thanks again, > > Jake > > On 08/06/17 12:08, nokia ceph wrote: > > Hello Mark, > > > > Raised tracker for the issue -- http://tracker.ceph.com/issues/20222 > > > > Jake can you share the restart_OSD_and_log-this.sh script > > > > Thanks > > Jayaram > > > > On Wed, Jun 7, 2017 at 9:40 PM, Jake Grimmett <j...@mrc-lmb.cam.ac.uk > > <mailto:j...@mrc-lmb.cam.ac.uk>> wrote: > > > > Hi Mark & List, > > > > Unfortunately, even when using yesterdays master version of ceph, > > I'm still seeing OSDs go down, same error as before: > > > > OSD log shows lots of entries like this: > > > > (osd38) > > 2017-06-07 16:48:46.070564 7f90b58c3700 1 heartbeat_map is_healthy > > 'tp_osd_tp thread tp_osd_tp' had timed out after 60 > > > > (osd3) > > 2017-06-07 17:01:25.391075 7f62de6c3700 1 heartbeat_map is_healthy > > 'tp_osd_tp thread tp_osd_tp' had timed out after 60 > > 2017-06-07 17:01:26.276881 7f62dbe86700 -1 osd.3 6165 > heartbeat_check: > > no reply from 10.1.0.86:6811 <http://10.1.0.86:6811> osd.2 since > > back 2017-06-07 17:00:19.640002 > > front 2017-06-07 17:01:21.950160 (cutoff 2017-06-07 17:01:06.276881) > > > > > > [root@ceph4 ceph]# ceph -v > > ceph version 12.0.2-2399-ge38ca14 > > (e38ca14914340d65ea8001c7bd6e0ff769f3eb2e) luminous (dev) > > > > > > I'll continue running the cluster with my > "restart_OSD_and_log-this.sh" > > workaround... > > > > thanks again for your help, > > > > Jake > > > > On 06/06/17 15:52, Jake Grimmett wrote: > > > Hi Mark, > > > > > > OK, I'll upgrade to the current master and retest... > > > > > > best, > > > > > > Jake > > > > > > On 06/06/17 15:46, Mark Nelson wrote: > > >> Hi Jake, > > >> > > >> I just happened to notice this was on 12.0.3. Would it be > > possible to > > >> test this out with current master and see if it still is a > problem? > > >> > > >> Mark > > >> > > >> On 06/06/2017 09:10 AM, Mark Nelson wrote: > > >>> Hi Jake, > > >>> > > >>> Thanks much. I'm guessing at this point this is probably a > > bug. Would > > >>> you (or nokiauser) mind creating a bug in the tracker with a > short > > >>> description of what's going on and the collectl sample showing > > this is > > >>> not IOs backing up on the disk? > > >>> > > >>> If you want to try it, we have a gdb based wallclock profiler > > that might > > >>> be interesting to run while it's in the process of timing out. > > It tries > > >>> to grab 2000 samples from the osd process which typically takes > > about 10 > > >>> minutes or so. You'll need to either change the number of > > samples to be > > >>> lower in the python code (maybe like 50-100), or change the > > timeout to > > >>> be something longer. > > >>> > > >>> You can find the code here: > > >>> > > >>> https://github.com/markhpc/gdbprof > > <https://github.com/markhpc/gdbprof> > > >>> > > >>> and invoke it like: > > >>> > > >>> udo gdb -ex 'set pagination off' -ex 'attach 27962' -ex 'source > > >>> ./gdbprof.py' -ex 'profile begin' -ex 'quit' > > >>> > > >>> where 27962 in this case is the PID of the ceph-osd process. > You'll > > >>> need gdb with the python bindings and the ceph debug symbols for > > it to > > >>> work. > > >>> > > >>> This might tell us over time if the tp_osd_tp processes are just > > sitting > > >>> on pg::locks. > > >>> > > >>> Mark > > >>> > > >>> On 06/06/2017 05:34 AM, Jake Grimmett wrote: > > >>>> Hi Mark, > > >>>> > > >>>> Thanks again for looking into this problem. > > >>>> > > >>>> I ran the cluster overnight, with a script checking for dead > > OSDs every > > >>>> second, and restarting them. > > >>>> > > >>>> 40 OSD failures occurred in 12 hours, some OSDs failed multiple > > times, > > >>>> (there are 50 OSDs in the EC tier). > > >>>> > > >>>> Unfortunately, the output of collectl doesn't appear to show any > > >>>> increase in disk queue depth and service times before the OSDs > die. > > >>>> > > >>>> I've put a couple of examples of collectl output for the disks > > >>>> associated with the OSDs here: > > >>>> > > >>>> https://hastebin.com/icuvotemot.scala > > <https://hastebin.com/icuvotemot.scala> > > >>>> > > >>>> please let me know if you need more info... > > >>>> > > >>>> best regards, > > >>>> > > >>>> Jake > > >>>> > > >>>> > > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> > > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com