Re: osd crash when deep-scrubbing

2015-10-19 Thread changtao381
Jiaying Ren  gmail.com> writes:

> 
> Hi, cephers:
> 
> I've encountered a problem that a pg stuck in inconsistent status:
> 
> $ ceph -s
> cluster 27d39faa-48ae-4356-a8e3-19d5b81e179e
>  health HEALTH_ERR 1 pgs inconsistent; 34 near full osd(s); 1
> scrub errors; noout flag(s) set
>  monmap e4: 3 mons at
>
{server-61.0..x.in=10.8.0.61:6789/0,server-62.0..x.i
n=10.8.0.62:6789/0,server-63.0..x.in=10.8.0.63:6789/0},
> election epoch 6706, quorum 0,1,2
>
server-61.0..x.in,server-62.0..x.in,server-63.0.
.x.in
>  osdmap e87808: 180 osds: 180 up, 180 in
> flags noout
>   pgmap v29322850: 35026 pgs, 15 pools, 27768 GB data, 1905 kobjects
> 83575 GB used, 114 TB / 196 TB avail
>35025 active+clean
>1 active+clean+inconsistent
>   client io 120 kB/s rd, 216 MB/s wr, 6398 op/s
> 
> `pg repair` cmd doesn't work, so I manually repaired a inconsistent
object(pool
> size is 3,I removed the object different from other two copys).after that
pg
> still in inconsistent status:
> 
> $ ceph pg dump | grep active+clean+inconsistent
> dumped all in format plain
> 3.d70   290 0   0   0   4600869888  30503050
>   stale+active+clean+inconsistent 2015-10-18 13:05:43.320451
>   87798'7631234   87798:10758311[131,119,132]   131
>   [131,119,132]   131 85161'7599152   2015-10-16 14:34:21.283303
>   85161'7599152   2015-10-16 14:34:21.283303
> 
> And after restarted osd.131, the primary osd osd.131 would crash,the
straceback:
> 
>  1: /usr/bin/ceph-osd() [0x9c6de1]
>  2: (()+0xf790) [0x7f384b6b8790]
>  3: (gsignal()+0x35) [0x7f384a58a625]
>  4: (abort()+0x175) [0x7f384a58be05]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f384ae44a5d]
>  6: (()+0xbcbe6) [0x7f384ae42be6]
>  7: (()+0xbcc13) [0x7f384ae42c13]
>  8: (()+0xbcd0e) [0x7f384ae42d0e]
>  9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e)
[0x9cd0de]
>  10: (object_info_t::decode(ceph::buffer::list::iterator&)+0x81)
[0x7dfaf1]
>  11: (PG::_scan_snaps(ScrubMap&)+0x394) [0x84b8c4]
>  12: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
> ThreadPool::TPHandle&)+0x27b) [0x84cdab]
>  13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x5c4) [0x85c1b4]
>  14: (PG::scrub(ThreadPool::TPHandle&)+0x181) [0x85d691]
>  15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x1c) [0x6737cc]
>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x53d) [0x9e05dd]
>  17: (ThreadPool::WorkThread::entry()+0x10) [0x9e1760]
>  18: (()+0x7a51) [0x7f384b6b0a51]
>  19: (clone()+0x6d) [0x7f384a6409ad]
> 
> ceph version is v0.80.9, manually executes `ceph pg deep-scrub 3.d70`
would also
> cause osd crash.
> 
> Any ideas? or did I missed some logs necessary for further investigation?
> 
> Thx.
> 
> --
> Best Regards!
> Jiaying Ren(mikulely)
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo  vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

I have met a problem when run 'ceph pg deep-scrub' command. It also causes
osd crash. And finally i find some sector of the disk have corrupted .so
please check dmesg info to check weather there is some disk errors


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


osd crash when deep-scrubbing

2015-10-18 Thread Jiaying Ren
Hi, cephers:

I've encountered a problem that a pg stuck in inconsistent status:

$ ceph -s
cluster 27d39faa-48ae-4356-a8e3-19d5b81e179e
 health HEALTH_ERR 1 pgs inconsistent; 34 near full osd(s); 1
scrub errors; noout flag(s) set
 monmap e4: 3 mons at
{server-61.0..x.in=10.8.0.61:6789/0,server-62.0..x.in=10.8.0.62:6789/0,server-63.0..x.in=10.8.0.63:6789/0},
election epoch 6706, quorum 0,1,2
server-61.0..x.in,server-62.0..x.in,server-63.0..x.in
 osdmap e87808: 180 osds: 180 up, 180 in
flags noout
  pgmap v29322850: 35026 pgs, 15 pools, 27768 GB data, 1905 kobjects
83575 GB used, 114 TB / 196 TB avail
   35025 active+clean
   1 active+clean+inconsistent
  client io 120 kB/s rd, 216 MB/s wr, 6398 op/s

`pg repair` cmd doesn't work, so I manually repaired a inconsistent object(pool
size is 3,I removed the object different from other two copys).after that pg
still in inconsistent status:

$ ceph pg dump | grep active+clean+inconsistent
dumped all in format plain
3.d70   290 0   0   0   4600869888  30503050
  stale+active+clean+inconsistent 2015-10-18 13:05:43.320451
  87798'7631234   87798:10758311[131,119,132]   131
  [131,119,132]   131 85161'7599152   2015-10-16 14:34:21.283303
  85161'7599152   2015-10-16 14:34:21.283303

And after restarted osd.131, the primary osd osd.131 would crash,the straceback:

 1: /usr/bin/ceph-osd() [0x9c6de1]
 2: (()+0xf790) [0x7f384b6b8790]
 3: (gsignal()+0x35) [0x7f384a58a625]
 4: (abort()+0x175) [0x7f384a58be05]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x12d) [0x7f384ae44a5d]
 6: (()+0xbcbe6) [0x7f384ae42be6]
 7: (()+0xbcc13) [0x7f384ae42c13]
 8: (()+0xbcd0e) [0x7f384ae42d0e]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x13e) [0x9cd0de]
 10: (object_info_t::decode(ceph::buffer::list::iterator&)+0x81) [0x7dfaf1]
 11: (PG::_scan_snaps(ScrubMap&)+0x394) [0x84b8c4]
 12: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
ThreadPool::TPHandle&)+0x27b) [0x84cdab]
 13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x5c4) [0x85c1b4]
 14: (PG::scrub(ThreadPool::TPHandle&)+0x181) [0x85d691]
 15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x1c) [0x6737cc]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x53d) [0x9e05dd]
 17: (ThreadPool::WorkThread::entry()+0x10) [0x9e1760]
 18: (()+0x7a51) [0x7f384b6b0a51]
 19: (clone()+0x6d) [0x7f384a6409ad]

ceph version is v0.80.9, manually executes `ceph pg deep-scrub 3.d70` would also
cause osd crash.

Any ideas? or did I missed some logs necessary for further investigation?

Thx.

--
Best Regards!
Jiaying Ren(mikulely)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-05 Thread Srikanth Madugundi
Hi Sage,

Did you get a chance to look at the crash?

Regards
Srikanth

On Wed, Jun 3, 2015 at 1:38 PM, Srikanth Madugundi
srikanth.madugu...@gmail.com wrote:
 Hi Sage,

 I saw the crash again here is the output after adding the debug
 message from wip-newstore-debuglist


-31 2015-06-03 20:28:18.864496 7fd95976b700 -1
 newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is
 --.7fff..!!!.


 Here is the id of the file I posted.

 ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804

 Let me know if you need anything else.

 Regards
 Srikanth


 On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
 srikanth.madugu...@gmail.com wrote:
 Hi Sage,

 Unfortunately I purged the cluster yesterday and restarted the
 backfill tool. I did not see the osd crash yet on the cluster. I am
 monitoring the OSDs and will update you once I see the crash.

 With the new backfill run I have reduced the rps by half, not sure if
 this is the reason for not seeing the crash yet.

 Regards
 Srikanth


 On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote:
 I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
 with that branch with 'debug newstore = 20' and send us the log?
 (You can just do 'ceph-post-file filename'.)

 Thanks!
 sage

 On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

 Hi Sage,

 The assertion failed at line 1639, here is the log message


 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)


 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)

 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0


 Regards
 Srikanth

 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
  Hi Sage and all,
 
  I build ceph code from wip-newstore on RHEL7 and running performance
  tests to compare with filestore. After few hours of running the tests
  the osd daemons started to crash. Here is the stack trace, the osd
  crashes immediately after the restart. So I could not get the osd up
  and running.
 
  ceph version b8e22893f44979613738dfcdd40dada2b513118
  (eb8e22893f44979613738dfcdd40dada2b513118)
  1: /usr/bin/ceph-osd() [0xb84652]
  2: (()+0xf130) [0x7f915f84f130]
  3: (gsignal()+0x39) [0x7f915e2695c9]
  4: (abort()+0x148) [0x7f915e26acd8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
  6: (()+0x5e946) [0x7f915eb6b946]
  7: (()+0x5e973) [0x7f915eb6b973]
  8: (()+0x5eb9f) [0x7f915eb6bb9f]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
  const*)+0x27a) [0xc84c5a]
  10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
  snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
  ghobject_t*)+0x13c9) [0xa08639]
  11: (PGBackend::objects_list_partial(hobject_t const, int, int,
  snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
  hobject_t*)+0x352) [0x918a02]
  12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
  [0x8aa906]
  13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
  [0x8cd06b]
  14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
  ThreadPool::TPHandle)+0x68a) [0x85dbea]
  15: (OSD::dequeue_op(boost::intrusive_ptrPG,
  std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
  [0x6c3f5d]
  16: (OSD::ShardedOpWQ::_process(unsigned int,
  ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
  [0xc746bf]
  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
  19: (()+0x7df3) [0x7f915f847df3]
  20: (clone()+0x6d) [0x7f915e32a01d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
 
  Please let me know the cause of this crash, when this crash happens I
  noticed that two osds on separate machines are down. I can bring one
  osd up but restarting the other osd causes both OSDs to crash. My
  understanding is the crash seems to happen when two OSDs try to
  communicate and replicate a particular PG.
 
  Can you include the log lines that preceed the dump above?  In 
  particular,
  there should be a line that tells you what assertion failed in what
  function and at what line number.  I haven't seen this crash so I'm not
  sure offhand what

Re: osd crash with object store set to newstore

2015-06-05 Thread Sage Weil
On Fri, 5 Jun 2015, Srikanth Madugundi wrote:
 Hi Sage,
 
 Did you get a chance to look at the crash?

Not yet--I am still focusing on getting wip-temp (and other newstore 
prerequisite code) working before turning back to newstore.  I'll look at 
this once I get back to newstore... hopefully in the next week or so!

sage


 
 Regards
 Srikanth
 
 On Wed, Jun 3, 2015 at 1:38 PM, Srikanth Madugundi
 srikanth.madugu...@gmail.com wrote:
  Hi Sage,
 
  I saw the crash again here is the output after adding the debug
  message from wip-newstore-debuglist
 
 
 -31 2015-06-03 20:28:18.864496 7fd95976b700 -1
  newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is
  --.7fff..!!!.
 
 
  Here is the id of the file I posted.
 
  ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804
 
  Let me know if you need anything else.
 
  Regards
  Srikanth
 
 
  On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
  srikanth.madugu...@gmail.com wrote:
  Hi Sage,
 
  Unfortunately I purged the cluster yesterday and restarted the
  backfill tool. I did not see the osd crash yet on the cluster. I am
  monitoring the OSDs and will update you once I see the crash.
 
  With the new backfill run I have reduced the rps by half, not sure if
  this is the reason for not seeing the crash yet.
 
  Regards
  Srikanth
 
 
  On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote:
  I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
  with that branch with 'debug newstore = 20' and send us the log?
  (You can just do 'ceph-post-file filename'.)
 
  Thanks!
  sage
 
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
 
  Hi Sage,
 
  The assertion failed at line 1639, here is the log message
 
 
  2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
  function 'virtual int NewStore::collection_list_partial(coll_t,
  ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
  ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174
 
  os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  
  end_key)
 
 
  Just before the crash the here are the debug statements printed by the
  method (collection_list_partial)
 
  2015-05-30 22:49:23.607232 7f1681934700 15
  newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
  start -1/0//0/0 min/max 1024/1024 snap head
  2015-05-30 22:49:23.607251 7f1681934700 20
  newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
  --.7fb4.. to --.7fb4.0800. and
  --.804b.. to --.804b.0800. start
  -1/0//0/0
 
 
  Regards
  Srikanth
 
  On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
   On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
   Hi Sage and all,
  
   I build ceph code from wip-newstore on RHEL7 and running performance
   tests to compare with filestore. After few hours of running the tests
   the osd daemons started to crash. Here is the stack trace, the osd
   crashes immediately after the restart. So I could not get the osd up
   and running.
  
   ceph version b8e22893f44979613738dfcdd40dada2b513118
   (eb8e22893f44979613738dfcdd40dada2b513118)
   1: /usr/bin/ceph-osd() [0xb84652]
   2: (()+0xf130) [0x7f915f84f130]
   3: (gsignal()+0x39) [0x7f915e2695c9]
   4: (abort()+0x148) [0x7f915e26acd8]
   5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
   6: (()+0x5e946) [0x7f915eb6b946]
   7: (()+0x5e973) [0x7f915eb6b973]
   8: (()+0x5eb9f) [0x7f915eb6bb9f]
   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
   const*)+0x27a) [0xc84c5a]
   10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
   snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
   ghobject_t*)+0x13c9) [0xa08639]
   11: (PGBackend::objects_list_partial(hobject_t const, int, int,
   snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
   hobject_t*)+0x352) [0x918a02]
   12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
   [0x8aa906]
   13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
   [0x8cd06b]
   14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
   ThreadPool::TPHandle)+0x68a) [0x85dbea]
   15: (OSD::dequeue_op(boost::intrusive_ptrPG,
   std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
   [0x6c3f5d]
   16: (OSD::ShardedOpWQ::_process(unsigned int,
   ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
   17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
   [0xc746bf]
   18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
   19: (()+0x7df3) [0x7f915f847df3]
   20: (clone()+0x6d) [0x7f915e32a01d]
   NOTE: a copy of the executable, or `objdump -rdS executable` is
   needed to interpret this.
  
   Please let me know the cause of this crash, when this crash happens I
   noticed that two osds on separate machines are down. I can bring one
   osd up but restarting

Re: osd crash with object store set to newstore

2015-06-03 Thread Srikanth Madugundi
Hi Sage,

I saw the crash again here is the output after adding the debug
message from wip-newstore-debuglist


   -31 2015-06-03 20:28:18.864496 7fd95976b700 -1
newstore(/var/lib/ceph/osd/ceph-19) start is -1/0//0/0 ... k is
--.7fff..!!!.


Here is the id of the file I posted.

ceph-post-file: ddfcf940-8c13-4913-a7b9-436c1a7d0804

Let me know if you need anything else.

Regards
Srikanth


On Mon, Jun 1, 2015 at 10:25 PM, Srikanth Madugundi
srikanth.madugu...@gmail.com wrote:
 Hi Sage,

 Unfortunately I purged the cluster yesterday and restarted the
 backfill tool. I did not see the osd crash yet on the cluster. I am
 monitoring the OSDs and will update you once I see the crash.

 With the new backfill run I have reduced the rps by half, not sure if
 this is the reason for not seeing the crash yet.

 Regards
 Srikanth


 On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote:
 I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
 with that branch with 'debug newstore = 20' and send us the log?
 (You can just do 'ceph-post-file filename'.)

 Thanks!
 sage

 On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

 Hi Sage,

 The assertion failed at line 1639, here is the log message


 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)


 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)

 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0


 Regards
 Srikanth

 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
  Hi Sage and all,
 
  I build ceph code from wip-newstore on RHEL7 and running performance
  tests to compare with filestore. After few hours of running the tests
  the osd daemons started to crash. Here is the stack trace, the osd
  crashes immediately after the restart. So I could not get the osd up
  and running.
 
  ceph version b8e22893f44979613738dfcdd40dada2b513118
  (eb8e22893f44979613738dfcdd40dada2b513118)
  1: /usr/bin/ceph-osd() [0xb84652]
  2: (()+0xf130) [0x7f915f84f130]
  3: (gsignal()+0x39) [0x7f915e2695c9]
  4: (abort()+0x148) [0x7f915e26acd8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
  6: (()+0x5e946) [0x7f915eb6b946]
  7: (()+0x5e973) [0x7f915eb6b973]
  8: (()+0x5eb9f) [0x7f915eb6bb9f]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
  const*)+0x27a) [0xc84c5a]
  10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
  snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
  ghobject_t*)+0x13c9) [0xa08639]
  11: (PGBackend::objects_list_partial(hobject_t const, int, int,
  snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
  hobject_t*)+0x352) [0x918a02]
  12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
  [0x8aa906]
  13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
  [0x8cd06b]
  14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
  ThreadPool::TPHandle)+0x68a) [0x85dbea]
  15: (OSD::dequeue_op(boost::intrusive_ptrPG,
  std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
  [0x6c3f5d]
  16: (OSD::ShardedOpWQ::_process(unsigned int,
  ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
  [0xc746bf]
  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
  19: (()+0x7df3) [0x7f915f847df3]
  20: (clone()+0x6d) [0x7f915e32a01d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
 
  Please let me know the cause of this crash, when this crash happens I
  noticed that two osds on separate machines are down. I can bring one
  osd up but restarting the other osd causes both OSDs to crash. My
  understanding is the crash seems to happen when two OSDs try to
  communicate and replicate a particular PG.
 
  Can you include the log lines that preceed the dump above?  In particular,
  there should be a line that tells you what assertion failed in what
  function and at what line number.  I haven't seen this crash so I'm not
  sure offhand what it is.
 
  Thanks!
  sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info

Re: osd crash with object store set to newstore

2015-06-01 Thread Sage Weil
I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash 
with that branch with 'debug newstore = 20' and send us the log?  
(You can just do 'ceph-post-file filename'.)

Thanks!
sage

On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

 Hi Sage,
 
 The assertion failed at line 1639, here is the log message
 
 
 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174
 
 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)
 
 
 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)
 
 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0
 
 
 Regards
 Srikanth
 
 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
  Hi Sage and all,
 
  I build ceph code from wip-newstore on RHEL7 and running performance
  tests to compare with filestore. After few hours of running the tests
  the osd daemons started to crash. Here is the stack trace, the osd
  crashes immediately after the restart. So I could not get the osd up
  and running.
 
  ceph version b8e22893f44979613738dfcdd40dada2b513118
  (eb8e22893f44979613738dfcdd40dada2b513118)
  1: /usr/bin/ceph-osd() [0xb84652]
  2: (()+0xf130) [0x7f915f84f130]
  3: (gsignal()+0x39) [0x7f915e2695c9]
  4: (abort()+0x148) [0x7f915e26acd8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
  6: (()+0x5e946) [0x7f915eb6b946]
  7: (()+0x5e973) [0x7f915eb6b973]
  8: (()+0x5eb9f) [0x7f915eb6bb9f]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
  const*)+0x27a) [0xc84c5a]
  10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
  snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
  ghobject_t*)+0x13c9) [0xa08639]
  11: (PGBackend::objects_list_partial(hobject_t const, int, int,
  snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
  hobject_t*)+0x352) [0x918a02]
  12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
  [0x8aa906]
  13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
  [0x8cd06b]
  14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
  ThreadPool::TPHandle)+0x68a) [0x85dbea]
  15: (OSD::dequeue_op(boost::intrusive_ptrPG,
  std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
  [0x6c3f5d]
  16: (OSD::ShardedOpWQ::_process(unsigned int,
  ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
  [0xc746bf]
  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
  19: (()+0x7df3) [0x7f915f847df3]
  20: (clone()+0x6d) [0x7f915e32a01d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
 
  Please let me know the cause of this crash, when this crash happens I
  noticed that two osds on separate machines are down. I can bring one
  osd up but restarting the other osd causes both OSDs to crash. My
  understanding is the crash seems to happen when two OSDs try to
  communicate and replicate a particular PG.
 
  Can you include the log lines that preceed the dump above?  In particular,
  there should be a line that tells you what assertion failed in what
  function and at what line number.  I haven't seen this crash so I'm not
  sure offhand what it is.
 
  Thanks!
  sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage and all,

I build ceph code from wip-newstore on RHEL7 and running performance
tests to compare with filestore. After few hours of running the tests
the osd daemons started to crash. Here is the stack trace, the osd
crashes immediately after the restart. So I could not get the osd up
and running.

ceph version b8e22893f44979613738dfcdd40dada2b513118
(eb8e22893f44979613738dfcdd40dada2b513118)
1: /usr/bin/ceph-osd() [0xb84652]
2: (()+0xf130) [0x7f915f84f130]
3: (gsignal()+0x39) [0x7f915e2695c9]
4: (abort()+0x148) [0x7f915e26acd8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
6: (()+0x5e946) [0x7f915eb6b946]
7: (()+0x5e973) [0x7f915eb6b973]
8: (()+0x5eb9f) [0x7f915eb6bb9f]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0xc84c5a]
10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
ghobject_t*)+0x13c9) [0xa08639]
11: (PGBackend::objects_list_partial(hobject_t const, int, int,
snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
hobject_t*)+0x352) [0x918a02]
12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) [0x8aa906]
13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b]
14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
ThreadPool::TPHandle)+0x68a) [0x85dbea]
15: (OSD::dequeue_op(boost::intrusive_ptrPG,
std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
[0x6c3f5d]
16: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xc746bf]
18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
19: (()+0x7df3) [0x7f915f847df3]
20: (clone()+0x6d) [0x7f915e32a01d]
NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

Please let me know the cause of this crash, when this crash happens I
noticed that two osds on separate machines are down. I can bring one
osd up but restarting the other osd causes both OSDs to crash. My
understanding is the crash seems to happen when two OSDs try to
communicate and replicate a particular PG.

Regards
Srikanth
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage,

The assertion failed at line 1639, here is the log message


2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
function 'virtual int NewStore::collection_list_partial(coll_t,
ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)


Just before the crash the here are the debug statements printed by the
method (collection_list_partial)

2015-05-30 22:49:23.607232 7f1681934700 15
newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
start -1/0//0/0 min/max 1024/1024 snap head
2015-05-30 22:49:23.607251 7f1681934700 20
newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
--.7fb4.. to --.7fb4.0800. and
--.804b.. to --.804b.0800. start
-1/0//0/0


Regards
Srikanth

On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
 On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
 Hi Sage and all,

 I build ceph code from wip-newstore on RHEL7 and running performance
 tests to compare with filestore. After few hours of running the tests
 the osd daemons started to crash. Here is the stack trace, the osd
 crashes immediately after the restart. So I could not get the osd up
 and running.

 ceph version b8e22893f44979613738dfcdd40dada2b513118
 (eb8e22893f44979613738dfcdd40dada2b513118)
 1: /usr/bin/ceph-osd() [0xb84652]
 2: (()+0xf130) [0x7f915f84f130]
 3: (gsignal()+0x39) [0x7f915e2695c9]
 4: (abort()+0x148) [0x7f915e26acd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
 6: (()+0x5e946) [0x7f915eb6b946]
 7: (()+0x5e973) [0x7f915eb6b973]
 8: (()+0x5eb9f) [0x7f915eb6bb9f]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x27a) [0xc84c5a]
 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
 snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
 ghobject_t*)+0x13c9) [0xa08639]
 11: (PGBackend::objects_list_partial(hobject_t const, int, int,
 snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
 hobject_t*)+0x352) [0x918a02]
 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
 [0x8aa906]
 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b]
 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
 ThreadPool::TPHandle)+0x68a) [0x85dbea]
 15: (OSD::dequeue_op(boost::intrusive_ptrPG,
 std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
 [0x6c3f5d]
 16: (OSD::ShardedOpWQ::_process(unsigned int,
 ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
 [0xc746bf]
 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
 19: (()+0x7df3) [0x7f915f847df3]
 20: (clone()+0x6d) [0x7f915e32a01d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 Please let me know the cause of this crash, when this crash happens I
 noticed that two osds on separate machines are down. I can bring one
 osd up but restarting the other osd causes both OSDs to crash. My
 understanding is the crash seems to happen when two OSDs try to
 communicate and replicate a particular PG.

 Can you include the log lines that preceed the dump above?  In particular,
 there should be a line that tells you what assertion failed in what
 function and at what line number.  I haven't seen this crash so I'm not
 sure offhand what it is.

 Thanks!
 sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Sage Weil
On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
 Hi Sage and all,
 
 I build ceph code from wip-newstore on RHEL7 and running performance
 tests to compare with filestore. After few hours of running the tests
 the osd daemons started to crash. Here is the stack trace, the osd
 crashes immediately after the restart. So I could not get the osd up
 and running.
 
 ceph version b8e22893f44979613738dfcdd40dada2b513118
 (eb8e22893f44979613738dfcdd40dada2b513118)
 1: /usr/bin/ceph-osd() [0xb84652]
 2: (()+0xf130) [0x7f915f84f130]
 3: (gsignal()+0x39) [0x7f915e2695c9]
 4: (abort()+0x148) [0x7f915e26acd8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
 6: (()+0x5e946) [0x7f915eb6b946]
 7: (()+0x5e973) [0x7f915eb6b973]
 8: (()+0x5eb9f) [0x7f915eb6bb9f]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x27a) [0xc84c5a]
 10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
 snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
 ghobject_t*)+0x13c9) [0xa08639]
 11: (PGBackend::objects_list_partial(hobject_t const, int, int,
 snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
 hobject_t*)+0x352) [0x918a02]
 12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
 [0x8aa906]
 13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b]
 14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
 ThreadPool::TPHandle)+0x68a) [0x85dbea]
 15: (OSD::dequeue_op(boost::intrusive_ptrPG,
 std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
 [0x6c3f5d]
 16: (OSD::ShardedOpWQ::_process(unsigned int,
 ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
 17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
 [0xc746bf]
 18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
 19: (()+0x7df3) [0x7f915f847df3]
 20: (clone()+0x6d) [0x7f915e32a01d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.
 
 Please let me know the cause of this crash, when this crash happens I
 noticed that two osds on separate machines are down. I can bring one
 osd up but restarting the other osd causes both OSDs to crash. My
 understanding is the crash seems to happen when two OSDs try to
 communicate and replicate a particular PG.

Can you include the log lines that preceed the dump above?  In particular, 
there should be a line that tells you what assertion failed in what 
function and at what line number.  I haven't seen this crash so I'm not 
sure offhand what it is.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash with object store set to newstore

2015-06-01 Thread Srikanth Madugundi
Hi Sage,

Unfortunately I purged the cluster yesterday and restarted the
backfill tool. I did not see the osd crash yet on the cluster. I am
monitoring the OSDs and will update you once I see the crash.

With the new backfill run I have reduced the rps by half, not sure if
this is the reason for not seeing the crash yet.

Regards
Srikanth


On Mon, Jun 1, 2015 at 10:06 PM, Sage Weil s...@newdream.net wrote:
 I pushed a commit to wip-newstore-debuglist.. can you reproduce the crash
 with that branch with 'debug newstore = 20' and send us the log?
 (You can just do 'ceph-post-file filename'.)

 Thanks!
 sage

 On Mon, 1 Jun 2015, Srikanth Madugundi wrote:

 Hi Sage,

 The assertion failed at line 1639, here is the log message


 2015-05-30 23:17:55.141388 7f0891be0700 -1 os/newstore/NewStore.cc: In
 function 'virtual int NewStore::collection_list_partial(coll_t,
 ghobject_t, int, int, snapid_t, std::vectorghobject_t*,
 ghobject_t*)' thread 7f0891be0700 time 2015-05-30 23:17:55.137174

 os/newstore/NewStore.cc: 1639: FAILED assert(k = start_key  k  end_key)


 Just before the crash the here are the debug statements printed by the
 method (collection_list_partial)

 2015-05-30 22:49:23.607232 7f1681934700 15
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial 75.0_head
 start -1/0//0/0 min/max 1024/1024 snap head
 2015-05-30 22:49:23.607251 7f1681934700 20
 newstore(/var/lib/ceph/osd/ceph-7) collection_list_partial range
 --.7fb4.. to --.7fb4.0800. and
 --.804b.. to --.804b.0800. start
 -1/0//0/0


 Regards
 Srikanth

 On Mon, Jun 1, 2015 at 8:54 PM, Sage Weil s...@newdream.net wrote:
  On Mon, 1 Jun 2015, Srikanth Madugundi wrote:
  Hi Sage and all,
 
  I build ceph code from wip-newstore on RHEL7 and running performance
  tests to compare with filestore. After few hours of running the tests
  the osd daemons started to crash. Here is the stack trace, the osd
  crashes immediately after the restart. So I could not get the osd up
  and running.
 
  ceph version b8e22893f44979613738dfcdd40dada2b513118
  (eb8e22893f44979613738dfcdd40dada2b513118)
  1: /usr/bin/ceph-osd() [0xb84652]
  2: (()+0xf130) [0x7f915f84f130]
  3: (gsignal()+0x39) [0x7f915e2695c9]
  4: (abort()+0x148) [0x7f915e26acd8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
  6: (()+0x5e946) [0x7f915eb6b946]
  7: (()+0x5e973) [0x7f915eb6b973]
  8: (()+0x5eb9f) [0x7f915eb6bb9f]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
  const*)+0x27a) [0xc84c5a]
  10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
  snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
  ghobject_t*)+0x13c9) [0xa08639]
  11: (PGBackend::objects_list_partial(hobject_t const, int, int,
  snapid_t, std::vectorhobject_t, std::allocatorhobject_t *,
  hobject_t*)+0x352) [0x918a02]
  12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066) 
  [0x8aa906]
  13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) 
  [0x8cd06b]
  14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
  ThreadPool::TPHandle)+0x68a) [0x85dbea]
  15: (OSD::dequeue_op(boost::intrusive_ptrPG,
  std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed)
  [0x6c3f5d]
  16: (OSD::ShardedOpWQ::_process(unsigned int,
  ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
  17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) 
  [0xc746bf]
  18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
  19: (()+0x7df3) [0x7f915f847df3]
  20: (clone()+0x6d) [0x7f915e32a01d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
 
  Please let me know the cause of this crash, when this crash happens I
  noticed that two osds on separate machines are down. I can bring one
  osd up but restarting the other osd causes both OSDs to crash. My
  understanding is the crash seems to happen when two OSDs try to
  communicate and replicate a particular PG.
 
  Can you include the log lines that preceed the dump above?  In particular,
  there should be a line that tells you what assertion failed in what
  function and at what line number.  I haven't seen this crash so I'm not
  sure offhand what it is.
 
  Thanks!
  sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD Crash for xattr _ absent issue.

2014-11-26 Thread Wenjunh

 Hi, Samuel  Sage
 
 In our current production environment, there exists osd crash because of the 
 inconsistence of data, when reading the “_” xattr. Which is described in the 
 issue:
 
 http://tracker.ceph.com/issues/10117.
 
 And I also find a two year’s old issue, which also describes the same bug:
 
 http://tracker.ceph.com/issues/3676.
 
 I think there is a apparent flaw in the related code. Could you help to 
 review my last comment describing the way to fix the bug.
 
 I prefer the second way, we just delete the object if we can’t get the “_” 
 xattr, instead of crashing the osd, and the object has two other replicas, 
 which can serve the client’s request.
 And when the next time self-healing process(scrub, deep scrub) occurs, the 
 object can recover from its peer.
 
 Because I am not so proficient of the source code, I don’t know if the 
 repairing way has any other side effects on the ceph cluster.
 
 If you have any idea about the bug, please feel free to let me know.
 
 Thanks
 
 Wenjunh
 
 
 
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bobtail to dumpling (was: OSD crash during repair)

2013-09-10 Thread Chris Dunlop
On Fri, Sep 06, 2013 at 08:21:07AM -0700, Sage Weil wrote:
 On Fri, 6 Sep 2013, Chris Dunlop wrote:
 On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
 Also, you should upgrade to dumpling.  :)
 
 I've been considering it. It was initially a little scary with
 the various issues that were cropping up but that all seems to
 have quietened down.
 
 Of course I'd like my cluster to be clean before attempting an upgrade!
 
 Definitely.  Let us know how it goes! :)

Upgraded, directly from bobtail to dumpling.

Well, that was a mite more traumatic than I expected. I had two
issues, both my fault...

Firstly, I didn't realise I should have restarted the osds one
at a time rather than doing 'service ceph restart' on each host
quickly in succession. Restarting them all at once meant
everything was offline whilst PGs are upgrading.

Secondly, whilst I saw the 'osd crush update on start' issue in
the release notes, and checked that my crush map hostnames match
the actual hostnames, I have two separate pools (for fast SAS vs
bulk SATA disks) and I stupidly only noticed the one which
matched, but not the other which didn't match. So on restart all
the osds moved into the one pool, and started rebalancing.

The two issues at the same time produced quite the adrenaline
rush! :-)

My current crush configuration is below (host b2 is recently
added and I haven't added it into the pools yet). Is there a
better/recommended way of using the crush map to support
separate pools to avoid setting 'osd crush update on start =
false'? It doesn't seem that I can use the same 'host' names
under the separate 'sas' and 'default' roots?

Cheers,

Chris

--
# ceph osd tree
# idweight  type name   up/down reweight
-8  2   root sas
-7  2   rack sas-rack-1
-5  1   host b4-sas
4   0.5 osd.4   up  1   
5   0.5 osd.5   up  1   
-6  1   host b5-sas
2   0.5 osd.2   up  1   
3   0.5 osd.3   up  1   
-1  12.66   root default
-3  8   rack unknownrack
-2  4   host b4
0   2   osd.0   up  1   
7   2   osd.7   up  1   
-4  4   host b5
1   2   osd.1   up  1   
6   2   osd.6   up  1   
-9  4.66host b2
10  1.82osd.10  up  1   
11  1.82osd.11  up  1   
8   0.51osd.8   up  1   
9   0.51osd.9   up  1   

--
# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root

# buckets
host b4 {
id -2   # do not change unnecessarily
# weight 4.000
alg straw
hash 0  # rjenkins1
item osd.0 weight 2.000
item osd.7 weight 2.000
}
host b5 {
id -4   # do not change unnecessarily
# weight 4.000
alg straw
hash 0  # rjenkins1
item osd.1 weight 2.000
item osd.6 weight 2.000
}
rack unknownrack {
id -3   # do not change unnecessarily
# weight 8.000
alg straw
hash 0  # rjenkins1
item b4 weight 4.000
item b5 weight 4.000
}
host b2 {
id -9   # do not change unnecessarily
# weight 4.660
alg straw
hash 0  # rjenkins1
item osd.10 weight 1.820
item osd.11 weight 1.820
item osd.8 weight 0.510
item osd.9 weight 0.510
}
root default {
id -1   # do not change unnecessarily
# weight 12.660
alg straw
hash 0  # rjenkins1
item unknownrack weight 8.000
item b2 weight 4.660
}
host b4-sas {
id -5   # do not change unnecessarily
# weight 1.000
alg straw
hash 0  # rjenkins1
item osd.4 weight 0.500
item osd.5 weight 0.500
}
host b5-sas {
id -6   # do not change unnecessarily
# weight 1.000
alg straw
hash 0  # rjenkins1
item osd.2 weight 0.500
item osd.3 weight 0.500
}
rack sas-rack-1 {
id -7   # do not change unnecessarily
# weight 2.000
alg straw
hash 0  # rjenkins1
item b4-sas weight 1.000
item b5-sas weight 1.000
}
root sas {
id -8   # 

Re: Bobtail to dumpling (was: OSD crash during repair)

2013-09-10 Thread Sage Weil
On Wed, 11 Sep 2013, Chris Dunlop wrote:
 On Fri, Sep 06, 2013 at 08:21:07AM -0700, Sage Weil wrote:
  On Fri, 6 Sep 2013, Chris Dunlop wrote:
  On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
  Also, you should upgrade to dumpling.  :)
  
  I've been considering it. It was initially a little scary with
  the various issues that were cropping up but that all seems to
  have quietened down.
  
  Of course I'd like my cluster to be clean before attempting an upgrade!
  
  Definitely.  Let us know how it goes! :)
 
 Upgraded, directly from bobtail to dumpling.
 
 Well, that was a mite more traumatic than I expected. I had two
 issues, both my fault...
 
 Firstly, I didn't realise I should have restarted the osds one
 at a time rather than doing 'service ceph restart' on each host
 quickly in succession. Restarting them all at once meant
 everything was offline whilst PGs are upgrading.
 
 Secondly, whilst I saw the 'osd crush update on start' issue in
 the release notes, and checked that my crush map hostnames match
 the actual hostnames, I have two separate pools (for fast SAS vs
 bulk SATA disks) and I stupidly only noticed the one which
 matched, but not the other which didn't match. So on restart all
 the osds moved into the one pool, and started rebalancing.
 
 The two issues at the same time produced quite the adrenaline
 rush! :-)

I can imagine!

 My current crush configuration is below (host b2 is recently
 added and I haven't added it into the pools yet). Is there a
 better/recommended way of using the crush map to support
 separate pools to avoid setting 'osd crush update on start =
 false'? It doesn't seem that I can use the same 'host' names
 under the separate 'sas' and 'default' roots?

For now we don't have a better solution than setting 'osd crush update on 
start = false'.  Sorry!  I'm guessing that it is pretty uncommong for 
disks to switch hosts, at least.  :/

We could come up with a 'standard' way of structuring these sorts of maps 
with prefixes or suffixes on the bucket names; I'm open to suggestions.

However, I'm also wondering if we should take the next step at the same 
time and embed another dimension in the CRUSH tree so that CRUSH itself 
understands that it is host=b4 (say) but it is only looking at the sas or 
ssd items.  This would (help) allow rules along the lines of pick 3 
hosts; choose the ssd from the first and sas disks from the other two.  
I'm not convinced that is an especially good idea for most users, but it's 
probably worth considering.

sage


 
 Cheers,
 
 Chris
 
 --
 # ceph osd tree
 # idweight  type name   up/down reweight
 -8  2   root sas
 -7  2   rack sas-rack-1
 -5  1   host b4-sas
 4   0.5 osd.4   up  1   
 5   0.5 osd.5   up  1   
 -6  1   host b5-sas
 2   0.5 osd.2   up  1   
 3   0.5 osd.3   up  1   
 -1  12.66   root default
 -3  8   rack unknownrack
 -2  4   host b4
 0   2   osd.0   up  1   
 7   2   osd.7   up  1   
 -4  4   host b5
 1   2   osd.1   up  1   
 6   2   osd.6   up  1   
 -9  4.66host b2
 10  1.82osd.10  up  1   
 11  1.82osd.11  up  1   
 8   0.51osd.8   up  1   
 9   0.51osd.9   up  1   
 
 --
 # begin crush map
 
 # devices
 device 0 osd.0
 device 1 osd.1
 device 2 osd.2
 device 3 osd.3
 device 4 osd.4
 device 5 osd.5
 device 6 osd.6
 device 7 osd.7
 device 8 osd.8
 device 9 osd.9
 device 10 osd.10
 device 11 osd.11
 
 # types
 type 0 osd
 type 1 host
 type 2 rack
 type 3 row
 type 4 room
 type 5 datacenter
 type 6 root
 
 # buckets
 host b4 {
   id -2   # do not change unnecessarily
   # weight 4.000
   alg straw
   hash 0  # rjenkins1
   item osd.0 weight 2.000
   item osd.7 weight 2.000
 }
 host b5 {
   id -4   # do not change unnecessarily
   # weight 4.000
   alg straw
   hash 0  # rjenkins1
   item osd.1 weight 2.000
   item osd.6 weight 2.000
 }
 rack unknownrack {
   id -3   # do not change unnecessarily
   # weight 8.000
   alg straw
   hash 0  # rjenkins1
   item b4 weight 4.000
   item b5 weight 4.000
 }
 host b2 {
   id -9   # do not change unnecessarily
   # weight 4.660
   alg straw
   hash 0  # rjenkins1
   item osd.10 weight 1.820
   item 

Re: OSD crash during repair

2013-09-06 Thread Sage Weil
On Fri, 6 Sep 2013, Chris Dunlop wrote:
 On Fri, Sep 06, 2013 at 01:12:21PM +1000, Chris Dunlop wrote:
  On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
  On Fri, 6 Sep 2013, Chris Dunlop wrote:
  Hi Sage,
  
  Does this answer your question?
  
  2013-09-06 09:30:19.813811 7f0ae8cbc700  0 log [INF] : applying 
  configuration change: internal_safe_to_start_threads = 'true'
  2013-09-06 09:33:28.303658 7f0ae94bd700  0 log [ERR] : 2.12 osd.7: soid 
  56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra 
  attr snapset
  2013-09-06 09:33:28.303685 7f0ae94bd700  0 log [ERR] : repair 2.12 
  56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr
  2013-09-06 09:34:45.138468 7f0ae94bd700  0 log [ERR] : 2.12 repair stat 
  mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 
  bytes.
  2013-09-06 09:34:45.142215 7f0ae94bd700  0 log [ERR] : 2.12 repair 0 
  missing, 1 inconsistent objects
  2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) **
  
  I've just attached the full 'debug_osd 0/10' log to the bug report.
  
  This suggests to me that the object on osd.6 is missing those xattrs; can 
  you confirm with getfattr -d on the in osd.6's data directory?
  
  I haven't yet wrapped my head around how to translate an oid
  like those above into a underlying file system object. What 
  directory should I be looking at?
 
 Found it:
 
 b5# cd /var/lib/ceph/osd/ceph-6/current
 b5# find 2.12* | grep -i 17d9b.2ae8944a.1e11
 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
 b5# getfattr -d 
 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
  ...crickets... 
 
 vs.
 
 b4# cd /var/lib/ceph/osd/ceph-7/current
 b4# getfattr -d 
 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
 # file: 
 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
 user.ceph._=0sCgjhBANBACByYi4wLjE3ZDliLjJhZTg5NDRhLjAwMDAwMDAwMWUxMf7/EnqYVgAAAgAEAxACAP8AAEInCgAAuEsAAEEnCgAAuEsAAAICFQgTmwEAAHD1AgAAQAAAyY4dUpjCTSACAhUAAABCJwoAALhL
 user.ceph.snapset=0sAgIZAAABAA==
 
  If that is indeed the case, you should be able to move the object out of 
  the way (don't delete it, just in case) and then do the repair.  The osd.6 
  should recover by copying the object from osd.7 (which has the needed 
  xattrs).  Bobtail is smart enough to recover missing objects but not to 
  recover just missing xattrs.
  
  Do you want me to hold off on any repairs to allow tracking down
  the crash, or is the current code sufficiently different that
  there's little point?
 
 Repaired! ...but why does it take multiple rounds?

Excellent!

It's because the first round repairs the object, but doesn't take its own 
change into account when verifying/recalculating the PG stats (object 
count, byte sum).  The second pass just fixes up that arithmetic.

sage

 
 b5# mv 
 2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
  ..
 
 b5# ceph pg repair 2.12
 b5# while ceph -s | grep -q scrubbing; do sleep 60; done
 b5# tail /var/log/ceph/ceph-osd.6.log
 2013-09-06 15:02:13.751160 7f6ccc5ae700  0 log [ERR] : 2.12 osd.6 missing 
 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2
 2013-09-06 15:04:15.286711 7f6ccc5ae700  0 log [ERR] : 2.12 repair stat 
 mismatch, got 2723/2724 objects, 339/339 clones, 11311299072/11315493376 
 bytes.
 2013-09-06 15:04:15.286766 7f6ccc5ae700  0 log [ERR] : 2.12 repair 1 missing, 
 0 inconsistent objects
 2013-09-06 15:04:15.286823 7f6ccc5ae700  0 log [ERR] : 2.12 repair 2 errors, 
 2 fixed
 2013-09-06 15:04:20.778377 7f6ccc5ae700  0 log [ERR] : 2.12 scrub stat 
 mismatch, got 2724/2723 objects, 339/339 clones, 11315493376/11311299072 
 bytes.
 2013-09-06 15:04:20.778383 7f6ccc5ae700  0 log [ERR] : 2.12 scrub 1 errors
 
 b5# ceph pg dump | grep inconsistent
 2.1227230   0   0   11311299072 159103  159103  
 active+clean+inconsistent   2013-09-06 15:04:20.778413  20121'690883  
   20128'7941893   [6,7]   [6,7]   20121'6908832013-09-06 15:04:20.778387  
 20121'6908832013-09-06 15:04:15.286835
 
 b5# ceph pg repair 2.12
 b5# while ceph -s | grep -q scrubbing; do sleep 60; done
 b5# tail /var/log/ceph/ceph-osd.6.log
 2013-09-06 15:07:30.461959 7f6ccc5ae700  0 log [ERR] : 2.12 repair stat 
 mismatch, got 2724/2723 objects, 339/339 clones, 11315493376/11311299072 
 bytes.
 2013-09-06 15:07:30.461991 7f6ccc5ae700  0 log [ERR] : 2.12 repair 1 errors, 
 1 fixed
 
 b5# ceph pg dump | grep inconsistent
 2.1227240   0   0   11315493376 159580  159580  
 active+clean+inconsistent   2013-09-06 15:07:30.462039  20129'690886  
   20128'7942171   [6,7]   [6,7]   20129'690886

Re: OSD crash during repair

2013-09-06 Thread Sage Weil
On Fri, 6 Sep 2013, Chris Dunlop wrote:
 On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
  On Fri, 6 Sep 2013, Chris Dunlop wrote:
  Hi Sage,
  
  Does this answer your question?
  
  2013-09-06 09:30:19.813811 7f0ae8cbc700  0 log [INF] : applying 
  configuration change: internal_safe_to_start_threads = 'true'
  2013-09-06 09:33:28.303658 7f0ae94bd700  0 log [ERR] : 2.12 osd.7: soid 
  56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr 
  snapset
  2013-09-06 09:33:28.303685 7f0ae94bd700  0 log [ERR] : repair 2.12 
  56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr
  2013-09-06 09:34:45.138468 7f0ae94bd700  0 log [ERR] : 2.12 repair stat 
  mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 
  bytes.
  2013-09-06 09:34:45.142215 7f0ae94bd700  0 log [ERR] : 2.12 repair 0 
  missing, 1 inconsistent objects
  2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) **
  
  I've just attached the full 'debug_osd 0/10' log to the bug report.
  
  This suggests to me that the object on osd.6 is missing those xattrs; can 
  you confirm with getfattr -d on the in osd.6's data directory?
 
 I haven't yet wrapped my head around how to translate an oid
 like those above into a underlying file system object. What 
 directory should I be looking at?

It's the osd.6 data directory (maybe /var/lib/ceph/osd/ceph-6, or 
whatever you configured), 
/currrent/$pgid_head/.../*rb.0.17d9b.2ae8944a.1e11*.
In your case $pgid is 2.12.  Do a 

 find . | grep rb.0.17d9b.2ae8944a.1e11

and you will see it pop up (with head in there along with some other 
stuff).  getfattr -d $file to confirm the user.ceph._ and 
user.ceph.snapset xattrs are missing.  I would also confirm that they are 
present on the same file in osd.7's data directory.  Maybe do a sanity 
check to make sure the objects otherwise look like the match (file size, 
md5sum, etc.).  Assuming osd.7 doesn't look obviously wrong (e.g., 0 
bytes or something), rename the bad osd.6 copy out of the way and let 
repair recover it for you.

Note that you might have to do repair twice to make the pg stats number 
reflect the just-repaired object.

  If that is indeed the case, you should be able to move the object out of 
  the way (don't delete it, just in case) and then do the repair.  The osd.6 
  should recover by copying the object from osd.7 (which has the needed 
  xattrs).  Bobtail is smart enough to recover missing objects but not to 
  recover just missing xattrs.
 
 Do you want me to hold off on any repairs to allow tracking down
 the crash, or is the current code sufficiently different that
 there's little point?

There is little point with bobtail.
 
  Also, you should upgrade to dumpling.  :)
 
 I've been considering it. It was initially a little scary with
 the various issues that were cropping up but that all seems to
 have quietened down.
 
 Of course I'd like my cluster to be clean before attempting an upgrade!

Definitely.  Let us know how it goes! :)

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD crash during repair

2013-09-05 Thread Chris Dunlop
G'day,

I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD:

http://tracker.ceph.com/issues/6233


ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
 1: /usr/bin/ceph-osd() [0x8530a2]
 2: (()+0xf030) [0x7f541ca39030]
 3: (gsignal()+0x35) [0x7f541b132475]
 4: (abort()+0x180) [0x7f541b1356f0]
 5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d]
 6: (()+0x63996) [0x7f541b985996]
 7: (()+0x639c3) [0x7f541b9859c3]
 8: (()+0x63bee) [0x7f541b985bee]
 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x8fa9a7]
 10: (object_info_t::decode(ceph::buffer::list::iterator)+0x29) [0x95b579]
 11: (object_info_t::object_info_t(ceph::buffer::list)+0x180) [0x695ec0]
 12: (PG::repair_object(hobject_t const, ScrubMap::object*, int, int)+0xc7) 
[0x7646b7]
 13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d]
 14: (PG::scrub_finish()+0x4f) [0x76587f]
 15: (PG::chunky_scrub(ThreadPool::TPHandle)+0x10d6) [0x76cb96]
 16: (PG::scrub(ThreadPool::TPHandle)+0x138) [0x76d7e8]
 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0xf) [0x70515f]
 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542]
 19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0]
 20: (()+0x6b50) [0x7f541ca30b50]
 21: (clone()+0x6d) [0x7f541b1daa7d]
 NOTE: a copy of the executable, or `objdump -rdS lt;executablegt;` is needed 
to interpret this.


This occurs as a result of:

# ceph pg dump | grep inconsistent
2.1227230   0   0   11311299072 159189  159189  
active+clean+inconsistent   2013-09-06 09:35:47.512119  20117'690441
20120'7914185   [6,7]   [6,7]   20021'6759672013-09-03 15:58:12.459188  
19384'6654042013-08-28 12:42:07.490877
# ceph pg repair 2.12

Looking at PG::repair_object per line 12 of the backtrace, I can see a
dout(10) which should tell me the problem object:


src/osd/PG.cc:
void PG::repair_object(const hobject_t soid, ScrubMap::object *po, int 
bad_peer, int ok_peer)
{
  dout(10)  repair_object   soid   bad_peer osd.  bad_peer   
ok_peer osd.  ok_peer  dendl;
  ...
}


The 'ceph pg dump' output above tells me the primary osd is '6', so I
can increase the logging level to 10 on osd.6 to get the debug output,
and repair again:

# ceph osd tell 6 injectargs '--debug_osd 0/10'
# ceph pg repair 2.12

I get the same OSD crash, but this time it logs the dout from above,
which shows the problem object:

-1 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 
pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 n=2722 ec=1 
les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 mlcod 20117'690440 
active+scrubbing+deep+repair] repair_object 
56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 bad_peer osd.7 ok_peer osd.6
 0 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) 
**

So...

Firstly, is anyone interested in further investigating the problem to
fix the crash behaviour?

And, what's the best way to fix the pool?

Cheers,

Chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash during repair

2013-09-05 Thread Sage Weil
Hi Chris,

What is the inconsistency that scrub reports in the log?  My guess is that 
the simplest way to resolve this is to remove whichever copy you decide is 
invalid, but it depends on what the inconstency it is trying/failing to 
repair is.

Thanks!
sage


On Fri, 6 Sep 2013, Chris Dunlop wrote:

 G'day,
 
 I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD:
 
 http://tracker.ceph.com/issues/6233
 
 
 ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
  1: /usr/bin/ceph-osd() [0x8530a2]
  2: (()+0xf030) [0x7f541ca39030]
  3: (gsignal()+0x35) [0x7f541b132475]
  4: (abort()+0x180) [0x7f541b1356f0]
  5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d]
  6: (()+0x63996) [0x7f541b985996]
  7: (()+0x639c3) [0x7f541b9859c3]
  8: (()+0x63bee) [0x7f541b985bee]
  9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x8fa9a7]
  10: (object_info_t::decode(ceph::buffer::list::iterator)+0x29) [0x95b579]
  11: (object_info_t::object_info_t(ceph::buffer::list)+0x180) [0x695ec0]
  12: (PG::repair_object(hobject_t const, ScrubMap::object*, int, int)+0xc7) 
 [0x7646b7]
  13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d]
  14: (PG::scrub_finish()+0x4f) [0x76587f]
  15: (PG::chunky_scrub(ThreadPool::TPHandle)+0x10d6) [0x76cb96]
  16: (PG::scrub(ThreadPool::TPHandle)+0x138) [0x76d7e8]
  17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0xf) [0x70515f]
  18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542]
  19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0]
  20: (()+0x6b50) [0x7f541ca30b50]
  21: (clone()+0x6d) [0x7f541b1daa7d]
  NOTE: a copy of the executable, or `objdump -rdS lt;executablegt;` is 
 needed to interpret this.
 
 
 This occurs as a result of:
 
 # ceph pg dump | grep inconsistent
 2.1227230   0   0   11311299072 159189  159189  
 active+clean+inconsistent   2013-09-06 09:35:47.512119  20117'690441  
   20120'7914185   [6,7]   [6,7]   20021'6759672013-09-03 15:58:12.459188  
 19384'6654042013-08-28 12:42:07.490877
 # ceph pg repair 2.12
 
 Looking at PG::repair_object per line 12 of the backtrace, I can see a
 dout(10) which should tell me the problem object:
 
 
 src/osd/PG.cc:
 void PG::repair_object(const hobject_t soid, ScrubMap::object *po, int 
 bad_peer, int ok_peer)
 {
   dout(10)  repair_object   soid   bad_peer osd.  bad_peer   
 ok_peer osd.  ok_peer  dendl;
   ...
 }
 
 
 The 'ceph pg dump' output above tells me the primary osd is '6', so I
 can increase the logging level to 10 on osd.6 to get the debug output,
 and repair again:
 
 # ceph osd tell 6 injectargs '--debug_osd 0/10'
 # ceph pg repair 2.12
 
 I get the same OSD crash, but this time it logs the dout from above,
 which shows the problem object:
 
 -1 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 
 pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 n=2722 
 ec=1 les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 mlcod 
 20117'690440 active+scrubbing+deep+repair] repair_object 
 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 bad_peer osd.7 ok_peer osd.6
  0 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal 
 (Aborted) **
 
 So...
 
 Firstly, is anyone interested in further investigating the problem to
 fix the crash behaviour?
 
 And, what's the best way to fix the pool?
 
 Cheers,
 
 Chris
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash during repair

2013-09-05 Thread Chris Dunlop
Hi Sage,

Does this answer your question?

2013-09-06 09:30:19.813811 7f0ae8cbc700  0 log [INF] : applying configuration 
change: internal_safe_to_start_threads = 'true'
2013-09-06 09:33:28.303658 7f0ae94bd700  0 log [ERR] : 2.12 osd.7: soid 
56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr 
snapset
2013-09-06 09:33:28.303685 7f0ae94bd700  0 log [ERR] : repair 2.12 
56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr
2013-09-06 09:34:45.138468 7f0ae94bd700  0 log [ERR] : 2.12 repair stat 
mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 bytes.
2013-09-06 09:34:45.142215 7f0ae94bd700  0 log [ERR] : 2.12 repair 0 missing, 1 
inconsistent objects
2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) **

I've just attached the full 'debug_osd 0/10' log to the bug report.

Thanks,

Chris

On Thu, Sep 05, 2013 at 07:38:47PM -0700, Sage Weil wrote:
 Hi Chris,
 
 What is the inconsistency that scrub reports in the log?  My guess is that 
 the simplest way to resolve this is to remove whichever copy you decide is 
 invalid, but it depends on what the inconstency it is trying/failing to 
 repair is.
 
 Thanks!
 sage
 
 
 On Fri, 6 Sep 2013, Chris Dunlop wrote:
 
  G'day,
  
  I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD:
  
  http://tracker.ceph.com/issues/6233
  
  
  ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
   1: /usr/bin/ceph-osd() [0x8530a2]
   2: (()+0xf030) [0x7f541ca39030]
   3: (gsignal()+0x35) [0x7f541b132475]
   4: (abort()+0x180) [0x7f541b1356f0]
   5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d]
   6: (()+0x63996) [0x7f541b985996]
   7: (()+0x639c3) [0x7f541b9859c3]
   8: (()+0x63bee) [0x7f541b985bee]
   9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) 
  [0x8fa9a7]
   10: (object_info_t::decode(ceph::buffer::list::iterator)+0x29) [0x95b579]
   11: (object_info_t::object_info_t(ceph::buffer::list)+0x180) [0x695ec0]
   12: (PG::repair_object(hobject_t const, ScrubMap::object*, int, 
  int)+0xc7) [0x7646b7]
   13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d]
   14: (PG::scrub_finish()+0x4f) [0x76587f]
   15: (PG::chunky_scrub(ThreadPool::TPHandle)+0x10d6) [0x76cb96]
   16: (PG::scrub(ThreadPool::TPHandle)+0x138) [0x76d7e8]
   17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0xf) [0x70515f]
   18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542]
   19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0]
   20: (()+0x6b50) [0x7f541ca30b50]
   21: (clone()+0x6d) [0x7f541b1daa7d]
   NOTE: a copy of the executable, or `objdump -rdS lt;executablegt;` is 
  needed to interpret this.
  
  
  This occurs as a result of:
  
  # ceph pg dump | grep inconsistent
  2.1227230   0   0   11311299072 159189  159189  
  active+clean+inconsistent   2013-09-06 09:35:47.512119  
  20117'69044120120'7914185   [6,7]   [6,7]   20021'6759672013-09-03 
  15:58:12.459188  19384'6654042013-08-28 12:42:07.490877
  # ceph pg repair 2.12
  
  Looking at PG::repair_object per line 12 of the backtrace, I can see a
  dout(10) which should tell me the problem object:
  
  
  src/osd/PG.cc:
  void PG::repair_object(const hobject_t soid, ScrubMap::object *po, int 
  bad_peer, int ok_peer)
  {
dout(10)  repair_object   soid   bad_peer osd.  bad_peer   
  ok_peer osd.  ok_peer  dendl;
...
  }
  
  
  The 'ceph pg dump' output above tells me the primary osd is '6', so I
  can increase the logging level to 10 on osd.6 to get the debug output,
  and repair again:
  
  # ceph osd tell 6 injectargs '--debug_osd 0/10'
  # ceph pg repair 2.12
  
  I get the same OSD crash, but this time it logs the dout from above,
  which shows the problem object:
  
  -1 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 
  pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 n=2722 
  ec=1 les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 mlcod 
  20117'690440 active+scrubbing+deep+repair] repair_object 
  56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 bad_peer osd.7 ok_peer 
  osd.6
   0 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal 
  (Aborted) **
  
  So...
  
  Firstly, is anyone interested in further investigating the problem to
  fix the crash behaviour?
  
  And, what's the best way to fix the pool?
  
  Cheers,
  
  Chris
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord

Re: OSD crash during repair

2013-09-05 Thread Sage Weil
On Fri, 6 Sep 2013, Chris Dunlop wrote:
 Hi Sage,
 
 Does this answer your question?
 
 2013-09-06 09:30:19.813811 7f0ae8cbc700  0 log [INF] : applying configuration 
 change: internal_safe_to_start_threads = 'true'
 2013-09-06 09:33:28.303658 7f0ae94bd700  0 log [ERR] : 2.12 osd.7: soid 
 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr 
 snapset
 2013-09-06 09:33:28.303685 7f0ae94bd700  0 log [ERR] : repair 2.12 
 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr
 2013-09-06 09:34:45.138468 7f0ae94bd700  0 log [ERR] : 2.12 repair stat 
 mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 
 bytes.
 2013-09-06 09:34:45.142215 7f0ae94bd700  0 log [ERR] : 2.12 repair 0 missing, 
 1 inconsistent objects
 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) **
 
 I've just attached the full 'debug_osd 0/10' log to the bug report.

This suggests to me that the object on osd.6 is missing those xattrs; can 
you confirm with getfattr -d on the in osd.6's data directory?

If that is indeed the case, you should be able to move the object out of 
the way (don't delete it, just in case) and then do the repair.  The osd.6 
should recover by copying the object from osd.7 (which has the needed 
xattrs).  Bobtail is smart enough to recover missing objects but not to 
recover just missing xattrs.

Also, you should upgrade to dumpling.  :)

sage



 
 Thanks,
 
 Chris
 
 On Thu, Sep 05, 2013 at 07:38:47PM -0700, Sage Weil wrote:
  Hi Chris,
  
  What is the inconsistency that scrub reports in the log?  My guess is that 
  the simplest way to resolve this is to remove whichever copy you decide is 
  invalid, but it depends on what the inconstency it is trying/failing to 
  repair is.
  
  Thanks!
  sage
  
  
  On Fri, 6 Sep 2013, Chris Dunlop wrote:
  
   G'day,
   
   I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an 
   OSD:
   
   http://tracker.ceph.com/issues/6233
   
   
   ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33)
1: /usr/bin/ceph-osd() [0x8530a2]
2: (()+0xf030) [0x7f541ca39030]
3: (gsignal()+0x35) [0x7f541b132475]
4: (abort()+0x180) [0x7f541b1356f0]
5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d]
6: (()+0x63996) [0x7f541b985996]
7: (()+0x639c3) [0x7f541b9859c3]
8: (()+0x63bee) [0x7f541b985bee]
9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) 
   [0x8fa9a7]
10: (object_info_t::decode(ceph::buffer::list::iterator)+0x29) 
   [0x95b579]
11: (object_info_t::object_info_t(ceph::buffer::list)+0x180) [0x695ec0]
12: (PG::repair_object(hobject_t const, ScrubMap::object*, int, 
   int)+0xc7) [0x7646b7]
13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d]
14: (PG::scrub_finish()+0x4f) [0x76587f]
15: (PG::chunky_scrub(ThreadPool::TPHandle)+0x10d6) [0x76cb96]
16: (PG::scrub(ThreadPool::TPHandle)+0x138) [0x76d7e8]
17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle)+0xf) [0x70515f]
18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542]
19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0]
20: (()+0x6b50) [0x7f541ca30b50]
21: (clone()+0x6d) [0x7f541b1daa7d]
NOTE: a copy of the executable, or `objdump -rdS lt;executablegt;` is 
   needed to interpret this.
   
   
   This occurs as a result of:
   
   # ceph pg dump | grep inconsistent
   2.1227230   0   0   11311299072 159189  159189  
   active+clean+inconsistent   2013-09-06 09:35:47.512119  
   20117'69044120120'7914185   [6,7]   [6,7]   20021'675967
   2013-09-03 15:58:12.459188  19384'6654042013-08-28 12:42:07.490877
   # ceph pg repair 2.12
   
   Looking at PG::repair_object per line 12 of the backtrace, I can see a
   dout(10) which should tell me the problem object:
   
   
   src/osd/PG.cc:
   void PG::repair_object(const hobject_t soid, ScrubMap::object *po, int 
   bad_peer, int ok_peer)
   {
 dout(10)  repair_object   soid   bad_peer osd.  bad_peer  
ok_peer osd.  ok_peer  dendl;
 ...
   }
   
   
   The 'ceph pg dump' output above tells me the primary osd is '6', so I
   can increase the logging level to 10 on osd.6 to get the debug output,
   and repair again:
   
   # ceph osd tell 6 injectargs '--debug_osd 0/10'
   # ceph pg repair 2.12
   
   I get the same OSD crash, but this time it logs the dout from above,
   which shows the problem object:
   
   -1 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 
   pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 
   n=2722 ec=1 les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 
   mlcod 20117'690440 active+scrubbing+deep+repair] repair_object 
   56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 bad_peer osd.7 ok_peer 
   osd.6
0 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal 
   (Aborted) **
   
   So

Re: OSD crash during repair

2013-09-05 Thread Chris Dunlop
On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
 On Fri, 6 Sep 2013, Chris Dunlop wrote:
 Hi Sage,
 
 Does this answer your question?
 
 2013-09-06 09:30:19.813811 7f0ae8cbc700  0 log [INF] : applying 
 configuration change: internal_safe_to_start_threads = 'true'
 2013-09-06 09:33:28.303658 7f0ae94bd700  0 log [ERR] : 2.12 osd.7: soid 
 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr 
 snapset
 2013-09-06 09:33:28.303685 7f0ae94bd700  0 log [ERR] : repair 2.12 
 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr
 2013-09-06 09:34:45.138468 7f0ae94bd700  0 log [ERR] : 2.12 repair stat 
 mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 
 bytes.
 2013-09-06 09:34:45.142215 7f0ae94bd700  0 log [ERR] : 2.12 repair 0 
 missing, 1 inconsistent objects
 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) **
 
 I've just attached the full 'debug_osd 0/10' log to the bug report.
 
 This suggests to me that the object on osd.6 is missing those xattrs; can 
 you confirm with getfattr -d on the in osd.6's data directory?

I haven't yet wrapped my head around how to translate an oid
like those above into a underlying file system object. What 
directory should I be looking at?

 If that is indeed the case, you should be able to move the object out of 
 the way (don't delete it, just in case) and then do the repair.  The osd.6 
 should recover by copying the object from osd.7 (which has the needed 
 xattrs).  Bobtail is smart enough to recover missing objects but not to 
 recover just missing xattrs.

Do you want me to hold off on any repairs to allow tracking down
the crash, or is the current code sufficiently different that
there's little point?

 Also, you should upgrade to dumpling.  :)

I've been considering it. It was initially a little scary with
the various issues that were cropping up but that all seems to
have quietened down.

Of course I'd like my cluster to be clean before attempting an upgrade!

Chris
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash during repair

2013-09-05 Thread Chris Dunlop
On Fri, Sep 06, 2013 at 01:12:21PM +1000, Chris Dunlop wrote:
 On Thu, Sep 05, 2013 at 07:55:52PM -0700, Sage Weil wrote:
 On Fri, 6 Sep 2013, Chris Dunlop wrote:
 Hi Sage,
 
 Does this answer your question?
 
 2013-09-06 09:30:19.813811 7f0ae8cbc700  0 log [INF] : applying 
 configuration change: internal_safe_to_start_threads = 'true'
 2013-09-06 09:33:28.303658 7f0ae94bd700  0 log [ERR] : 2.12 osd.7: soid 
 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 extra attr _, extra attr 
 snapset
 2013-09-06 09:33:28.303685 7f0ae94bd700  0 log [ERR] : repair 2.12 
 56987a12/rb.0.17d9b.2ae8944a.1e11/head//2 no 'snapset' attr
 2013-09-06 09:34:45.138468 7f0ae94bd700  0 log [ERR] : 2.12 repair stat 
 mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 
 bytes.
 2013-09-06 09:34:45.142215 7f0ae94bd700  0 log [ERR] : 2.12 repair 0 
 missing, 1 inconsistent objects
 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) **
 
 I've just attached the full 'debug_osd 0/10' log to the bug report.
 
 This suggests to me that the object on osd.6 is missing those xattrs; can 
 you confirm with getfattr -d on the in osd.6's data directory?
 
 I haven't yet wrapped my head around how to translate an oid
 like those above into a underlying file system object. What 
 directory should I be looking at?

Found it:

b5# cd /var/lib/ceph/osd/ceph-6/current
b5# find 2.12* | grep -i 17d9b.2ae8944a.1e11
2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
b5# getfattr -d 
2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
 ...crickets... 

vs.

b4# cd /var/lib/ceph/osd/ceph-7/current
b4# getfattr -d 
2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
# file: 
2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2
user.ceph._=0sCgjhBANBACByYi4wLjE3ZDliLjJhZTg5NDRhLjAwMDAwMDAwMWUxMf7/EnqYVgAAAgAEAxACAP8AAEInCgAAuEsAAEEnCgAAuEsAAAICFQgTmwEAAHD1AgAAQAAAyY4dUpjCTSACAhUAAABCJwoAALhL
user.ceph.snapset=0sAgIZAAABAA==

 If that is indeed the case, you should be able to move the object out of 
 the way (don't delete it, just in case) and then do the repair.  The osd.6 
 should recover by copying the object from osd.7 (which has the needed 
 xattrs).  Bobtail is smart enough to recover missing objects but not to 
 recover just missing xattrs.
 
 Do you want me to hold off on any repairs to allow tracking down
 the crash, or is the current code sufficiently different that
 there's little point?

Repaired! ...but why does it take multiple rounds?

b5# mv 
2.12_head/DIR_2/DIR_1/DIR_A/rb.0.17d9b.2ae8944a.1e11__head_56987A12__2 
..

b5# ceph pg repair 2.12
b5# while ceph -s | grep -q scrubbing; do sleep 60; done
b5# tail /var/log/ceph/ceph-osd.6.log
2013-09-06 15:02:13.751160 7f6ccc5ae700  0 log [ERR] : 2.12 osd.6 missing 
56987a12/rb.0.17d9b.2ae8944a.1e11/head//2
2013-09-06 15:04:15.286711 7f6ccc5ae700  0 log [ERR] : 2.12 repair stat 
mismatch, got 2723/2724 objects, 339/339 clones, 11311299072/11315493376 bytes.
2013-09-06 15:04:15.286766 7f6ccc5ae700  0 log [ERR] : 2.12 repair 1 missing, 0 
inconsistent objects
2013-09-06 15:04:15.286823 7f6ccc5ae700  0 log [ERR] : 2.12 repair 2 errors, 2 
fixed
2013-09-06 15:04:20.778377 7f6ccc5ae700  0 log [ERR] : 2.12 scrub stat 
mismatch, got 2724/2723 objects, 339/339 clones, 11315493376/11311299072 bytes.
2013-09-06 15:04:20.778383 7f6ccc5ae700  0 log [ERR] : 2.12 scrub 1 errors

b5# ceph pg dump | grep inconsistent
2.1227230   0   0   11311299072 159103  159103  
active+clean+inconsistent   2013-09-06 15:04:20.778413  20121'690883
20128'7941893   [6,7]   [6,7]   20121'6908832013-09-06 15:04:20.778387  
20121'6908832013-09-06 15:04:15.286835

b5# ceph pg repair 2.12
b5# while ceph -s | grep -q scrubbing; do sleep 60; done
b5# tail /var/log/ceph/ceph-osd.6.log
2013-09-06 15:07:30.461959 7f6ccc5ae700  0 log [ERR] : 2.12 repair stat 
mismatch, got 2724/2723 objects, 339/339 clones, 11315493376/11311299072 bytes.
2013-09-06 15:07:30.461991 7f6ccc5ae700  0 log [ERR] : 2.12 repair 1 errors, 1 
fixed

b5# ceph pg dump | grep inconsistent
2.1227240   0   0   11315493376 159580  159580  
active+clean+inconsistent   2013-09-06 15:07:30.462039  20129'690886
20128'7942171   [6,7]   [6,7]   20129'6908862013-09-06 15:07:30.461995  
20129'6908862013-09-06 15:07:30.461995

b5# ceph pg repair 2.12
b5# while ceph -s | grep -q scrubbing; do sleep 60; done
b5# tail /var/log/ceph/ceph-osd.6.log
2013-09-06 15:09:36.993049 7f6ccc5ae700  0 log [INF] : 2.12 repair ok, 0 fixed

# ceph pg dump | grep inconsistent
 ...crickets... 


Chris
--
To unsubscribe from this list: send the 

OSD crash upon pool creation

2013-07-15 Thread Andrey Korolyov
Hello,

Using db2bb270e93ed44f9252d65d1d4c9b36875d0ea5 I had observed some
disaster-alike behavior after ``pool create'' command - every osd
daemon in the cluster will die at least once(some will crash times in
a row after bringing back). Please take a look on the
backtraces(almost identical) below. Issue #5637 is created in the
tracker.

Thanks!

http://xdel.ru/downloads/poolcreate.txt.gz
http://xdel.ru/downloads/poolcreate2.txt.gz
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD Crash

2013-03-04 Thread Dave Spano
I had one of my OSDs crash yesterday. I'm using ceph version 0.56.3 
(6eb7e15a4783b122e9b0c85ea9ba064145958aa5). 

The part of the log file where the crash happened is attached. Not really sure 
what lead up to it, but I did get an alert from my server monitor telling me my 
swap space got really low around the time it crashed. 

The OSD reconnected after restarting the service. Currently, I'm waiting 
patiently as 1 of my 400 pgs gets out of active+clean+scrubbing status. 

Dave Spano 
Optogenics 
Systems Administrator 


   -17 2013-03-03 13:02:13.478152 7f5d5a9b5700  5 --OSD::tracker-- reqid: client.13039.0:6860359, seq: 5393222, time: 2013-03-03 13:02:13.478134, event: write_thread_in_journal_buffer, request: osd_sub_op(client.13039.0:6860359 3.0 a10c17c8/rb.0.2dd7.16d28c4f.002f/head//3 [] v 411'1980074 snapset=0=[]:[] snapc=0=[]) v7
   -16 2013-03-03 13:02:13.478153 7f5d559ab700  1 -- 192.168.3.11:6801/4500 -- osd.1 192.168.3.12:6802/2467 -- osd_sub_op_reply(client.14000.1:570700 0.16 5e01a96/13797f2./head//0 [] ondisk, result = 0) v1 -- ?+0 0xc45cc80
   -15 2013-03-03 13:02:13.478184 7f5d5a9b5700  5 --OSD::tracker-- reqid: client.14000.1:570701, seq: 5393223, time: 2013-03-03 13:02:13.478184, event: write_thread_in_journal_buffer, request: osd_sub_op(client.14000.1:570701 0.22 40dccca2/11164ca.0002/head//0 [] v 411'447369 snapset=0=[]:[] snapc=0=[]) v7
   -14 2013-03-03 13:02:13.478209 7f5d5a9b5700  5 --OSD::tracker-- reqid: client.11755.0:2625658, seq: 5393225, time: 2013-03-03 13:02:13.478209, event: write_thread_in_journal_buffer, request: osd_sub_op(client.11755.0:2625658 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095529 snapset=0=[]:[] snapc=0=[]) v7
   -13 2013-03-03 13:02:13.478234 7f5d5a9b5700  5 --OSD::tracker-- reqid: client.11755.0:2625659, seq: 5393226, time: 2013-03-03 13:02:13.478234, event: write_thread_in_journal_buffer, request: osd_sub_op(client.11755.0:2625659 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095530 snapset=0=[]:[] snapc=0=[]) v7
   -12 2013-03-03 13:02:13.484696 7f5d549a9700  1 -- 192.168.3.11:6800/4500 == client.11755 192.168.1.64:0/1062411 90128  ping v1  0+0+0 (0 0 0) 0xff4e000 con 0x307a6e0
   -11 2013-03-03 13:02:13.489457 7f5d4f99f700  5 --OSD::tracker-- reqid: client.11755.0:2625660, seq: 5393227, time: 2013-03-03 13:02:13.489457, event: started, request: osd_sub_op(client.11755.0:2625660 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095531 snapset=0=[]:[] snapc=0=[]) v7
   -10 2013-03-03 13:02:13.489503 7f5d4f99f700  5 --OSD::tracker-- reqid: client.11755.0:2625660, seq: 5393227, time: 2013-03-03 13:02:13.489503, event: commit_queued_for_journal_write, request: osd_sub_op(client.11755.0:2625660 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095531 snapset=0=[]:[] snapc=0=[]) v7
-9 2013-03-03 13:02:13.571632 7f5d501a0700  5 --OSD::tracker-- reqid: client.11755.0:2625657, seq: 5393224, time: 2013-03-03 13:02:13.571631, event: started, request: osd_op(client.11755.0:2625657 rb.0.2ea4.614c277f.003d [write 1253376~4096] 3.c7bd6ff1) v4
-8 2013-03-03 13:02:13.571661 7f5d501a0700  5 --OSD::tracker-- reqid: client.11755.0:2625657, seq: 5393224, time: 2013-03-03 13:02:13.571661, event: started, request: osd_op(client.11755.0:2625657 rb.0.2ea4.614c277f.003d [write 1253376~4096] 3.c7bd6ff1) v4
-7 2013-03-03 13:02:13.571733 7f5d501a0700  5 --OSD::tracker-- reqid: client.11755.0:2625657, seq: 5393224, time: 2013-03-03 13:02:13.571733, event: waiting for subops from [1], request: osd_op(client.11755.0:2625657 rb.0.2ea4.614c277f.003d [write 1253376~4096] 3.c7bd6ff1) v4
-6 2013-03-03 13:02:13.598028 7f5d5a9b5700  5 --OSD::tracker-- reqid: client.13039.0:6860359, seq: 5393222, time: 2013-03-03 13:02:13.598027, event: journaled_completion_queued, request: osd_sub_op(client.13039.0:6860359 3.0 a10c17c8/rb.0.2dd7.16d28c4f.002f/head//3 [] v 411'1980074 snapset=0=[]:[] snapc=0=[]) v7
-5 2013-03-03 13:02:13.598061 7f5d5a9b5700  5 --OSD::tracker-- reqid: client.14000.1:570701, seq: 5393223, time: 2013-03-03 13:02:13.598061, event: journaled_completion_queued, request: osd_sub_op(client.14000.1:570701 0.22 40dccca2/11164ca.0002/head//0 [] v 411'447369 snapset=0=[]:[] snapc=0=[]) v7
-4 2013-03-03 13:02:13.598081 7f5d5a9b5700  5 --OSD::tracker-- reqid: client.11755.0:2625658, seq: 5393225, time: 2013-03-03 13:02:13.598081, event: journaled_completion_queued, request: osd_sub_op(client.11755.0:2625658 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095529 snapset=0=[]:[] snapc=0=[]) v7
-3 2013-03-03 13:02:13.598098 7f5d5a9b5700  5 --OSD::tracker-- reqid: client.11755.0:2625659, seq: 5393226, time: 2013-03-03 13:02:13.598098, event: journaled_completion_queued, request: osd_sub_op(client.11755.0:2625659 3.7 2cb006a7/rb.0.2ea4.614c277f.103d/head//3 [] v 411'6095530 snapset=0=[]:[] snapc=0=[]) v7
 

Re: OSD crash, ceph version 0.56.1

2013-01-09 Thread Ian Pye
On Wed, Jan 9, 2013 at 4:38 PM, Sage Weil s...@inktank.com wrote:
 On Wed, 9 Jan 2013, Ian Pye wrote:
 Hi,

 Every time I try an bring up an OSD, it crashes and I get the
 following: error (121) Remote I/O error not handled on operation 20

 This error code (EREMOTEIO) is not used by Ceph.  What fs are you using?
 Which kernel version?  Anything else unusual happen with your hardware
 recently that might have wreaked havoc on your underlying fs?

3.7.1 kernel with XFS. Its a demo-box from a vendor, so should be brand new.

I'm going to say its a disk error, given the following:

mkfs.xfs: read failed: Input/output error

Interestingly, running an osd and btrfs worked fine on the same disk.

Thanks for the help,

Ian


 sage



 The cluster is new and only has a little bit of data on it. Any ideas
 what is going on? Does Remote I/O mean a network error? Full log
 below:

-9 2013-01-10 00:00:20.182237 7f2ddde8f910  0
 filestore(/mnt/dist_j/ceph)  error (121) Remote I/O error not handled
 on operation 20 (12.0.0, or op 0, counting from 0)
 -8 2013-01-10 00:00:20.182275 7f2ddde8f910  0
 filestore(/mnt/dist_j/ceph) unexpected error code
 -7 2013-01-10 00:00:20.182285 7f2ddde8f910  0
 filestore(/mnt/dist_j/ceph)  transaction dump:
 { ops: [
 { op_num: 0,
   op_name: mkcoll,
   collection: 0.2c0_head},
 { op_num: 1,
   op_name: collection_setattr,
   collection: 0.2c0_head,
   name: info,
   length: 5},
 { op_num: 2,
   op_name: truncate,
   collection: meta,
   oid: a04c46e9\/pginfo_0.2c0\/0\/\/-1,
   offset: 0},
 { op_num: 3,
   op_name: write,
   collection: meta,
   oid: a04c46e9\/pginfo_0.2c0\/0\/\/-1,
   length: 531,
   offset: 0,
   bufferlist length: 531},
 { op_num: 4,
   op_name: remove,
   collection: meta,
   oid: 1f9ede85\/pglog_0.2c0\/0\/\/-1},
 { op_num: 5,
   op_name: write,
   collection: meta,
   oid: 1f9ede85\/pglog_0.2c0\/0\/\/-1,
   length: 0,
   offset: 0,
   bufferlist length: 0},
 { op_num: 6,
   op_name: collection_setattr,
   collection: 0.2c0_head,
   name: ondisklog,
   length: 34},
 { op_num: 7,
   op_name: nop}]}
 -6 2013-01-10 00:00:20.183085 7f2dd5e7f910 10 monclient:
 _send_mon_message to mon.a at 108.162.209.120:6789/0
 -5 2013-01-10 00:00:20.183108 7f2dd5e7f910  1 --
 108.162.209.120:6834/6359 -- 108.162.209.120:6789/0 -- osd_pgtemp(e22
 {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
 v22) v1 -- ?+0 0x5b15600 con 0x34629a0
 -4 2013-01-10 00:00:20.183772 7f2dd6680910 10 monclient:
 _send_mon_message to mon.a at 108.162.209.120:6789/0
 -3 2013-01-10 00:00:20.183797 7f2dd6680910  1 --
 108.162.209.120:6834/6359 -- 108.162.209.120:6789/0 -- osd_pgtemp(e22
 {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
 v22) v1 -- ?+0 0x5f75600 con 0x34629a0
 -2 2013-01-10 00:00:20.184315 7f2dd5e7f910 10 monclient:
 _send_mon_message to mon.a at 108.162.209.120:6789/0
 -1 2013-01-10 00:00:20.184338 7f2dd5e7f910  1 --
 108.162.209.120:6834/6359 -- 108.162.209.120:6789/0 -- osd_pgtemp(e22
 {0.110=[8,9],0.147=[3,9],0.155=[1,9],0.171=[0,9],0.194=[3,9],0.1ad=[10,9],0.1c2=[1,9],0.1cb=[7,9],0.1df=[6,9],0.1e8=[7,9],0.1e9=[11,9],0.1f1=[7,9]}
 v22) v1 -- ?+0 0x5b15400 con 0x34629a0
  0 2013-01-10 00:00:20.184755 7f2ddde8f910 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int)'
 thread 7f2ddde8f910 time 2013-01-10 00:00:20.182422
 os/FileStore.cc: 2681: FAILED assert(0 == unexpected error)

  ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
  1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
 long, int)+0x90a) [0x73e14a]
  2: (FileStore::do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long)+0x4c)
 [0x7455dc]
  3: (FileStore::_do_op(FileStore::OpSequencer*)+0xab) [0x72428b]
  4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x82b) [0x894feb]
  5: (ThreadPool::WorkThread::entry()+0x10) [0x8977d0]
  6: /lib/libpthread.so.0 [0x7f2de6d087aa]
  7: (clone()+0x6d) [0x7f2de518159d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 --- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 

osd crash after reboot

2012-12-14 Thread Stefan Priebe

Hello list,

after a reboot of my node i see this on all OSDs of this node after the 
reboot:


2012-12-14 09:03:20.393224 7f8e652f8780 -1 osd/OSD.cc: In function 
'OSDMapRef OSDService::get_map(epoch_t)' thread 7f8e652f8780 time 
2012-12-14 09:03:20.392528

osd/OSD.cc: 4385: FAILED assert(_get_map_bl(epoch, bl))

 ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824)
 1: (OSDService::get_map(unsigned int)+0x918) [0x607f78]
 2: (OSD::load_pgs()+0x13ed) [0x6168ad]
 3: (OSD::init()+0xaff) [0x617a5f]
 4: (main()+0x2de6) [0x55a416]
 5: (__libc_start_main()+0xfd) [0x7f8e63093c8d]
 6: /usr/bin/ceph-osd() [0x557269]
 NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.


--- begin dump of recent events ---
   -29 2012-12-14 09:03:20.266349 7f8e652f8780  5 asok(0x285c000) 
register_command perfcounters_dump hook 0x2850010
   -28 2012-12-14 09:03:20.266366 7f8e652f8780  5 asok(0x285c000) 
register_command 1 hook 0x2850010
   -27 2012-12-14 09:03:20.266369 7f8e652f8780  5 asok(0x285c000) 
register_command perf dump hook 0x2850010
   -26 2012-12-14 09:03:20.266379 7f8e652f8780  5 asok(0x285c000) 
register_command perfcounters_schema hook 0x2850010
   -25 2012-12-14 09:03:20.266383 7f8e652f8780  5 asok(0x285c000) 
register_command 2 hook 0x2850010
   -24 2012-12-14 09:03:20.266386 7f8e652f8780  5 asok(0x285c000) 
register_command perf schema hook 0x2850010
   -23 2012-12-14 09:03:20.266389 7f8e652f8780  5 asok(0x285c000) 
register_command config show hook 0x2850010
   -22 2012-12-14 09:03:20.266392 7f8e652f8780  5 asok(0x285c000) 
register_command config set hook 0x2850010
   -21 2012-12-14 09:03:20.266396 7f8e652f8780  5 asok(0x285c000) 
register_command log flush hook 0x2850010
   -20 2012-12-14 09:03:20.266398 7f8e652f8780  5 asok(0x285c000) 
register_command log dump hook 0x2850010
   -19 2012-12-14 09:03:20.266401 7f8e652f8780  5 asok(0x285c000) 
register_command log reopen hook 0x2850010
   -18 2012-12-14 09:03:20.267686 7f8e652f8780  0 ceph version 
0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824), process 
ceph-osd, pid 7212
   -17 2012-12-14 09:03:20.268738 7f8e652f8780  1 finished 
global_init_daemonize
   -16 2012-12-14 09:03:20.275957 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount FIEMAP ioctl is supported and appears to work
   -15 2012-12-14 09:03:20.275968 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount FIEMAP ioctl is disabled via 'filestore 
fiemap' config option
   -14 2012-12-14 09:03:20.276177 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount did NOT detect btrfs
   -13 2012-12-14 09:03:20.277051 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount syscall(__NR_syncfs, fd) fully supported
   -12 2012-12-14 09:03:20.277585 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount found snaps 
   -11 2012-12-14 09:03:20.278899 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount: enabling WRITEAHEAD journal mode: btrfs 
not detected
   -10 2012-12-14 09:03:20.290745 7f8e652f8780  0 journal  kernel 
version is 3.6.10
-9 2012-12-14 09:03:20.320728 7f8e652f8780  0 journal  kernel 
version is 3.6.10
-8 2012-12-14 09:03:20.328381 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount FIEMAP ioctl is supported and appears to work
-7 2012-12-14 09:03:20.328391 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount FIEMAP ioctl is disabled via 'filestore 
fiemap' config option
-6 2012-12-14 09:03:20.328574 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount did NOT detect btrfs
-5 2012-12-14 09:03:20.329579 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount syscall(__NR_syncfs, fd) fully supported
-4 2012-12-14 09:03:20.329612 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount found snaps 
-3 2012-12-14 09:03:20.330786 7f8e652f8780  0 
filestore(/ceph/osd.1/) mount: enabling WRITEAHEAD journal mode: btrfs 
not detected
-2 2012-12-14 09:03:20.340711 7f8e652f8780  0 journal  kernel 
version is 3.6.10
-1 2012-12-14 09:03:20.370707 7f8e652f8780  0 journal  kernel 
version is 3.6.10
 0 2012-12-14 09:03:20.393224 7f8e652f8780 -1 osd/OSD.cc: In 
function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f8e652f8780 
time 2012-12-14 09:03:20.392528

osd/OSD.cc: 4385: FAILED assert(_get_map_bl(epoch, bl))

 ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824)
 1: (OSDService::get_map(unsigned int)+0x918) [0x607f78]
 2: (OSD::load_pgs()+0x13ed) [0x6168ad]
 3: (OSD::init()+0xaff) [0x617a5f]
 4: (main()+0x2de6) [0x55a416]
 5: (__libc_start_main()+0xfd) [0x7f8e63093c8d]
 6: /usr/bin/ceph-osd() [0x557269]
 NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash after reboot

2012-12-14 Thread Stefan Priebe

same log more verbose:
11 ec=10 les/c 3307/3307 3306/3306/3306) [] r=0 lpr=0 lcod 0'0 mlcod 0'0 
inactive] read_log done
   -11 2012-12-14 09:17:50.648572 7fb6e0d6b780 10 osd.3 pg_epoch: 3996 
pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 
les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 
inactive] handle_loaded
   -10 2012-12-14 09:17:50.648581 7fb6e0d6b780 20 osd.3 pg_epoch: 3996 
pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 
les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 
inactive] exit Initial 0.015080 0 0.00
-9 2012-12-14 09:17:50.648591 7fb6e0d6b780 20 osd.3 pg_epoch: 3996 
pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 
les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 
inactive] enter Reset
-8 2012-12-14 09:17:50.648599 7fb6e0d6b780 20 osd.3 pg_epoch: 3996 
pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10 
les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0 
inactive] set_last_peering_reset 3996
-7 2012-12-14 09:17:50.648609 7fb6e0d6b780 10 osd.3 4233 load_pgs 
loaded pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 
ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=3996 lcod 0'0 mlcod 
0'0 inactive] log(1379'2968,3988'3969]
-6 2012-12-14 09:17:50.648649 7fb6e0d6b780 15 
filestore(/ceph/osd.3/) collection_getattr /ceph/osd.3//current/0.1_head 
'info'
-5 2012-12-14 09:17:50.648664 7fb6e0d6b780 10 
filestore(/ceph/osd.3/) collection_getattr /ceph/osd.3//current/0.1_head 
'info' = 5
-4 2012-12-14 09:17:50.648672 7fb6e0d6b780 20 osd.3 0 get_map 3316 
- loading and decoding 0x2943e00
-3 2012-12-14 09:17:50.648678 7fb6e0d6b780 15 
filestore(/ceph/osd.3/) read meta/a09ec88/osdmap.3316/0//-1 0~0
-2 2012-12-14 09:17:50.648705 7fb6e0d6b780 10 
filestore(/ceph/osd.3/) error opening file 
/ceph/osd.3//current/meta/DIR_8/DIR_8/osdmap.3316__0_0A09EC88__none with 
flags=0 and mode=0: (2) No such file or directory
-1 2012-12-14 09:17:50.648722 7fb6e0d6b780 10 
filestore(/ceph/osd.3/) FileStore::read(meta/a09ec88/osdmap.3316/0//-1) 
open error: (2) No such file or directory
 0 2012-12-14 09:17:50.649586 7fb6e0d6b780 -1 osd/OSD.cc: In 
function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6e0d6b780 
time 2012-12-14 09:17:50.648733

osd/OSD.cc: 4385: FAILED assert(_get_map_bl(epoch, bl))

 ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824)
 1: (OSDService::get_map(unsigned int)+0x918) [0x607f78]
 2: (OSD::load_pgs()+0x13ed) [0x6168ad]
 3: (OSD::init()+0xaff) [0x617a5f]
 4: (main()+0x2de6) [0x55a416]
 5: (__libc_start_main()+0xfd) [0x7fb6deb06c8d]
 6: /usr/bin/ceph-osd() [0x557269]
 NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.


--- logging levels ---
   0/ 5 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/20 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/20 osd
   0/ 0 optracker
   0/ 0 objclass
   0/20 filestore
   0/20 journal
   0/ 0 ms
   1/ 5 mon
   0/ 0 monc
   0/ 5 paxos
   0/ 0 tp
   0/ 0 auth
   1/ 5 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   1/ 5 rgw
   1/ 5 hadoop
   1/ 5 javaclient
   0/ 0 asok
   0/ 0 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent10
  max_new 1000
  log_file /var/log/ceph/ceph-osd.3.log
--- end dump of recent events ---
2012-12-14 09:17:50.714676 7fb6e0d6b780 -1 *** Caught signal (Aborted) **
 in thread 7fb6e0d6b780

 ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824)
 1: /usr/bin/ceph-osd() [0x7a1889]
 2: (()+0xeff0) [0x7fb6e0750ff0]
 3: (gsignal()+0x35) [0x7fb6deb1a1b5]
 4: (abort()+0x180) [0x7fb6deb1cfc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fb6df3aedc5]
 6: (()+0xcb166) [0x7fb6df3ad166]
 7: (()+0xcb193) [0x7fb6df3ad193]
 8: (()+0xcb28e) [0x7fb6df3ad28e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x7c9) [0x805659]

 10: (OSDService::get_map(unsigned int)+0x918) [0x607f78]
 11: (OSD::load_pgs()+0x13ed) [0x6168ad]
 12: (OSD::init()+0xaff) [0x617a5f]
 13: (main()+0x2de6) [0x55a416]
 14: (__libc_start_main()+0xfd) [0x7fb6deb06c8d]
 15: /usr/bin/ceph-osd() [0x557269]
 NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.


--- begin dump of recent events ---
 0 2012-12-14 09:17:50.714676 7fb6e0d6b780 -1 *** Caught signal 
(Aborted) **

 in thread 7fb6e0d6b780

 ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824)
 1: /usr/bin/ceph-osd() [0x7a1889]
 2: (()+0xeff0) [0x7fb6e0750ff0]
 3: (gsignal()+0x35) [0x7fb6deb1a1b5]
 4: (abort()+0x180) [0x7fb6deb1cfc0]
 5: 

Re: osd crash after reboot

2012-12-14 Thread Dennis Jacobfeuerborn
On 12/14/2012 10:14 AM, Stefan Priebe wrote:
 One more IMPORTANT note. This might happen due to the fact that a disk was
 missing (disk failure) afte the reboot.
 
 fstab and mountpoint are working with UUIDs so they match but the journal
 block device:
 osd journal  = /dev/sde1
 
 didn't match anymore - as the numbers got renumber due to the failed disk.
 Is there a way to use some kind of UUIDs here too for journal?

You should be able to use /dev/disk/by-uuid/* instead. That should give you
a stable view of the filesystems.

Regards,
  Dennis

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash after reboot

2012-12-14 Thread Mark Nelson

On 12/14/2012 08:52 AM, Dennis Jacobfeuerborn wrote:

On 12/14/2012 10:14 AM, Stefan Priebe wrote:

One more IMPORTANT note. This might happen due to the fact that a disk was
missing (disk failure) afte the reboot.

fstab and mountpoint are working with UUIDs so they match but the journal
block device:
osd journal  = /dev/sde1

didn't match anymore - as the numbers got renumber due to the failed disk.
Is there a way to use some kind of UUIDs here too for journal?


You should be able to use /dev/disk/by-uuid/* instead. That should give you
a stable view of the filesystems.


I often map partitions to something in /dev/disk/by-partlabel and use 
those in my ceph.conf files.  that way disks can be remapped behind the 
scenes and the ceph configuration doesn't have to change even if disks 
get replaced.




Regards,
   Dennis

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash after reboot

2012-12-14 Thread Stefan Priebe - Profihost AG

Hello Dennis,

Am 14.12.2012 15:52, schrieb Dennis Jacobfeuerborn:

didn't match anymore - as the numbers got renumber due to the failed disk.
Is there a way to use some kind of UUIDs here too for journal?


You should be able to use /dev/disk/by-uuid/* instead. That should give you
a stable view of the filesystems.


Good idea but there are only listed partitions with UUIDs. When the 
journal is using directly the partition it does not have a UUID.


But this reminded me of /dev/disk/by-id and that works fine. I'm now 
using the wwn Number.


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash after reboot

2012-12-14 Thread Mark Nelson

Hi Stefan,

Here's what I often do when I have a journal and data partition sharing 
a disk:


sudo parted -s -a optimal /dev/$DEV mklabel gpt
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100%

Mark

On 12/14/2012 09:11 AM, Stefan Priebe - Profihost AG wrote:

Hi Mark,

but do i set a label for a partition without FS like the journal blockdev?
Am 14.12.2012 16:01, schrieb Mark Nelson:

I often map partitions to something in /dev/disk/by-partlabel and use
those in my ceph.conf files.  that way disks can be remapped behind the
scenes and the ceph configuration doesn't have to change even if disks
get replaced.


Greets,
Stefan


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash after reboot

2012-12-14 Thread Stefan Priebe - Profihost AG

Hi Mark,

Am 14.12.2012 16:20, schrieb Mark Nelson:

sudo parted -s -a optimal /dev/$DEV mklabel gpt
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100%


My disks are gpt too and i'm also using parted. But i don't want to 
recreate my partitions. I haven't seen a way in parted to set such a 
label later.


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash after reboot

2012-12-14 Thread Stefan Priebe - Profihost AG

Hello Mark,

Am 14.12.2012 16:20, schrieb Mark Nelson:

sudo parted -s -a optimal /dev/$DEV mklabel gpt
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-journal 0% 10G
sudo parted -s -a optimal /dev/$DEV mkpart osd-device-$i-data 10G 100%


Isn't that the part type you're using?
mkpart part-type start-mb end-mb

I like your idea and i think it's a good one but i want to know why this 
works. part-type isn't FS label...


Greets,
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash after reboot

2012-12-14 Thread Sage Weil
On Fri, 14 Dec 2012, Stefan Priebe wrote:
 One more IMPORTANT note. This might happen due to the fact that a disk was
 missing (disk failure) afte the reboot.
 
 fstab and mountpoint are working with UUIDs so they match but the journal
 block device:
 osd journal  = /dev/sde1
 
 didn't match anymore - as the numbers got renumber due to the failed disk. Is
 there a way to use some kind of UUIDs here too for journal?

I think others have addressed the uuid question, but one note:

The ceph-osd process has an internal uuid/fingerprint on the journal and 
data dir, and will refuse to start if they don't match.

sage


 
 Stefan
 
 Am 14.12.2012 09:22, schrieb Stefan Priebe:
  same log more verbose:
  11 ec=10 les/c 3307/3307 3306/3306/3306) [] r=0 lpr=0 lcod 0'0 mlcod 0'0
  inactive] read_log done
  -11 2012-12-14 09:17:50.648572 7fb6e0d6b780 10 osd.3 pg_epoch: 3996
  pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10
  les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0
  inactive] handle_loaded
  -10 2012-12-14 09:17:50.648581 7fb6e0d6b780 20 osd.3 pg_epoch: 3996
  pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10
  les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0
  inactive] exit Initial 0.015080 0 0.00
   -9 2012-12-14 09:17:50.648591 7fb6e0d6b780 20 osd.3 pg_epoch: 3996
  pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10
  les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0
  inactive] enter Reset
   -8 2012-12-14 09:17:50.648599 7fb6e0d6b780 20 osd.3 pg_epoch: 3996
  pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11 ec=10
  les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=0 lcod 0'0 mlcod 0'0
  inactive] set_last_peering_reset 3996
   -7 2012-12-14 09:17:50.648609 7fb6e0d6b780 10 osd.3 4233 load_pgs
  loaded pg[3.44b( v 3988'3969 (1379'2968,3988'3969] local-les=3307 n=11
  ec=10 les/c 3307/3307 3306/3306/3306) [3,12] r=0 lpr=3996 lcod 0'0 mlcod
  0'0 inactive] log(1379'2968,3988'3969]
   -6 2012-12-14 09:17:50.648649 7fb6e0d6b780 15
  filestore(/ceph/osd.3/) collection_getattr /ceph/osd.3//current/0.1_head
  'info'
   -5 2012-12-14 09:17:50.648664 7fb6e0d6b780 10
  filestore(/ceph/osd.3/) collection_getattr /ceph/osd.3//current/0.1_head
  'info' = 5
   -4 2012-12-14 09:17:50.648672 7fb6e0d6b780 20 osd.3 0 get_map 3316
  - loading and decoding 0x2943e00
   -3 2012-12-14 09:17:50.648678 7fb6e0d6b780 15
  filestore(/ceph/osd.3/) read meta/a09ec88/osdmap.3316/0//-1 0~0
   -2 2012-12-14 09:17:50.648705 7fb6e0d6b780 10
  filestore(/ceph/osd.3/) error opening file
  /ceph/osd.3//current/meta/DIR_8/DIR_8/osdmap.3316__0_0A09EC88__none with
  flags=0 and mode=0: (2) No such file or directory
   -1 2012-12-14 09:17:50.648722 7fb6e0d6b780 10
  filestore(/ceph/osd.3/) FileStore::read(meta/a09ec88/osdmap.3316/0//-1)
  open error: (2) No such file or directory
0 2012-12-14 09:17:50.649586 7fb6e0d6b780 -1 osd/OSD.cc: In
  function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fb6e0d6b780
  time 2012-12-14 09:17:50.648733
  osd/OSD.cc: 4385: FAILED assert(_get_map_bl(epoch, bl))
  
ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824)
1: (OSDService::get_map(unsigned int)+0x918) [0x607f78]
2: (OSD::load_pgs()+0x13ed) [0x6168ad]
3: (OSD::init()+0xaff) [0x617a5f]
4: (main()+0x2de6) [0x55a416]
5: (__libc_start_main()+0xfd) [0x7fb6deb06c8d]
6: /usr/bin/ceph-osd() [0x557269]
NOTE: a copy of the executable, or `objdump -rdS executable` is
  needed to interpret this.
  
  --- logging levels ---
  0/ 5 none
  0/ 0 lockdep
  0/ 0 context
  0/ 0 crush
  1/ 5 mds
  1/ 5 mds_balancer
  1/ 5 mds_locker
  1/ 5 mds_log
  1/ 5 mds_log_expire
  1/ 5 mds_migrator
  0/ 0 buffer
  0/ 0 timer
  0/ 1 filer
  0/ 1 striper
  0/ 1 objecter
  0/ 5 rados
  0/ 5 rbd
  0/20 journaler
  0/ 5 objectcacher
  0/ 5 client
  0/20 osd
  0/ 0 optracker
  0/ 0 objclass
  0/20 filestore
  0/20 journal
  0/ 0 ms
  1/ 5 mon
  0/ 0 monc
  0/ 5 paxos
  0/ 0 tp
  0/ 0 auth
  1/ 5 crypto
  0/ 0 finisher
  0/ 0 heartbeatmap
  0/ 0 perfcounter
  1/ 5 rgw
  1/ 5 hadoop
  1/ 5 javaclient
  0/ 0 asok
  0/ 0 throttle
 -2/-2 (syslog threshold)
 -1/-1 (stderr threshold)
 max_recent10
 max_new 1000
 log_file /var/log/ceph/ceph-osd.3.log
  --- end dump of recent events ---
  2012-12-14 09:17:50.714676 7fb6e0d6b780 -1 *** Caught signal (Aborted) **
in thread 7fb6e0d6b780
  
ceph version 0.55-239-gc951c27 (c951c270a42b94b6f269992c9001d90f70a2b824)
1: /usr/bin/ceph-osd() [0x7a1889]
2: (()+0xeff0) [0x7fb6e0750ff0]
3: (gsignal()+0x35) [0x7fb6deb1a1b5]
4: (abort()+0x180) [0x7fb6deb1cfc0]
5: 

Re: osd crash after reboot

2012-12-14 Thread Stefan Priebe

Hi Sage,

this was just an idea and i need to fix MY uuid problem. But then the 
crash is still a problem of ceph. Have you looked into my log?

Am 14.12.2012 20:42, schrieb Sage Weil:

On Fri, 14 Dec 2012, Stefan Priebe wrote:

One more IMPORTANT note. This might happen due to the fact that a disk was
missing (disk failure) afte the reboot.

fstab and mountpoint are working with UUIDs so they match but the journal
block device:
osd journal  = /dev/sde1

didn't match anymore - as the numbers got renumber due to the failed disk. Is
there a way to use some kind of UUIDs here too for journal?


I think others have addressed the uuid question, but one note:

The ceph-osd process has an internal uuid/fingerprint on the journal and
data dir, and will refuse to start if they don't match.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash on 0.48.2argonaut

2012-11-15 Thread Josh Durgin

On 11/14/2012 11:31 PM, eric_yh_c...@wiwynn.com wrote:

Dear All:

I met this issue on one of osd node. Is this a known issue? Thanks!

ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
  1: /usr/bin/ceph-osd() [0x6edaba]
  2: (()+0xfcb0) [0x7f08b112dcb0]
  3: (gsignal()+0x35) [0x7f08afd09445]
  4: (abort()+0x17b) [0x7f08afd0cbab]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f08b065769d]
  6: (()+0xb5846) [0x7f08b0655846]
  7: (()+0xb5873) [0x7f08b0655873]
  8: (()+0xb596e) [0x7f08b065596e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1de) [0x7a82fe]
  10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x693) [0x530f83]
  11: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, 
eversion_t)+0x159) [0x531ac9]
  12: 
(ReplicatedPG::sub_op_modify_reply(std::tr1::shared_ptrOpRequest)+0x15c) 
[0x53251c]
  13: (ReplicatedPG::do_sub_op_reply(std::tr1::shared_ptrOpRequest)+0x81) 
[0x54d241]
  14: (PG::do_request(std::tr1::shared_ptrOpRequest)+0x1e3) [0x600883]
  15: (OSD::dequeue_op(PG*)+0x238) [0x5bfaf8]
  16: (ThreadPool::worker()+0x4d5) [0x79f835]
  17: (ThreadPool::WorkThread::entry()+0xd) [0x5d87cd]
  18: (()+0x7e9a) [0x7f08b1125e9a]
  19: (clone()+0x6d) [0x7f08afdc54bd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.


The log of the crashed osd should show which assert actually failed.
It could be this bug, but I can't tell without knowing which
assert was triggered:

http://tracker.newdream.net/issues/2956

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD crash on 0.48.2argonaut

2012-11-14 Thread Eric_YH_Chen
Dear All:

I met this issue on one of osd node. Is this a known issue? Thanks!

ceph version 0.48.2argonaut (commit:3e02b2fad88c2a95d9c0c86878f10d1beb780bfe)
 1: /usr/bin/ceph-osd() [0x6edaba]
 2: (()+0xfcb0) [0x7f08b112dcb0]
 3: (gsignal()+0x35) [0x7f08afd09445]
 4: (abort()+0x17b) [0x7f08afd0cbab]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f08b065769d]
 6: (()+0xb5846) [0x7f08b0655846]
 7: (()+0xb5873) [0x7f08b0655873]
 8: (()+0xb596e) [0x7f08b065596e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x1de) [0x7a82fe]
 10: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x693) [0x530f83]
 11: (ReplicatedPG::repop_ack(ReplicatedPG::RepGather*, int, int, int, 
eversion_t)+0x159) [0x531ac9]
 12: (ReplicatedPG::sub_op_modify_reply(std::tr1::shared_ptrOpRequest)+0x15c) 
[0x53251c]
 13: (ReplicatedPG::do_sub_op_reply(std::tr1::shared_ptrOpRequest)+0x81) 
[0x54d241]
 14: (PG::do_request(std::tr1::shared_ptrOpRequest)+0x1e3) [0x600883]
 15: (OSD::dequeue_op(PG*)+0x238) [0x5bfaf8]
 16: (ThreadPool::worker()+0x4d5) [0x79f835]
 17: (ThreadPool::WorkThread::entry()+0xd) [0x5d87cd]
 18: (()+0x7e9a) [0x7f08b1125e9a]
 19: (clone()+0x6d) [0x7f08afdc54bd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)

2012-10-15 Thread Samuel Just
Do you have a coredump for the crash?  Can you reproduce the crash with:

debug filestore = 20
debug osd = 20

and post the logs?

As far as the incomplete pg goes, can you post the output of

ceph pg pgid query

where pgid is the pgid of the incomplete pg (e.g. 1.34)?

Thanks
-Sam

On Thu, Oct 11, 2012 at 3:17 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Hello everybody.

 I'm currently having problem with 1 of my OSD, crashing with  this trace :

 ceph version 0.52 (commit:e48859474c4944d4ff201ddc9f5fd400e8898173)
  1: /usr/bin/ceph-osd() [0x737879]
  2: (()+0xf030) [0x7f43f0af0030]
  3:
 (ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*,
 pg_stat_t*)+0x292) [0x555262]
  4: (ReplicatedPG::recover_backfill(int)+0x1c1a) [0x55c93a]
  5: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x26a)
 [0x563c1a]
  6: (OSD::do_recovery(PG*)+0x39d) [0x5d3c9d]
  7: (OSD::RecoveryWQ::_process(PG*)+0xd) [0x6119fd]
  8: (ThreadPool::worker()+0x82b) [0x7c176b]
  9: (ThreadPool::WorkThread::entry()+0xd) [0x5f609d]
  10: (()+0x6b50) [0x7f43f0ae7b50]
  11: (clone()+0x6d) [0x7f43ef81b78d]

 Restarting gives the same message after some seconds.
 I've been watching the bug tracker but I don't see something related.

 Some informations : kernel is 3.6.1, with standard debian packages from
 ceph.com

 My ceph cluster was running well and stable on 6 osd since june (3
 datacenters, 2 with 2 nodes, 1 with 4 nodes, a replication of 2, and
 adjusted weight to try to balance data evenly). Beginned with the
 then-up-to-date version, then 0.48, 49,50,51... Data store is on XFS.

 I'm currently in the process of growing my ceph from 6 nodes to 12 nodes. 11
 nodes are currently in ceph, for a 130 TB total. Declaring new osd was OK,
 the data has moved quite ok (in fact I had some OSD crash - not
 definitive, the osd restart ok-, maybe related to an error in my new nodes
 network configuration that I discovered after. More on that later, I can
 find the traces, but I'm not sure it's related)

 When ceph was finally stable again, with HEALTH_OK, I decided to reweight
 the osd (that was tuesday). Operation went quite OK, but near the end of
 operation (0,085% left), 1 of my OSD crashed, and won't start again.

 More problematic, with this osd down, I have 1 incomplete PG :

 ceph -s
health HEALTH_WARN 86 pgs backfill; 231 pgs degraded; 4 pgs down; 15 pgs
 incomplete; 4 pgs peering; 134 pgs recovering; 19 pgs stuck inactive; 455
 pgs stuck unclean; recovery 2122878/23181946 degraded (9.157%);
 2321/11590973 unfound (0.020%); 1 near full osd(s)
monmap e1: 3 mons at
 {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0},
 election epoch 20, quorum 0,1,2 chichibu,glenesk,karuizawa
osdmap e13184: 11 osds: 10 up, 10 in
 pgmap v2399093: 1728 pgs: 165 active, 1270 active+clean, 8
 active+recovering+degraded, 41 active+recovering+degraded+remapped+backfill,
 4 down+peering, 137 active+degraded, 3 active+clean+scrubbing, 15
 incomplete, 40 active+recovering, 45 active+recovering+degraded+backfill;
 44119 GB data, 84824 GB used, 37643 GB / 119 TB avail; 2122878/23181946
 degraded (9.157%); 2321/11590973 unfound (0.020%)
mdsmap e321: 1/1/1 up {0=karuizawa=up:active}, 2 up:standby

 how is it possible as I have a replication of 2  ?

 Is it a known problem ?

 Cheers,

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


osd crash in ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, pg_stat_t*)

2012-10-11 Thread Yann Dupont

Hello everybody.

I'm currently having problem with 1 of my OSD, crashing with  this trace :

ceph version 0.52 (commit:e48859474c4944d4ff201ddc9f5fd400e8898173)
 1: /usr/bin/ceph-osd() [0x737879]
 2: (()+0xf030) [0x7f43f0af0030]
 3: 
(ReplicatedPG::add_object_context_to_pg_stat(ReplicatedPG::ObjectContext*, 
pg_stat_t*)+0x292) [0x555262]

 4: (ReplicatedPG::recover_backfill(int)+0x1c1a) [0x55c93a]
 5: (ReplicatedPG::start_recovery_ops(int, PG::RecoveryCtx*)+0x26a) 
[0x563c1a]

 6: (OSD::do_recovery(PG*)+0x39d) [0x5d3c9d]
 7: (OSD::RecoveryWQ::_process(PG*)+0xd) [0x6119fd]
 8: (ThreadPool::worker()+0x82b) [0x7c176b]
 9: (ThreadPool::WorkThread::entry()+0xd) [0x5f609d]
 10: (()+0x6b50) [0x7f43f0ae7b50]
 11: (clone()+0x6d) [0x7f43ef81b78d]

Restarting gives the same message after some seconds.
I've been watching the bug tracker but I don't see something related.

Some informations : kernel is 3.6.1, with standard debian packages 
from ceph.com


My ceph cluster was running well and stable on 6 osd since june (3 
datacenters, 2 with 2 nodes, 1 with 4 nodes, a replication of 2, and 
adjusted weight to try to balance data evenly). Beginned with the 
then-up-to-date version, then 0.48, 49,50,51... Data store is on XFS.


I'm currently in the process of growing my ceph from 6 nodes to 12 
nodes. 11 nodes are currently in ceph, for a 130 TB total. Declaring new 
osd was OK, the data has moved quite ok (in fact I had some OSD crash 
- not definitive, the osd restart ok-, maybe related to an error in my 
new nodes network configuration that I discovered after. More on that 
later, I can find the traces, but I'm not sure it's related)


When ceph was finally stable again, with HEALTH_OK, I decided to 
reweight the osd (that was tuesday). Operation went quite OK, but near 
the end of operation (0,085% left), 1 of my OSD crashed, and won't start 
again.


More problematic, with this osd down, I have 1 incomplete PG :

ceph -s
   health HEALTH_WARN 86 pgs backfill; 231 pgs degraded; 4 pgs down; 15 
pgs incomplete; 4 pgs peering; 134 pgs recovering; 19 pgs stuck 
inactive; 455 pgs stuck unclean; recovery 2122878/23181946 degraded 
(9.157%); 2321/11590973 unfound (0.020%); 1 near full osd(s)
   monmap e1: 3 mons at 
{chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, 
election epoch 20, quorum 0,1,2 chichibu,glenesk,karuizawa

   osdmap e13184: 11 osds: 10 up, 10 in
pgmap v2399093: 1728 pgs: 165 active, 1270 active+clean, 8 
active+recovering+degraded, 41 
active+recovering+degraded+remapped+backfill, 4 down+peering, 137 
active+degraded, 3 active+clean+scrubbing, 15 incomplete, 40 
active+recovering, 45 active+recovering+degraded+backfill; 44119 GB 
data, 84824 GB used, 37643 GB / 119 TB avail; 2122878/23181946 degraded 
(9.157%); 2321/11590973 unfound (0.020%)

   mdsmap e321: 1/1/1 up {0=karuizawa=up:active}, 2 up:standby

how is it possible as I have a replication of 2  ?

Is it a known problem ?

Cheers,

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD-crash on 0.48.1argonout, error void ReplicatedPG::recover_got(hobject_t, eversion_t) not seen on list

2012-09-19 Thread Oliver Francke

Hi all,

after adding a new node into our ceph-cluster yesterday, we had a crash 
of one OSD.


I have found this kind of message in the bugtracker as being solved ( 
http://tracker.newdream.net/issues/2075),
I will update this one for my convenience and attach the according log ( 
due to productive site, there is no more

verbose debug available, sorry).

Other than that, everything went almost smoothly, except the annoying 
slow requests,
which are hopefully not only fixed in 0.51, ... when do we expect next 
stable, btw?


The replication was fast, due to a SSD-cached LSI-controller, 4 OSDs per 
node, one per HDD,

1Gbit was completely saturated, time for next step towards 10Gbit ;)

Regards,

Oliver.

--

Oliver Francke

filoo GmbH
Moltkestraße 25a
0 Gütersloh
HRB4355 AG Gütersloh

Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz

Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash

2012-09-04 Thread Andrey Korolyov
Hi,

Almost always one or more osd dies when doing overlapped recovery -
e.g. add new crushmap and remove some newly added osds from cluster
some minutes later during remap or inject two slightly different
crushmaps after a short time(surely preserving at least one of
replicas online). Seems that osd dying on excessive amount of
operations in queue because under normal test, e.g. rados, iowait does
not break one percent barrier but during recovery it may raise up to
ten percents(2108 w/ cache, splitted disks as R0 each).

#0  0x7f62f193a445 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x7f62f193db9b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x7f62f2236665 in __gnu_cxx::__verbose_terminate_handler() ()
from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x7f62f2234796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x7f62f22347c3 in std::terminate() () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x7f62f22349ee in __cxa_throw () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00844e11 in ceph::__ceph_assert_fail(char const*, char
const*, int, char const*) ()
#7  0x0073148f in
FileStore::_do_transaction(ObjectStore::Transaction, unsigned long,
int) ()
#8  0x0073484e in
FileStore::do_transactions(std::listObjectStore::Transaction*,
std::allocatorObjectStore::Transaction* , unsigned long) ()
#9  0x0070c680 in FileStore::_do_op(FileStore::OpSequencer*) ()
#10 0x0083ce01 in ThreadPool::worker() ()
#11 0x006823ed in ThreadPool::WorkThread::entry() ()
#12 0x7f62f345ee9a in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#13 0x7f62f19f64cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#14 0x in ?? ()
ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)

On Sun, Aug 26, 2012 at 8:52 PM, Andrey Korolyov and...@xdel.ru wrote:
 During recovery, following crash happens(simular to
 http://tracker.newdream.net/issues/2126 which marked resolved long
 ago):

 http://xdel.ru/downloads/ceph-log/osd-2012-08-26.txt

 On Sat, Aug 25, 2012 at 12:30 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Aug 23, 2012 at 4:09 AM, Gregory Farnum g...@inktank.com wrote:
 The tcmalloc backtrace on the OSD suggests this may be unrelated, but
 what's the fd limit on your monitor process? You may be approaching
 that limit if you've got 500 OSDs and a similar number of clients.


 Thanks! I didn`t measured a # of connection because of bearing in mind
 1 conn per client, raising limit did the thing. Previously mentioned
 qemu-kvm zombie does not related to rbd itself - it can be created by
 destroying libvirt domain which is in saving state or vice-versa, so
 I`ll put a workaround on this. Right now I am faced different problem
 - osds dying silently, e.g. not leaving a core, I`ll check logs on the
 next testing phase.

 On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 23 Aug 2012, Andrey Korolyov wrote:
 Hi,

 today during heavy test a pair of osds and one mon died, resulting to
 hard lockup of some kvm processes - they went unresponsible and was
 killed leaving zombie processes ([kvm] defunct). Entire cluster
 contain sixteen osd on eight nodes and three mons, on first and last
 node and on vm outside cluster.

 osd bt:
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 (gdb) bt
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
 #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
 /usr/include/c++/4.7/bits/basic_string.h:246
 #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at
 /usr/include/c++/4.7/bits/basic_string.h:536
 #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out)
 at /usr/include/c++/4.7/sstream:60
 #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized
 out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439
 #7  pretty_version_to_str () at common/version.cc:40
 #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
 out=...) at common/BackTrace.cc:19
 #9  0x0078f450 in handle_fatal_signal (signum=11) at
 global/signal_handler.cc:91
 #10 signal handler called
 #11 0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
 #14 

Re: OSD crash

2012-09-04 Thread Sage Weil
On Tue, 4 Sep 2012, Andrey Korolyov wrote:
 Hi,
 
 Almost always one or more osd dies when doing overlapped recovery -
 e.g. add new crushmap and remove some newly added osds from cluster
 some minutes later during remap or inject two slightly different
 crushmaps after a short time(surely preserving at least one of
 replicas online). Seems that osd dying on excessive amount of
 operations in queue because under normal test, e.g. rados, iowait does
 not break one percent barrier but during recovery it may raise up to
 ten percents(2108 w/ cache, splitted disks as R0 each).
 
 #0  0x7f62f193a445 in raise () from /lib/x86_64-linux-gnu/libc.so.6
 #1  0x7f62f193db9b in abort () from /lib/x86_64-linux-gnu/libc.so.6
 #2  0x7f62f2236665 in __gnu_cxx::__verbose_terminate_handler() ()
 from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #3  0x7f62f2234796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #4  0x7f62f22347c3 in std::terminate() () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #5  0x7f62f22349ee in __cxa_throw () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #6  0x00844e11 in ceph::__ceph_assert_fail(char const*, char
 const*, int, char const*) ()
 #7  0x0073148f in
 FileStore::_do_transaction(ObjectStore::Transaction, unsigned long,
 int) ()

Can you install debug symbols to see what line number this is one (e.g. 
apt-get install ceph-dbg), or check in the log file to see what the assert 
failure is?

Thanks!
sage


 #8  0x0073484e in
 FileStore::do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long) ()
 #9  0x0070c680 in FileStore::_do_op(FileStore::OpSequencer*) ()
 #10 0x0083ce01 in ThreadPool::worker() ()
 #11 0x006823ed in ThreadPool::WorkThread::entry() ()
 #12 0x7f62f345ee9a in start_thread () from
 /lib/x86_64-linux-gnu/libpthread.so.0
 #13 0x7f62f19f64cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
 #14 0x in ?? ()
 ceph version 0.48.1argonaut (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
 
 On Sun, Aug 26, 2012 at 8:52 PM, Andrey Korolyov and...@xdel.ru wrote:
  During recovery, following crash happens(simular to
  http://tracker.newdream.net/issues/2126 which marked resolved long
  ago):
 
  http://xdel.ru/downloads/ceph-log/osd-2012-08-26.txt
 
  On Sat, Aug 25, 2012 at 12:30 PM, Andrey Korolyov and...@xdel.ru wrote:
  On Thu, Aug 23, 2012 at 4:09 AM, Gregory Farnum g...@inktank.com wrote:
  The tcmalloc backtrace on the OSD suggests this may be unrelated, but
  what's the fd limit on your monitor process? You may be approaching
  that limit if you've got 500 OSDs and a similar number of clients.
 
 
  Thanks! I didn`t measured a # of connection because of bearing in mind
  1 conn per client, raising limit did the thing. Previously mentioned
  qemu-kvm zombie does not related to rbd itself - it can be created by
  destroying libvirt domain which is in saving state or vice-versa, so
  I`ll put a workaround on this. Right now I am faced different problem
  - osds dying silently, e.g. not leaving a core, I`ll check logs on the
  next testing phase.
 
  On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote:
  On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote:
  On Thu, 23 Aug 2012, Andrey Korolyov wrote:
  Hi,
 
  today during heavy test a pair of osds and one mon died, resulting to
  hard lockup of some kvm processes - they went unresponsible and was
  killed leaving zombie processes ([kvm] defunct). Entire cluster
  contain sixteen osd on eight nodes and three mons, on first and last
  node and on vm outside cluster.
 
  osd bt:
  #0  0x7fc37d490be3 in
  tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
  unsigned long, int) () from /usr/lib/libtcmalloc.so.4
  (gdb) bt
  #0  0x7fc37d490be3 in
  tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
  unsigned long, int) () from /usr/lib/libtcmalloc.so.4
  #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
  /usr/lib/libtcmalloc.so.4
  #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
  #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
  /usr/include/c++/4.7/bits/basic_string.h:246
  #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at
  /usr/include/c++/4.7/bits/basic_string.h:536
  #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out)
  at /usr/include/c++/4.7/sstream:60
  #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized
  out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439
  #7  pretty_version_to_str () at common/version.cc:40
  #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
  out=...) at common/BackTrace.cc:19
  #9  0x0078f450 in handle_fatal_signal (signum=11) at
  global/signal_handler.cc:91
  #10 signal handler called
  #11 

Re: OSD crash

2012-08-22 Thread Sage Weil
On Thu, 23 Aug 2012, Andrey Korolyov wrote:
 Hi,
 
 today during heavy test a pair of osds and one mon died, resulting to
 hard lockup of some kvm processes - they went unresponsible and was
 killed leaving zombie processes ([kvm] defunct). Entire cluster
 contain sixteen osd on eight nodes and three mons, on first and last
 node and on vm outside cluster.
 
 osd bt:
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 (gdb) bt
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
 #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
 /usr/include/c++/4.7/bits/basic_string.h:246
 #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at
 /usr/include/c++/4.7/bits/basic_string.h:536
 #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out)
 at /usr/include/c++/4.7/sstream:60
 #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized
 out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439
 #7  pretty_version_to_str () at common/version.cc:40
 #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
 out=...) at common/BackTrace.cc:19
 #9  0x0078f450 in handle_fatal_signal (signum=11) at
 global/signal_handler.cc:91
 #10 signal handler called
 #11 0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() ()
 from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #16 0x7fc37d1c47c3 in std::terminate() () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #17 0x7fc37d1c49ee in __cxa_throw () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c
 0 == \unexpected error\, file=optimized out, line=3007,
 func=0x90ef80 unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int))
 at common/assert.cc:77

This means it got an unexpected error when talking to the file system.  If 
you look in the osd log, it may tell you what that was.  (It may 
not--there isn't usually the other tcmalloc stuff triggered from the 
assert handler.)

What happens if you restart that ceph-osd daemon?

sage


 #19 0x0073148f in FileStore::_do_transaction
 (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545,
 trans_num=trans_num@entry=0) at os/FileStore.cc:3007
 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000,
 tls=..., op_seq=429545) at os/FileStore.cc:2436
 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000,
 osr=optimized out) at os/FileStore.cc:2259
 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at
 common/WorkQueue.cc:54
 #23 0x006823ed in ThreadPool::WorkThread::entry
 (this=optimized out) at ./common/WorkQueue.h:126
 #24 0x7fc37e3eee9a in start_thread () from
 /lib/x86_64-linux-gnu/libpthread.so.0
 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
 #26 0x in ?? ()
 
 mon bt was exactly the same as in http://tracker.newdream.net/issues/2762
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash

2012-08-22 Thread Andrey Korolyov
On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 23 Aug 2012, Andrey Korolyov wrote:
 Hi,

 today during heavy test a pair of osds and one mon died, resulting to
 hard lockup of some kvm processes - they went unresponsible and was
 killed leaving zombie processes ([kvm] defunct). Entire cluster
 contain sixteen osd on eight nodes and three mons, on first and last
 node and on vm outside cluster.

 osd bt:
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 (gdb) bt
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
 #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
 /usr/include/c++/4.7/bits/basic_string.h:246
 #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at
 /usr/include/c++/4.7/bits/basic_string.h:536
 #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out)
 at /usr/include/c++/4.7/sstream:60
 #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized
 out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439
 #7  pretty_version_to_str () at common/version.cc:40
 #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
 out=...) at common/BackTrace.cc:19
 #9  0x0078f450 in handle_fatal_signal (signum=11) at
 global/signal_handler.cc:91
 #10 signal handler called
 #11 0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() ()
 from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #16 0x7fc37d1c47c3 in std::terminate() () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #17 0x7fc37d1c49ee in __cxa_throw () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c
 0 == \unexpected error\, file=optimized out, line=3007,
 func=0x90ef80 unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int))
 at common/assert.cc:77

 This means it got an unexpected error when talking to the file system.  If
 you look in the osd log, it may tell you what that was.  (It may
 not--there isn't usually the other tcmalloc stuff triggered from the
 assert handler.)

 What happens if you restart that ceph-osd daemon?

 sage



Unfortunately I have completely disabled logs during test, so there
are no suggestion of assert_fail. The main problem was revealed -
created VMs was pointed to one monitor instead set of three, so there
may be some unusual things(btw, crashed mon isn`t one from above, but
a neighbor of crashed osds on first node). After IPMI reset node
returns back well and cluster behavior seems to be okay - stuck kvm
I/O somehow prevented even other module load|unload on this node, so I
finally decided to do hard reset. Despite I`m using almost generic
wheezy, glibc was updated to 2.15, may be because of this my trace
appears first time ever. I`m almost sure that fs does not triggered
this crash and mainly suspecting stuck kvm processes. I`ll rerun test
with same conditions tomorrow(~500 vms pointed to one mon and very
high I/O, but with osd logging).

 #19 0x0073148f in FileStore::_do_transaction
 (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545,
 trans_num=trans_num@entry=0) at os/FileStore.cc:3007
 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000,
 tls=..., op_seq=429545) at os/FileStore.cc:2436
 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000,
 osr=optimized out) at os/FileStore.cc:2259
 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at
 common/WorkQueue.cc:54
 #23 0x006823ed in ThreadPool::WorkThread::entry
 (this=optimized out) at ./common/WorkQueue.h:126
 #24 0x7fc37e3eee9a in start_thread () from
 /lib/x86_64-linux-gnu/libpthread.so.0
 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
 #26 0x in ?? ()

 mon bt was exactly the same as in http://tracker.newdream.net/issues/2762
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info 

Re: OSD crash

2012-08-22 Thread Gregory Farnum
The tcmalloc backtrace on the OSD suggests this may be unrelated, but
what's the fd limit on your monitor process? You may be approaching
that limit if you've got 500 OSDs and a similar number of clients.

On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil s...@inktank.com wrote:
 On Thu, 23 Aug 2012, Andrey Korolyov wrote:
 Hi,

 today during heavy test a pair of osds and one mon died, resulting to
 hard lockup of some kvm processes - they went unresponsible and was
 killed leaving zombie processes ([kvm] defunct). Entire cluster
 contain sixteen osd on eight nodes and three mons, on first and last
 node and on vm outside cluster.

 osd bt:
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 (gdb) bt
 #0  0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
 #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
 /usr/include/c++/4.7/bits/basic_string.h:246
 #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=optimized out) at
 /usr/include/c++/4.7/bits/basic_string.h:536
 #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=optimized out)
 at /usr/include/c++/4.7/sstream:60
 #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=optimized
 out, __vtt_parm=optimized out) at /usr/include/c++/4.7/sstream:439
 #7  pretty_version_to_str () at common/version.cc:40
 #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
 out=...) at common/BackTrace.cc:19
 #9  0x0078f450 in handle_fatal_signal (signum=11) at
 global/signal_handler.cc:91
 #10 signal handler called
 #11 0x7fc37d490be3 in
 tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
 unsigned long, int) () from /usr/lib/libtcmalloc.so.4
 #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
 /usr/lib/libtcmalloc.so.4
 #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
 #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() ()
 from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #15 0x7fc37d1c4796 in ?? () from 
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #16 0x7fc37d1c47c3 in std::terminate() () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #17 0x7fc37d1c49ee in __cxa_throw () from
 /usr/lib/x86_64-linux-gnu/libstdc++.so.6
 #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c
 0 == \unexpected error\, file=optimized out, line=3007,
 func=0x90ef80 unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int))
 at common/assert.cc:77

 This means it got an unexpected error when talking to the file system.  If
 you look in the osd log, it may tell you what that was.  (It may
 not--there isn't usually the other tcmalloc stuff triggered from the
 assert handler.)

 What happens if you restart that ceph-osd daemon?

 sage



 Unfortunately I have completely disabled logs during test, so there
 are no suggestion of assert_fail. The main problem was revealed -
 created VMs was pointed to one monitor instead set of three, so there
 may be some unusual things(btw, crashed mon isn`t one from above, but
 a neighbor of crashed osds on first node). After IPMI reset node
 returns back well and cluster behavior seems to be okay - stuck kvm
 I/O somehow prevented even other module load|unload on this node, so I
 finally decided to do hard reset. Despite I`m using almost generic
 wheezy, glibc was updated to 2.15, may be because of this my trace
 appears first time ever. I`m almost sure that fs does not triggered
 this crash and mainly suspecting stuck kvm processes. I`ll rerun test
 with same conditions tomorrow(~500 vms pointed to one mon and very
 high I/O, but with osd logging).

 #19 0x0073148f in FileStore::_do_transaction
 (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545,
 trans_num=trans_num@entry=0) at os/FileStore.cc:3007
 #20 0x0073484e in FileStore::do_transactions (this=0x2cde000,
 tls=..., op_seq=429545) at os/FileStore.cc:2436
 #21 0x0070c680 in FileStore::_do_op (this=0x2cde000,
 osr=optimized out) at os/FileStore.cc:2259
 #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at
 common/WorkQueue.cc:54
 #23 0x006823ed in ThreadPool::WorkThread::entry
 (this=optimized out) at ./common/WorkQueue.h:126
 #24 0x7fc37e3eee9a in start_thread () from
 /lib/x86_64-linux-gnu/libpthread.so.0
 #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
 #26 0x in ?? ()

 mon bt was exactly the same as in http://tracker.newdream.net/issues/2762
 --
 To unsubscribe from this list: send the line 

Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont

Le 09/07/2012 19:14, Samuel Just a écrit :

Can you restart the node that failed to complete the upgrade with


Well, it's a little big complicated ; I now run those nodes with XFS, 
and I've long-running jobs on it right now, so I can't stop the ceph 
cluster at the moment.


As I've keeped the original broken btrfs volumes, I tried this morning 
to run the old osd in parrallel, using the $cluster variable. I only 
have partial success.
I tried using different port for the mons, but ceph want to use the old 
mon map. I can edit it (epoch 1) but it seems to use 'latest' instead, 
the format isn't compatible with monmaptool and I don't know how to 
inject the modified on a non running cluster.


Anyway, osd seems to start fine, and I can reproduce the bug :

debug filestore = 20
debug osd = 20



I've put it in [global], is it sufficient ?



and post the log after an hour or so of running?  The upgrade process
might legitimately take a while.
-Sam
Only 15 minutes running, but ceph-osd is consumming lots of cpu, and a 
strace shows lots of pread.


Here is the log :

[..]
2012-07-10 11:33:29.560052 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount syncfs(2) syscall not support by 
glibc
2012-07-10 11:33:29.560062 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount no syncfs(2), but the btrfs SYNC 
ioctl will suffice
2012-07-10 11:33:29.560172 7f3e615ac780 -1 
filestore(/CEPH-PROD/data/osd.1) FileStore::mount : stale version stamp 
detected: 2. Proceeding, do_update is set, performing disk format upgrade.
2012-07-10 11:33:29.560233 7f3e615ac780  0 
filestore(/CEPH-PROD/data/osd.1) mount found snaps 3744666,3746725
2012-07-10 11:33:29.560263 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1)  current/ seq was 3746725
2012-07-10 11:33:29.560267 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1)  most recent snap from 
3744666,3746725 is 3746725
2012-07-10 11:33:29.560280 7f3e615ac780 10 
filestore(/CEPH-PROD/data/osd.1) mount rolling back to consistent snap 
3746725
2012-07-10 11:33:29.839281 7f3e615ac780  5 
filestore(/CEPH-PROD/data/osd.1) mount op_seq is 3746725



... and nothing more.

I'll let him running for 3 hours. If I have another message, I'll let 
you know.


Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 As I've keeped the original broken btrfs volumes, I tried this morning to
 run the old osd in parrallel, using the $cluster variable. I only have
 partial success.

The cluster mechanism was never intended for moving existing osds to
other clusters. Trying that might not be a good idea.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont

Le 10/07/2012 17:56, Tommi Virtanen a écrit :

On Tue, Jul 10, 2012 at 2:46 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

As I've keeped the original broken btrfs volumes, I tried this morning to
run the old osd in parrallel, using the $cluster variable. I only have
partial success.

The cluster mechanism was never intended for moving existing osds to
other clusters. Trying that might not be a good idea.
Ok, good to know. I saw that the remaining maps could lead to problem, 
but in 2 words, what are the other associated risks ? Basically If I use 
2 distincts config files,
with differents  non-overlapping paths, and different ports for OSD, 
MDS  MON, we basically have 2 distincts and independant instances ?


By the way, is using 2 mon instance with different ports supported ?

Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 The cluster mechanism was never intended for moving existing osds to
 other clusters. Trying that might not be a good idea.
 Ok, good to know. I saw that the remaining maps could lead to problem, but
 in 2 words, what are the other associated risks ? Basically If I use 2
 distincts config files,
 with differents  non-overlapping paths, and different ports for OSD, MDS 
 MON, we basically have 2 distincts and independant instances ?

Fundamentally, it comes down to this: the two clusters will still have
the same fsid, and you won't be isolated from configuration errors or
leftover state (such as the monmap) in any way. There's a high chance
that your let's poke around and debug cluster wrecks your healthy
cluster.

 By the way, is using 2 mon instance with different ports supported ?

Monitors are identified by ip:port. You can have multiple bind to the
same IP address, as long as they get separate ports.

Naturally, this practically means giving up on high availability.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-10 Thread Yann Dupont

Le 10/07/2012 19:11, Tommi Virtanen a écrit :

On Tue, Jul 10, 2012 at 9:39 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

The cluster mechanism was never intended for moving existing osds to
other clusters. Trying that might not be a good idea.

Ok, good to know. I saw that the remaining maps could lead to problem, but
in 2 words, what are the other associated risks ? Basically If I use 2
distincts config files,
with differents  non-overlapping paths, and different ports for OSD, MDS 
MON, we basically have 2 distincts and independant instances ?

Fundamentally, it comes down to this: the two clusters will still have
the same fsid, and you won't be isolated from configuration errors or


Ah I understand. This is not the case : see :

root@chichibu:~# cat /CEPH/data/osd.0/fsid
f00139fe-478e-4c50-80e2-f7cb359100d4
root@chichibu:~# cat /CEPH-PROD/data/osd.0/fsid
43afd025-330e-4aa8-9324-3e9b0afce794

(CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, 
completely redone  reformatted with mkcephfs. The volumes are totally 
independant :


if you want the gore details :

root@chichibu:~# lvs
  LV  VG Attr   LSize   Origin Snap%  Move Log 
Copy%  Convert

  ceph-osdLocalDisk  -wi-a- 225,00g
  mon-btrfs   LocalDisk  -wi-ao 10,00g
  mon-xfs LocalDisk  -wi-ao 10,00g
  dataceph-chichibu  -wi-ao   5,00t- OLD btrfs, 
mounted on /CEPH-PROD
  dataxceph-chichibu -wi-ao   4,50t   - NEW xfs, mounted 
on /CEPH



leftover state (such as the monmap) in any way. There's a high chance
that your let's poke around and debug cluster wrecks your healthy
cluster.


Yes I understand the risk.


By the way, is using 2 mon instance with different ports supported ?

Monitors are identified by ip:port. You can have multiple bind to the
same IP address, as long as they get separate ports.

Naturally, this practically means giving up on high availability.


The idea is not just having 2 mon. I'll still use 3 differents machines 
for mon, but with 2 mon instance on each. One for the current ceph, the 
other for the old ceph.

2x3 Mon.

Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-10 Thread Tommi Virtanen
On Tue, Jul 10, 2012 at 10:36 AM, Yann Dupont
yann.dup...@univ-nantes.fr wrote:
 Fundamentally, it comes down to this: the two clusters will still have
 the same fsid, and you won't be isolated from configuration errors or
 (CEPH-PROD is the old btrfs volume ). /CEPH is new xfs volume, completely
 redone  reformatted with mkcephfs. The volumes are totally independant :

Ahh you re-created the monitors too. That changes things, then you
have a new random fsid. I understood you only re-mkfsed the osd.

Doing it like that, your real worry is just the remembered state of
monmaps, osdmaps etc. If the daemons accidentally talk to the wrong
cluster, the fsid *should* protect you from damage; they should get
rejected. Similarly, if you use cephx authentication, the keys won't
match either.

 Naturally, this practically means giving up on high availability.
 The idea is not just having 2 mon. I'll still use 3 differents machines for
 mon, but with 2 mon instance on each. One for the current ceph, the other
 for the old ceph.
 2x3 Mon.

That should be perfectly doable.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-09 Thread Samuel Just
Can you restart the node that failed to complete the upgrade with

debug filestore = 20
debug osd = 20

and post the log after an hour or so of running?  The upgrade process
might legitimately take a while.
-Sam

On Sat, Jul 7, 2012 at 1:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Le 06/07/2012 19:01, Gregory Farnum a écrit :

 On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr
 wrote:

 Le 05/07/2012 23:32, Gregory Farnum a écrit :

 [...]

 ok, so as all nodes were identical, I probably have hit a btrfs bug
 (like
 a
 erroneous out of space ) in more or less the same time. And when 1 osd
 was
 out,


 OH , I didn't finish the sentence... When 1 osd was out, missing data was
 copied on another nodes, probably speeding btrfs problem on those nodes
 (I
 suspect erroneous out of space conditions)

 Ah. How full are/were the disks?


 The OSD nodes were below 50 % (all are 5 To volumes):

 osd.0 : 31%
 osd.1 : 31%
 osd.2 : 39%
 osd.3 : 65%
 no osd.4 :)
 osd.5 : 35%
 osd.6 : 60%
 osd.7 : 42%
 osd.8 : 34%

 all the volumes were using btrfs with lzo compress.

 [...]


 Oh, interesting. Are the broken nodes all on the same set of arrays?


 No. There are 4 completely independant raid arrays, in 4 different
 locations. They are similar (same brand  model, but slighltly different
 disks, and 1 different firmware), all arrays are multipathed. I don't
 think
 the raid array is the problem. We use those particular models since 2/3
 years, and in the logs I don't see any problem that can be caused by the
 storage itself (like scsi or multipath errors)

 I must have misunderstood then. What did you mean by 1 Array for 2 OSD
 nodes?


 I have 8 osd nodes, in 4 different locations (several km away). In each
 location I have 2 nodes and 1 raid Array.
 On each location, each raid array has 16 2To disks, 2 controllers with 4x 8
 Gb FC channels each. The 16 disks are organized in Raid 5 (8 disks for one,
 7 disks for the orher). Each raid set is primary attached to 1 controller,
 and each osd node on the location has acces to the controller with 2
 distinct paths.

 There were no correlation between failed nodes  raid array.


 Cheers,

 --
 Yann Dupont - Service IRTS, DSI Université de Nantes
 Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-09 Thread Tommi Virtanen
On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Well, I probably wasn't clear enough. I talked about crashed FS, but i was
 talking about ceph. The underlying FS (btrfs in that case) of 1 node (and
 only one) has PROBABLY crashed in the past, causing corruption in ceph data
 on this node, and then the subsequent crash of other nodes.

 RIGHT now btrfs on this node is OK. I can access the filesystem without
 errors.

But the LevelDB isn't. It's contents got corrupted, somehow somewhere,
and it really is up to the LevelDB library to tolerate those errors;
we have a simple get/put interface we use, and LevelDB is triggering
an internal error.

 One node had problem with btrfs, leading first to kernel problem , probably
 corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops.
 Before that ultimate kernel oops, bad data has been transmitted to other
 (sane) nodes, leading to ceph-osd crash on thoses nodes.

The LevelDB binary contents are not transferred over to other nodes;
this kind of corruption would not spread over the Ceph clustering
mechanisms. It's more likely that you have 4 independently corrupted
LevelDBs. Something in the workload Ceph runs makes that corruption
quite likely.

The information here isn't enough to say whether the cause of the
corruption is btrfs or LevelDB, but the recovery needs to handled by
LevelDB -- and upstream is working on making it more robust:
http://code.google.com/p/leveldb/issues/detail?id=97
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-09 Thread Yann Dupont

Le 09/07/2012 19:43, Tommi Virtanen a écrit :

On Wed, Jul 4, 2012 at 1:06 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

Well, I probably wasn't clear enough. I talked about crashed FS, but i was
talking about ceph. The underlying FS (btrfs in that case) of 1 node (and
only one) has PROBABLY crashed in the past, causing corruption in ceph data
on this node, and then the subsequent crash of other nodes.

RIGHT now btrfs on this node is OK. I can access the filesystem without
errors.

But the LevelDB isn't. It's contents got corrupted, somehow somewhere,
and it really is up to the LevelDB library to tolerate those errors;
we have a simple get/put interface we use, and LevelDB is triggering
an internal error.

Yes, understood.


One node had problem with btrfs, leading first to kernel problem , probably
corruption (in disk/ in memory maybe ?) ,and ultimately to a kernel oops.
Before that ultimate kernel oops, bad data has been transmitted to other
(sane) nodes, leading to ceph-osd crash on thoses nodes.

The LevelDB binary contents are not transferred over to other nodes;

Ok thanks for the clarification ;

this kind of corruption would not spread over the Ceph clustering
mechanisms. It's more likely that you have 4 independently corrupted
LevelDBs. Something in the workload Ceph runs makes that corruption
quite likely.
Very likely : since I reformatted my nodes with XFS I don't have 
problems so far.


The information here isn't enough to say whether the cause of the
corruption is btrfs or LevelDB, but the recovery needs to handled by
LevelDB -- and upstream is working on making it more robust:
http://code.google.com/p/leveldb/issues/detail?id=97
Yes, saw this. It's very important. Sometimes, s... happens. In respect 
to the size ceph volumes can reach, having a tool to restart damaged 
nodes (for whatever reason) is a must.


Thanks for the time you took to answer. It's much clearer for me now.

Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-09 Thread Tommi Virtanen
On Mon, Jul 9, 2012 at 12:05 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 The information here isn't enough to say whether the cause of the
 corruption is btrfs or LevelDB, but the recovery needs to handled by
 LevelDB -- and upstream is working on making it more robust:
 http://code.google.com/p/leveldb/issues/detail?id=97

 Yes, saw this. It's very important. Sometimes, s... happens. In respect to
 the size ceph volumes can reach, having a tool to restart damaged nodes (for
 whatever reason) is a must.

 Thanks for the time you took to answer. It's much clearer for me now.

If it doesn't recover, you re-format the disk and thereby throw away
the contents. Not really all that different from handling hardware
failure. That's why we have replication.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-07 Thread Yann Dupont

Le 06/07/2012 19:01, Gregory Farnum a écrit :

On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

Le 05/07/2012 23:32, Gregory Farnum a écrit :

[...]


ok, so as all nodes were identical, I probably have hit a btrfs bug (like
a
erroneous out of space ) in more or less the same time. And when 1 osd
was
out,


OH , I didn't finish the sentence... When 1 osd was out, missing data was
copied on another nodes, probably speeding btrfs problem on those nodes (I
suspect erroneous out of space conditions)

Ah. How full are/were the disks?


The OSD nodes were below 50 % (all are 5 To volumes):

osd.0 : 31%
osd.1 : 31%
osd.2 : 39%
osd.3 : 65%
no osd.4 :)
osd.5 : 35%
osd.6 : 60%
osd.7 : 42%
osd.8 : 34%

all the volumes were using btrfs with lzo compress.

[...]


Oh, interesting. Are the broken nodes all on the same set of arrays?


No. There are 4 completely independant raid arrays, in 4 different
locations. They are similar (same brand  model, but slighltly different
disks, and 1 different firmware), all arrays are multipathed. I don't think
the raid array is the problem. We use those particular models since 2/3
years, and in the logs I don't see any problem that can be caused by the
storage itself (like scsi or multipath errors)

I must have misunderstood then. What did you mean by 1 Array for 2 OSD nodes?


I have 8 osd nodes, in 4 different locations (several km away). In each 
location I have 2 nodes and 1 raid Array.
On each location, each raid array has 16 2To disks, 2 controllers with 
4x 8 Gb FC channels each. The 16 disks are organized in Raid 5 (8 disks 
for one, 7 disks for the orher). Each raid set is primary attached to 1 
controller, and each osd node on the location has acces to the 
controller with 2 distinct paths.


There were no correlation between failed nodes  raid array.

Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-06 Thread Yann Dupont

Le 05/07/2012 23:32, Gregory Farnum a écrit :

[...]

ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
erroneous out of space ) in more or less the same time. And when 1 osd was
out,


OH , I didn't finish the sentence... When 1 osd was out, missing data 
was copied on another nodes, probably speeding btrfs problem on those 
nodes (I suspect erroneous out of space conditions)


I've reformatted OSD with xfs. Performance is slightly worse for the 
moment (well, depend on the workload, and maybe lack of syncfs is to 
blame), but at least I hope to have the storage layer rock-solid. BTW, 
I've managed to keep the faulty btrfs volumes .


[...]


I wonder if maybe there's a confounding factor here — are all your nodes
similar to each other,

Yes. I designed the cluster that way. All nodes are identical hardware
(powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)

Oh, interesting. Are the broken nodes all on the same set of arrays?


No. There are 4 completely independant raid arrays, in 4 different 
locations. They are similar (same brand  model, but slighltly different 
disks, and 1 different firmware), all arrays are multipathed. I don't 
think the raid array is the problem. We use those particular models 
since 2/3 years, and in the logs I don't see any problem that can be 
caused by the storage itself (like scsi or multipath errors)


Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-06 Thread Gregory Farnum
On Fri, Jul 6, 2012 at 12:19 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Le 05/07/2012 23:32, Gregory Farnum a écrit :

 [...]

 ok, so as all nodes were identical, I probably have hit a btrfs bug (like
 a
 erroneous out of space ) in more or less the same time. And when 1 osd
 was
 out,


 OH , I didn't finish the sentence... When 1 osd was out, missing data was
 copied on another nodes, probably speeding btrfs problem on those nodes (I
 suspect erroneous out of space conditions)

Ah. How full are/were the disks?


 I've reformatted OSD with xfs. Performance is slightly worse for the moment
 (well, depend on the workload, and maybe lack of syncfs is to blame), but at
 least I hope to have the storage layer rock-solid. BTW, I've managed to keep
 the faulty btrfs volumes .

 [...]


 I wonder if maybe there's a confounding factor here — are all your nodes
 similar to each other,

 Yes. I designed the cluster that way. All nodes are identical hardware
 (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
 storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)

 Oh, interesting. Are the broken nodes all on the same set of arrays?


 No. There are 4 completely independant raid arrays, in 4 different
 locations. They are similar (same brand  model, but slighltly different
 disks, and 1 different firmware), all arrays are multipathed. I don't think
 the raid array is the problem. We use those particular models since 2/3
 years, and in the logs I don't see any problem that can be caused by the
 storage itself (like scsi or multipath errors)

I must have misunderstood then. What did you mean by 1 Array for 2 OSD nodes?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-05 Thread Gregory Farnum
On Wed, Jul 4, 2012 at 10:53 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Le 04/07/2012 18:21, Gregory Farnum a écrit :

 On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:

 Le 03/07/2012 23:38, Tommi Virtanen a écrit :

 On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr
 (mailto:yann.dup...@univ-nantes.fr) wrote:

 In the case I could repair, do you think a crashed FS as it is right
 now is
 valuable for you, for future reference , as I saw you can't reproduce
 the
 problem ? I can make an archive (or a btrfs dump ?), but it will be
 quite
 big.

     At this point, it's more about the upstream developers (of btrfs
 etc)
 than us; we're on good terms with them but not experts on the on-disk
 format(s). You might want to send an email to the relevant mailing
 lists before wiping the disks.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html

     Well, I probably wasn't clear enough. I talked about crashed FS, but
 i
 was talking about ceph. The underlying FS (btrfs in that case) of 1 node
 (and only one) has PROBABLY crashed in the past, causing corruption in
 ceph data on this node, and then the subsequent crash of other nodes.
   RIGHT now btrfs on this node is OK. I can access the filesystem without
 errors.
   For the moment, on 8 nodes, 4 refuse to restart .
 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
 with the underlying fs as far as I can tell.
   So I think the scenario is :
   One node had problem with btrfs, leading first to kernel problem ,
 probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
 kernel oops. Before that ultimate kernel oops, bad data has been
 transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
 nodes.

 I don't think that's actually possible — the OSDs all do quite a lot of
 interpretation between what they get off the wire and what goes on disk.
 What you've got here are 4 corrupted LevelDB databases, and we pretty much
 can't do that through the interfaces we have. :/


 ok, so as all nodes were identical, I probably have hit a btrfs bug (like a
 erroneous out of space ) in more or less the same time. And when 1 osd was
 out,



   If you think this scenario is highly improbable in real life (that is,
 btrfs will probably be fixed for good, and then, corruption can't
 happen), it's ok.
   But I wonder if this scenario can be triggered with other problem, and
 bad data can be transmitted to other sane nodes (power outage, out of
 memory condition, disk full... for example)
   That's why I proposed you a crashed ceph volume image (I shouldn't have
 talked about a crashed fs, sorry for the confusion)

 I appreciate the offer, but I don't think this will help much — it's a
 disk state managed by somebody else, not our logical state, which has
 broken. If we could figure out how that state got broken that'd be good, but
 a ceph image won't really help in doing so.

 ok, no problem. I'll restart from scratch, freshly formated.


 I wonder if maybe there's a confounding factor here — are all your nodes
 similar to each other,


 Yes. I designed the cluster that way. All nodes are identical hardware
 (powerEdge M610, 10G intel ethernet + emulex fibre channel attached to
 storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)

Oh, interesting. Are the broken nodes all on the same set of arrays?




   or are they running on different kinds of hardware? How did you do your
 Ceph upgrades? What's ceph -s display when the cluster is running as best it
 can?


 Ceph was running 0.47.2 at that time - (debian package for ceph). After the
 crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 without
 success.

 Nothing particular for upgrades, because for the moment ceph is broken, so
 just apt-get upgrade with new version.


 ceph -s show that :

 root@label5:~# ceph -s
    health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 32
 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck stale;
 1092 pgs stuck unclean; recovery 267286/2491140 degraded (10.729%);
 1814/1245570 unfound (0.146%)
    monmap e1: 3 mons at
 {chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0},
 election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa
    osdmap e2404: 8 osds: 3 up, 3 in
     pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5
 active+recovering+remapped, 32 active+clean+replay, 11
 active+recovering+degraded, 25 active+remapped, 710 down+peering, 222
 active+degraded, 7 stale+active+recovering+degraded, 61 stale+down+peering,
 20 stale+active+degraded, 6 down+remapped+peering, 8
 stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB
 used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%);
 1814

Re: domino-style OSD crash

2012-07-04 Thread Yann Dupont

Le 03/07/2012 23:38, Tommi Virtanen a écrit :

On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

In the case I could repair, do you think a crashed FS as it is right now is
valuable for you, for future reference , as I saw you can't reproduce the
problem ? I can make an archive (or a btrfs dump ?), but it will be quite
big.

At this point, it's more about the upstream developers (of btrfs etc)
than us; we're on good terms with them but not experts on the on-disk
format(s). You might want to send an email to the relevant mailing
lists before wiping the disks.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Well, I probably wasn't clear enough. I talked about crashed FS, but i 
was talking about ceph. The underlying FS (btrfs in that case) of 1 node 
(and only one) has PROBABLY crashed in the past, causing corruption in 
ceph data on this node, and then the subsequent crash of other nodes.


RIGHT now btrfs on this node is OK. I can access the filesystem without 
errors.


For the moment, on 8 nodes, 4 refuse to restart .
1 of the 4 nodes was the crashed node , the 3 others didn't had broblem 
with the underlying fs as far as I can tell.


So I think the scenario is :

One node had problem with btrfs, leading first to kernel problem , 
probably corruption (in disk/ in memory maybe ?) ,and ultimately to a 
kernel oops. Before that ultimate kernel oops, bad data has been 
transmitted to other (sane) nodes, leading to ceph-osd crash on thoses 
nodes.


If you think this scenario is highly improbable in real life (that is, 
btrfs will probably be fixed for good, and then, corruption can't 
happen), it's ok.


But I wonder if this scenario can be triggered with other problem, and 
bad data can be transmitted to other sane nodes (power outage, out of 
memory condition, disk full... for example)


That's why I proposed you a crashed ceph volume image (I shouldn't have 
talked about a crashed fs, sorry for the confusion)


Talking about btrfs, there is a lot of fixes in btrfs between 3.4 and 
3.5rc. After the crash, I couldn't mount the btrfs volume. With 3.5rc I 
can , and there is no sign of problem on it. It does'nt mean data is 
safe there, but i think it's a sign that at least, some bugs have been 
corrected in btrfs code.


Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-04 Thread Gregory Farnum
On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:
 Le 03/07/2012 23:38, Tommi Virtanen a écrit :
  On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr 
  (mailto:yann.dup...@univ-nantes.fr) wrote:
   In the case I could repair, do you think a crashed FS as it is right now 
   is
   valuable for you, for future reference , as I saw you can't reproduce the
   problem ? I can make an archive (or a btrfs dump ?), but it will be quite
   big.
   
   
  At this point, it's more about the upstream developers (of btrfs etc)
  than us; we're on good terms with them but not experts on the on-disk
  format(s). You might want to send an email to the relevant mailing
  lists before wiping the disks.
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org 
  (mailto:majord...@vger.kernel.org)
  More majordomo info at http://vger.kernel.org/majordomo-info.html
  
  
 Well, I probably wasn't clear enough. I talked about crashed FS, but i  
 was talking about ceph. The underlying FS (btrfs in that case) of 1 node  
 (and only one) has PROBABLY crashed in the past, causing corruption in  
 ceph data on this node, and then the subsequent crash of other nodes.
  
 RIGHT now btrfs on this node is OK. I can access the filesystem without  
 errors.
  
 For the moment, on 8 nodes, 4 refuse to restart .
 1 of the 4 nodes was the crashed node , the 3 others didn't had broblem  
 with the underlying fs as far as I can tell.
  
 So I think the scenario is :
  
 One node had problem with btrfs, leading first to kernel problem ,  
 probably corruption (in disk/ in memory maybe ?) ,and ultimately to a  
 kernel oops. Before that ultimate kernel oops, bad data has been  
 transmitted to other (sane) nodes, leading to ceph-osd crash on thoses  
 nodes.

I don't think that's actually possible — the OSDs all do quite a lot of 
interpretation between what they get off the wire and what goes on disk. What 
you've got here are 4 corrupted LevelDB databases, and we pretty much can't do 
that through the interfaces we have. :/
  
  
 If you think this scenario is highly improbable in real life (that is,  
 btrfs will probably be fixed for good, and then, corruption can't  
 happen), it's ok.
  
 But I wonder if this scenario can be triggered with other problem, and  
 bad data can be transmitted to other sane nodes (power outage, out of  
 memory condition, disk full... for example)
  
 That's why I proposed you a crashed ceph volume image (I shouldn't have  
 talked about a crashed fs, sorry for the confusion)

I appreciate the offer, but I don't think this will help much — it's a disk 
state managed by somebody else, not our logical state, which has broken. If we 
could figure out how that state got broken that'd be good, but a ceph image 
won't really help in doing so.

I wonder if maybe there's a confounding factor here — are all your nodes 
similar to each other, or are they running on different kinds of hardware? How 
did you do your Ceph upgrades? What's ceph -s display when the cluster is 
running as best it can?
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-04 Thread Yann Dupont

Le 04/07/2012 18:21, Gregory Farnum a écrit :

On Wednesday, July 4, 2012 at 1:06 AM, Yann Dupont wrote:

Le 03/07/2012 23:38, Tommi Virtanen a écrit :

On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr 
(mailto:yann.dup...@univ-nantes.fr) wrote:

In the case I could repair, do you think a crashed FS as it is right now is
valuable for you, for future reference , as I saw you can't reproduce the
problem ? I can make an archive (or a btrfs dump ?), but it will be quite
big.
  
  
At this point, it's more about the upstream developers (of btrfs etc)

than us; we're on good terms with them but not experts on the on-disk
format(s). You might want to send an email to the relevant mailing
lists before wiping the disks.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org 
(mailto:majord...@vger.kernel.org)
More majordomo info at http://vger.kernel.org/majordomo-info.html
  
  
Well, I probably wasn't clear enough. I talked about crashed FS, but i

was talking about ceph. The underlying FS (btrfs in that case) of 1 node
(and only one) has PROBABLY crashed in the past, causing corruption in
ceph data on this node, and then the subsequent crash of other nodes.
  
RIGHT now btrfs on this node is OK. I can access the filesystem without

errors.
  
For the moment, on 8 nodes, 4 refuse to restart .

1 of the 4 nodes was the crashed node , the 3 others didn't had broblem
with the underlying fs as far as I can tell.
  
So I think the scenario is :
  
One node had problem with btrfs, leading first to kernel problem ,

probably corruption (in disk/ in memory maybe ?) ,and ultimately to a
kernel oops. Before that ultimate kernel oops, bad data has been
transmitted to other (sane) nodes, leading to ceph-osd crash on thoses
nodes.

I don't think that's actually possible — the OSDs all do quite a lot of 
interpretation between what they get off the wire and what goes on disk. What 
you've got here are 4 corrupted LevelDB databases, and we pretty much can't do 
that through the interfaces we have. :/


ok, so as all nodes were identical, I probably have hit a btrfs bug 
(like a erroneous out of space ) in more or less the same time. And when 
1 osd was out,
   
  
If you think this scenario is highly improbable in real life (that is,

btrfs will probably be fixed for good, and then, corruption can't
happen), it's ok.
  
But I wonder if this scenario can be triggered with other problem, and

bad data can be transmitted to other sane nodes (power outage, out of
memory condition, disk full... for example)
  
That's why I proposed you a crashed ceph volume image (I shouldn't have

talked about a crashed fs, sorry for the confusion)

I appreciate the offer, but I don't think this will help much — it's a disk state managed 
by somebody else, not our logical state, which has broken. If we could figure out how 
that state got broken that'd be good, but a ceph image won't really help in 
doing so.

ok, no problem. I'll restart from scratch, freshly formated.


I wonder if maybe there's a confounding factor here — are all your nodes 
similar to each other,


Yes. I designed the cluster that way. All nodes are identical hardware 
(powerEdge M610, 10G intel ethernet + emulex fibre channel attached to 
storage (1 Array for 2 OSD nodes, 1 controller dedicated for each OSD)



  or are they running on different kinds of hardware? How did you do your Ceph 
upgrades? What's ceph -s display when the cluster is running as best it can?


Ceph was running 0.47.2 at that time - (debian package for ceph). After 
the crash I couldn't restart all the nodes. Tried 0.47.3 and now 0.48 
without success.


Nothing particular for upgrades, because for the moment ceph is broken, 
so just apt-get upgrade with new version.



ceph -s show that :

root@label5:~# ceph -s
   health HEALTH_WARN 260 pgs degraded; 793 pgs down; 785 pgs peering; 
32 pgs recovering; 96 pgs stale; 793 pgs stuck inactive; 96 pgs stuck 
stale; 1092 pgs stuck unclean; recovery 267286/2491140 degraded 
(10.729%); 1814/1245570 unfound (0.146%)
   monmap e1: 3 mons at 
{chichibu=172.20.14.130:6789/0,glenesk=172.20.14.131:6789/0,karuizawa=172.20.14.133:6789/0}, 
election epoch 12, quorum 0,1,2 chichibu,glenesk,karuizawa

   osdmap e2404: 8 osds: 3 up, 3 in
pgmap v173701: 1728 pgs: 604 active+clean, 8 down, 5 
active+recovering+remapped, 32 active+clean+replay, 11 
active+recovering+degraded, 25 active+remapped, 710 down+peering, 222 
active+degraded, 7 stale+active+recovering+degraded, 61 
stale+down+peering, 20 stale+active+degraded, 6 down+remapped+peering, 8 
stale+down+remapped+peering, 9 active+recovering; 4786 GB data, 7495 GB 
used, 7280 GB / 15360 GB avail; 267286/2491140 degraded (10.729%); 
1814/1245570 unfound (0.146%)

   mdsmap e172: 1/1/1 up {0=karuizawa=up:replay}, 2 up:standby



BTW, After the 0.48 upgrade, there was a disk format conversion. 1 of 
the 4 surviving OSD didn't

Re: domino-style OSD crash

2012-07-03 Thread Tommi Virtanen
On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right
 now.

 Tried to restart osd with 0.47.3, then next branch, and today with 0.48.

 4 of 8 nodes fails with the same message :

 ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
  1: /usr/bin/ceph-osd() [0x701929]
...
  13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
 leveldb::Slice const) const+0x4d) [0x6e811d]

That looks like http://tracker.newdream.net/issues/2563 and the best
we have for that ticket is looks like you have a corrupted leveldb
file. Is this reproducible with a freshly mkfs'ed data partition?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-03 Thread Yann Dupont

Le 03/07/2012 21:42, Tommi Virtanen a écrit :

On Tue, Jul 3, 2012 at 1:40 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:

Upgraded the kernel to 3.5.0-rc4 + some patches, seems btrfs is OK right
now.

Tried to restart osd with 0.47.3, then next branch, and today with 0.48.

4 of 8 nodes fails with the same message :

ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030)
  1: /usr/bin/ceph-osd() [0x701929]

...

  13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
leveldb::Slice const) const+0x4d) [0x6e811d]

That looks like http://tracker.newdream.net/issues/2563 and the best
we have for that ticket is looks like you have a corrupted leveldb
file. Is this reproducible with a freshly mkfs'ed data partition?
Probably not. I have multiple data volumes on each nodes (I was planning 
xfs vs ext4 vs btrfs benchmarks before being ill) and thoses nodes start 
OK with another data partition .


It's very probable that there is corruption somewhere, due to kernel bug 
, probably triggered by btrfs.


Issue 2563 is probably the same.

I'd like to restart those nodes without formatting them, not because the 
data is valuable, but because if the same thing happens in production, a 
method similar to fsck the node could be of great value.


I saw the method to check the leveldb. Will try tomorrow without garantees.

In the case I could repair, do you think a crashed FS as it is right now 
is valuable for you, for future reference , as I saw you can't reproduce 
the problem ? I can make an archive (or a btrfs dump ?), but it will be 
quite big.


Cheers,

--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-07-03 Thread Tommi Virtanen
On Tue, Jul 3, 2012 at 1:54 PM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 In the case I could repair, do you think a crashed FS as it is right now is
 valuable for you, for future reference , as I saw you can't reproduce the
 problem ? I can make an archive (or a btrfs dump ?), but it will be quite
 big.

At this point, it's more about the upstream developers (of btrfs etc)
than us; we're on good terms with them but not experts on the on-disk
format(s). You might want to send an email to the relevant mailing
lists before wiping the disks.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Should an OSD crash when journal device is out of space?

2012-07-02 Thread Gregory Farnum
Hey guys,
Thanks for the problem report. I've created an issue to track it at
http://tracker.newdream.net/issues/2687.
It looks like we just assume that if you're using a file, you've got
enough space for it. It shouldn't be a big deal to at least do some
startup checks which will fail gracefully.
-Greg

On Wed, Jun 20, 2012 at 1:57 PM, Matthew Roy imjustmatt...@gmail.com wrote:
 I hit this a couple times and wondered the same thing. Why does the
 OSD need to bail when it runs out of journal space?

 On Wed, Jun 20, 2012 at 3:56 PM, Travis Rhoden trho...@gmail.com wrote:
 Not sure if this is a bug or not.  It was definitely user error -- but
 since the OSD process bailed, figured I would report it.

 I had /tmpfs mounted with 2.5GB of space:

 tmpfs on /tmpfs type tmpfs (rw,size=2560m)

 Then I decided to increase my journal size to 5G, but forgot to
 increase the limit on /tmpfs.  =)

 osd journal size = 5000


 Predictably, things didn't go well when I ran a rados bench that
 filled up the journal.  I'm not sure if such a case can be handled
 more gracefully:


    -4 2012-06-20 12:39:36.648773 7fc042a5f780  1 journal _open
 /tmpfs/osd.2.journal fd 30: 524288 bytes, block size 4096 bytes,
 directio = 0, aio = 0
    -3 2012-06-20 12:42:23.179164 7fc02e1ad700  1
 CephxAuthorizeHandler::verify_authorizer isvalid=1
    -2 2012-06-20 12:42:46.643205 7fc0396cf700 -1 journal
 FileJournal::write_bl : write_fd failed: (28) No space left on device
    -1 2012-06-20 12:42:46.643245 7fc0396cf700 -1 journal
 FileJournal::do_write: write_bl(pos=2678079488) failed
     0 2012-06-20 12:42:46.676991 7fc0396cf700 -1 os/FileJournal.cc:
 In function 'void FileJournal::do_write(ceph::bufferlist)' thread
 7fc0396cf700 time 2012-06-20 12:42:46.643315
 os/FileJournal.cc: 994: FAILED assert(0)

  ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
  1: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082]
  2: (FileJournal::write_thread_entry()+0x735) [0x659545]
  3: (FileJournal::Writer::entry()+0xd) [0x5de41d]
  4: (()+0x7e9a) [0x7fc042434e9a]
  5: (clone()+0x6d) [0x7fc0409e94bd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 --- end dump of recent events ---
 2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal (Aborted) **
  in thread 7fc0396cf700

  ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
  1: /usr/bin/ceph-osd() [0x6eb32a]
  2: (()+0xfcb0) [0x7fc04243ccb0]
  3: (gsignal()+0x35) [0x7fc04092d445]
  4: (abort()+0x17b) [0x7fc040930bab]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d]
  6: (()+0xb5846) [0x7fc041279846]
  7: (()+0xb5873) [0x7fc041279873]
  8: (()+0xb596e) [0x7fc04127996e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x282) [0x79dd02]
  10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082]
  11: (FileJournal::write_thread_entry()+0x735) [0x659545]
  12: (FileJournal::Writer::entry()+0xd) [0x5de41d]
  13: (()+0x7e9a) [0x7fc042434e9a]
  14: (clone()+0x6d) [0x7fc0409e94bd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 --- begin dump of recent events ---
     0 2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal
 (Aborted) **
  in thread 7fc0396cf700

  ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
  1: /usr/bin/ceph-osd() [0x6eb32a]
  2: (()+0xfcb0) [0x7fc04243ccb0]
  3: (gsignal()+0x35) [0x7fc04092d445]
  4: (abort()+0x17b) [0x7fc040930bab]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d]
  6: (()+0xb5846) [0x7fc041279846]
  7: (()+0xb5873) [0x7fc041279873]
  8: (()+0xb596e) [0x7fc04127996e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x282) [0x79dd02]
  10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082]
  11: (FileJournal::write_thread_entry()+0x735) [0x659545]
  12: (FileJournal::Writer::entry()+0xd) [0x5de41d]
  13: (()+0x7e9a) [0x7fc042434e9a]
  14: (clone()+0x6d) [0x7fc0409e94bd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 --- end dump of recent events ---
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-27 Thread Stefan Priebe - Profihost AG
THANKS a lot. This fixes it. I've merged your branch into next and i 
wsn't able to trigger the osd crash again. So please include this into 0.48.


Greets
Stefan

Am 26.06.2012 20:01, schrieb Sam Just:

Stefan,

Sorry for the delay, I think I've found the problem.  Could you give
wip_ms_handle_reset_race a try?
-Sam

On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe s.pri...@profihost.ag wrote:

Am 26.06.2012 18:05, schrieb Tommi Virtanen:


On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag
wrote:


Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
can happen. For building I use the provided Debian scripts.



Perhaps you upgraded the debs but did not restart the daemons? That
would make the on-disk executable with that name not match the
in-memory one.



No, i reboot after each upgrade ;-)

Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce the
issue.

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-27 Thread Sage Weil
On Wed, 27 Jun 2012, Stefan Priebe - Profihost AG wrote:
 THANKS a lot. This fixes it. I've merged your branch into next and i wsn't
 able to trigger the osd crash again. So please include this into 0.48.

Excellent.  Thanks for testing!  This now in next.

sage


 
 Greets
 Stefan
 
 Am 26.06.2012 20:01, schrieb Sam Just:
  Stefan,
  
  Sorry for the delay, I think I've found the problem.  Could you give
  wip_ms_handle_reset_race a try?
  -Sam
  
  On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe s.pri...@profihost.ag
  wrote:
   Am 26.06.2012 18:05, schrieb Tommi Virtanen:
   
On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag
wrote:
 
 Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how
 this
 can happen. For building I use the provided Debian scripts.


Perhaps you upgraded the debs but did not restart the daemons? That
would make the on-disk executable with that name not match the
in-memory one.
   
   
   No, i reboot after each upgrade ;-)
   
   Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce
   the
   issue.
   
   Stefan
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-26 Thread Tommi Virtanen
On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag wrote:
 Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
 can happen. For building I use the provided Debian scripts.

Perhaps you upgraded the debs but did not restart the daemons? That
would make the on-disk executable with that name not match the
in-memory one.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-26 Thread Stefan Priebe

Am 26.06.2012 18:05, schrieb Tommi Virtanen:

On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag wrote:

Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
can happen. For building I use the provided Debian scripts.


Perhaps you upgraded the debs but did not restart the daemons? That
would make the on-disk executable with that name not match the
in-memory one.


No, i reboot after each upgrade ;-)

Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce 
the issue.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-26 Thread Sam Just
Stefan,

Sorry for the delay, I think I've found the problem.  Could you give
wip_ms_handle_reset_race a try?
-Sam

On Tue, Jun 26, 2012 at 9:47 AM, Stefan Priebe s.pri...@profihost.ag wrote:
 Am 26.06.2012 18:05, schrieb Tommi Virtanen:

 On Mon, Jun 25, 2012 at 10:48 PM, Stefan Priebe s.pri...@profihost.ag
 wrote:

 Strange just copied /core.hostname and /usr/bin/ceph-osd no idea how this
 can happen. For building I use the provided Debian scripts.


 Perhaps you upgraded the debs but did not restart the daemons? That
 would make the on-disk executable with that name not match the
 in-memory one.


 No, i reboot after each upgrade ;-)

 Right now i'm witing for a FS fix xfs or btrfs and i will then reproduce the
 issue.

 Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-25 Thread Dan Mick
I've yet to make the core match the binary.  

On Jun 22, 2012, at 11:32 PM, Stefan Priebe s.pri...@profihost.ag wrote:

 Thanks did you find anything?
 
 Am 23.06.2012 um 01:59 schrieb Sam Just sam.j...@inktank.com:
 
 I am still looking into the logs.
 -Sam
 
 On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote:
 Stefan, I'm looking at your logs and coredump now.
 
 
 On 06/21/2012 11:43 PM, Stefan Priebe wrote:
 
 Does anybody have an idea? This is right now a showstopper to me.
 
 Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
 AGs.pri...@profihost.ag:
 
 Hello list,
 
 i'm able to reproducably crash osd daemons.
 
 How i can reproduce:
 
 Kernel: 3.5.0-rc3
 Ceph: 0.47.3
 FS: btrfs
 Journal: 2GB tmpfs per OSD
 OSD: 3x servers with 4x Intel SSD OSDs each
 10GBE Network
 rbd_cache_max_age: 2.0
 rbd_cache_size: 33554432
 
 Disk is set to writeback.
 
 Start a KVM VM via PXE with the disk attached in writeback mode.
 
 Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
 crashes.
 
 # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
 
 Strangely exactly THIS OSD also has the most log entries:
 64K ceph-osd.20.log
 64K ceph-osd.21.log
 1,3Mceph-osd.22.log
 64K ceph-osd.23.log
 
 But all OSDs are set to debug osd = 20.
 
 dmesg shows:
 ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp
 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
 
 I uploaded the following files:
 priebe_fio_randwrite_ceph-osd.21.log.bz2 =  OSD which was OK and didn't
 crash
 priebe_fio_randwrite_ceph-osd.22.log.bz2 =  Log from the crashed OSD
 üu
 priebe_fio_randwrite_core.ssdstor001.27204.bz2 =  Core dump
 priebe_fio_randwrite_ceph-osd.bz2 =  osd binary
 
 Stefan
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-23 Thread Stefan Priebe
Thanks yes it is from the next branch.

Am 23.06.2012 um 02:26 schrieb Dan Mick dan.m...@inktank.com:

 The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, which 
 is not quite 0.47.3.  You can get the version with binary -v, or (in my 
 case) examining strings in the binary.  I'm retrieving that version to 
 analyze the core dump.
 
 
 On 06/21/2012 11:43 PM, Stefan Priebe wrote:
 Does anybody have an idea? This is right now a showstopper to me.
 
 Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost 
 AGs.pri...@profihost.ag:
 
 Hello list,
 
 i'm able to reproducably crash osd daemons.
 
 How i can reproduce:
 
 Kernel: 3.5.0-rc3
 Ceph: 0.47.3
 FS: btrfs
 Journal: 2GB tmpfs per OSD
 OSD: 3x servers with 4x Intel SSD OSDs each
 10GBE Network
 rbd_cache_max_age: 2.0
 rbd_cache_size: 33554432
 
 Disk is set to writeback.
 
 Start a KVM VM via PXE with the disk attached in writeback mode.
 
 Then run randwrite stress more than 2 time. Mostly OSD 22 in my case 
 crashes.
 
 # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio 
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio 
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
 --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
 
 Strangely exactly THIS OSD also has the most log entries:
 64K ceph-osd.20.log
 64K ceph-osd.21.log
 1,3Mceph-osd.22.log
 64K ceph-osd.23.log
 
 But all OSDs are set to debug osd = 20.
 
 dmesg shows:
 ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 
 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
 
 I uploaded the following files:
 priebe_fio_randwrite_ceph-osd.21.log.bz2 =  OSD which was OK and didn't 
 crash
 priebe_fio_randwrite_ceph-osd.22.log.bz2 =  Log from the crashed OSD
 üu
 priebe_fio_randwrite_core.ssdstor001.27204.bz2 =  Core dump
 priebe_fio_randwrite_ceph-osd.bz2 =  osd binary
 
 Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-23 Thread Stefan Priebe
Thanks did you find anything?

Am 23.06.2012 um 01:59 schrieb Sam Just sam.j...@inktank.com:

 I am still looking into the logs.
 -Sam
 
 On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote:
 Stefan, I'm looking at your logs and coredump now.
 
 
 On 06/21/2012 11:43 PM, Stefan Priebe wrote:
 
 Does anybody have an idea? This is right now a showstopper to me.
 
 Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
 AGs.pri...@profihost.ag:
 
 Hello list,
 
 i'm able to reproducably crash osd daemons.
 
 How i can reproduce:
 
 Kernel: 3.5.0-rc3
 Ceph: 0.47.3
 FS: btrfs
 Journal: 2GB tmpfs per OSD
 OSD: 3x servers with 4x Intel SSD OSDs each
 10GBE Network
 rbd_cache_max_age: 2.0
 rbd_cache_size: 33554432
 
 Disk is set to writeback.
 
 Start a KVM VM via PXE with the disk attached in writeback mode.
 
 Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
 crashes.
 
 # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; halt
 
 Strangely exactly THIS OSD also has the most log entries:
 64K ceph-osd.20.log
 64K ceph-osd.21.log
 1,3Mceph-osd.22.log
 64K ceph-osd.23.log
 
 But all OSDs are set to debug osd = 20.
 
 dmesg shows:
 ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp
 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]
 
 I uploaded the following files:
 priebe_fio_randwrite_ceph-osd.21.log.bz2 =  OSD which was OK and didn't
 crash
 priebe_fio_randwrite_ceph-osd.22.log.bz2 =  Log from the crashed OSD
 üu
 priebe_fio_randwrite_core.ssdstor001.27204.bz2 =  Core dump
 priebe_fio_randwrite_ceph-osd.bz2 =  osd binary
 
 Stefan
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-22 Thread Stefan Priebe - Profihost AG
I'm still able to crash the ceph cluster while doing a lot of random I/O 
and then shut down the KVM.


Stefan

Am 21.06.2012 21:57, schrieb Stefan Priebe:

OK i discovered this time that all osds had the same disk usage before
crash. After starting the osd again i got this one:
/dev/sdb1 224G 23G 191G 11% /srv/osd.30
/dev/sdc1 224G 1,5G 213G 1% /srv/osd.31
/dev/sdd1 224G 1,5G 213G 1% /srv/osd.32
/dev/sde1 224G 1,6G 213G 1% /srv/osd.33

So instead of 1,5GB osd 30 now uses 23G.

Stefan

Am 21.06.2012 15:23, schrieb Stefan Priebe - Profihost AG:

Mhm is this normal (ceph health is NOW OK again)

/dev/sdb1 224G 655M 214G 1% /srv/osd.20
/dev/sdc1 224G 640M 214G 1% /srv/osd.21
/dev/sdd1 224G 34G 181G 16% /srv/osd.22
/dev/sde1 224G 608M 214G 1% /srv/osd.23

Why does one OSD has so much more used space than the others?

On my other OSD nodes all have around 600MB-700MB. Even when i reformat
/dev/sdd1 after the backfill it has again 34GB?

Stefan

Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG:

Another strange thing. Why does THIS OSD have 24GB and the others just
650MB?

/dev/sdb1 224G 654M 214G 1% /srv/osd.20
/dev/sdc1 224G 638M 214G 1% /srv/osd.21
/dev/sdd1 224G 24G 190G 12% /srv/osd.22
/dev/sde1 224G 607M 214G 1% /srv/osd.23


When i start now the OSD again it seems to hang for forever. Load goes
up to 200 and I/O Waits rise vom 0% to 20%.

Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:

Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
crashes.

# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k
--size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; halt

Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3M ceph-osd.22.log
64K ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp
7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and
didn't
crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump
priebe_fio_randwrite_ceph-osd.bz2 = osd binary

Stefan

--
To unsubscribe from this list: send the line unsubscribe
ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-22 Thread Dan Mick

Stefan, I'm looking at your logs and coredump now.

On 06/21/2012 11:43 PM, Stefan Priebe wrote:

Does anybody have an idea? This is right now a showstopper to me.

Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost 
AGs.pri...@profihost.ag:


Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.

# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio 
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 
--runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 
--direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 
--group_reporting --name=file1; halt

Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3Mceph-osd.22.log
64K ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 
error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 =  OSD which was OK and didn't crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 =  Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 =  Core dump
priebe_fio_randwrite_ceph-osd.bz2 =  osd binary

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-22 Thread Sam Just
I am still looking into the logs.
-Sam

On Fri, Jun 22, 2012 at 3:56 PM, Dan Mick dan.m...@inktank.com wrote:
 Stefan, I'm looking at your logs and coredump now.


 On 06/21/2012 11:43 PM, Stefan Priebe wrote:

 Does anybody have an idea? This is right now a showstopper to me.

 Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost
 AGs.pri...@profihost.ag:

 Hello list,

 i'm able to reproducably crash osd daemons.

 How i can reproduce:

 Kernel: 3.5.0-rc3
 Ceph: 0.47.3
 FS: btrfs
 Journal: 2GB tmpfs per OSD
 OSD: 3x servers with 4x Intel SSD OSDs each
 10GBE Network
 rbd_cache_max_age: 2.0
 rbd_cache_size: 33554432

 Disk is set to writeback.

 Start a KVM VM via PXE with the disk attached in writeback mode.

 Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
 crashes.

 # fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; fio
 --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
 --numjobs=50 --runtime=90 --group_reporting --name=file1; halt

 Strangely exactly THIS OSD also has the most log entries:
 64K     ceph-osd.20.log
 64K     ceph-osd.21.log
 1,3M    ceph-osd.22.log
 64K     ceph-osd.23.log

 But all OSDs are set to debug osd = 20.

 dmesg shows:
 ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp
 7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

 I uploaded the following files:
 priebe_fio_randwrite_ceph-osd.21.log.bz2 =  OSD which was OK and didn't
 crash
 priebe_fio_randwrite_ceph-osd.22.log.bz2 =  Log from the crashed OSD
 üu
 priebe_fio_randwrite_core.ssdstor001.27204.bz2 =  Core dump
 priebe_fio_randwrite_ceph-osd.bz2 =  osd binary

 Stefan

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-22 Thread Dan Mick
The ceph-osd binary you sent claims to be version 0.47.2-521-g88c762, 
which is not quite 0.47.3.  You can get the version with binary -v, or 
(in my case) examining strings in the binary.  I'm retrieving that 
version to analyze the core dump.



On 06/21/2012 11:43 PM, Stefan Priebe wrote:

Does anybody have an idea? This is right now a showstopper to me.

Am 21.06.2012 um 14:55 schrieb Stefan Priebe - Profihost 
AGs.pri...@profihost.ag:


Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case crashes.

# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio 
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 
--runtime=90 --group_reporting --name=file1; fio --filename=/dev/vda1 
--direct=1 --rw=randwrite --bs=4k --size=200G --numjobs=50 --runtime=90 
--group_reporting --name=file1; halt

Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3Mceph-osd.22.log
64K ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 7fa27702d260 
error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 =  OSD which was OK and didn't crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 =  Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 =  Core dump
priebe_fio_randwrite_ceph-osd.bz2 =  osd binary

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


reproducable osd crash

2012-06-21 Thread Stefan Priebe - Profihost AG

Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case 
crashes.


# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio 
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio 
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G 
--numjobs=50 --runtime=90 --group_reporting --name=file1; halt


Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3Mceph-osd.22.log
64K ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp 
7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]


I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't 
crash

priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump
priebe_fio_randwrite_ceph-osd.bz2 = osd binary

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-21 Thread Stefan Priebe - Profihost AG


When i start now the OSD again it seems to hang for forever. Load goes 
up to 200 and I/O Waits rise vom 0% to 20%.


Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:

Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
crashes.

# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; halt

Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3M ceph-osd.22.log
64K ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp
7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't
crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump
priebe_fio_randwrite_ceph-osd.bz2 = osd binary

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-21 Thread Stefan Priebe - Profihost AG
Another strange thing. Why does THIS OSD have 24GB and the others just 
650MB?


/dev/sdb1 224G  654M  214G   1% /srv/osd.20
/dev/sdc1 224G  638M  214G   1% /srv/osd.21
/dev/sdd1 224G   24G  190G  12% /srv/osd.22
/dev/sde1 224G  607M  214G   1% /srv/osd.23


When i start now the OSD again it seems to hang for forever. Load goes
up to 200 and I/O Waits rise vom 0% to 20%.

Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:

Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
crashes.

# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; halt

Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3M ceph-osd.22.log
64K ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp
7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't
crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump
priebe_fio_randwrite_ceph-osd.bz2 = osd binary

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-21 Thread Stefan Priebe - Profihost AG

Mhm is this normal (ceph health is NOW OK again)

/dev/sdb1 224G  655M  214G   1% /srv/osd.20
/dev/sdc1 224G  640M  214G   1% /srv/osd.21
/dev/sdd1 224G   34G  181G  16% /srv/osd.22
/dev/sde1 224G  608M  214G   1% /srv/osd.23

Why does one OSD has so much more used space than the others?

On my other OSD nodes all have around 600MB-700MB. Even when i reformat 
/dev/sdd1 after the backfill it has again 34GB?


Stefan

Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG:

Another strange thing. Why does THIS OSD have 24GB and the others just
650MB?

/dev/sdb1 224G 654M 214G 1% /srv/osd.20
/dev/sdc1 224G 638M 214G 1% /srv/osd.21
/dev/sdd1 224G 24G 190G 12% /srv/osd.22
/dev/sde1 224G 607M 214G 1% /srv/osd.23


When i start now the OSD again it seems to hang for forever. Load goes
up to 200 and I/O Waits rise vom 0% to 20%.

Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:

Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
crashes.

# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; halt

Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3M ceph-osd.22.log
64K ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp
7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't
crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump
priebe_fio_randwrite_ceph-osd.bz2 = osd binary

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: reproducable osd crash

2012-06-21 Thread Stefan Priebe
OK i discovered this time that all osds had the same disk usage before 
crash. After starting the osd again i got this one:

/dev/sdb1 224G   23G  191G  11% /srv/osd.30
/dev/sdc1 224G  1,5G  213G   1% /srv/osd.31
/dev/sdd1 224G  1,5G  213G   1% /srv/osd.32
/dev/sde1 224G  1,6G  213G   1% /srv/osd.33

So instead of 1,5GB osd 30 now uses 23G.

Stefan

Am 21.06.2012 15:23, schrieb Stefan Priebe - Profihost AG:

Mhm is this normal (ceph health is NOW OK again)

/dev/sdb1 224G  655M  214G   1% /srv/osd.20
/dev/sdc1 224G  640M  214G   1% /srv/osd.21
/dev/sdd1 224G   34G  181G  16% /srv/osd.22
/dev/sde1 224G  608M  214G   1% /srv/osd.23

Why does one OSD has so much more used space than the others?

On my other OSD nodes all have around 600MB-700MB. Even when i reformat
/dev/sdd1 after the backfill it has again 34GB?

Stefan

Am 21.06.2012 15:13, schrieb Stefan Priebe - Profihost AG:

Another strange thing. Why does THIS OSD have 24GB and the others just
650MB?

/dev/sdb1 224G 654M 214G 1% /srv/osd.20
/dev/sdc1 224G 638M 214G 1% /srv/osd.21
/dev/sdd1 224G 24G 190G 12% /srv/osd.22
/dev/sde1 224G 607M 214G 1% /srv/osd.23


When i start now the OSD again it seems to hang for forever. Load goes
up to 200 and I/O Waits rise vom 0% to 20%.

Am 21.06.2012 14:55, schrieb Stefan Priebe - Profihost AG:

Hello list,

i'm able to reproducably crash osd daemons.

How i can reproduce:

Kernel: 3.5.0-rc3
Ceph: 0.47.3
FS: btrfs
Journal: 2GB tmpfs per OSD
OSD: 3x servers with 4x Intel SSD OSDs each
10GBE Network
rbd_cache_max_age: 2.0
rbd_cache_size: 33554432

Disk is set to writeback.

Start a KVM VM via PXE with the disk attached in writeback mode.

Then run randwrite stress more than 2 time. Mostly OSD 22 in my case
crashes.

# fio --filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k
--size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; fio
--filename=/dev/vda1 --direct=1 --rw=randwrite --bs=4k --size=200G
--numjobs=50 --runtime=90 --group_reporting --name=file1; halt

Strangely exactly THIS OSD also has the most log entries:
64K ceph-osd.20.log
64K ceph-osd.21.log
1,3M ceph-osd.22.log
64K ceph-osd.23.log

But all OSDs are set to debug osd = 20.

dmesg shows:
ceph-osd[5381]: segfault at 3f592c000 ip 7fa281d8eb23 sp
7fa27702d260 error 4 in libtcmalloc.so.0.0.0[7fa281d6a000+3d000]

I uploaded the following files:
priebe_fio_randwrite_ceph-osd.21.log.bz2 = OSD which was OK and didn't
crash
priebe_fio_randwrite_ceph-osd.22.log.bz2 = Log from the crashed OSD
üu
priebe_fio_randwrite_core.ssdstor001.27204.bz2 = Core dump
priebe_fio_randwrite_ceph-osd.bz2 = osd binary

Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Should an OSD crash when journal device is out of space?

2012-06-20 Thread Travis Rhoden
Not sure if this is a bug or not.  It was definitely user error -- but
since the OSD process bailed, figured I would report it.

I had /tmpfs mounted with 2.5GB of space:

tmpfs on /tmpfs type tmpfs (rw,size=2560m)

Then I decided to increase my journal size to 5G, but forgot to
increase the limit on /tmpfs.  =)

osd journal size = 5000


Predictably, things didn't go well when I ran a rados bench that
filled up the journal.  I'm not sure if such a case can be handled
more gracefully:


-4 2012-06-20 12:39:36.648773 7fc042a5f780  1 journal _open
/tmpfs/osd.2.journal fd 30: 524288 bytes, block size 4096 bytes,
directio = 0, aio = 0
-3 2012-06-20 12:42:23.179164 7fc02e1ad700  1
CephxAuthorizeHandler::verify_authorizer isvalid=1
-2 2012-06-20 12:42:46.643205 7fc0396cf700 -1 journal
FileJournal::write_bl : write_fd failed: (28) No space left on device
-1 2012-06-20 12:42:46.643245 7fc0396cf700 -1 journal
FileJournal::do_write: write_bl(pos=2678079488) failed
 0 2012-06-20 12:42:46.676991 7fc0396cf700 -1 os/FileJournal.cc:
In function 'void FileJournal::do_write(ceph::bufferlist)' thread
7fc0396cf700 time 2012-06-20 12:42:46.643315
os/FileJournal.cc: 994: FAILED assert(0)

 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
 1: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082]
 2: (FileJournal::write_thread_entry()+0x735) [0x659545]
 3: (FileJournal::Writer::entry()+0xd) [0x5de41d]
 4: (()+0x7e9a) [0x7fc042434e9a]
 5: (clone()+0x6d) [0x7fc0409e94bd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- end dump of recent events ---
2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal (Aborted) **
 in thread 7fc0396cf700

 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
 1: /usr/bin/ceph-osd() [0x6eb32a]
 2: (()+0xfcb0) [0x7fc04243ccb0]
 3: (gsignal()+0x35) [0x7fc04092d445]
 4: (abort()+0x17b) [0x7fc040930bab]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d]
 6: (()+0xb5846) [0x7fc041279846]
 7: (()+0xb5873) [0x7fc041279873]
 8: (()+0xb596e) [0x7fc04127996e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x282) [0x79dd02]
 10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082]
 11: (FileJournal::write_thread_entry()+0x735) [0x659545]
 12: (FileJournal::Writer::entry()+0xd) [0x5de41d]
 13: (()+0x7e9a) [0x7fc042434e9a]
 14: (clone()+0x6d) [0x7fc0409e94bd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- begin dump of recent events ---
 0 2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal
(Aborted) **
 in thread 7fc0396cf700

 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
 1: /usr/bin/ceph-osd() [0x6eb32a]
 2: (()+0xfcb0) [0x7fc04243ccb0]
 3: (gsignal()+0x35) [0x7fc04092d445]
 4: (abort()+0x17b) [0x7fc040930bab]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d]
 6: (()+0xb5846) [0x7fc041279846]
 7: (()+0xb5873) [0x7fc041279873]
 8: (()+0xb596e) [0x7fc04127996e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x282) [0x79dd02]
 10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082]
 11: (FileJournal::write_thread_entry()+0x735) [0x659545]
 12: (FileJournal::Writer::entry()+0xd) [0x5de41d]
 13: (()+0x7e9a) [0x7fc042434e9a]
 14: (clone()+0x6d) [0x7fc0409e94bd]
 NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.

--- end dump of recent events ---
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Should an OSD crash when journal device is out of space?

2012-06-20 Thread Matthew Roy
I hit this a couple times and wondered the same thing. Why does the
OSD need to bail when it runs out of journal space?

On Wed, Jun 20, 2012 at 3:56 PM, Travis Rhoden trho...@gmail.com wrote:
 Not sure if this is a bug or not.  It was definitely user error -- but
 since the OSD process bailed, figured I would report it.

 I had /tmpfs mounted with 2.5GB of space:

 tmpfs on /tmpfs type tmpfs (rw,size=2560m)

 Then I decided to increase my journal size to 5G, but forgot to
 increase the limit on /tmpfs.  =)

 osd journal size = 5000


 Predictably, things didn't go well when I ran a rados bench that
 filled up the journal.  I'm not sure if such a case can be handled
 more gracefully:


    -4 2012-06-20 12:39:36.648773 7fc042a5f780  1 journal _open
 /tmpfs/osd.2.journal fd 30: 524288 bytes, block size 4096 bytes,
 directio = 0, aio = 0
    -3 2012-06-20 12:42:23.179164 7fc02e1ad700  1
 CephxAuthorizeHandler::verify_authorizer isvalid=1
    -2 2012-06-20 12:42:46.643205 7fc0396cf700 -1 journal
 FileJournal::write_bl : write_fd failed: (28) No space left on device
    -1 2012-06-20 12:42:46.643245 7fc0396cf700 -1 journal
 FileJournal::do_write: write_bl(pos=2678079488) failed
     0 2012-06-20 12:42:46.676991 7fc0396cf700 -1 os/FileJournal.cc:
 In function 'void FileJournal::do_write(ceph::bufferlist)' thread
 7fc0396cf700 time 2012-06-20 12:42:46.643315
 os/FileJournal.cc: 994: FAILED assert(0)

  ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
  1: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082]
  2: (FileJournal::write_thread_entry()+0x735) [0x659545]
  3: (FileJournal::Writer::entry()+0xd) [0x5de41d]
  4: (()+0x7e9a) [0x7fc042434e9a]
  5: (clone()+0x6d) [0x7fc0409e94bd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 --- end dump of recent events ---
 2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal (Aborted) **
  in thread 7fc0396cf700

  ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
  1: /usr/bin/ceph-osd() [0x6eb32a]
  2: (()+0xfcb0) [0x7fc04243ccb0]
  3: (gsignal()+0x35) [0x7fc04092d445]
  4: (abort()+0x17b) [0x7fc040930bab]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d]
  6: (()+0xb5846) [0x7fc041279846]
  7: (()+0xb5873) [0x7fc041279873]
  8: (()+0xb596e) [0x7fc04127996e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x282) [0x79dd02]
  10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082]
  11: (FileJournal::write_thread_entry()+0x735) [0x659545]
  12: (FileJournal::Writer::entry()+0xd) [0x5de41d]
  13: (()+0x7e9a) [0x7fc042434e9a]
  14: (clone()+0x6d) [0x7fc0409e94bd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 --- begin dump of recent events ---
     0 2012-06-20 12:42:46.693963 7fc0396cf700 -1 *** Caught signal
 (Aborted) **
  in thread 7fc0396cf700

  ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
  1: /usr/bin/ceph-osd() [0x6eb32a]
  2: (()+0xfcb0) [0x7fc04243ccb0]
  3: (gsignal()+0x35) [0x7fc04092d445]
  4: (abort()+0x17b) [0x7fc040930bab]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fc04127b69d]
  6: (()+0xb5846) [0x7fc041279846]
  7: (()+0xb5873) [0x7fc041279873]
  8: (()+0xb596e) [0x7fc04127996e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x282) [0x79dd02]
  10: (FileJournal::do_write(ceph::buffer::list)+0xe22) [0x653082]
  11: (FileJournal::write_thread_entry()+0x735) [0x659545]
  12: (FileJournal::Writer::entry()+0xd) [0x5de41d]
  13: (()+0x7e9a) [0x7fc042434e9a]
  14: (clone()+0x6d) [0x7fc0409e94bd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
 needed to interpret this.

 --- end dump of recent events ---
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash

2012-06-18 Thread Stefan Priebe - Profihost AG

Am 17.06.2012 23:16, schrieb Sage Weil:

Hi Stefan,

I opened http://tracker.newdream.net/issues/2599 to track this, but the
dump strangely does not include the ceph version or commit sha1.  What
version were you running?
Sorry that was my build system it accidently removed the .git dir while 
builing so the version string couldn't be compiled in.


It was 5efaa8d7799347dfae38333b1fd6e1a87dc76b28

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD crash

2012-06-16 Thread Stefan Priebe

Hi,

today i got another osd crash ;-( Strangely the osd logs are all empty. 
It seems the logrotate hasn't reloaded the daemons but i still have the 
core dump file? What's next?


Stefan

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash

2012-06-16 Thread Stefan Priebe

and another crash again ;-(


 0 2012-06-16 15:31:32.524369 7fd8935c4700 -1 ./common/Mutex.h: In 
function 'void Mutex::Lock(bool)' thread 7fd8935c4700 time 2012-06-16 
15:31:32.522446

./common/Mutex.h: 110: FAILED assert(r == 0)

 ceph version  (commit:)
 1: /usr/bin/ceph-osd() [0x51a07d]
 2: (ReplicatedPG::C_OSD_OndiskWriteUnlock::finish(int)+0x2a) [0x579c5a]
 3: (FileStore::_finish_op(FileStore::OpSequencer*)+0x2e4) [0x684374]
 4: (ThreadPool::worker()+0xbb7) [0x7bc087]
 5: (ThreadPool::WorkThread::entry()+0xd) [0x5f144d]
 6: (()+0x68ca) [0x7fd89db3a8ca]
 7: (clone()+0x6d) [0x7fd89c1bec0d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.


--- end dump of recent events ---
2012-06-16 15:31:32.531567 7fd8935c4700 -1 *** Caught signal (Aborted) **
 in thread 7fd8935c4700

 ceph version  (commit:)
 1: /usr/bin/ceph-osd() [0x70e4b9]
 2: (()+0xeff0) [0x7fd89db42ff0]
 3: (gsignal()+0x35) [0x7fd89c121225]
 4: (abort()+0x180) [0x7fd89c124030]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fd89c9b5dc5]
 6: (()+0xcb166) [0x7fd89c9b4166]
 7: (()+0xcb193) [0x7fd89c9b4193]
 8: (()+0xcb28e) [0x7fd89c9b428e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x940) [0x78af20]

 10: /usr/bin/ceph-osd() [0x51a07d]
 11: (ReplicatedPG::C_OSD_OndiskWriteUnlock::finish(int)+0x2a) [0x579c5a]
 12: (FileStore::_finish_op(FileStore::OpSequencer*)+0x2e4) [0x684374]
 13: (ThreadPool::worker()+0xbb7) [0x7bc087]
 14: (ThreadPool::WorkThread::entry()+0xd) [0x5f144d]
 15: (()+0x68ca) [0x7fd89db3a8ca]
 16: (clone()+0x6d) [0x7fd89c1bec0d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.


--- begin dump of recent events ---
 0 2012-06-16 15:31:32.531567 7fd8935c4700 -1 *** Caught signal 
(Aborted) **

 in thread 7fd8935c4700

 ceph version  (commit:)
 1: /usr/bin/ceph-osd() [0x70e4b9]
 2: (()+0xeff0) [0x7fd89db42ff0]
 3: (gsignal()+0x35) [0x7fd89c121225]
 4: (abort()+0x180) [0x7fd89c124030]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fd89c9b5dc5]
 6: (()+0xcb166) [0x7fd89c9b4166]
 7: (()+0xcb193) [0x7fd89c9b4193]
 8: (()+0xcb28e) [0x7fd89c9b428e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x940) [0x78af20]

 10: /usr/bin/ceph-osd() [0x51a07d]
 11: (ReplicatedPG::C_OSD_OndiskWriteUnlock::finish(int)+0x2a) [0x579c5a]
 12: (FileStore::_finish_op(FileStore::OpSequencer*)+0x2e4) [0x684374]
 13: (ThreadPool::worker()+0xbb7) [0x7bc087]
 14: (ThreadPool::WorkThread::entry()+0xd) [0x5f144d]
 15: (()+0x68ca) [0x7fd89db3a8ca]
 16: (clone()+0x6d) [0x7fd89c1bec0d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.


--- end dump of recent events ---

Am 16.06.2012 14:57, schrieb Stefan Priebe:

Hi,

today i got another osd crash ;-( Strangely the osd logs are all empty.
It seems the logrotate hasn't reloaded the daemons but i still have the
core dump file? What's next?

Stefan



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


domino-style OSD crash

2012-06-04 Thread Yann Dupont

Hello,
Besides the performance inconsistency (see other thread titled poor OSD 
performance using kernel 3.4) where I promised some tests (will run this 
afternoon), we tried this week-end to stress test ceph, making backups 
with bacula on a rbd volume of 15T (8 osd nodes, using 8 physical machines)


Results : Worked like a charm during two days, apart btrfs warn messages 
then OSD begin to crash 1 after all 'domino style'.


This morning, only 2 OSD of 8 are left.

1 of the physical machine was in kernel oops state - Nothing was remote 
logged, don't know what happened, there were no clear stack message. I 
suspect btrfs , but I have no proof.


This node (OSD.7) seems to have been the 1st one to crash, generated 
reconstruction between OSD  then lead to the cascade osd crash.


The other physical machines are still up, but with no osd running. here 
are some trace found in osd log :


   -3 2012-06-03 12:43:32.524671 7ff1352b8700  0 log [WRN] : slow 
request 30.506952 seconds old, rec
eived at 2012-06-03 12:43:01.997386: osd_sub_op(osd.0.0:1842628 2.57 
ea8d5657/label5_17606_object7068/

head [push] v 191'628 snapset=0=[]:[] snapc=0=[]) v6 currently queued for pg
-2 2012-06-03 12:44:32.869852 7ff1352b8700  0 log [WRN] : 1 slow 
requests, 1 included below; olde

st blocked for  30.073136 secs
-1 2012-06-03 12:44:32.869886 7ff1352b8700  0 log [WRN] : slow 
request 30.073136 seconds old, rec
eived at 2012-06-03 12:44:02.796651: osd_sub_op(osd.6.0:1837430 2.59 
97e62059/rb.0.1.000a2cdf/head

 [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
 0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal 
(Aborted) **

 in thread 7ff1237f6700

 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
 1: /usr/bin/ceph-osd() [0x708ea9]
 2: (()+0xeff0) [0x7ff13af2cff0]
 3: (gsignal()+0x35) [0x7ff13950b1b5]
 4: (abort()+0x180) [0x7ff13950dfc0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7ff139d9fdc5]
 6: (()+0xcb166) [0x7ff139d9e166]
 7: (()+0xcb193) [0x7ff139d9e193]
 8: (()+0xcb28e) [0x7ff139d9e28e]
 9: (std::__throw_length_error(char const*)+0x67) [0x7ff139d39307]
 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, 
std::allocatorchar const)+0x72) [0x7ff139d7ab42]

 11: (()+0xa8565) [0x7ff139d7b565]
 12: (std::basic_stringchar, std::char_traitschar, 
std::allocatorchar ::basic_string(char const*, unsigned long, 
std::allocatorchar const)+0x1b) [0x7ff139d7b7ab]
 13: 
(leveldb::InternalKeyComparator::FindShortestSeparator(std::string*, 
leveldb::Slice const) const+0x4d) [0x6ef69d]
 14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice 
const)+0x9f) [0x6fdd9f]
 15: 
(leveldb::DBImpl::DoCompactionWork(leveldb::DBImpl::CompactionState*)+0x4d3) 
[0x6eaba3]

 16: (leveldb::DBImpl::BackgroundCompaction()+0x222) [0x6ebb02]
 17: (leveldb::DBImpl::BackgroundCall()+0x68) [0x6ec378]
 18: /usr/bin/ceph-osd() [0x704981]
 19: (()+0x68ca) [0x7ff13af248ca]
 20: (clone()+0x6d) [0x7ff1395a892d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.


2 OSD exhibit similar traces.

---

4 other had traces like this one :

-5 2012-06-03 13:31:39.393489 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin

ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:19.393488)
-4 2012-06-03 13:31:40.393689 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin

ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:20.393687)
-3 2012-06-03 13:31:41.402873 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin

ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:21.402872)
-2 2012-06-03 13:31:42.363270 7f74f08ac700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin

ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:22.363269)
-1 2012-06-03 13:31:42.416968 7f74fd9c7700 -1 osd.3 1513 
heartbeat_check: no reply from osd.5 sin

ce 2012-06-03 13:31:18.459792 (cutoff 2012-06-03 13:31:22.416966)
 0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In 
function 'void PG::merge_log(ObjectStore::Transaction, pg_info_t, 
pg_log_t, int)' thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
osd/PG.cc: 402: FAILED assert(log.head = olog.tail  olog.head = 
log.tail)


 ceph version 0.47.2 (commit:8bf9fde89bd6ebc4b0645b2fe02dadb1c17ad372)
 1: (PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, 
int)+0x1eae) [0x649cce]
 2: (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec 
const)+0x2b1) [0x649fc1]
 3: (boost::statechart::simple_statePG::RecoveryState::Stray, 
PG::RecoveryState::Started, boost::mpl::listmpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
(boost::statechart::history_mode)0::react_impl(boost::statechart::event_base 
const, void const*)+0x203) [0x660343]
 4: 
(boost

Re: domino-style OSD crash

2012-06-04 Thread Tommi Virtanen
On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr wrote:
 Results : Worked like a charm during two days, apart btrfs warn messages
 then OSD begin to crash 1 after all 'domino style'.

Sorry to hear that. Reading through your message, there seem to be
several problems; whether they are because of the same root cause, I
can't tell.

Quick triage to benefit the other devs:

#1: kernel crash, no details available
 1 of the physical machine was in kernel oops state - Nothing was remote

#2: leveldb corruption? may be memory corruption that started
elsewhere.. Sam, does this look like the leveldb issue you saw?
  [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
     0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
 (Aborted) **
...
  13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
 leveldb::Slice const) const+0x4d) [0x6ef69d]
  14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice
 const)+0x9f) [0x6fdd9f]

#3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
     0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In function
 'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)'
 thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
 osd/PG.cc: 402: FAILED assert(log.head = olog.tail  olog.head =
 log.tail)

#4: unknown btrfs warnings, there should an actual message above this
traceback; believed fixed in latest kernel
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
 [a026fca5] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
 [a026965a] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
 [a02bc9a0] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
 [a026a6f1] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
 [8105a9f0] ? add_wait_queue+0x60/0x60
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
 [a026aba0] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
 [a026abb1] ? do_async_commit+0x11/0x20 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-06-04 Thread Sam Just
Can you send the osd logs?  The merge_log crashes are probably fixable
if I can see the logs.

The leveldb crash is almost certainly a result of memory corruption.

Thanks
-Sam

On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen t...@inktank.com wrote:
 On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr 
 wrote:
 Results : Worked like a charm during two days, apart btrfs warn messages
 then OSD begin to crash 1 after all 'domino style'.

 Sorry to hear that. Reading through your message, there seem to be
 several problems; whether they are because of the same root cause, I
 can't tell.

 Quick triage to benefit the other devs:

 #1: kernel crash, no details available
 1 of the physical machine was in kernel oops state - Nothing was remote

 #2: leveldb corruption? may be memory corruption that started
 elsewhere.. Sam, does this look like the leveldb issue you saw?
  [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
     0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
 (Aborted) **
 ...
  13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
 leveldb::Slice const) const+0x4d) [0x6ef69d]
  14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice
 const)+0x9f) [0x6fdd9f]

 #3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
     0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc: In function
 'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, int)'
 thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
 osd/PG.cc: 402: FAILED assert(log.head = olog.tail  olog.head =
 log.tail)

 #4: unknown btrfs warnings, there should an actual message above this
 traceback; believed fixed in latest kernel
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
 [a026fca5] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
 [a026965a] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
 [a02bc9a0] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
 [a026a6f1] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
 [8105a9f0] ? add_wait_queue+0x60/0x60
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
 [a026aba0] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
 [a026abb1] ? do_async_commit+0x11/0x20 [btrfs]
 Jun  2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: domino-style OSD crash

2012-06-04 Thread Greg Farnum
This is probably the same/similar to http://tracker.newdream.net/issues/2462, 
no? There's a log there, though I've no idea how helpful it is.


On Monday, June 4, 2012 at 10:40 AM, Sam Just wrote:

 Can you send the osd logs? The merge_log crashes are probably fixable
 if I can see the logs.
 
 The leveldb crash is almost certainly a result of memory corruption.
 
 Thanks
 -Sam
 
 On Mon, Jun 4, 2012 at 9:16 AM, Tommi Virtanen t...@inktank.com 
 (mailto:t...@inktank.com) wrote:
  On Mon, Jun 4, 2012 at 1:44 AM, Yann Dupont yann.dup...@univ-nantes.fr 
  (mailto:yann.dup...@univ-nantes.fr) wrote:
   Results : Worked like a charm during two days, apart btrfs warn messages
   then OSD begin to crash 1 after all 'domino style'.
  
  
  
  Sorry to hear that. Reading through your message, there seem to be
  several problems; whether they are because of the same root cause, I
  can't tell.
  
  Quick triage to benefit the other devs:
  
  #1: kernel crash, no details available
   1 of the physical machine was in kernel oops state - Nothing was remote
  
  
  
  #2: leveldb corruption? may be memory corruption that started
  elsewhere.. Sam, does this look like the leveldb issue you saw?
   [push] v 1438'9416 snapset=0=[]:[] snapc=0=[]) v6 currently started
   0 2012-06-03 12:55:33.088034 7ff1237f6700 -1 *** Caught signal
   (Aborted) **
  
  
  ...
   13: (leveldb::InternalKeyComparator::FindShortestSeparator(std::string*,
   leveldb::Slice const) const+0x4d) [0x6ef69d]
   14: (leveldb::TableBuilder::Add(leveldb::Slice const, leveldb::Slice
   const)+0x9f) [0x6fdd9f]
  
  
  
  #3: PG::merge_log assertion while recovering from the above; Sam, any ideas?
   0 2012-06-03 13:36:48.147020 7f74f58b6700 -1 osd/PG.cc (http://PG.cc): 
   In function
   'void PG::merge_log(ObjectStore::Transaction, pg_info_t, pg_log_t, 
   int)'
   thread 7f74f58b6700 time 2012-06-03 13:36:48.100157
   osd/PG.cc (http://PG.cc): 402: FAILED assert(log.head = olog.tail  
   olog.head =
   log.tail)
  
  
  
  #4: unknown btrfs warnings, there should an actual message above this
  traceback; believed fixed in latest kernel
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479278]
   [a026fca5] ? btrfs_orphan_commit_root+0x105/0x110 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479328]
   [a026965a] ? commit_fs_roots.isra.22+0xaa/0x170 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479379]
   [a02bc9a0] ? btrfs_scrub_pause+0xf0/0x100 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479415]
   [a026a6f1] ? btrfs_commit_transaction+0x521/0x9d0 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479460]
   [8105a9f0] ? add_wait_queue+0x60/0x60
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479493]
   [a026aba0] ? btrfs_commit_transaction+0x9d0/0x9d0 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479543]
   [a026abb1] ? do_async_commit+0x11/0x20 [btrfs]
   Jun 2 23:40:03 chichibu.u14.univ-nantes.prive kernel: [200652.479572]
  
  
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org 
  (mailto:majord...@vger.kernel.org)
  More majordomo info at http://vger.kernel.org/majordomo-info.html
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org 
 (mailto:majord...@vger.kernel.org)
 More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem after ceph-osd crash

2012-02-20 Thread Sage Weil
On Mon, 20 Feb 2012, Oliver Francke wrote:
 Hi Sage,
 
 On 02/20/2012 06:41 PM, Sage Weil wrote:
  On Mon, 20 Feb 2012, Oliver Francke wrote:
   Hi,
   
   we are just in trouble after some mess with trying to include a new
   OSD-node
   into our cluster.
   
   We get some weird libceph: corrupt inc osdmap epoch 880 off 102
   (c9001db8990a of c9001db898a4-c9001db89dae)

I just retested the kernel client against the new server code and I don't 
see this.  If you can pull the osdmap/880 file from the monitor data 
directory (soon, please, the monitor will delete it once things fully 
recover and move on) I can see what the data looks like.

   
   on the console.
   The whole system is in a state ala:
   
   012-02-20 17:56:27.585295pg v942504: 2046 pgs: 1348 active+clean, 43
   active+recovering+degraded+remapped+backfill, 218 active+recovering, 437
   active+recovering+remapped+backfill; 1950 GB data, 3734 GB used, 26059 GB
   /
   29794 GB avail; 272914/1349073 degraded (20.230%)
   
   and sometimes the ceph-osd on node0 is crashing. At the moment of writing,
   the
   degrading continues to shrink down below 20%.
  How did ceph-osd crash?  Is there a dump in the log?
 
 'course I will provide all logs, uhm, a bit later, we are busy to start all
 VM's, and handle first customer-tickets right now ;-)

 To be most complete for the collection, would you be so kind to give a 
 list of all necessary kern.log osdX.log etc.?

I think just the crashed osd log will be enough.  It looks like the rest 
of the cluster is recovering ok...

Are the VMs running on top of the kernel rbd client, or KVM+librbd?

sage


 
 Thnx for the fast reaction,
 
 Oliver.
 
  sage
  
   Any clues?
   
   Thnx in @vance,
   
   Oliver.
   
   -- 
   
   Oliver Francke
   
   filoo GmbH
   Moltkestraße 25a
   0 Gütersloh
   HRB4355 AG Gütersloh
   
   Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz
   
   Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
   
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
   
 
 
 -- 
 
 Oliver Francke
 
 filoo GmbH
 Moltkestraße 25a
 0 Gütersloh
 HRB4355 AG Gütersloh
 
 Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz
 
 Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 

Re: osd crash during resync

2012-01-26 Thread Martin Mailand

Hi Sage,
I uploaded the osd.0 log as well.

http://85.214.49.87/ceph/20120124/osd.0.log.bz2

-martin

Am 25.01.2012 23:08, schrieb Sage Weil:

Hi Martin,

On Tue, 24 Jan 2012, Martin Mailand wrote:

Hi,
today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted
osd.0 with a new kernel and created a new btrfs on the osd.0, than I took the
osd.0 into the cluster. During the the resync of osd.0 osd.2 and osd.3
crashed.
I am not sure, if the crashes happened because I played with osd.0, or if they
are bugs.


osd.2
-rw---  1 root root 1.1G 2012-01-24 12:19
core-ceph-osd-1000-1327403927-s-brick-002

log:
2012-01-24 12:15:45.563135 7f1fdd42c700 log [INF] : 2.a restarting backfill on
osd.0 from (185'113859,185'113859] 0//0 to 196'114038
osd/PG.cc: In function 'void PG::finish_recovery_op(const hobject_t, bool)',
in thread '7f1fdab26700'
osd/PG.cc: 1553: FAILED assert(recovery_ops_active  0)

-rw---  1 root root 758M 2012-01-24 15:58
core-ceph-osd-20755-1327417128-s-brick-002


Can you post the log for osd.0 too?

Thanks!
sage





log:
2012-01-24 15:58:48.356892 7fe26acbf700 osd.2 379 pg[2.ff( v 379'286211 lc
202'286160 (185'285159,379'286211] n=112 ec=1 les/c 379/310 373/376/376) [2,1]
r=0 lpr=376 rops=1 mlcod 202'286160 active m=6]  * oi-watcher: client.4478
cookie=1
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)', in thread
'7fe26fdca700'
osd/ReplicatedPG.cc: 3199: FAILED assert(obc-watchers.size() == 0)
osd/ReplicatedPG.cc: In function 'void
ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)', in thread
'7fe26fdca700'

http://85.214.49.87/ceph/20120124/osd.2.log.bz2



osd.3
-rw---  1 root root 986M 2012-01-24 12:24
core-ceph-osd-962-1327404263-s-brick-003

log:
2012-01-24 12:15:50.241321 7f30c8fde700 log [INF] : 2.2e restarting backfill
on osd.0 from (185'338312,185'338312] 0//0 to 196'339910
2012-01-24 12:21:48.420242 7f30c5ed7700 log [INF] : 2.9d scrub ok
osd/PG.cc: In function 'void PG::activate(ObjectStore::Transaction,
std::listContext*, std::mapint, std::mappg_t, PG::Query  ,
std::mapint, MOSDPGInfo**)', in thread '7f30c8fde700'

http://85.214.49.87/ceph/20120124/osd.3.log.bz2



-martin


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash during resync

2012-01-25 Thread Sage Weil
Hi Martin,

On Tue, 24 Jan 2012, Martin Mailand wrote:
 Hi,
 today I tried the btrfs patch mentioned on the btrfs ml. Therefore I rebooted
 osd.0 with a new kernel and created a new btrfs on the osd.0, than I took the
 osd.0 into the cluster. During the the resync of osd.0 osd.2 and osd.3
 crashed.
 I am not sure, if the crashes happened because I played with osd.0, or if they
 are bugs.
 
 
 osd.2
 -rw---  1 root root 1.1G 2012-01-24 12:19
 core-ceph-osd-1000-1327403927-s-brick-002
 
 log:
 2012-01-24 12:15:45.563135 7f1fdd42c700 log [INF] : 2.a restarting backfill on
 osd.0 from (185'113859,185'113859] 0//0 to 196'114038
 osd/PG.cc: In function 'void PG::finish_recovery_op(const hobject_t, bool)',
 in thread '7f1fdab26700'
 osd/PG.cc: 1553: FAILED assert(recovery_ops_active  0)
 
 -rw---  1 root root 758M 2012-01-24 15:58
 core-ceph-osd-20755-1327417128-s-brick-002

Can you post the log for osd.0 too?

Thanks!
sage



 
 log:
 2012-01-24 15:58:48.356892 7fe26acbf700 osd.2 379 pg[2.ff( v 379'286211 lc
 202'286160 (185'285159,379'286211] n=112 ec=1 les/c 379/310 373/376/376) [2,1]
 r=0 lpr=376 rops=1 mlcod 202'286160 active m=6]  * oi-watcher: client.4478
 cookie=1
 osd/ReplicatedPG.cc: In function 'void
 ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)', in thread
 '7fe26fdca700'
 osd/ReplicatedPG.cc: 3199: FAILED assert(obc-watchers.size() == 0)
 osd/ReplicatedPG.cc: In function 'void
 ReplicatedPG::populate_obc_watchers(ReplicatedPG::ObjectContext*)', in thread
 '7fe26fdca700'
 
 http://85.214.49.87/ceph/20120124/osd.2.log.bz2
 
 
 
 osd.3
 -rw---  1 root root 986M 2012-01-24 12:24
 core-ceph-osd-962-1327404263-s-brick-003
 
 log:
 2012-01-24 12:15:50.241321 7f30c8fde700 log [INF] : 2.2e restarting backfill
 on osd.0 from (185'338312,185'338312] 0//0 to 196'339910
 2012-01-24 12:21:48.420242 7f30c5ed7700 log [INF] : 2.9d scrub ok
 osd/PG.cc: In function 'void PG::activate(ObjectStore::Transaction,
 std::listContext*, std::mapint, std::mappg_t, PG::Query ,
 std::mapint, MOSDPGInfo**)', in thread '7f30c8fde700'
 
 http://85.214.49.87/ceph/20120124/osd.3.log.bz2
 
 
 
 -martin
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash during resync

2012-01-24 Thread Gregory Farnum
On Tue, Jan 24, 2012 at 10:48 AM, Martin Mailand mar...@tuxadero.com wrote:
 Hi,
 today I tried the btrfs patch mentioned on the btrfs ml. Therefore I
 rebooted osd.0 with a new kernel and created a new btrfs on the osd.0, than
 I took the osd.0 into the cluster. During the the resync of osd.0 osd.2 and
 osd.3 crashed.
 I am not sure, if the crashes happened because I played with osd.0, or if
 they are bugs.

These are OSD-level issues not caused by btrfs, so your new kernel
definitely didn't do it. It's probably fallout from the backfill
changes that got merged in last week. I created new bugs to track
them: http://tracker.newdream.net/issues/1982 (1983, 1984). Sam and
Josh are going wild on some other issues that we've turned up and
these have been added to the queue as soon as somebody qualified can
get to them. :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash during resync

2012-01-24 Thread Martin Mailand

Hi Greg,
ok, do you guys still need the core files, or could I delete them?

-martin

Am 24.01.2012 22:13, schrieb Gregory Farnum:

On Tue, Jan 24, 2012 at 10:48 AM, Martin Mailandmar...@tuxadero.com  wrote:

Hi,
today I tried the btrfs patch mentioned on the btrfs ml. Therefore I
rebooted osd.0 with a new kernel and created a new btrfs on the osd.0, than
I took the osd.0 into the cluster. During the the resync of osd.0 osd.2 and
osd.3 crashed.
I am not sure, if the crashes happened because I played with osd.0, or if
they are bugs.


These are OSD-level issues not caused by btrfs, so your new kernel
definitely didn't do it. It's probably fallout from the backfill
changes that got merged in last week. I created new bugs to track
them: http://tracker.newdream.net/issues/1982 (1983, 1984). Sam and
Josh are going wild on some other issues that we've turned up and
these have been added to the queue as soon as somebody qualified can
get to them. :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: osd crash during resync

2012-01-24 Thread Gregory Farnum
On Tue, Jan 24, 2012 at 1:22 PM, Martin Mailand mar...@tuxadero.com wrote:
 Hi Greg,
 ok, do you guys still need the core files, or could I delete them?

Sam thinks probably not since we have the backtraces and the
logs...thanks for asking, though! :)
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash

2011-05-27 Thread Gregory Farnum
This is an interesting one -- the invariant that assert is checking
isn't too complicated (that the object lives on the RecoveryWQ's
queue) and seems to hold everywhere the RecoveryWQ is called. And the
functions modifying the queue are always called under the workqueue
lock, and do maintenance if the xlist::item is on a different list.
Which makes me think that the problem must be from conflating the
RecoveryWQ lock and the PG lock in the few places that modify the
PG::recovery_item directly, rather than via RecoveryWQ functions.
Anybody more familiar than me with this have ideas?
Fyodor, based on the time stamps and output you've given us, I assume
you don't have more detailed logs?
-Greg

On Thu, May 26, 2011 at 5:12 PM, Fyodor Ustinov u...@ufm.su wrote:
 Hi!

 2011-05-27 02:35:22.046798 7fa8ff058700 journal check_for_full at 837623808
 : JOURNAL FULL 837623808 = 147455 (max_size 996147200 start 837771264)
 2011-05-27 02:35:23.479379 7fa8f7f49700 journal throttle: waited for bytes
 2011-05-27 02:35:34.730418 7fa8ff058700 journal check_for_full at 836984832
 : JOURNAL FULL 836984832 = 638975 (max_size 996147200 start 837623808)
 2011-05-27 02:35:36.050384 7fa8f7f49700 journal throttle: waited for bytes
 2011-05-27 02:35:47.226789 7fa8ff058700 journal check_for_full at 836882432
 : JOURNAL FULL 836882432 = 102399 (max_size 996147200 start 836984832)
 2011-05-27 02:35:48.937259 7fa8f874a700 journal throttle: waited for bytes
 2011-05-27 02:35:59.985040 7fa8ff058700 journal check_for_full at 836685824
 : JOURNAL FULL 836685824 = 196607 (max_size 996147200 start 836882432)
 2011-05-27 02:36:01.654955 7fa8f874a700 journal throttle: waited for bytes
 2011-05-27 02:36:12.362896 7fa8ff058700 journal check_for_full at 835723264
 : JOURNAL FULL 835723264 = 962559 (max_size 996147200 start 836685824)
 2011-05-27 02:36:14.375435 7fa8f7f49700 journal throttle: waited for bytes
 ./include/xlist.h: In function 'void xlistT::remove(xlistT::item*) [with
 T = PG*]', in thread '0x7fa8f7748700'
 ./include/xlist.h: 107: FAILED assert(i-_list == this)
  ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632)
  1: (xlistPG*::pop_front()+0xbb) [0x54f28b]
  2: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3]
  3: (ThreadPool::worker()+0x10a) [0x65799a]
  4: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d]
  5: (()+0x6d8c) [0x7fa904294d8c]
  6: (clone()+0x6d) [0x7fa90314704d]
  ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632)
  1: (xlistPG*::pop_front()+0xbb) [0x54f28b]
  2: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3]
  3: (ThreadPool::worker()+0x10a) [0x65799a]
  4: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d]
  5: (()+0x6d8c) [0x7fa904294d8c]
  6: (clone()+0x6d) [0x7fa90314704d]
 *** Caught signal (Aborted) **
  in thread 0x7fa8f7748700
  ceph version 0.28.1 (commit:d66c6ca19bbde3c363b135b66072de44e67c6632)
  1: /usr/bin/cosd() [0x6729f9]
  2: (()+0xfc60) [0x7fa90429dc60]
  3: (gsignal()+0x35) [0x7fa903094d05]
  4: (abort()+0x186) [0x7fa903098ab6]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7fa90394b6dd]
  6: (()+0xb9926) [0x7fa903949926]
  7: (()+0xb9953) [0x7fa903949953]
  8: (()+0xb9a5e) [0x7fa903949a5e]
  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x362) [0x655e32]
  10: (xlistPG*::pop_front()+0xbb) [0x54f28b]
  11: (OSD::RecoveryWQ::_dequeue()+0x73) [0x56bcc3]
  12: (ThreadPool::worker()+0x10a) [0x65799a]
  13: (ThreadPool::WorkThread::entry()+0xd) [0x548c8d]
  14: (()+0x6d8c) [0x7fa904294d8c]
  15: (clone()+0x6d) [0x7fa90314704d]

 WBR,
    Fyodor.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash

2011-05-27 Thread Fyodor Ustinov

On 05/27/2011 06:16 PM, Gregory Farnum wrote:

This is an interesting one -- the invariant that assert is checking
isn't too complicated (that the object lives on the RecoveryWQ's
queue) and seems to hold everywhere the RecoveryWQ is called. And the
functions modifying the queue are always called under the workqueue
lock, and do maintenance if the xlist::item is on a different list.
Which makes me think that the problem must be from conflating the
RecoveryWQ lock and the PG lock in the few places that modify the
PG::recovery_item directly, rather than via RecoveryWQ functions.
Anybody more familiar than me with this have ideas?
Fyodor, based on the time stamps and output you've given us, I assume
you don't have more detailed logs?
-Greg


Greg, i got this crash again.
Let me tell you the configuration and what is happening:
Configuration:
6 osd servers. 4G RAM, 4*1T hdd (mdadmed to raid0), 2*1G etherchannel 
ethernet, Ubuntu server 11.04/64  with kernel 2.6.39 (hand compiled)

mon+mds server 24G RAM, the same os.

On each OSD Journal placed on 1G tempfs. OSD data - on xfs in this case.

Configuration file:

[global]
max open files = 131072
log file = /var/log/ceph/$name.log
pid file = /var/run/ceph/$name.pid

[mon]
mon data = /mfs/mon$id

[mon.0]
mon addr  = 10.5.51.230:6789

[mds]
keyring = /mfs/mds/keyring.$name

[mds.0]
host = mds0


[osd]
osd data = /$name
osd journal = /journal/$name
osd journal size = 950
journal dio = false

[osd.0]
host = osd0
cluster addr = 10.5.51.10
public addr = 10.5.51.140

[osd.1]
host = osd1
cluster addr = 10.5.51.11
public addr = 10.5.51.141

[osd.2]
host = osd2
cluster addr = 10.5.51.12
public addr = 10.5.51.142

[osd.3]
host = osd3
cluster addr = 10.5.51.13
public addr = 10.5.51.143

[osd.4]
host = osd4
cluster addr = 10.5.51.14
public addr = 10.5.51.144

[osd.5]
host = osd5
cluster addr = 10.5.51.15
public addr = 10.5.51.145

What happening:
osd2 was crashed, rebooted, osd data and journal created from scratch by 
cosd --mkfs -i 2 --monmap /tmp/monmap and server started.
Additional - on osd2 enables writeahaed, but I think it's not 
principal in this case.


Well, server start rebalancing:

2011-05-27 15:12:49.323558 7f3b69de5740 ceph version 0.28.1.commit: 
d66c6ca19bbde3c363b135b66072de44e67c6632. process: cosd. pid: 1694
2011-05-27 15:12:49.325331 7f3b69de5740 filestore(/osd.2) mount FIEMAP 
ioctl is NOT supported
2011-05-27 15:12:49.325378 7f3b69de5740 filestore(/osd.2) mount did NOT 
detect btrfs
2011-05-27 15:12:49.325467 7f3b69de5740 filestore(/osd.2) mount found 
snaps 
2011-05-27 15:12:49.325512 7f3b69de5740 filestore(/osd.2) mount: 
WRITEAHEAD journal mode explicitly enabled in conf
2011-05-27 15:12:49.325526 7f3b69de5740 filestore(/osd.2) mount WARNING: 
not btrfs or ext3; data may be lost
2011-05-27 15:12:49.325606 7f3b69de5740 journal _open /journal/osd.2 fd 
11: 996147200 bytes, block size 4096 bytes, directio = 0
2011-05-27 15:12:49.325641 7f3b69de5740 journal read_entry 4096 : seq 1 
203 bytes
2011-05-27 15:12:49.325698 7f3b69de5740 journal _open /journal/osd.2 fd 
11: 996147200 bytes, block size 4096 bytes, directio = 0
2011-05-27 15:12:49.544716 7f3b59656700 -- 10.5.51.12:6801/1694  
10.5.51.14:6801/5070 pipe(0x1239d20 sd=27 pgs=0 cs=0 l=0).accept we 
reset (peer sent cseq 2), sending RESETSESSION
2011-05-27 15:12:49.544798 7f3b59c5c700 -- 10.5.51.12:6801/1694  
10.5.51.13:6801/5165 pipe(0x104b950 sd=14 pgs=0 cs=0 l=0).accept we 
reset (peer sent cseq 2), sending RESETSESSION
2011-05-27 15:12:49.544864 7f3b59757700 -- 10.5.51.12:6801/1694  
10.5.51.15:6801/1574 pipe(0x11e7cd0 sd=16 pgs=0 cs=0 l=0).accept we 
reset (peer sent cseq 2), sending RESETSESSION
2011-05-27 15:12:49.544909 7f3b59959700 -- 10.5.51.12:6801/1694  
10.5.51.10:6801/6148 pipe(0x11d7d30 sd=15 pgs=0 cs=0 l=0).accept we 
reset (peer sent cseq 2), sending RESETSESSION
2011-05-27 15:13:23.015637 7f3b64579700 journal check_for_full at 
66404352 : JOURNAL FULL 66404352 = 851967 (max_size 996147200 start 
67256320)

2011-05-27 15:13:25.586081 7f3b5dc6b700 journal throttle: waited for bytes
2011-05-27 15:13:25.601789 7f3b5d46a700 journal throttle: waited for bytes

[...] and after 2 hours:

2011-05-27 17:30:21.355034 7f3b64579700 journal check_for_full at 
415199232 : JOURNAL FULL 415199232 = 778239 (max_size 996147200 start 
415977472)

2011-05-27 17:30:23.441445 7f3b5d46a700 journal throttle: waited for bytes
2011-05-27 17:30:36.362877 7f3b64579700 journal check_for_full at 
414326784 : JOURNAL FULL 414326784 = 872447 (max_size 996147200 start 
415199232)

2011-05-27 17:30:38.391372 7f3b5d46a700 journal throttle: waited for bytes
2011-05-27 17:30:50.373936 7f3b64579700 journal check_for_full at 
414314496 : JOURNAL FULL 414314496 = 12287 (max_size 996147200 

Re: OSD crash

2011-05-27 Thread Fyodor Ustinov

On 05/27/2011 10:18 PM, Gregory Farnum wrote:

Can you check out the recoverywq_fix branch and see if that prevents
this issue? Or just apply the patch I've included below. :)
-Greg


Looks as though this patch has helped.
At least this osd has completd rebalancing.
Great! Thanks!

WBR,
Fyodor.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >