RE: OSD sometimes stuck in init phase

2015-08-07 Thread Gurjar, Unmesh
@Haomai: Thanks for your support and curiosity.

After enabling the verbose journal logs and observing the discrepancies of 
working and stuck OSDs, it seemed related to the 'journal aio'. To confirm that 
the issue I hit is due to journal aio, I disabled it and erected a new cluster 
having 9 OSDs, all of which went to 'up' and 'in' state (on starting the OSD 
service for first time)!

The fix related for the issue [1] is available in next version of Firefly 
release (0.80.8 and onwards); probably I should move to the latest Firefly 
version (0.80.10)!

[1] - http://tracker.ceph.com/issues/9073

Regards,
Unmesh G.
IRC: unmeshg

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Thursday, August 06, 2015 6:59 PM
 To: Gurjar, Unmesh
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: OSD sometimes stuck in init phase
 
 It seemed filestore doesn't do transaction as expected. Sorry, you need to add
 debug_journal=20/20 to help find the reason. :-)
 
 BTW, what's your os version? How many osds do you have in this cluster, how
 many osds failed to start like this?
 
 On Thu, Aug 6, 2015 at 9:17 PM, Gurjar, Unmesh unmesh.gur...@hp.com
 wrote:
  Please find ceph.conf at [1] and the corresponding OSD log at [2].
 
  To clarify one thing I skipped earlier on, is while bringing up the OSDs, 
  'ceph-
 disk activate' was getting hung (due to issue [3]). To get over this, I had to
 temporarily disable 'journal dio' to get the disk activated (with a 
 'mark-init' set
 to none) and then explicitly start the OSD service after updating the conf to
 enable 'journal dio'. I am hopeful that this should not cause the present 
 issue
 (since few OSD start successfully on first attempt and others on subsequent
 service restarts)!
 
  [1] - http://paste.openstack.org/show/411161/
  [2] - http://paste.openstack.org/show/411162/
  [3] - http://tracker.ceph.com/issues/9768
 
  Regards,
  Unmesh G.
  IRC: unmeshg
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Thursday, August 06, 2015 6:22 PM
  To: Gurjar, Unmesh
  Cc: ceph-devel@vger.kernel.org
  Subject: Re: OSD sometimes stuck in init phase
 
  Don't find something strange.
 
  Could you paste your ceph.conf? And restart this osd with
  debug_osd=20/20,
  debug_filestore=20/20 :-)
 
  On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh unmesh.gur...@hp.com
  wrote:
   Thanks for quick response Haomai! Please find the backtrace here [1].
  
   [1] - http://paste.openstack.org/show/411139/
  
   Regards,
   Unmesh G.
   IRC: unmeshg
  
   -Original Message-
   From: Haomai Wang [mailto:haomaiw...@gmail.com]
   Sent: Thursday, August 06, 2015 5:31 PM
   To: Gurjar, Unmesh
   Cc: ceph-devel@vger.kernel.org
   Subject: Re: OSD sometimes stuck in init phase
  
   Could you print your all thread callback via thread apply all bt?
  
   On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh
   unmesh.gur...@hp.com
   wrote:
Hi,
   
On a Ceph Firefly cluster (version [1]), OSDs are configured to
use separate
   data and journal disks (using the ceph-disk utility). It is
   observed, that few OSDs start-up fine (are 'up' and 'in' state);
   however, others are stuck in the 'init creating/touching snapmapper
 object'
   phase. Below is a OSD start-up log
   snippet:
   
2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open
/var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block
size
4096 bytes, directio = 1, aio = 1
2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open
/var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block
size
4096 bytes, directio = 1, aio = 1
2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0
read_superblock sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0
a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init
creating/touching snapmapper object
   
The log statement is inaccurate though, since it is actually
doing init
   operation for the 'infos' object (as can be observed from source [2]).
   
Upon debugging further, the thread seems to be waiting to
acquire the
   'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug
 trace:
   
(gdb) where
#0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x7fd313132bf4 in
ObjectStore::apply_transactions(ObjectStore::Sequencer*,
std::listObjectStore::Transaction*,
std::allocatorObjectStore::Transaction* , Context*) ()
#2  0x7fd313097d08 in
ObjectStore::apply_transaction(ObjectStore::Transaction,
Context*)
()
#3  0x7fd313076790 in OSD::init() ()
#4  0x7fd3130233a7 in main ()
   
In a few cases, upon restarting the stuck OSD (service), it
successfully
   completes the 'init' phase and reaches the 'up

RE: OSD sometimes stuck in init phase

2015-08-06 Thread Gurjar, Unmesh
Thanks for quick response Haomai! Please find the backtrace here [1].

[1] - http://paste.openstack.org/show/411139/

Regards,
Unmesh G.
IRC: unmeshg

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Thursday, August 06, 2015 5:31 PM
 To: Gurjar, Unmesh
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: OSD sometimes stuck in init phase
 
 Could you print your all thread callback via thread apply all bt?
 
 On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com
 wrote:
  Hi,
 
  On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate
 data and journal disks (using the ceph-disk utility). It is observed, that 
 few OSDs
 start-up fine (are 'up' and 'in' state); however, others are stuck in the 
 'init
 creating/touching snapmapper object' phase. Below is a OSD start-up log
 snippet:
 
  2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open
  /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size
  4096 bytes, directio = 1, aio = 1
  2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open
  /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size
  4096 bytes, directio = 1, aio = 1
  2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
  2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock
  sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0
  a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
  2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init
  creating/touching snapmapper object
 
  The log statement is inaccurate though, since it is actually doing init
 operation for the 'infos' object (as can be observed from source [2]).
 
  Upon debugging further, the thread seems to be waiting to acquire the
 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:
 
  (gdb) where
  #0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from
  /lib/x86_64-linux-gnu/libpthread.so.0
  #1  0x7fd313132bf4 in
  ObjectStore::apply_transactions(ObjectStore::Sequencer*,
  std::listObjectStore::Transaction*,
  std::allocatorObjectStore::Transaction* , Context*) ()
  #2  0x7fd313097d08 in
  ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) ()
  #3  0x7fd313076790 in OSD::init() ()
  #4  0x7fd3130233a7 in main ()
 
  In a few cases, upon restarting the stuck OSD (service), it successfully
 completes the 'init' phase and reaches the 'up' and 'in' state!
 
  Any help is greatly appreciated. Please let me know if any more details are
 required for root causing.
 
  [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
  [2] -  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211
 
  Regards,
  Unmesh G.
  IRC: unmeshg
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in the body of a message to majord...@vger.kernel.org More majordomo
  info at  http://vger.kernel.org/majordomo-info.html
 
 
 
 --
 Best Regards,
 
 Wheat
N�r��yb�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�mzZ+�ݢj��!�i

RE: OSD sometimes stuck in init phase

2015-08-06 Thread Gurjar, Unmesh
Please find ceph.conf at [1] and the corresponding OSD log at [2].

To clarify one thing I skipped earlier on, is while bringing up the OSDs, 
'ceph-disk activate' was getting hung (due to issue [3]). To get over this, I 
had to temporarily disable 'journal dio' to get the disk activated (with a 
'mark-init' set to none) and then explicitly start the OSD service after 
updating the conf to enable 'journal dio'. I am hopeful that this should not 
cause the present issue (since few OSD start successfully on first attempt and 
others on subsequent service restarts)!

[1] - http://paste.openstack.org/show/411161/
[2] - http://paste.openstack.org/show/411162/
[3] - http://tracker.ceph.com/issues/9768

Regards,
Unmesh G.
IRC: unmeshg

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Thursday, August 06, 2015 6:22 PM
 To: Gurjar, Unmesh
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: OSD sometimes stuck in init phase
 
 Don't find something strange.
 
 Could you paste your ceph.conf? And restart this osd with debug_osd=20/20,
 debug_filestore=20/20 :-)
 
 On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh unmesh.gur...@hp.com
 wrote:
  Thanks for quick response Haomai! Please find the backtrace here [1].
 
  [1] - http://paste.openstack.org/show/411139/
 
  Regards,
  Unmesh G.
  IRC: unmeshg
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Thursday, August 06, 2015 5:31 PM
  To: Gurjar, Unmesh
  Cc: ceph-devel@vger.kernel.org
  Subject: Re: OSD sometimes stuck in init phase
 
  Could you print your all thread callback via thread apply all bt?
 
  On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com
  wrote:
   Hi,
  
   On a Ceph Firefly cluster (version [1]), OSDs are configured to use
   separate
  data and journal disks (using the ceph-disk utility). It is observed,
  that few OSDs start-up fine (are 'up' and 'in' state); however,
  others are stuck in the 'init creating/touching snapmapper object'
  phase. Below is a OSD start-up log
  snippet:
  
   2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open
   /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block
   size
   4096 bytes, directio = 1, aio = 1
   2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open
   /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block
   size
   4096 bytes, directio = 1, aio = 1
   2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
   2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock
   sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0
   a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
   2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init
   creating/touching snapmapper object
  
   The log statement is inaccurate though, since it is actually doing
   init
  operation for the 'infos' object (as can be observed from source [2]).
  
   Upon debugging further, the thread seems to be waiting to acquire
   the
  'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:
  
   (gdb) where
   #0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from
   /lib/x86_64-linux-gnu/libpthread.so.0
   #1  0x7fd313132bf4 in
   ObjectStore::apply_transactions(ObjectStore::Sequencer*,
   std::listObjectStore::Transaction*,
   std::allocatorObjectStore::Transaction* , Context*) ()
   #2  0x7fd313097d08 in
   ObjectStore::apply_transaction(ObjectStore::Transaction, Context*)
   ()
   #3  0x7fd313076790 in OSD::init() ()
   #4  0x7fd3130233a7 in main ()
  
   In a few cases, upon restarting the stuck OSD (service), it
   successfully
  completes the 'init' phase and reaches the 'up' and 'in' state!
  
   Any help is greatly appreciated. Please let me know if any more
   details are
  required for root causing.
  
   [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
   [2] -
   https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211
  
   Regards,
   Unmesh G.
   IRC: unmeshg
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel
   in the body of a message to majord...@vger.kernel.org More
   majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 
  --
  Best Regards,
 
  Wheat
 
 
 
 --
 Best Regards,
 
 Wheat


OSD sometimes stuck in init phase

2015-08-06 Thread Gurjar, Unmesh
Hi,

On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate 
data and journal disks (using the ceph-disk utility). It is observed, that few 
OSDs start-up fine (are 'up' and 'in' state); however, others are stuck in the 
'init creating/touching snapmapper object' phase. Below is a OSD start-up log 
snippet:

2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open 
/var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 
bytes, directio = 1, aio = 1
2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open 
/var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 
bytes, directio = 1, aio = 1
2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock 
sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0 
a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init creating/touching 
snapmapper object

The log statement is inaccurate though, since it is actually doing init 
operation for the 'infos' object (as can be observed from source [2]).

Upon debugging further, the thread seems to be waiting to acquire the 
'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:

(gdb) where
#0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib/x86_64-linux-gnu/libpthread.so.0
#1  0x7fd313132bf4 in 
ObjectStore::apply_transactions(ObjectStore::Sequencer*, 
std::listObjectStore::Transaction*, std::allocatorObjectStore::Transaction* 
, Context*) ()
#2  0x7fd313097d08 in 
ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) ()
#3  0x7fd313076790 in OSD::init() ()
#4  0x7fd3130233a7 in main ()

In a few cases, upon restarting the stuck OSD (service), it successfully 
completes the 'init' phase and reaches the 'up' and 'in' state! 

Any help is greatly appreciated. Please let me know if any more details are 
required for root causing.

[1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
[2] -  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211

Regards,
Unmesh G.
IRC: unmeshg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html