date:20150530

Re: [ceph-users] Read Errors and OSD Flapping

2015-05-30 Thread Christian Balzer


Hello,

On Sat, 30 May 2015 22:23:22 +0100 Nick Fisk wrote:

 Hi All,
 
  
 
 I was noticing poor performance on my cluster and when I went to
 investigate I noticed OSD 29 was flapping up and down. On investigation
 it looks like it has 2 pending sectors, kernel log is filled with the
 following
 
  
 
 end_request: critical medium error, dev sdk, sector 4483365656
 
 end_request: critical medium error, dev sdk, sector 4483365872
 
  
 
 I can see in the OSD logs that it looked like when the OSD was crashing
 it was trying to scrub the PG, probably failing when the kernel passes
 up the read error. 
 
  
 
 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 
 1: /usr/bin/ceph-osd() [0xacaf4a]
 
 2: (()+0x10340) [0x7fdc43032340]
 
 3: (gsignal()+0x39) [0x7fdc414d1cc9]
 
 4: (abort()+0x148) [0x7fdc414d50d8]
 
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fdc41ddc6b5]
 
 6: (()+0x5e836) [0x7fdc41dda836]
 
 7: (()+0x5e863) [0x7fdc41dda863]
 
 8: (()+0x5eaa2) [0x7fdc41ddaaa2]
 
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x278) [0xbc2908]
 
 10: (FileStore::read(coll_t, ghobject_t const, unsigned long, unsigned
 long, ceph::buffer::list, unsigned int, bool)+0xc98) [0x9168e
 
 8]
 
 11: (ReplicatedBackend::be_deep_scrub(hobject_t const, unsigned int,
 ScrubMap::object, ThreadPool::TPHandle)+0x2f9) [0xa05bf9]
 
 12: (PGBackend::be_scan_list(ScrubMap, std::vectorhobject_t,
 std::allocatorhobject_t  const, bool, unsigned int, ThreadPool::TPH
 
 andle)+0x2c8) [0x8dab98]
 
 13: (PG::build_scrub_map_chunk(ScrubMap, hobject_t, hobject_t, bool,
 unsigned int, ThreadPool::TPHandle)+0x1fa) [0x7f099a]
 
 14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle)+0x4a2)
 [0x7f1132]
 
 15: (OSD::RepScrubWQ::_process(MOSDRepScrub*,
 ThreadPool::TPHandle)+0xbe) [0x6e583e]
 
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb38ae]
 
 17: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]
 
 18: (()+0x8182) [0x7fdc4302a182]
 
 19: (clone()+0x6d) [0x7fdc4159547d]
 
  
 
 Few questions: 
 
 1.   Is this the expected behaviour, or should Ceph try and do
 something to either keep the OSD down or rewrite the sector to cause a
 sector remap?
 
I guess what you see is what you get, but both things, especially the
rewrite would be better.
Alas I suppose it is a bit of work for it to do the right thing there
(getting the replica to rewrite things with from another node) AND to be
certain that this wasn't the last good replica, read error or not. 

 2.   I am monitoring smart stats, but is there any other way of
 picking this up or getting Ceph to highlight it? Something like a
 flapping OSD notification would be nice.
 
Lots of improvement opportunities in the Ceph status indeed. 
Starting with what constitutes which level (ERR, WRN, INF).

 3.   I'm assuming at this stage this disk will not be replaceable
 under warranty, am I best to mark it as out, let it drain and then
 re-introduce it again, which should overwrite the sector and cause a
 remap? Or is there a better way?

That's the safe, easy way. Might want to add a dd zeroing the drive and
long SMART test afterwards for good measure before re-adding it.

A faster way might be to determine which PG, file is affected just rewrite
this, preferably even with a good copy of the data. 
After that a deep-scrub of that PG, potentially doing a manual repair if
this was the acting one.
 
Christian
  
 
 Many Thanks,
 
 Nick
 
 
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Read Errors and OSD Flapping

2015-05-30 Thread Nick Fisk

Hi All,

 

I was noticing poor performance on my cluster and when I went to investigate
I noticed OSD 29 was flapping up and down. On investigation it looks like it
has 2 pending sectors, kernel log is filled with the following

 

end_request: critical medium error, dev sdk, sector 4483365656

end_request: critical medium error, dev sdk, sector 4483365872

 

I can see in the OSD logs that it looked like when the OSD was crashing it
was trying to scrub the PG, probably failing when the kernel passes up the
read error. 

 

ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

1: /usr/bin/ceph-osd() [0xacaf4a]

2: (()+0x10340) [0x7fdc43032340]

3: (gsignal()+0x39) [0x7fdc414d1cc9]

4: (abort()+0x148) [0x7fdc414d50d8]

5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fdc41ddc6b5]

6: (()+0x5e836) [0x7fdc41dda836]

7: (()+0x5e863) [0x7fdc41dda863]

8: (()+0x5eaa2) [0x7fdc41ddaaa2]

9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x278) [0xbc2908]

10: (FileStore::read(coll_t, ghobject_t const, unsigned long, unsigned
long, ceph::buffer::list, unsigned int, bool)+0xc98) [0x9168e

8]

11: (ReplicatedBackend::be_deep_scrub(hobject_t const, unsigned int,
ScrubMap::object, ThreadPool::TPHandle)+0x2f9) [0xa05bf9]

12: (PGBackend::be_scan_list(ScrubMap, std::vectorhobject_t,
std::allocatorhobject_t  const, bool, unsigned int, ThreadPool::TPH

andle)+0x2c8) [0x8dab98]

13: (PG::build_scrub_map_chunk(ScrubMap, hobject_t, hobject_t, bool,
unsigned int, ThreadPool::TPHandle)+0x1fa) [0x7f099a]

14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle)+0x4a2)
[0x7f1132]

15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle)+0xbe)
[0x6e583e]

16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb38ae]

17: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]

18: (()+0x8182) [0x7fdc4302a182]

19: (clone()+0x6d) [0x7fdc4159547d]

 

Few questions: 

1.   Is this the expected behaviour, or should Ceph try and do something
to either keep the OSD down or rewrite the sector to cause a sector remap?

2.   I am monitoring smart stats, but is there any other way of picking
this up or getting Ceph to highlight it? Something like a flapping OSD
notification would be nice.

3.   I'm assuming at this stage this disk will not be replaceable under
warranty, am I best to mark it as out, let it drain and then re-introduce it
again, which should overwrite the sector and cause a remap? Or is there a
better way?

 

Many Thanks,

Nick




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD disk distribution

2015-05-30 Thread Somnath Roy

Martin,
It all depends on your workload.
For example, if you are not bothered about write speed at all, I would say to 
configure primary affinity of your cluster properly so that primary OSDs can be 
the one hosted by SSDs..If you are considering 4 SSDs per node, so, total of 56 
SSDs and 14 * 12 = 168 HDDs , I guess  numbers should work out reasonably 
(considering 1 OSD per disk) well. This should give your cluster all SSD like 
read performance, but, write performance won’t improve (will be HDD like).
In this case, making 2 or 3 all SSD nodes with high performance servers make 
sense as all the read traffic will be landing there and with SSDs you need more 
powerful cpu complex.
If your workload is read/write mix, I would say your theory of 2 SSDs for 
journal and 2 for Cache pool make sense. Journal will help only for write and 
cache tier can help for read. But, I must say I am yet to evaluate cache 
tiering performance though.
In this case as you said, distributing the ssds across all nodes should be your 
correct approach.
Hope this helps,

Thanks  Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Martin 
Palma
Sent: Saturday, May 30, 2015 1:37 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] SSD disk distribution

Hello,

We are planing to deploy our first Ceph cluster with 14 storage nodes and 3 
monitor nodes. The storage node have 12 SATA disks and 4 SSDs. 2 of the SSDs we 
plan to use as
journal disks and 2 for cache tiering.

Now the question raised in our team if it would be better to put all SSDs lets 
say in 2 storage nodes and consider them as fast nodes or to distribute the 
SSDs for the cache tiering over all 14 nodes (2 per node).

In mine opinion, if I understood the concept of Ceph right (I'm still in the 
learning process ;-) distributing the SSDs across all storage nodes would be 
better since this also would distribute the network traffic (client access) 
across all 14 nodes and not only limit it to 2 nodes. Right?

Any suggestion on that?

Best,
Martin



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] osd crash with object store as newstore

2015-05-30 Thread Srikanth Madugundi

Hi,

I build ceph code from wip-newstore on RHEL7 and running performance tests
to compare with filestore. After few hours of running the tests the osd
daemons started to crash. Here is the stack trace, the osd crashes
immediately after the restart. So I could not get the osd up and running.

ceph version b8e22893f44979613738dfcdd40dada2b513118
(eb8e22893f44979613738dfcdd40dada2b513118)
1: /usr/bin/ceph-osd() [0xb84652]
2: (()+0xf130) [0x7f915f84f130]
3: (gsignal()+0x39) [0x7f915e2695c9]
4: (abort()+0x148) [0x7f915e26acd8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f915eb6d9d5]
6: (()+0x5e946) [0x7f915eb6b946]
7: (()+0x5e973) [0x7f915eb6b973]
8: (()+0x5eb9f) [0x7f915eb6bb9f]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0xc84c5a]
10: (NewStore::collection_list_partial(coll_t, ghobject_t, int, int,
snapid_t, std::vectorghobject_t, std::allocatorghobject_t *,
ghobject_t*)+0x13c9) [0xa08639]
11: (PGBackend::objects_list_partial(hobject_t const, int, int, snapid_t,
std::vectorhobject_t, std::allocatorhobject_t *, hobject_t*)+0x352)
[0x918a02]
12: (ReplicatedPG::do_pg_op(std::tr1::shared_ptrOpRequest)+0x1066)
[0x8aa906]
13: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x1eb) [0x8cd06b]
14: (ReplicatedPG::do_request(std::tr1::shared_ptrOpRequest,
ThreadPool::TPHandle)+0x68a) [0x85dbea]
15: (OSD::dequeue_op(boost::intrusive_ptrPG,
std::tr1::shared_ptrOpRequest, ThreadPool::TPHandle)+0x3ed) [0x6c3f5d]
16: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x2e9) [0x6c4449]
17: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
[0xc746bf]
18: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xc767f0]
19: (()+0x7df3) [0x7f915f847df3]
20: (clone()+0x6d) [0x7f915e32a01d]
NOTE: a copy of the executable, or `objdump -rdS executable` is needed to
interpret this.

Please let me know what is the cause of this crash.

Regards
Srikanth
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] SSD disk distribution

2015-05-30 Thread Martin Palma

Hello,

We are planing to deploy our first Ceph cluster with 14 storage nodes and 3
monitor nodes. The storage node have 12 SATA disks and 4 SSDs. 2 of the
SSDs we plan to use as
journal disks and 2 for cache tiering.

Now the question raised in our team if it would be better to put all SSDs
lets say in 2 storage nodes and consider them as fast nodes or to
distribute the SSDs for the cache tiering over all 14 nodes (2 per node).

In mine opinion, if I understood the concept of Ceph right (I'm still in
the learning process ;-) distributing the SSDs across all storage nodes
would be better since this also would distribute the network traffic
(client access) across all 14 nodes and not only limit it to 2 nodes. Right?

Any suggestion on that?

Best,
Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD disk distribution

2015-05-30 Thread Christian Balzer


Hello,

see the current Blocked requests/ops? thread in this ML, especially the
later parts.
And a number of similar threads.

In short, the CPU requirement for SSD based pools are significantly higher
than for HDD or HDD/SSD journal pools.

So having dedicated SSD nodes with less OSDs, faster CPUs and potentially
faster network makes a lot of sense. 
It also helps a bit to keep you and your CRUSH rules sane.

In your example you'd have 12 HDD based OSDs with journals, at 1.5-2GHz
CPU per OSD (things will get CPU bound with small write IOPS).
A SSD (I'm assuming something like DC S3700) based OSD will eat all the
CPU you can throw at it, 6-8GHZ would be a pretty conservative number.

Search the archives for the latest tests/benchmarks by others,  don't take
my (slightly dated) word for it.

Lastly you may find like other that cache-tiers currently aren't all great
performance wise.

Christian.

On Sat, 30 May 2015 10:36:39 +0200 Martin Palma wrote:

 Hello,
 
 We are planing to deploy our first Ceph cluster with 14 storage nodes
 and 3 monitor nodes. The storage node have 12 SATA disks and 4 SSDs. 2
 of the SSDs we plan to use as
 journal disks and 2 for cache tiering.
 
 Now the question raised in our team if it would be better to put all SSDs
 lets say in 2 storage nodes and consider them as fast nodes or to
 distribute the SSDs for the cache tiering over all 14 nodes (2 per node).
 
 In mine opinion, if I understood the concept of Ceph right (I'm still in
 the learning process ;-) distributing the SSDs across all storage nodes
 would be better since this also would distribute the network traffic
 (client access) across all 14 nodes and not only limit it to 2 nodes.
 Right?
 
 Any suggestion on that?
 
 Best,
 Martin


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RGW - Can't download complete object

2015-05-30 Thread Nathan Cutler

The code has been backported and should be part of the firefly 0.80.10 
release and the hammer 0.94.2 release.


Nathan

On 05/14/2015 07:30 AM, Yehuda Sadeh-Weinraub wrote:

The code is in wip-11620, abd it's currently on top of the next branch. We'll 
get it through the tests, then get it into hammer and firefly. I wouldn't 
recommend installing it in production without proper testing first.

Yehuda

- Original Message -

From: Sean Sullivan seapasu...@uchicago.edu
To: Yehuda Sadeh-Weinraub yeh...@redhat.com
Cc: ceph-users@lists.ceph.com
Sent: Wednesday, May 13, 2015 7:22:10 PM
Subject: Re: [ceph-users] RGW - Can't download complete object

Thank you so much Yahuda! I look forward to testing these. Is there a way
for me to pull this code in? Is it in master?


On May 13, 2015 7:08:44 PM Yehuda Sadeh-Weinraub yeh...@redhat.com wrote:


Ok, I dug a bit more, and it seems to me that the problem is with the
manifest that was created. I was able to reproduce a similar issue (opened
ceph bug #11622), for which I also have a fix.

I created new tests to cover this issue, and we'll get those recent fixes
as soon as we can, after we test for any regressions.

Thanks,
Yehuda

- Original Message -

From: Yehuda Sadeh-Weinraub yeh...@redhat.com
To: Sean Sullivan seapasu...@uchicago.edu
Cc: ceph-users@lists.ceph.com
Sent: Wednesday, May 13, 2015 2:33:07 PM
Subject: Re: [ceph-users] RGW - Can't download complete object

That's another interesting issue. Note that for part 12_80 the manifest
specifies (I assume, by the messenger log) this part:



default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80

(note the 'tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14')

whereas it seems that you do have the original part:


default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.12_80

(note the '2/...')

The part that the manifest specifies does not exist, which makes me think
that there is some weird upload sequence, something like:

  - client uploads part, upload finishes but client does not get ack for
  it
  - client retries (second upload)
  - client gets ack for the first upload and gives up on the second one

But I'm not sure if it would explain the manifest, I'll need to take a
look
at the code. Could such a sequence happen with the client that you're
using
to upload?

Yehuda

- Original Message -

From: Sean Sullivan seapasu...@uchicago.edu
To: Yehuda Sadeh-Weinraub yeh...@redhat.com
Cc: ceph-users@lists.ceph.com
Sent: Wednesday, May 13, 2015 2:07:22 PM
Subject: Re: [ceph-users] RGW - Can't download complete object

Sorry for the delay. It took me a while to figure out how to do a range
request and append the data to a single file. The good news is that the
end
file seems to be 14G in size which matches the files manifest size. The
bad
news is that the file is completely corrupt and the radosgw log has
errors.
I am using the following code to perform the download::



https://raw.githubusercontent.com/mumrah/s3-multipart/master/s3-mp-download.py


Here is a clip of the log file::
--
2015-05-11 15:28:52.313742 7f570db7d700  1 -- 10.64.64.126:0/108
==
osd.11 10.64.64.101:6809/942707 5  osd_op_reply(74566287


default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.13_12

[read 0~858004] v0'0 uv41308 ondisk = 0) v6  304+0+858004
(1180387808 0
2445559038) 0x7f53d005b1a0 con 0x7f56f8119240
2015-05-11 15:28:52.313797 7f57067fc700 20 get_obj_aio_completion_cb:
io
completion ofs=12934184960 len=858004
2015-05-11 15:28:52.372453 7f570db7d700  1 -- 10.64.64.126:0/108
==
osd.45 10.64.64.101:6845/944590 2  osd_op_reply(74566142


default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.tJ8UddmcCxe0lOsgfHR9Q-ZHXdlrM14.12_80

[read 0~4194304] v0'0 uv0 ack = -2 ((2) No such file or directory)) v6

302+0+0 (3754425489 0 0) 0x7f53d005b1a0 con 0x7f56f81b1f30
2015-05-11 15:28:52.372494 7f57067fc700 20 get_obj_aio_completion_cb:
io
completion ofs=12145655808 len=4194304

2015-05-11 15:28:52.372501 7f57067fc700  0 ERROR: got unexpected error
when
trying to read object: -2

2015-05-11 15:28:52.426079 7f570db7d700  1 -- 10.64.64.126:0/108
==
osd.21 10.64.64.102:6856/1133473 16  osd_op_reply(74566144


default.20283.1__shadow_b235040a-46b6-42b3-b134-962b1f8813d5/28357709e44fff211de63b1d2c437159.bam.2/-ztodNISNLlaNeV4kDmrQwmkECBP2mZ.11_12

[read 0~3671316] v0'0 uv41395 ondisk = 0) v6  304+0+3671316
(1695485150
0 3933234139) 0x7f53d005b1a0 con 0x7f56f81e17d0
2015-05-11 15:28:52.426123 7f57067fc700 20 get_obj_aio_completion_cb:
io
completion ofs=10786701312 len=3671316
2015-05-11 15:28:52.504072 7f570db7d700  1 -- 10.64.64.126:0/108
==
osd.82 10.64.64.103:6857/88524 2  osd_op_reply(74566283

Re: [ceph-users] umount stuck on NFS gateways switch over by using Pacemaker

2015-05-30 Thread WD_Hwang

Dear Eric:
  Thanks for your information. The command 'reboot -fn' works well.
  I have no idea that anybody has met 'umount stuck' condition like me. If it's 
possible, I hope I could find the reason why the fail over process doesn't work 
fine after 30 minutes.

WD

-Original Message-
From: Eric Eastman [mailto:eric.east...@keepertech.com] 
Sent: Thursday, May 28, 2015 10:56 PM
To: WD Hwang/WHQ/Wistron
Cc: Ceph Users
Subject: Re: [ceph-users] umount stuck on NFS gateways switch over by using 
Pacemaker

On Thu, May 28, 2015 at 1:33 AM, wd_hw...@wistron.com wrote:

 Hello,

   I am testing NFS over RBD recently. I am trying to build the NFS HA 
 environment under Ubuntu 14.04 for testing, and the packages version 
 information as follows:
 - Ubuntu 14.04 : 3.13.0-32-generic(Ubuntu 14.04.2 LTS)
 - ceph : 0.80.9-0ubuntu0.14.04.2
 - ceph-common : 0.80.9-0ubuntu0.14.04.2
 - pacemaker (git20130802-1ubuntu2.3)
 - corosync (2.3.3-1ubuntu1)
 PS: I also tried ceph/ceph-common(0.87.1-1trusty and 0.87.2-1trusty) on 
 3.13.0-48-generic(Ubuntu 14.04.2) server and I got same situations.

   The environment has 5 nodes int the Ceph cluster (3 MONs and 5 OSDs) and 
 two NFS gateway (nfs1 and nfs2) for high availability. I issued the command, 
 'sudo service pacemaker stop', on 'nfs1' to force these resources stopped and 
 transferred to 'nfs2', and vice versa.

 When the two nodes are up, I issue 'sudo service pacemaker stop' on 
 one node, the other node will take over all resources. Everything 
 looks fine. Then I wait about 30 minutes and do nothing to the NFS 
 gateways. I repeated the previous steps to test fail over procedure. I 
 found the process code of 'umount' is 'D' (uninterruptible sleep), the 
 'ps' showed the following result

 root 21047 0.0 0.0 17412 952 ? D 16:39 0:00 umount /mnt/block1

 Have any idea to solve or work around? Because of 'umount' stuck, both 
 'reboot' and 'shutdown' command can't work well. So if I don't wait 20 
 minutes for 'umount' time out, the only way I can do is powering off the 
 server directly.

 Any help would be much appreciated.


I am not sure how to get out of the stuck umount, but you can skip the shutdown 
scripts that call the umount during a reboot using:

reboot -fn

This can cause data loss, as it is like a power cycle, so it is best to run 
sync before running the reboot -fn command to flush out buffers.

Sometime when a system is really hung, reboot -fn does not work, but this seems 
to always work if run as root:

echo 1  /proc/sys/kernel/sysrq
echo b  /proc/sysrq-trigger

Eric

---
This email contains confidential or legally privileged information and is for 
the sole use of its intended recipient. 
Any unauthorized review, use, copying or distribution of this email or the 
content of this email is strictly prohibited.
If you are not the intended recipient, you may reply to the sender and should 
delete this e-mail immediately.
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Read Errors and OSD Flapping

[ceph-users] Read Errors and OSD Flapping

Re: [ceph-users] SSD disk distribution

[ceph-users] osd crash with object store as newstore

[ceph-users] SSD disk distribution

Re: [ceph-users] SSD disk distribution

Re: [ceph-users] RGW - Can't download complete object

Re: [ceph-users] umount stuck on NFS gateways switch over by using Pacemaker

8 matches

Site Navigation

Mail list logo

Footer information