Re: [ceph-users] Cascading failure on a placement group

2016-08-13 Thread Goncalo Borges


>It should be worthwhile to check if timezone is/was different in mind.

What I meant was that it should be worthwhile to check if timezone is/was 
different in MONS also.

Cheers

From: Hein-Pieter van Braam [h...@tmm.cx]
Sent: 13 August 2016 22:42
To: Goncalo Borges; ceph-users
Subject: Re: [ceph-users] Cascading failure on a placement group

Hi,

The timezones on all my systems appear to be the same, I just verified
it by running 'date' on all my boxes.

- HP

On Sat, 2016-08-13 at 12:36 +, Goncalo Borges wrote:
> The ticket I mentioned earlier was marked as a duplicate of
>
> http://tracker.ceph.com/issues/9732
>
> Cheers
> Goncalo
>
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> Goncalo Borges [goncalo.bor...@sydney.edu.au]
> Sent: 13 August 2016 22:23
> To: Hein-Pieter van Braam; ceph-users
> Subject: Re: [ceph-users] Cascading failure on a placement group
>
> Hi HP.
>
> I am just a site admin so my opinion should be validated by proper
> support staff
>
> Seems really similar to
> http://tracker.ceph.com/issues/14399
>
> The ticket speaks about timezone difference between osds. Maybe it is
> something worthwhile to check?
>
> Cheers
> Goncalo
>
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> Hein-Pieter van Braam [h...@tmm.cx]
> Sent: 13 August 2016 21:48
> To: ceph-users
> Subject: [ceph-users] Cascading failure on a placement group
>
> Hello all,
>
> My cluster started to lose OSDs without any warning, whenever an OSD
> becomes the primary for a particular PG it crashes with the following
> stacktrace:
>
>  ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>  1: /usr/bin/ceph-osd() [0xada722]
>  2: (()+0xf100) [0x7fc28bca5100]
>  3: (gsignal()+0x37) [0x7fc28a6bd5f7]
>  4: (abort()+0x148) [0x7fc28a6bece8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5]
>  6: (()+0x5e946) [0x7fc28afc0946]
>  7: (()+0x5e973) [0x7fc28afc0973]
>  8: (()+0x5eb93) [0x7fc28afc0b93]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x27a) [0xbddcba]
>  10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
> int)+0x75f) [0x87e48f]
>  11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
>  12: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a)
> [0x8a0d1a]
>  13: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
> ThreadPool::TPHandle&)+0x68a) [0x83be4a]
>  14: (OSD::dequeue_op(boost::intrusive_ptr,
> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405)
> [0x69a5c5]
>  15: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
>  16: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> int)+0x86f)
> [0xbcd1cf]
>  17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
>  18: (()+0x7dc5) [0x7fc28bc9ddc5]
>  19: (clone()+0x6d) [0x7fc28a77eced]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> Has anyone ever seen this? Is there a way to fix this? My cluster is
> in
> rather large disarray at the moment. I have one of the OSDs now in a
> restart loop and that is at least preventing other OSDs from going
> down, but obviously not all other PGs can peer now.
>
> I'm not sure what else to do at the moment.
>
> Thank you so much,
>
> - HP
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-13 Thread Sean Sullivan
So with a patched leveldb to skip errors I now have a store.db that I can
extract the pg,mon,and osd map from. That said when I try to start kh10-8
it bombs out::

---
---
root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8# ceph-mon -i $(hostname) -d
2016-08-13 22:30:54.596039 7fa8b9e088c0  0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 708653
starting mon.kh10-8 rank 2 at 10.64.64.125:6789/0 mon_data
/var/lib/ceph/mon/ceph-kh10-8 fsid e452874b-cb29-4468-ac7f-f8901dfccebf
2016-08-13 22:30:54.608150 7fa8b9e088c0  0 starting mon.kh10-8 rank 2 at
10.64.64.125:6789/0 mon_data /var/lib/ceph/mon/ceph-kh10-8 fsid
e452874b-cb29-4468-ac7f-f8901dfccebf
2016-08-13 22:30:54.608395 7fa8b9e088c0  1 mon.kh10-8@-1(probing) e1
preinit fsid e452874b-cb29-4468-ac7f-f8901dfccebf
2016-08-13 22:30:54.608617 7fa8b9e088c0  1
mon.kh10-8@-1(probing).paxosservice(pgmap
0..35606392) refresh upgraded, format 0 -> 1
2016-08-13 22:30:54.608629 7fa8b9e088c0  1 mon.kh10-8@-1(probing).pg v0
on_upgrade discarding in-core PGMap
terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'
  what():  buffer::end_of_buffer
*** Caught signal (Aborted) **
 in thread 7fa8b9e088c0
 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: ceph-mon() [0x9b25ea]
 2: (()+0x10330) [0x7fa8b8f0b330]
 3: (gsignal()+0x37) [0x7fa8b73a8c37]
 4: (abort()+0x148) [0x7fa8b73ac028]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fa8b7cb3535]
 6: (()+0x5e6d6) [0x7fa8b7cb16d6]
 7: (()+0x5e703) [0x7fa8b7cb1703]
 8: (()+0x5e922) [0x7fa8b7cb1922]
 9: ceph-mon() [0x853c39]
 10:
(object_stat_collection_t::decode(ceph::buffer::list::iterator&)+0x167)
[0x894227]
 11: (pg_stat_t::decode(ceph::buffer::list::iterator&)+0x5ff) [0x894baf]
 12: (PGMap::update_pg(pg_t, ceph::buffer::list&)+0xa3) [0x91a8d3]
 13: (PGMonitor::read_pgmap_full()+0x1d8) [0x68b9b8]
 14: (PGMonitor::update_from_paxos(bool*)+0xbf7) [0x6977b7]
 15: (PaxosService::refresh(bool*)+0x19a) [0x605b5a]
 16: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b1ffb]
 17: (Monitor::init_paxos()+0x85) [0x5b2365]
 18: (Monitor::preinit()+0x7d7) [0x5b6f87]
 19: (main()+0x230c) [0x57853c]
 20: (__libc_start_main()+0xf5) [0x7fa8b7393f45]
 21: ceph-mon() [0x59a3c7]
2016-08-13 22:30:54.611791 7fa8b9e088c0 -1 *** Caught signal (Aborted) **
 in thread 7fa8b9e088c0

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: ceph-mon() [0x9b25ea]
 2: (()+0x10330) [0x7fa8b8f0b330]
 3: (gsignal()+0x37) [0x7fa8b73a8c37]
 4: (abort()+0x148) [0x7fa8b73ac028]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fa8b7cb3535]
 6: (()+0x5e6d6) [0x7fa8b7cb16d6]
 7: (()+0x5e703) [0x7fa8b7cb1703]
 8: (()+0x5e922) [0x7fa8b7cb1922]
 9: ceph-mon() [0x853c39]
 10:
(object_stat_collection_t::decode(ceph::buffer::list::iterator&)+0x167)
[0x894227]
 11: (pg_stat_t::decode(ceph::buffer::list::iterator&)+0x5ff) [0x894baf]
 12: (PGMap::update_pg(pg_t, ceph::buffer::list&)+0xa3) [0x91a8d3]
 13: (PGMonitor::read_pgmap_full()+0x1d8) [0x68b9b8]
 14: (PGMonitor::update_from_paxos(bool*)+0xbf7) [0x6977b7]
 15: (PaxosService::refresh(bool*)+0x19a) [0x605b5a]
 16: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b1ffb]
 17: (Monitor::init_paxos()+0x85) [0x5b2365]
 18: (Monitor::preinit()+0x7d7) [0x5b6f87]
 19: (main()+0x230c) [0x57853c]
 20: (__libc_start_main()+0xf5) [0x7fa8b7393f45]
 21: ceph-mon() [0x59a3c7]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- begin dump of recent events ---
   -33> 2016-08-13 22:30:54.593450 7fa8b9e088c0  5 asok(0x36a20f0)
register_command perfcounters_dump hook 0x365a050
   -32> 2016-08-13 22:30:54.593480 7fa8b9e088c0  5 asok(0x36a20f0)
register_command 1 hook 0x365a050
   -31> 2016-08-13 22:30:54.593486 7fa8b9e088c0  5 asok(0x36a20f0)
register_command perf dump hook 0x365a050
   -30> 2016-08-13 22:30:54.593496 7fa8b9e088c0  5 asok(0x36a20f0)
register_command perfcounters_schema hook 0x365a050
   -29> 2016-08-13 22:30:54.593499 7fa8b9e088c0  5 asok(0x36a20f0)
register_command 2 hook 0x365a050
   -28> 2016-08-13 22:30:54.593501 7fa8b9e088c0  5 asok(0x36a20f0)
register_command perf schema hook 0x365a050
   -27> 2016-08-13 22:30:54.593503 7fa8b9e088c0  5 asok(0x36a20f0)
register_command perf reset hook 0x365a050
   -26> 2016-08-13 22:30:54.593505 7fa8b9e088c0  5 asok(0x36a20f0)
register_command config show hook 0x365a050
   -25> 2016-08-13 22:30:54.593508 7fa8b9e088c0  5 asok(0x36a20f0)
register_command config set hook 0x365a050
   -24> 2016-08-13 22:30:54.593510 7fa8b9e088c0  5 asok(0x36a20f0)
register_command config get hook 0x365a050
   -23> 2016-08-13 22:30:54.593512 7fa8b9e088c0  5 asok(0x36a20f0)
register_command config diff hook 0x365a050
   -22> 2016-08-13 22:30:54.593513 7fa8b9e088c0  5 asok(0x36a20f0)
register_command log flush hook 0x365a050
   -21> 2016-08-13 22:30:54.593557 7fa8b9e088c0  5 asok(0x36a20f0)
register_command log dump 

Re: [ceph-users] Cascading failure on a placement group

2016-08-13 Thread Goncalo Borges
Hi HP

My 2 cents again.

In

> http://tracker.ceph.com/issues/9732

There is a comment from Samuel saying "This...is not resolved! The 
utime_t->hobject_t mapping is timezone dependent. Needs to be not timezone 
dependent when generating the archive object names."

The way I read it is that you will get problems if at a given time your 
timezone has been different (since it is used for archive object names) even if 
now everything is now in the same timezone. So I guess it could be worthwhile 
to check if, around the time of the first failures, your timezone wasn't 
different even if now is ok.

It should be worthwhile to check if timezone is/was different in mind.

Cheers

From: Hein-Pieter van Braam [h...@tmm.cx]
Sent: 13 August 2016 22:42
To: Goncalo Borges; ceph-users
Subject: Re: [ceph-users] Cascading failure on a placement group

Hi,

The timezones on all my systems appear to be the same, I just verified
it by running 'date' on all my boxes.

- HP

On Sat, 2016-08-13 at 12:36 +, Goncalo Borges wrote:
> The ticket I mentioned earlier was marked as a duplicate of
>
> http://tracker.ceph.com/issues/9732
>
> Cheers
> Goncalo
>
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> Goncalo Borges [goncalo.bor...@sydney.edu.au]
> Sent: 13 August 2016 22:23
> To: Hein-Pieter van Braam; ceph-users
> Subject: Re: [ceph-users] Cascading failure on a placement group
>
> Hi HP.
>
> I am just a site admin so my opinion should be validated by proper
> support staff
>
> Seems really similar to
> http://tracker.ceph.com/issues/14399
>
> The ticket speaks about timezone difference between osds. Maybe it is
> something worthwhile to check?
>
> Cheers
> Goncalo
>
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> Hein-Pieter van Braam [h...@tmm.cx]
> Sent: 13 August 2016 21:48
> To: ceph-users
> Subject: [ceph-users] Cascading failure on a placement group
>
> Hello all,
>
> My cluster started to lose OSDs without any warning, whenever an OSD
> becomes the primary for a particular PG it crashes with the following
> stacktrace:
>
>  ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>  1: /usr/bin/ceph-osd() [0xada722]
>  2: (()+0xf100) [0x7fc28bca5100]
>  3: (gsignal()+0x37) [0x7fc28a6bd5f7]
>  4: (abort()+0x148) [0x7fc28a6bece8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5]
>  6: (()+0x5e946) [0x7fc28afc0946]
>  7: (()+0x5e973) [0x7fc28afc0973]
>  8: (()+0x5eb93) [0x7fc28afc0b93]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x27a) [0xbddcba]
>  10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
> int)+0x75f) [0x87e48f]
>  11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
>  12: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a)
> [0x8a0d1a]
>  13: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
> ThreadPool::TPHandle&)+0x68a) [0x83be4a]
>  14: (OSD::dequeue_op(boost::intrusive_ptr,
> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405)
> [0x69a5c5]
>  15: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
>  16: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> int)+0x86f)
> [0xbcd1cf]
>  17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
>  18: (()+0x7dc5) [0x7fc28bc9ddc5]
>  19: (clone()+0x6d) [0x7fc28a77eced]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> Has anyone ever seen this? Is there a way to fix this? My cluster is
> in
> rather large disarray at the moment. I have one of the OSDs now in a
> restart loop and that is at least preventing other OSDs from going
> down, but obviously not all other PGs can peer now.
>
> I'm not sure what else to do at the moment.
>
> Thank you so much,
>
> - HP
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-13 Thread Georgios Dimitrakakis



Op 13 aug. 2016 om 03:19 heeft Bill Sharer  het volgende geschreven:


If all the system disk does is handle the o/s (ie osd journals are
on dedicated or osd drives as well), no problem. Just rebuild the
system and copy the ceph.conf back in when you re-install ceph.Â
Keep a spare copy of your original fstab to keep your osd filesystem
mounts straight.


With systems deployed with ceph-disk/ceph-deploy you no longer need a
fstab. Udev handles it.


Just keep in mind that you are down 11 osds while that system drive
gets rebuilt though. It's safer to do 10 osds and then have a
mirror set for the system disk.


In the years that I run Ceph I rarely see OS disks fail. Why bother?
Ceph is designed for failure.

I would not sacrifice a OSD slot for a OS disk. Also, let's say a
additional OS disk is €100.

If you put that disk in 20 machines that's €2.000. For that money
you can even buy a additional chassis.

No, I would run on a single OS disk. It fails? Let it fail. 
Re-install

and you're good again.

Ceph makes sure the data is safe.



Wido,

can you elaborate a little bit more on this? How does CEPH achieve 
that? Is it by redundant MONs?


To my understanding the OSD mapping is needed to have the cluster back. 
In our setup (I assume in others as well) that is stored in the OS 
disk.Furthermore, our MONs are running on the same host as OSDs. So if 
the OS disk fails not only we loose the OSD host but we also loose the 
MON node. Is there another way to be protected by such a failure besides 
additional MONs?


We recently had a problem where a user accidentally deleted a volume. 
Of course this has nothing to do with OS disk failure itself but we 've 
been in the loop to start looking for other possible failures on our 
system that could jeopardize data and this thread got my attention.



Warmest regards,

George



Wido

 Bill Sharer

 On 08/12/2016 03:33 PM, Ronny Aasen wrote:


On 12.08.2016 13:41, Félix Barbeira wrote:


Hi,

I'm planning to make a ceph cluster but I have a serious doubt. At
this moment we have ~10 servers DELL R730xd with 12x4TB SATA
disks. The official ceph docs says:

"We recommend using a dedicated drive for the operating system and
software, and one drive for each Ceph OSD Daemon you run on the
host."

I could use for example 1 disk for the OS and 11 for OSD data. In
the operating system I would run 11 daemons to control the OSDs.
But...what happen to the cluster if the disk with the OS fails??
maybe the cluster thinks that 11 OSD failed and try to replicate
all that data over the cluster...that sounds no good.

Should I use 2 disks for the OS making a RAID1? in this case I'm
"wasting" 8TB only for ~10GB that the OS needs.

In all the docs that i've been reading says ceph has no unique
single point of failure, so I think that this scenario must have a
optimal solution, maybe somebody could help me.

Thanks in advance.

--

Félix Barbeira.

if you do not have dedicated slots on the back for OS disks, then i
would recomend using SATADOM flash modules directly into a SATA port
internal in the machine. Saves you 2 slots for osd's and they are
quite reliable. you could even use 2 sd cards if your machine have
the internal SD slot




http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf

[1]

kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com [2]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3]

___
ceph-users mailing list
ceph-u

ph.com
http://li


i/ceph-users-ceph.com



Links:
--
[1]

http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf
[2] mailto:ceph-users@lists.ceph.com
[3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[4] mailto:bsha...@sharerland.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS quota

2016-08-13 Thread Willi Fehler

Hello,

I'm trying to use CephFS quaotas. On my client I've created a 
subdirectory in my CephFS mountpoint and used the following command from 
the documentation.


setfattr -n ceph.quota.max_bytes -v 1 /mnt/cephfs/quota

But if I create files bigger than my quota nothing happens. Do I need a 
mount option to use Quotas?


Regards - Willi

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-13 Thread w...@42on.com


> Op 13 aug. 2016 om 03:19 heeft Bill Sharer  het 
> volgende geschreven:
> 
> If all the system disk does is handle the o/s (ie osd journals are on 
> dedicated or osd drives as well), no problem.  Just rebuild the system and 
> copy the ceph.conf back in when you re-install ceph.  Keep a spare copy of 
> your original fstab to keep your osd filesystem mounts straight.
> 

With systems deployed with ceph-disk/ceph-deploy you no longer need a fstab. 
Udev handles it.

> Just keep in mind that you are down 11 osds while that system drive gets 
> rebuilt though.  It's safer to do 10 osds and then have a mirror set for the 
> system disk.
> 

In the years that I run Ceph I rarely see OS disks fail. Why bother? Ceph is 
designed for failure.

I would not sacrifice a OSD slot for a OS disk. Also, let's say a additional OS 
disk is €100.

If you put that disk in 20 machines that's €2.000. For that money you can even 
buy a additional chassis.

No, I would run on a single OS disk. It fails? Let it fail. Re-install and 
you're good again.

Ceph makes sure the data is safe.

Wido

> Bill Sharer
> 
> 
>> On 08/12/2016 03:33 PM, Ronny Aasen wrote:
>>> On 12.08.2016 13:41, Félix Barbeira wrote:
>>> Hi,
>>> 
>>> I'm planning to make a ceph cluster but I have a serious doubt. At this 
>>> moment we have ~10 servers DELL R730xd with 12x4TB SATA disks. The official 
>>> ceph docs says:
>>> 
>>> "We recommend using a dedicated drive for the operating system and 
>>> software, and one drive for each Ceph OSD Daemon you run on the host."
>>> 
>>> I could use for example 1 disk for the OS and 11 for OSD data. In the 
>>> operating system I would run 11 daemons to control the OSDs. But...what 
>>> happen to the cluster if the disk with the OS fails?? maybe the cluster 
>>> thinks that 11 OSD failed and try to replicate all that data over the 
>>> cluster...that sounds no good.
>>> 
>>> Should I use 2 disks for the OS making a RAID1? in this case I'm "wasting" 
>>> 8TB only for ~10GB that the OS needs.
>>> 
>>> In all the docs that i've been reading says ceph has no unique single point 
>>> of failure, so I think that this scenario must have a optimal solution, 
>>> maybe somebody could help me.
>>> 
>>> Thanks in advance.
>>> 
>>> -- 
>>> Félix Barbeira.
>>> 
>> if you do not have dedicated slots on the back for OS disks, then i would 
>> recomend using SATADOM flash modules directly into a SATA port internal in 
>> the machine. Saves you 2 slots for osd's and they are quite reliable. you 
>> could even use 2 sd cards if your machine have the internal SD slot 
>> 
>> http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf
>> 
>> kind regards
>> Ronny Aasen
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] what happen to the OSDs if the OS disk dies?

2016-08-13 Thread w...@42on.com


> Op 13 aug. 2016 om 08:58 heeft Georgios Dimitrakakis  
> het volgende geschreven:
> 
> 
>>> Op 13 aug. 2016 om 03:19 heeft Bill Sharer  het volgende geschreven:
>>> 
>>> If all the system disk does is handle the o/s (ie osd journals are
>>> on dedicated or osd drives as well), no problem. Just rebuild the
>>> system and copy the ceph.conf back in when you re-install ceph.Â
>>> Keep a spare copy of your original fstab to keep your osd filesystem
>>> mounts straight.
>> 
>> With systems deployed with ceph-disk/ceph-deploy you no longer need a
>> fstab. Udev handles it.
>> 
>>> Just keep in mind that you are down 11 osds while that system drive
>>> gets rebuilt though. It's safer to do 10 osds and then have a
>>> mirror set for the system disk.
>> 
>> In the years that I run Ceph I rarely see OS disks fail. Why bother?
>> Ceph is designed for failure.
>> 
>> I would not sacrifice a OSD slot for a OS disk. Also, let's say a
>> additional OS disk is €100.
>> 
>> If you put that disk in 20 machines that's €2.000. For that money
>> you can even buy a additional chassis.
>> 
>> No, I would run on a single OS disk. It fails? Let it fail. Re-install
>> and you're good again.
>> 
>> Ceph makes sure the data is safe.
>> 
> 
> Wido,
> 
> can you elaborate a little bit more on this? How does CEPH achieve that? Is 
> it by redundant MONs?
> 

No, Ceph replicates over hosts by default. So you can loose a host and the 
other ones will have copies.


> To my understanding the OSD mapping is needed to have the cluster back. In 
> our setup (I assume in others as well) that is stored in the OS 
> disk.Furthermore, our MONs are running on the same host as OSDs. So if the OS 
> disk fails not only we loose the OSD host but we also loose the MON node. Is 
> there another way to be protected by such a failure besides additional MONs?
> 

Aha, MON on the OSD host. I never recommend that. Try to use dedicated machines 
with a good SSD for MONs.

Technically you can run the MON on the OSD nodes, but I always try to avoid it. 
It just isn't practical when stuff really goes wrong.

Wido

> We recently had a problem where a user accidentally deleted a volume. Of 
> course this has nothing to do with OS disk failure itself but we 've been in 
> the loop to start looking for other possible failures on our system that 
> could jeopardize data and this thread got my attention.
> 
> 
> Warmest regards,
> 
> George
> 
> 
>> Wido
>> 
>> Bill Sharer
>> 
>>> On 08/12/2016 03:33 PM, Ronny Aasen wrote:
>>> 
 On 12.08.2016 13:41, Félix Barbeira wrote:
 
 Hi,
 
 I'm planning to make a ceph cluster but I have a serious doubt. At
 this moment we have ~10 servers DELL R730xd with 12x4TB SATA
 disks. The official ceph docs says:
 
 "We recommend using a dedicated drive for the operating system and
 software, and one drive for each Ceph OSD Daemon you run on the
 host."
 
 I could use for example 1 disk for the OS and 11 for OSD data. In
 the operating system I would run 11 daemons to control the OSDs.
 But...what happen to the cluster if the disk with the OS fails??
 maybe the cluster thinks that 11 OSD failed and try to replicate
 all that data over the cluster...that sounds no good.
 
 Should I use 2 disks for the OS making a RAID1? in this case I'm
 "wasting" 8TB only for ~10GB that the OS needs.
 
 In all the docs that i've been reading says ceph has no unique
 single point of failure, so I think that this scenario must have a
 optimal solution, maybe somebody could help me.
 
 Thanks in advance.
 
 --
 
 Félix Barbeira.
>>> if you do not have dedicated slots on the back for OS disks, then i
>>> would recomend using SATADOM flash modules directly into a SATA port
>>> internal in the machine. Saves you 2 slots for osd's and they are
>>> quite reliable. you could even use 2 sd cards if your machine have
>>> the internal SD slot
>>> 
>>> 
>> http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf
>>> [1]
>>> 
>>> kind regards
>>> Ronny Aasen
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com [2]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [3]
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-u
>> ph.com
>> http://li
>> 
>>> i/ceph-users-ceph.com
>> 
>> 
>> Links:
>> --
>> [1]
>> http://www.dell.com/downloads/global/products/pedge/en/poweredge-idsdm-whitepaper-en.pdf
>> [2] mailto:ceph-users@lists.ceph.com
>> [3] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> [4] mailto:bsha...@sharerland.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list

Re: [ceph-users] CephFS quota

2016-08-13 Thread w...@42on.com


> Op 13 aug. 2016 om 09:24 heeft Willi Fehler  het 
> volgende geschreven:
> 
> Hello,
> 
> I'm trying to use CephFS quaotas. On my client I've created a subdirectory in 
> my CephFS mountpoint and used the following command from the documentation.
> 
> setfattr -n ceph.quota.max_bytes -v 1 /mnt/cephfs/quota
> 
> But if I create files bigger than my quota nothing happens. Do I need a mount 
> option to use Quotas?
> 

What version is the client? CephFS quotas rely on the client to support it as 
well.

Wido

> Regards - Willi
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cascading failure on a placement group

2016-08-13 Thread Hein-Pieter van Braam
Hello all,

My cluster started to lose OSDs without any warning, whenever an OSD
becomes the primary for a particular PG it crashes with the following
stacktrace:

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: /usr/bin/ceph-osd() [0xada722]
 2: (()+0xf100) [0x7fc28bca5100]
 3: (gsignal()+0x37) [0x7fc28a6bd5f7]
 4: (abort()+0x148) [0x7fc28a6bece8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5]
 6: (()+0x5e946) [0x7fc28afc0946]
 7: (()+0x5e973) [0x7fc28afc0973]
 8: (()+0x5eb93) [0x7fc28afc0b93]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0xbddcba]
 10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e48f]
 11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
 12: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a)
[0x8a0d1a]
 13: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
ThreadPool::TPHandle&)+0x68a) [0x83be4a]
 14: (OSD::dequeue_op(boost::intrusive_ptr,
std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405)
[0x69a5c5]
 15: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
[0xbcd1cf]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
 18: (()+0x7dc5) [0x7fc28bc9ddc5]
 19: (clone()+0x6d) [0x7fc28a77eced]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Has anyone ever seen this? Is there a way to fix this? My cluster is in
rather large disarray at the moment. I have one of the OSDs now in a
restart loop and that is at least preventing other OSDs from going
down, but obviously not all other PGs can peer now.

I'm not sure what else to do at the moment.

Thank you so much,

- HP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading failure on a placement group

2016-08-13 Thread Goncalo Borges
The ticket I mentioned earlier was marked as a duplicate of

http://tracker.ceph.com/issues/9732

Cheers
Goncalo

From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Goncalo 
Borges [goncalo.bor...@sydney.edu.au]
Sent: 13 August 2016 22:23
To: Hein-Pieter van Braam; ceph-users
Subject: Re: [ceph-users] Cascading failure on a placement group

Hi HP.

I am just a site admin so my opinion should be validated by proper support staff

Seems really similar to
http://tracker.ceph.com/issues/14399

The ticket speaks about timezone difference between osds. Maybe it is something 
worthwhile to check?

Cheers
Goncalo


From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Hein-Pieter 
van Braam [h...@tmm.cx]
Sent: 13 August 2016 21:48
To: ceph-users
Subject: [ceph-users] Cascading failure on a placement group

Hello all,

My cluster started to lose OSDs without any warning, whenever an OSD
becomes the primary for a particular PG it crashes with the following
stacktrace:

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: /usr/bin/ceph-osd() [0xada722]
 2: (()+0xf100) [0x7fc28bca5100]
 3: (gsignal()+0x37) [0x7fc28a6bd5f7]
 4: (abort()+0x148) [0x7fc28a6bece8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5]
 6: (()+0x5e946) [0x7fc28afc0946]
 7: (()+0x5e973) [0x7fc28afc0973]
 8: (()+0x5eb93) [0x7fc28afc0b93]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0xbddcba]
 10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e48f]
 11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
 12: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a)
[0x8a0d1a]
 13: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
ThreadPool::TPHandle&)+0x68a) [0x83be4a]
 14: (OSD::dequeue_op(boost::intrusive_ptr,
std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405)
[0x69a5c5]
 15: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
[0xbcd1cf]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
 18: (()+0x7dc5) [0x7fc28bc9ddc5]
 19: (clone()+0x6d) [0x7fc28a77eced]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Has anyone ever seen this? Is there a way to fix this? My cluster is in
rather large disarray at the moment. I have one of the OSDs now in a
restart loop and that is at least preventing other OSDs from going
down, but obviously not all other PGs can peer now.

I'm not sure what else to do at the moment.

Thank you so much,

- HP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS quota

2016-08-13 Thread Goncalo Borges
Hi Willi
If you are using ceph-fuse, to enable quota, you need to pass "--client-quota" 
option in the mount operation.
Cheers
Goncalo


From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Willi Fehler 
[willi.feh...@t-online.de]
Sent: 13 August 2016 17:23
To: ceph-users
Subject: [ceph-users] CephFS quota

Hello,

I'm trying to use CephFS quaotas. On my client I've created a
subdirectory in my CephFS mountpoint and used the following command from
the documentation.

setfattr -n ceph.quota.max_bytes -v 1 /mnt/cephfs/quota

But if I create files bigger than my quota nothing happens. Do I need a
mount option to use Quotas?

Regards - Willi

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading failure on a placement group

2016-08-13 Thread Goncalo Borges
Hi HP.

I am just a site admin so my opinion should be validated by proper support staff

Seems really similar to
http://tracker.ceph.com/issues/14399

The ticket speaks about timezone difference between osds. Maybe it is something 
worthwhile to check?

Cheers
Goncalo


From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Hein-Pieter 
van Braam [h...@tmm.cx]
Sent: 13 August 2016 21:48
To: ceph-users
Subject: [ceph-users] Cascading failure on a placement group

Hello all,

My cluster started to lose OSDs without any warning, whenever an OSD
becomes the primary for a particular PG it crashes with the following
stacktrace:

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: /usr/bin/ceph-osd() [0xada722]
 2: (()+0xf100) [0x7fc28bca5100]
 3: (gsignal()+0x37) [0x7fc28a6bd5f7]
 4: (abort()+0x148) [0x7fc28a6bece8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5]
 6: (()+0x5e946) [0x7fc28afc0946]
 7: (()+0x5e973) [0x7fc28afc0973]
 8: (()+0x5eb93) [0x7fc28afc0b93]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x27a) [0xbddcba]
 10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e48f]
 11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
 12: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a)
[0x8a0d1a]
 13: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
ThreadPool::TPHandle&)+0x68a) [0x83be4a]
 14: (OSD::dequeue_op(boost::intrusive_ptr,
std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405)
[0x69a5c5]
 15: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
[0xbcd1cf]
 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
 18: (()+0x7dc5) [0x7fc28bc9ddc5]
 19: (clone()+0x6d) [0x7fc28a77eced]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Has anyone ever seen this? Is there a way to fix this? My cluster is in
rather large disarray at the moment. I have one of the OSDs now in a
restart loop and that is at least preventing other OSDs from going
down, but obviously not all other PGs can peer now.

I'm not sure what else to do at the moment.

Thank you so much,

- HP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple OSD crashing a lot

2016-08-13 Thread Hein-Pieter van Braam
Hi Blade,

I appear to be stuck in the same situation you were in. Do you still
happen to have a patch to implement this workaround you described?

Thanks,

- HP
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading failure on a placement group

2016-08-13 Thread Hein-Pieter van Braam
Hi Goncalo,

Thank you for your response. I had already found that issue but it does
not apply to my situation. The timezones are correct and I'm running a
pure hammer cluster.

- HP

On Sat, 2016-08-13 at 12:23 +, Goncalo Borges wrote:
> Hi HP.
> 
> I am just a site admin so my opinion should be validated by proper
> support staff
> 
> Seems really similar to
> http://tracker.ceph.com/issues/14399
> 
> The ticket speaks about timezone difference between osds. Maybe it is
> something worthwhile to check?
> 
> Cheers
> Goncalo
> 
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> Hein-Pieter van Braam [h...@tmm.cx]
> Sent: 13 August 2016 21:48
> To: ceph-users
> Subject: [ceph-users] Cascading failure on a placement group
> 
> Hello all,
> 
> My cluster started to lose OSDs without any warning, whenever an OSD
> becomes the primary for a particular PG it crashes with the following
> stacktrace:
> 
>  ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>  1: /usr/bin/ceph-osd() [0xada722]
>  2: (()+0xf100) [0x7fc28bca5100]
>  3: (gsignal()+0x37) [0x7fc28a6bd5f7]
>  4: (abort()+0x148) [0x7fc28a6bece8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5]
>  6: (()+0x5e946) [0x7fc28afc0946]
>  7: (()+0x5e973) [0x7fc28afc0973]
>  8: (()+0x5eb93) [0x7fc28afc0b93]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x27a) [0xbddcba]
>  10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
> int)+0x75f) [0x87e48f]
>  11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
>  12: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a)
> [0x8a0d1a]
>  13: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
> ThreadPool::TPHandle&)+0x68a) [0x83be4a]
>  14: (OSD::dequeue_op(boost::intrusive_ptr,
> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405)
> [0x69a5c5]
>  15: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
>  16: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> int)+0x86f)
> [0xbcd1cf]
>  17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
>  18: (()+0x7dc5) [0x7fc28bc9ddc5]
>  19: (clone()+0x6d) [0x7fc28a77eced]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> 
> Has anyone ever seen this? Is there a way to fix this? My cluster is
> in
> rather large disarray at the moment. I have one of the OSDs now in a
> restart loop and that is at least preventing other OSDs from going
> down, but obviously not all other PGs can peer now.
> 
> I'm not sure what else to do at the moment.
> 
> Thank you so much,
> 
> - HP
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cascading failure on a placement group

2016-08-13 Thread Hein-Pieter van Braam
Hi,

The timezones on all my systems appear to be the same, I just verified
it by running 'date' on all my boxes.

- HP

On Sat, 2016-08-13 at 12:36 +, Goncalo Borges wrote:
> The ticket I mentioned earlier was marked as a duplicate of
> 
> http://tracker.ceph.com/issues/9732
> 
> Cheers
> Goncalo
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> Goncalo Borges [goncalo.bor...@sydney.edu.au]
> Sent: 13 August 2016 22:23
> To: Hein-Pieter van Braam; ceph-users
> Subject: Re: [ceph-users] Cascading failure on a placement group
> 
> Hi HP.
> 
> I am just a site admin so my opinion should be validated by proper
> support staff
> 
> Seems really similar to
> http://tracker.ceph.com/issues/14399
> 
> The ticket speaks about timezone difference between osds. Maybe it is
> something worthwhile to check?
> 
> Cheers
> Goncalo
> 
> 
> From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of
> Hein-Pieter van Braam [h...@tmm.cx]
> Sent: 13 August 2016 21:48
> To: ceph-users
> Subject: [ceph-users] Cascading failure on a placement group
> 
> Hello all,
> 
> My cluster started to lose OSDs without any warning, whenever an OSD
> becomes the primary for a particular PG it crashes with the following
> stacktrace:
> 
>  ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>  1: /usr/bin/ceph-osd() [0xada722]
>  2: (()+0xf100) [0x7fc28bca5100]
>  3: (gsignal()+0x37) [0x7fc28a6bd5f7]
>  4: (abort()+0x148) [0x7fc28a6bece8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fc28afc29d5]
>  6: (()+0x5e946) [0x7fc28afc0946]
>  7: (()+0x5e973) [0x7fc28afc0973]
>  8: (()+0x5eb93) [0x7fc28afc0b93]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x27a) [0xbddcba]
>  10: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
> int)+0x75f) [0x87e48f]
>  11: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
>  12: (ReplicatedPG::do_op(std::tr1::shared_ptr&)+0xe3a)
> [0x8a0d1a]
>  13: (ReplicatedPG::do_request(std::tr1::shared_ptr&,
> ThreadPool::TPHandle&)+0x68a) [0x83be4a]
>  14: (OSD::dequeue_op(boost::intrusive_ptr,
> std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x405)
> [0x69a5c5]
>  15: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
>  16: (ShardedThreadPool::shardedthreadpool_worker(unsigned
> int)+0x86f)
> [0xbcd1cf]
>  17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
>  18: (()+0x7dc5) [0x7fc28bc9ddc5]
>  19: (clone()+0x6d) [0x7fc28a77eced]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> 
> Has anyone ever seen this? Is there a way to fix this? My cluster is
> in
> rather large disarray at the moment. I have one of the OSDs now in a
> restart loop and that is at least preventing other OSDs from going
> down, but obviously not all other PGs can peer now.
> 
> I'm not sure what else to do at the moment.
> 
> Thank you so much,
> 
> - HP
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Scst-devel] Thin Provisioning and Ceph RBD's

2016-08-13 Thread Alex Gorbachev
On Mon, Aug 8, 2016 at 7:56 AM, Ilya Dryomov  wrote:
> On Sun, Aug 7, 2016 at 7:57 PM, Alex Gorbachev  
> wrote:
>>> I'm confused.  How can a 4M discard not free anything?  It's either
>>> going to hit an entire object or two adjacent objects, truncating the
>>> tail of one and zeroing the head of another.  Using rbd diff:
>>>
>>> $ rbd diff test | grep -A 1 25165824
>>> 25165824  4194304 data
>>> 29360128  4194304 data
>>>
>>> # a 4M discard at 1M into a RADOS object
>>> $ blkdiscard -o $((25165824 + (1 << 20))) -l $((4 << 20)) /dev/rbd0
>>>
>>> $ rbd diff test | grep -A 1 25165824
>>> 25165824  1048576 data
>>> 29360128  4194304 data
>>
>> I have tested this on a small RBD device with such offsets and indeed,
>> the discard works as you describe, Ilya.
>>
>> Looking more into why ESXi's discard is not working.  I found this
>> message in kern.log on Ubuntu on creation of the SCST LUN, which shows
>> unmap_alignment 0:
>>
>> Aug  6 22:02:33 e1 kernel: [300378.136765] virt_id 33 (p_iSCSILun_sclun945)
>> Aug  6 22:02:33 e1 kernel: [300378.136782] dev_vdisk: Auto enable thin
>> provisioning for device /dev/rbd/spin1/unmap1t
>> Aug  6 22:02:33 e1 kernel: [300378.136784] unmap_gran 8192,
>> unmap_alignment 0, max_unmap_lba 8192, discard_zeroes_data 1
>> Aug  6 22:02:33 e1 kernel: [300378.136786] dev_vdisk: Attached SCSI
>> target virtual disk p_iSCSILun_sclun945
>> (file="/dev/rbd/spin1/unmap1t", fs=409600MB, bs=512,
>> nblocks=838860800, cyln=409600)
>> Aug  6 22:02:33 e1 kernel: [300378.136847] [4682]:
>> scst_alloc_add_tgt_dev:5287:Device p_iSCSILun_sclun945 on SCST lun=32
>> Aug  6 22:02:33 e1 kernel: [300378.136853] [4682]: scst:
>> scst_alloc_set_UA:12711:Queuing new UA 8810251f3a90 (6:29:0,
>> d_sense 0) to tgt_dev 88102583ad00 (dev p_iSCSILun_sclun945,
>> initiator copy_manager_sess)
>>
>> even though:
>>
>> root@e1:/sys/block/rbd29# cat discard_alignment
>> 4194304
>>
>> So somehow the discard_alignment is not making it into the LUN.  Could
>> this be the issue?
>
> No, if you are not seeing *any* effect, the alignment is pretty much
> irrelevant.  Can you do the following on a small test image?
>
> - capture "rbd diff" output
> - blktrace -d /dev/rbd0 -o - | blkparse -i - -o rbd0.trace
> - issue a few discards with blkdiscard
> - issue a few unmaps with ESXi, preferrably with SCST debugging enabled
> - capture "rbd diff" output again
>
> and attach all of the above?  (You might need to install a blktrace
> package.)
>

Latest results from VMWare validation tests:

Each test creates and deletes a virtual disk, then calls ESXi unmap
for what ESXi maps to that volume:

Test 1: 10GB reclaim, rbd diff size: 3GB, discards: 4829

Test 2: 100GB reclaim, rbd diff size: 50GB, discards: 197837

Test 3: 175GB reclaim, rbd diff size: 47 GB, discards: 197824

Test 4: 250GB reclaim, rbd diff size: 125GB, discards: 197837

Test 5: 250GB reclaim, rbd diff size: 80GB, discards: 197837

At the end, the compounded used size via rbd diff is 608 GB from 775GB
of data.  So we release only about 20% via discards in the end.

Thank you,
Alex
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple OSD crashing a lot

2016-08-13 Thread Hein-Pieter van Braam
Hi Blade,

I was planning to do something similar. Run the OSD in the way you
describe, use object copy to copy the data to a new volume, then move
the clients to the new volume.

Thanks a lot,

- HP

On Sat, 2016-08-13 at 08:18 -0700, Blade Doyle wrote:
> Hi HP.
> 
> Mine was not really a fix, it was just a hack to get the OSD up long
> enough to make sure I had a full backup, then I rebuilt the cluster
> from scratch and restored the data.  Though the hack did stop the OSD
> from crashing, it is probably a symptom of some internal problem, and
> may not be "safe" to run like that in the long term.
> 
> The change was something like this:
> 
> Ref:  https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.c
> c
> 
> I changed this:
> 
> ObjectContextRef obc = get_object_context(oid, false); assert(obc);
> --ctx->delta_stats.num_objects; --ctx-
> >delta_stats.num_objects_hit_set_archive; ctx->delta_stats.num_bytes
> -= obc->obs.oi.size; ctx->delta_stats.num_bytes_hit_set_archive -=
> obc->obs.oi.size;
> 
> to this:
> 
> ObjectContextRef obc = 0; // get_object_context(oid, false);
> assert(obc); --ctx->delta_stats.num_objects; --ctx-
> >delta_stats.num_objects_hit_set_archive;
> if( obc)
> {
>  ctx->delta_stats.num_bytes -= obc->obs.oi.size;
>  ctx->delta_stats.num_bytes_hit_set_archive -= obc->obs.oi.size;
> }
> 
> 
> Good luck!
> Blade.
> 
> 
> On Sat, Aug 13, 2016 at 5:52 AM, Hein-Pieter van Braam 
> wrote:
> > Hi Blade,
> > 
> > I appear to be stuck in the same situation you were in. Do you
> > still
> > happen to have a patch to implement this workaround you described?
> > 
> > Thanks,
> > 
> > - HP
> > 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multiple OSD crashing a lot

2016-08-13 Thread Blade Doyle
Hi HP.

Mine was not really a fix, it was just a hack to get the OSD up long enough
to make sure I had a full backup, then I rebuilt the cluster from scratch
and restored the data.  Though the hack did stop the OSD from crashing, it
is probably a symptom of some internal problem, and may not be "safe" to
run like that in the long term.

The change was something like this:

Ref:  https://github.com/ceph/ceph/blob/master/src/osd/ReplicatedPG.cc

I changed this:

ObjectContextRef obc = get_object_context(oid, false); assert(obc);
--ctx->delta_stats.num_objects; --ctx->delta_stats.
num_objects_hit_set_archive; ctx->delta_stats.num_bytes -= obc->obs.oi.size;
ctx->delta_stats.num_bytes_hit_set_archive -= obc->obs.oi.size;

to this:

ObjectContextRef obc = 0; // get_object_context(oid, false); assert(obc);
--ctx->delta_stats.num_objects; --ctx->delta_stats.
num_objects_hit_set_archive;
if( obc)
{
 ctx->delta_stats.num_bytes -= obc->obs.oi.size;
 ctx->delta_stats.num_bytes_hit_set_archive -= obc->obs.oi.size;
}


Good luck!
Blade.


On Sat, Aug 13, 2016 at 5:52 AM, Hein-Pieter van Braam  wrote:

> Hi Blade,
>
> I appear to be stuck in the same situation you were in. Do you still
> happen to have a patch to implement this workaround you described?
>
> Thanks,
>
> - HP
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com