Re: [ceph-users] xenserver or xen ceph

2016-03-30 Thread Jiri Kanicky

Hi.

There is a solution for Ceph in XenServer. With the help of my engineer 
Mark, we developed a simple patch which allows you to search and attach 
RBD image on XenServer. We create LVHD over the RBD (not RBD per VDI 
mapping yet), so it is far from ideal, but its a good start. The process 
of creating the SR over RBD works even from XenCenter.


https://github.com/mstarikov/rbdsr

Install notes are included and its very simple. Takes you few minutes 
per XenServer.


We have been running this in our Sydney Citrix  lab for sometime and I 
have been running this at home also. Works great. For the future, the 
patch should work in the upcoming version of XenServer (Dundee) as well. 
Also we are trying to push native Ceph packages in the new version and 
build experimental (not official or approved yet) version of smapi which 
would allow us to map RBD per VDI. But there are no details on this. 
Anyway, everyone is welcome to participate in improving the patch on github.


Let me know if you have any questions.

Cheers,
Jiri

On 16/02/2016 15:30, Christian Balzer wrote:

On Tue, 16 Feb 2016 11:52:17 +0800 (CST) maoqi1982 wrote:


Hi lists
Is there any solution or documents that ceph as xenserver or xen backend
storage?



Not really.

There was a project to natively support Ceph (RBD) in Xenserver but that
seems to have gone nowhere.

There was also a thread last year here "RBD hard crash on kernel
3.10" (google for it) wher Shawn Edwards was working on something similar,
but that seems to have died off silently as well.

While you could of course do a NFS (some pains) or iSCSI (major pains)
head for Ceph the pains and reduced performance make it not an attractive
proposition.

Christian


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with journal on another drive

2015-09-30 Thread Jiri Kanicky

Thanks to all for responses. Great thread with a lot of info.

I will go with the 3 partitions on Kingstone SDD for 3 OSDs on each node.

Thanks
Jiri

On 30/09/2015 00:38, Lionel Bouton wrote:

Hi,

Le 29/09/2015 13:32, Jiri Kanicky a écrit :

Hi Lionel.

Thank you for your reply. In this case I am considering to create
separate partitions for each disk on the SSD drive. Would be good to
know what is the performance difference, because creating partitions
is kind of waste of space.

The difference is hard to guess : filesystems need more CPU power than
raw block devices for example, so if you don't have much CPU power this
can make a significant difference. Filesystems might put more load on
our storage too (for example ext3/4 with data=journal will at least
double the disk writes). So there's a lot to consider and nothing will
be faster for journals than a raw partition. LVM logical volumes come a
close second behind because usually (if you simply use LVM to create
your logical volumes and don't try to use anything else like snapshots)
they don't change access patterns and almost don't need any CPU power.


One more question, is it a good idea to move journal for 3 OSDs to a
single SSD considering if SSD fails the whole node with 3 HDDs will be
down?

If your SSDs are working well with Ceph and aren't cheap models dying
under heavy writes, yes. I use one 200GB DC3710 SSD for 6 7200rpm SATA
OSDs (using 60GB of it for the 6 journals) and it works very well (they
were a huge performance boost compared to our previous use of internal
journals).
Some SSDs are slower than HDDs for Ceph journals though (there has been
a lot of discussions on this subject on this mailing list).


Thinking of it, leaving journal on each OSD might be safer, because
journal on one disk does not affect other disks (OSDs). Or do you
think that having the journal on SSD is better trade off?

You will put significantly more stress on your HDD leaving journal on
them and good SSDs are far more robust than HDDs so if you pick Intel DC
or equivalent SSD for journal your infrastructure might even be more
robust than one using internal journals (HDDs are dropping like flies
when you have hundreds of them). There are other components able to take
down all your OSDs : the disk controller, the CPU, the memory, the power
supply, ... So adding one robust SSD shouldn't change the overall
availabilty much (you must check their wear level and choose the models
according to the amount of writes you want them to support over their
lifetime though).

The main reason for journals on SSD is performance anyway. If your setup
is already fast enough without them, I wouldn't try to add SSDs.
Otherwise, if you can't reach the level of performance needed by adding
the OSDs already needed for your storage capacity objectives, go SSD.

Best regards,

Lionel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with journal on another drive

2015-09-29 Thread Jiri Kanicky

Hi Lionel.

Thank you for your reply. In this case I am considering to create 
separate partitions for each disk on the SSD drive. Would be good to 
know what is the performance difference, because creating partitions is 
kind of waste of space.


One more question, is it a good idea to move journal for 3 OSDs to a 
single SSD considering if SSD fails the whole node with 3 HDDs will be 
down? Thinking of it, leaving journal on each OSD might be safer, 
because journal on one disk does not affect other disks (OSDs). Or do 
you think that having the journal on SSD is better trade off?


Thank you
Jiri

On 29/09/2015 21:10, Lionel Bouton wrote:

Le 29/09/2015 07:29, Jiri Kanicky a écrit :

Hi,

Is it possible to create journal in directory as explained here:
http://wiki.skytech.dk/index.php/Ceph_-_howto,_rbd,_lvm,_cluster#Add.2Fmove_journal_in_running_cluster

Yes, the general idea (stop, flush, move, update ceph.conf, mkjournal,
start) is valid for moving your journal wherever you want.
That said it probably won't perform as well on a filesystem (LVM as
lower overhead than a filesystem).


1. Create BTRFS over /dev/sda6 (assuming this is SSD partition alocate
for journal) and mount it to /srv/ceph/journal

BTRFS is probably the worst idea for hosting journals. If you must use
BTRFS, you'll have to make sure that the journals are created NoCoW
before the first byte is ever written to them.


2. Add OSD: ceph-deploy osd create --fs-type btrfs
ceph1:sdb:/srv/ceph/journal/osd$id/journal

I've no experience with ceph-deploy...

Best regards,

Lionel



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD nodes in XenServer VMs

2015-09-07 Thread Jiri Kanicky

Hi.

As we would like to use the CEPH storage with CloudStack/XS we have to 
use NFS or iSCSI client nodes to provide shared storage. To avoid having 
several nodes of physical hardware we thought that we could run 
NFS/iSCSI client node on the same box with Ceph OSD node. Possibly we 
could even run ceparate VMs for MONs on the same hypervisor. This would 
give us flexibility and we could migrate NFS/iSCSI or MONs VMs to any 
hosts we want any time.


Also we could take snapshot of the Ceph OSD VMs during upgrades. If 
something does not go well, we can roll back fast.


Potentially, we could turn every XS local storage to Ceph OSD (utilize 
the unused local HDDs).


I think the I/O performance tax from VM to raw local disk is negligible 
in comparison with physical box.



Anyway, these are just thoughts. Might not be the best idea, but that is 
the reason why I am asking.


Thx Jiri



On 21/08/2015 10:12, Steven McDonald wrote:

Hi Jiri,

On Thu, 20 Aug 2015 11:55:55 +1000
Jiri Kanicky <j...@ganomi.com> wrote:


We are experimenting with an idea to run OSD nodes in XenServer VMs.
We believe this could provide better flexibility, backups for the
nodes etc.

Could you expand on this? As written, it seems like a bad idea to me,
just because you'd be adding complexity for no gain. Can you explain,
for instance, why you think it would enable better flexibility, or why
it would help with backups?

What is it that you intend to back up? Backing up the OS on a storage
node should never be necessary, since it should be recreatable from
config management, and backing up data on the OSDs is best done on a
per-pool basis because the requirements are going to differ by pool and
not by OSD.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph OSD nodes in XenServer VMs

2015-08-19 Thread Jiri Kanicky

Hi all,

We are experimenting with an idea to run OSD nodes in XenServer VMs. We 
believe this could provide better flexibility, backups for the nodes etc.


For example:
Xenserver with 4 HDDs dedicated for Ceph.
We would introduce 1 VM (OSD node) with raw/direct access to 4 HDDs or 2 
VMs (2 OSD nodes) with 2 HDDs each.


Do you have any experience with this? Any thoughts on this? Good or bad 
idea?


Thank you
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mount error: ceph filesystem not supported by the system

2015-08-06 Thread Jiri Kanicky

Hi,

I can answer this myself. It was a kernel. After upgrade to lates Debian 
Jessie 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u2 (2015-07-17) 
x86_64 GNU/Linux. Everything started to work as normal.


Thanks :)

On 6/08/2015 22:38, Jiri Kanicky wrote:

Hi,

I am trying to mount my CephFS and getting the following message. It 
was all working previously, but after power failure I am not able to 
mount it anymore (Debian Jessie).


cephadmin@maverick:/etc/ceph$ sudo mount -t ceph 
ceph1.allsupp.corp,ceph2.allsupp.corp:6789:/ /mnt/cephdata/ -o 
name=admin,secretfile=/etc/ceph/admin.secret
modprobe: ERROR: could not insert 'ceph': Unknown symbol in module, or 
unknown parameter (see dmesg)

failed to load ceph kernel module (1)
mount error: ceph filesystem not supported by the system


Ceph seems to be healthy (ignore the PGs).

cephadmin@ceph1:~$ ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_WARN
too many PGs per OSD (384  max 300)
 monmap e2: 3 mons at 
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0,ceph3=192.168.30.23:6789/0}

election epoch 100, quorum 0,1,2 ceph1,ceph2,ceph3
 mdsmap e98: 1/1/1 up {0=ceph1=up:active}, 1 up:standby
 osdmap e773: 4 osds: 4 up, 4 in
  pgmap v457296: 768 pgs, 3 pools, 2020 GB data, 574 kobjects
4804 GB used, 6350 GB / 11158 GB avail
 768 active+clean


Any ideas?

Thank you
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mount error: ceph filesystem not supported by the system

2015-08-06 Thread Jiri Kanicky

Hi,

I am trying to mount my CephFS and getting the following message. It was 
all working previously, but after power failure I am not able to mount 
it anymore (Debian Jessie).


cephadmin@maverick:/etc/ceph$ sudo mount -t ceph 
ceph1.allsupp.corp,ceph2.allsupp.corp:6789:/ /mnt/cephdata/ -o 
name=admin,secretfile=/etc/ceph/admin.secret
modprobe: ERROR: could not insert 'ceph': Unknown symbol in module, or 
unknown parameter (see dmesg)

failed to load ceph kernel module (1)
mount error: ceph filesystem not supported by the system


Ceph seems to be healthy (ignore the PGs).

cephadmin@ceph1:~$ ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_WARN
too many PGs per OSD (384  max 300)
 monmap e2: 3 mons at 
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0,ceph3=192.168.30.23:6789/0}

election epoch 100, quorum 0,1,2 ceph1,ceph2,ceph3
 mdsmap e98: 1/1/1 up {0=ceph1=up:active}, 1 up:standby
 osdmap e773: 4 osds: 4 up, 4 in
  pgmap v457296: 768 pgs, 3 pools, 2020 GB data, 574 kobjects
4804 GB used, 6350 GB / 11158 GB avail
 768 active+clean


Any ideas?

Thank you
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph config files

2015-05-23 Thread Jiri Kanicky

Hi,

I just added new monitor (MON). $ ceph status shows the monitor in the 
quorum, but the new monitor is not shown in /etc/ceph/ceph.conf. I am 
wondering what role the /etc/ceph/ceph.conf plays? Do I need to manually 
edit the file on each node and add the monitors?


In addition, there are almost the same config files in my ceph 
deployment directory. But they can be different from what is in 
/etc/ceph/ceph.conf. What role these files play?


Please can someone explain me the function of ceph config files?

Thank you
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs degraded with 3 MONs and 1 OSD node

2015-01-21 Thread Jiri Kanicky

Hi,

Thanks for the reply. That clarifies it. I thought that the redundancy 
can be achieved with multiple OSDs (like multiple disks in RAID) in case 
you don't have more nodes. Obviously the single point of failure would 
be the box.


My current setting is:
osd_pool_default_size = 2

Thank you
Jiri


On 20/01/2015 13:13, Lindsay Mathieson wrote:
You only have one osd node (ceph4). The default replication 
requirements  for your pools (size = 3) require osd's spread over 
three nodes, so the data can be replicate on three different nodes. 
That will be why your pgs are degraded.


You need to either add mode osd nodes or reduce your size setting down 
to the number of osd nodes you have.


Setting your size to 1 would be a bad idea, there would be no 
redundancy in your data at all. Loosing one disk would destroy all 
your data.


The command to see you pool size is:

sudo ceph osd pool get poolname size

assuming default setup:

ceph osd pool  get rbd size
returns: 3

On 20 January 2015 at 10:51, Jiri Kanicky j...@ganomi.com 
mailto:j...@ganomi.com wrote:


Hi,

I just would like to clarify if I should expect degraded PGs with
11 OSD in one node. I am not sure if a setup with 3 MON and 1 OSD
(11 disks) nodes allows me to have healthy cluster.

$ sudo ceph osd pool create test 512
pool 'test' created

$ sudo ceph status
cluster 4e77327a-118d-450d-ab69-455df6458cd4
 health HEALTH_WARN 512 pgs degraded; 512 pgs stuck unclean;
512 pgs undersized
 monmap e1: 3 mons at

{ceph1=172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0

http://172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0},
election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3
 osdmap e190: 11 osds: 11 up, 11 in
  pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects
53724 kB used, 9709 GB / 9720 GB avail
 512 active+undersized+degraded

$ sudo ceph osd tree
# idweight  type name   up/down reweight
-1  9.45root default
-2  9.45host ceph4
0   0.45osd.0   up  1
1   0.9 osd.1   up  1
2   0.9 osd.2   up  1
3   0.9 osd.3   up  1
4   0.9 osd.4   up  1
5   0.9 osd.5   up  1
6   0.9 osd.6   up  1
7   0.9 osd.7   up  1
8   0.9 osd.8   up  1
9   0.9 osd.9   up  1
10  0.9 osd.10  up  1


Thank you,
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Lindsay


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs degraded with 3 MONs and 1 OSD node

2015-01-19 Thread Jiri Kanicky

Hi,

I just would like to clarify if I should expect degraded PGs with 11 OSD 
in one node. I am not sure if a setup with 3 MON and 1 OSD (11 disks) 
nodes allows me to have healthy cluster.


$ sudo ceph osd pool create test 512
pool 'test' created

$ sudo ceph status
cluster 4e77327a-118d-450d-ab69-455df6458cd4
 health HEALTH_WARN 512 pgs degraded; 512 pgs stuck unclean; 512 
pgs undersized
 monmap e1: 3 mons at 
{ceph1=172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0}, 
election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3

 osdmap e190: 11 osds: 11 up, 11 in
  pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects
53724 kB used, 9709 GB / 9720 GB avail
 512 active+undersized+degraded

$ sudo ceph osd tree
# idweight  type name   up/down reweight
-1  9.45root default
-2  9.45host ceph4
0   0.45osd.0   up  1
1   0.9 osd.1   up  1
2   0.9 osd.2   up  1
3   0.9 osd.3   up  1
4   0.9 osd.4   up  1
5   0.9 osd.5   up  1
6   0.9 osd.6   up  1
7   0.9 osd.7   up  1
8   0.9 osd.8   up  1
9   0.9 osd.9   up  1
10  0.9 osd.10  up  1


Thank you,
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs degraded with 3 MONs and 1 OSD node

2015-01-19 Thread Jiri Kanicky

Hi.

I am just curious. This is just lab environment and we are short on 
hardware :). We will have more hardware later, but right now this is all 
I have. Monitors are VMs.


Anyway, we will have to survive with this somehow :).

Thanks
Jiri

On 20/01/2015 15:33, Lindsay Mathieson wrote:



On 20 January 2015 at 14:10, Jiri Kanicky j...@ganomi.com 
mailto:j...@ganomi.com wrote:


Hi,

BTW, is there a way how to achieve redundancy over multiple OSDs
in one box by changing CRUSH map?



I asked that same question myself a few weeks back :)

The answer was yes - but fiddly and why would you do that?

Its kinda breaking the purpose of ceph, which is large amounts of data 
stored redundantly over multiple nodes.


Perhaps you should re-examine your requirements. If what you want is 
data redundantly stored on hard disks on one node, perhaps you would 
be better served by creating a ZFS raid setup. With just one node it 
would be easier and more flexible - better performance as well.


Alternatively, could you put some OSD's on your monitor ndoes? what 
spec are they?




Thank you
Jiri


On 20/01/2015 13:37, Jiri Kanicky wrote:

Hi,

Thanks for the reply. That clarifies it. I thought that the
redundancy can be achieved with multiple OSDs (like multiple
disks in RAID) in case you don't have more nodes. Obviously the
single point of failure would be the box.

My current setting is:
osd_pool_default_size = 2

Thank you
Jiri


On 20/01/2015 13:13, Lindsay Mathieson wrote:

You only have one osd node (ceph4). The default replication
requirements  for your pools (size = 3) require osd's spread
over three nodes, so the data can be replicate on three
different nodes. That will be why your pgs are degraded.

You need to either add mode osd nodes or reduce your size
setting down to the number of osd nodes you have.

Setting your size to 1 would be a bad idea, there would be no
redundancy in your data at all. Loosing one disk would destroy
all your data.

The command to see you pool size is:

sudo ceph osd pool get poolname size

assuming default setup:

ceph osd pool  get rbd size
returns: 3

On 20 January 2015 at 10:51, Jiri Kanicky j...@ganomi.com
mailto:j...@ganomi.com wrote:

Hi,

I just would like to clarify if I should expect degraded PGs
with 11 OSD in one node. I am not sure if a setup with 3 MON
and 1 OSD (11 disks) nodes allows me to have healthy cluster.

$ sudo ceph osd pool create test 512
pool 'test' created

$ sudo ceph status
cluster 4e77327a-118d-450d-ab69-455df6458cd4
 health HEALTH_WARN 512 pgs degraded; 512 pgs stuck
unclean; 512 pgs undersized
 monmap e1: 3 mons at

{ceph1=172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0

http://172.16.41.31:6789/0,ceph2=172.16.41.32:6789/0,ceph3=172.16.41.33:6789/0},
election epoch 36, quorum 0,1,2 ceph1,ceph2,ceph3
 osdmap e190: 11 osds: 11 up, 11 in
  pgmap v342: 512 pgs, 1 pools, 0 bytes data, 0 objects
53724 kB used, 9709 GB / 9720 GB avail
 512 active+undersized+degraded

$ sudo ceph osd tree
# idweight  type name   up/down reweight
-1  9.45root default
-2  9.45host ceph4
0   0.45osd.0  up  1
1   0.9 osd.1  up  1
2   0.9 osd.2  up  1
3   0.9 osd.3  up  1
4   0.9 osd.4  up  1
5   0.9 osd.5  up  1
6   0.9 osd.6  up  1
7   0.9 osd.7  up  1
8   0.9 osd.8  up  1
9   0.9 osd.9  up  1
10  0.9 osd.10 up  1


Thank you,
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
Lindsay







--
Lindsay


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Expansion

2015-01-18 Thread Jiri Kanicky

Hi George,

List disks available:
# $ ceph-deploy disk list {node-name [node-name]...}

Add OSD using osd create:
# $ ceph-deploy osd create {node-name}:{disk}[:{path/to/journal}]

Or you can use the manual steps to prepare and activate disk described 
at 
http://ceph.com/docs/master/start/quick-ceph-deploy/#expanding-your-cluster


Jiri

On 15/01/2015 06:36, Georgios Dimitrakakis wrote:

Hi all!

I would like to expand our CEPH Cluster and add a second OSD node.

In this node I will have ten 4TB disks dedicated to CEPH.

What is the proper way of putting them in the already available CEPH 
node?


I guess that the first thing to do is to prepare them with ceph-deploy 
and mark them as out at preparation.


I should then restart the services and add (mark as in) one of them. 
Afterwards, I have to wait for the rebalance
to occur and upon finishing I will add the second and so on. Is this 
safe enough?



How long do you expect the rebalancing procedure to take?


I already have ten more 4TB disks at another node and the amount of 
data is around 40GB with 2x replication factor.

The connection is over Gigabit.


Best,


George
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant on Centos 7 with custom cluster name

2015-01-18 Thread Jiri Kanicky

Hi,

I have upgraded Firefly to Giant on Debian Wheezy and it went without 
any problems.


Jiri


On 16/01/2015 06:49, Erik McCormick wrote:

Hello all,

I've got an existing Firefly cluster on Centos 7 which I deployed with 
ceph-deploy. In the latest version of ceph-deploy, it refuses to 
handle commands issued with a cluster name.


[ceph_deploy.install][ERROR ] custom cluster names are not supported 
on sysvinit hosts


This is a production cluster. Small, but still production. Is it safe 
to go through manually upgrading the packages? I'd hate to do the 
upgrade and find out I can no longer start the cluster because it 
can't be called anything other than ceph.


Thanks,
Erik


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-09 Thread Jiri Kanicky

Hi Nico,

I would probably recommend to upgrade to 0.87 (giant). I am running this 
version for some time now and it works very well. I also upgraded from 
firefly and it was easy.


The issue you are experiencing seems quite complex and it would require 
debug logs to troubleshoot.


Apology that I did not help much.

-Jiri

On 9/01/2015 20:23, Nico Schottelius wrote:

Good morning Jiri,

sure, let me catch up on this:

- Kernel 3.16
- ceph: 0.80.7
- fs: xfs
- os: debian (backports) (1x)/ubuntu (2x)

Cheers,

Nico

Jiri Kanicky [Fri, Jan 09, 2015 at 10:44:33AM +1100]:

Hi Nico.

If you are experiencing such issues it would be good if you provide more info 
about your deployment: ceph version, kernel versions, OS, filesystem btrfs/xfs.

Thx Jiri

- Reply message -
From: Nico Schottelius nico-eph-us...@schottelius.org
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = 
Cluster unusable]
Date: Wed, Dec 31, 2014 02:36

Good evening,

we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.

However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.

This story and the one of Christian makes me wonder:

Is anyone using ceph as a backend for qemu VM images in production?

And:

Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?

Reading about the issues on the list here gives me the impression that
ceph as a software is stuck/incomplete and has not yet become ready
clean for production (sorry for the word joke).

Cheers,

Nico

Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]:

Hi Nico and all others who answered,

After some more trying to somehow get the pgs in a working state (I've
tried force_create_pg, which was putting then in creating state. But
that was obviously not true, since after rebooting one of the containing
osd's it went back to incomplete), I decided to save what can be saved.

I've created a new pool, created a new image there, mapped the old image
from the old pool and the new image from the new pool to a machine, to
copy data on posix level.

Unfortunately, formatting the image from the new pool hangs after some
time. So it seems that the new pool is suffering from the same problem
as the old pool. Which is totaly not understandable for me.

Right now, it seems like Ceph is giving me no options to either save
some of the still intact rbd volumes, or to create a new pool along the
old one to at least enable our clients to send data to ceph again.

To tell the truth, I guess that will result in the end of our ceph
project (running for already 9 Monthes).

Regards,
Christian

Am 29.12.2014 15:59, schrieb Nico Schottelius:

Hey Christian,

Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:

[incomplete PG / RBD hanging, osd lost also not helping]

that is very interesting to hear, because we had a similar situation
with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
directories to allow OSDs to start after the disk filled up completly.

So I am sorry not to being able to give you a good hint, but I am very
interested in seeing your problem solved, as it is a show stopper for
us, too. (*)

Cheers,

Nico

(*) We migrated from sheepdog to gluster to ceph and so far sheepdog
 seems to run much smoother. The first one is however not supported
 by opennebula directly, the second one not flexible enough to host
 our heterogeneous infrastructure (mixed disk sizes/amounts) - so we
 are using ceph at the moment.



--
Christian Eichelmann
Systemadministrator

11 Internet AG - IT Operations Mail  Media Advertising  Targeting
Brauerstraße 48 · DE-76135 Karlsruhe
Telefon: +49 721 91374-8026
christian.eichelm...@1und1.de

Amtsgericht Montabaur / HRB 6484
Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
Aufsichtsratsvorsitzender: Michael Scheeren

--
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = Cluster unusable]

2015-01-08 Thread Jiri Kanicky
Hi Nico.

If you are experiencing such issues it would be good if you provide more info 
about your deployment: ceph version, kernel versions, OS, filesystem btrfs/xfs.

Thx Jiri

- Reply message -
From: Nico Schottelius nico-eph-us...@schottelius.org
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Is ceph production ready? [was: Ceph PG Incomplete = 
Cluster unusable]
Date: Wed, Dec 31, 2014 02:36

Good evening,

we also tried to rescue data *from* our old / broken pool by map'ing the
rbd devices, mounting them on a host and rsync'ing away as much as
possible.

However, after some time rsync got completly stuck and eventually the
host which mounted the rbd mapped devices decided to kernel panic at
which time we decided to drop the pool and go with a backup.

This story and the one of Christian makes me wonder:

Is anyone using ceph as a backend for qemu VM images in production?

And:

Has anyone on the list been able to recover from a pg incomplete /
stuck situation like ours?

Reading about the issues on the list here gives me the impression that
ceph as a software is stuck/incomplete and has not yet become ready
clean for production (sorry for the word joke).

Cheers,

Nico

Christian Eichelmann [Tue, Dec 30, 2014 at 12:17:23PM +0100]:
 Hi Nico and all others who answered,
 
 After some more trying to somehow get the pgs in a working state (I've
 tried force_create_pg, which was putting then in creating state. But
 that was obviously not true, since after rebooting one of the containing
 osd's it went back to incomplete), I decided to save what can be saved.
 
 I've created a new pool, created a new image there, mapped the old image
 from the old pool and the new image from the new pool to a machine, to
 copy data on posix level.
 
 Unfortunately, formatting the image from the new pool hangs after some
 time. So it seems that the new pool is suffering from the same problem
 as the old pool. Which is totaly not understandable for me.
 
 Right now, it seems like Ceph is giving me no options to either save
 some of the still intact rbd volumes, or to create a new pool along the
 old one to at least enable our clients to send data to ceph again.
 
 To tell the truth, I guess that will result in the end of our ceph
 project (running for already 9 Monthes).
 
 Regards,
 Christian
 
 Am 29.12.2014 15:59, schrieb Nico Schottelius:
  Hey Christian,
  
  Christian Eichelmann [Mon, Dec 29, 2014 at 10:56:59AM +0100]:
  [incomplete PG / RBD hanging, osd lost also not helping]
  
  that is very interesting to hear, because we had a similar situation
  with ceph 0.80.7 and had to re-create a pool, after I deleted 3 pg
  directories to allow OSDs to start after the disk filled up completly.
  
  So I am sorry not to being able to give you a good hint, but I am very
  interested in seeing your problem solved, as it is a show stopper for
  us, too. (*)
  
  Cheers,
  
  Nico
  
  (*) We migrated from sheepdog to gluster to ceph and so far sheepdog
  seems to run much smoother. The first one is however not supported
  by opennebula directly, the second one not flexible enough to host
  our heterogeneous infrastructure (mixed disk sizes/amounts) - so we 
  are using ceph at the moment.
  
 
 
 -- 
 Christian Eichelmann
 Systemadministrator
 
 11 Internet AG - IT Operations Mail  Media Advertising  Targeting
 Brauerstraße 48 · DE-76135 Karlsruhe
 Telefon: +49 721 91374-8026
 christian.eichelm...@1und1.de
 
 Amtsgericht Montabaur / HRB 6484
 Vorstände: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Robert
 Hoffmann, Markus Huhn, Hans-Henning Kettler, Dr. Oliver Mauss, Jan Oetjen
 Aufsichtsratsvorsitzender: Michael Scheeren

-- 
New PGP key: 659B 0D91 E86E 7E24 FD15  69D0 C729 21A1 293F 2D24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg repair unsuccessful

2015-01-07 Thread Jiri Kanicky

Hi,

I have been experiencing issues with several PGs which remained in 
inconsistent state (I use BTRFS). ceph pg repair is not able to repair 
them. The only way I can delete the corresponding file, which is causing 
the issue (see logs bellow) from the OSDs. This however means loss of data.


Is there any other way how to fix it?

$ ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 2.17 is active+clean+inconsistent, acting [1,3]
1 scrub errors

Log output:
2015-01-07 21:43:13.396376 7f0c5ac53700 -1 log_channel(default) log 
[ERR] : repair 2.17 f2a47417/100f485./head//2 on disk size 
(4194304) does not match object info size (0) adjusted for ondisk to (0)
2015-01-07 21:43:56.771820 7f0c5ac53700 -1 log_channel(default) log 
[ERR] : 2.17 repair 1 errors, 0 fixed
2015-01-07 21:44:10.473870 7f0c5ac53700 -1 log_channel(default) log 
[ERR] : deep-scrub 2.17 f2a47417/100f485./head//2 on disk 
size (4194304) does not match object info size (0) adjusted for ondisk 
to (0)
2015-01-07 21:44:42.919425 7f0c5ac53700 -1 log_channel(default) log 
[ERR] : 2.17 deep-scrub 1 errors



Thx Jiri
cephver 0.87, Debian Wheezy, BTRFS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs usable or not?

2015-01-07 Thread Jiri Kanicky

Hi Max,

Thanks for this info.

I am planing to use CephFS (ceph version 0.87) at home, because its more 
convenient than NFS over RBD. I dont have large environment; about 20TB, 
so hopefully it will hold.


I backup all important data just in case. :)

Thank you.
Jiri

On 29/12/2014 21:09, Thomas Lemarchand wrote:

Hi Max,

I do use CephFS (Giant) in a production environment. It works really
well, but I have backups ready to use, just in case.

As Wido said, kernel version is not really relevant if you use ceph-fuse
(which I recommend over cephfs kernel, for stability and ease of upgrade
reasons).

However, I found ceph-mds memory usage hard to predict, and I had some
problems with that. At first it was undersized (16GB, for ~8M files /
dirs, and 1M inodes cached), but it worked well until I had a server
crash who did not recover (mds rejoin / rebuild) because of the lack of
memory. So I gave it 24GB memory + 24GB swap, no problem anymore.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSDs with btrfs are down

2015-01-04 Thread Jiri Kanicky

Hi,

My OSDs with btrfs are down on one node. I found the cluster in this state:

cephadmin@ceph1:~$ ceph osd tree
# idweight  type name   up/down reweight
-1  10.88   root default
-2  5.44host ceph1
0   2.72osd.0   down0
1   2.72osd.1   down0
-3  5.44host ceph2
2   2.72osd.2   up  1
3   2.72osd.3   up  1


cephadmin@ceph1:~$ ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_ERR 645 pgs degraded; 29 pgs inconsistent; 14 pgs 
recovering; 645 pgs stuck degraded; 768 pgs stuck unclean; 631 pgs stuck 
undersized; 631 pgs undersized; recovery 397226/915548 objects degraded 
(43.387%); 72026/915548 objects misplaced (7.867%); 783 scrub errors
 monmap e1: 2 mons at 
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 
30, quorum 0,1 ceph1,ceph2

 mdsmap e30: 1/1/1 up {0=ceph1=up:active}, 1 up:standby
 osdmap e242: 4 osds: 2 up, 2 in
  pgmap v38318: 768 pgs, 3 pools, 1572 GB data, 447 kobjects
1811 GB used, 3764 GB / 5579 GB avail
397226/915548 objects degraded (43.387%); 72026/915548 
objects misplaced (7.867%)

  14 active+recovering+degraded+remapped
 122 active+remapped
   1 active+remapped+inconsistent
 603 active+undersized+degraded
  28 active+undersized+degraded+inconsistent


Would you know if this is pure BTRFS issue or is there any setting I 
forgot to use?


Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page 
allocation failure: order:1, mode:0x204020
Jan  4 17:11:06 ceph1 kernel: [756636.535669] CPU: 0 PID: 62644 Comm: 
kworker/0:2 Not tainted 3.16.0-0.bpo.4-amd64 #1 Debian 3.16.7-ckt2-1~bpo70+1
Jan  4 17:11:06 ceph1 kernel: [756636.535671] Hardware name: HP ProLiant 
MicroServer Gen8, BIOS J06 11/09/2013
Jan  4 17:11:06 ceph1 kernel: [756636.535701] Workqueue: events 
do_async_commit [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535704]   
0001 81541f8f 00204020
Jan  4 17:11:06 ceph1 kernel: [756636.535707]  811519ed 
0001 880075de0c00 0002
Jan  4 17:11:06 ceph1 kernel: [756636.535710]   
0001 880075de0c08 0096

Jan  4 17:11:06 ceph1 kernel: [756636.535713] Call Trace:
Jan  4 17:11:06 ceph1 kernel: [756636.535720] [81541f8f] ? 
dump_stack+0x41/0x51
Jan  4 17:11:06 ceph1 kernel: [756636.535725] [811519ed] ? 
warn_alloc_failed+0xfd/0x160
Jan  4 17:11:06 ceph1 kernel: [756636.535730] [8115606f] ? 
__alloc_pages_nodemask+0x91f/0xbb0
Jan  4 17:11:06 ceph1 kernel: [756636.535734] [8119fe00] ? 
kmem_getpages+0x60/0x110
Jan  4 17:11:06 ceph1 kernel: [756636.535737] [811a1648] ? 
fallback_alloc+0x158/0x220
Jan  4 17:11:06 ceph1 kernel: [756636.535741] [811a1f44] ? 
kmem_cache_alloc+0x1a4/0x1e0
Jan  4 17:11:06 ceph1 kernel: [756636.535745] [812c8fa0] ? 
ida_pre_get+0x60/0xd0
Jan  4 17:11:06 ceph1 kernel: [756636.535749] [811bdf61] ? 
get_anon_bdev+0x21/0xe0
Jan  4 17:11:06 ceph1 kernel: [756636.535762] [a015caef] ? 
btrfs_init_fs_root+0xff/0x1b0 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535774] [a015de23] ? 
btrfs_read_fs_root+0x33/0x40 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535785] [a015df06] ? 
btrfs_get_fs_root+0xd6/0x230 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535796] [a0162383] ? 
create_pending_snapshot+0x793/0xa00 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535807] [a0162679] ? 
create_pending_snapshots+0x89/0xa0 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535818] [a016394a] ? 
btrfs_commit_transaction+0x35a/0xa10 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535824] [8107a88e] ? 
mod_timer+0x10e/0x220
Jan  4 17:11:06 ceph1 kernel: [756636.535834] [a016402a] ? 
do_async_commit+0x2a/0x40 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535839] [810878fc] ? 
process_one_work+0x15c/0x450
Jan  4 17:11:06 ceph1 kernel: [756636.535843] [81088b52] ? 
worker_thread+0x112/0x540
Jan  4 17:11:06 ceph1 kernel: [756636.535847] [81088a40] ? 
create_and_start_worker+0x60/0x60
Jan  4 17:11:06 ceph1 kernel: [756636.535851] [8108f511] ? 
kthread+0xc1/0xe0
Jan  4 17:11:06 ceph1 kernel: [756636.535854] [8108f450] ? 
flush_kthread_worker+0xb0/0xb0
Jan  4 17:11:06 ceph1 kernel: [756636.535858] [815483bc] ? 
ret_from_fork+0x7c/0xb0
Jan  4 17:11:06 ceph1 kernel: [756636.535861] [8108f450] ? 
flush_kthread_worker+0xb0/0xb0

Jan  4 17:11:06 ceph1 kernel: [756636.535863] Mem-Info:
Jan  4 17:11:06 ceph1 kernel: [756636.535865] Node 0 DMA per-cpu:
Jan  4 17:11:06 ceph1 kernel: [756636.535867] CPU0: hi:0, 
btch:   1 usd:   0
Jan  4 17:11:06 ceph1 kernel: [756636.535869] 

Re: [ceph-users] OSDs with btrfs are down

2015-01-04 Thread Jiri Kanicky

Hi,

Here is my memory output. I use HP Microservers with 2GB RAM. Swap is 
500MB on SSD disk.


cephadmin@ceph1:~$ free
 total   used   free sharedbuffers cached
Mem:   18857201817860  67860  0 32 694552
-/+ buffers/cache:1123276 762444
Swap:  3859452 6334923225960

More googling took me to the following post:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html

Linux 3.14.1 is affected by serious Btrfs regression(s) that were fixed in
later releases.

Unfortunately even latest Linux can crash and corrupt Btrfs file system if
OSDs are using snapshots (which is the default). Due to kernel bugs related to
Btrfs snapshots I also lost some OSDs until I found that snapshotting can be
disabled with filestore btrfs snap = false.


I am wondering if this can be the problem.

-Jiri


On 5/01/2015 01:17, Dyweni - BTRFS wrote:

Hi,

BTRFS crashed because the system ran out of memory...

I see these entries in your logs:


Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page
allocation failure: order:1, mode:0x204020



Jan  4 17:11:06 ceph1 kernel: [756636.536112] BTRFS: error (device
sdb1) in create_pending_snapshot:1334: errno=-12 Out of memory



Jan  4 17:11:06 ceph1 kernel: [756636.536135] BTRFS: error (device
sdb1) in cleanup_transaction:1577: errno=-12 Out of memory



How much memory do you have in this node?  Where you using Ceph
(as a client) on this node?  Do you have swap configured on this
node?









On 2015-01-04 07:12, Jiri Kanicky wrote:

Hi,

My OSDs with btrfs are down on one node. I found the cluster in this 
state:


cephadmin@ceph1:~$ ceph osd tree
# idweight  type name   up/down reweight
-1  10.88   root default
-2  5.44host ceph1
0   2.72osd.0   down0
1   2.72osd.1   down0
-3  5.44host ceph2
2   2.72osd.2   up  1
3   2.72osd.3   up  1


cephadmin@ceph1:~$ ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_ERR 645 pgs degraded; 29 pgs inconsistent; 14 pgs
recovering; 645 pgs stuck degraded; 768 pgs stuck unclean; 631 pgs
stuck undersized; 631 pgs undersized; recovery 397226/915548 objects
degraded (43.387%); 72026/915548 objects misplaced (7.867%); 783 scrub
errors
 monmap e1: 2 mons at
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
epoch 30, quorum 0,1 ceph1,ceph2
 mdsmap e30: 1/1/1 up {0=ceph1=up:active}, 1 up:standby
 osdmap e242: 4 osds: 2 up, 2 in
  pgmap v38318: 768 pgs, 3 pools, 1572 GB data, 447 kobjects
1811 GB used, 3764 GB / 5579 GB avail
397226/915548 objects degraded (43.387%); 72026/915548
objects misplaced (7.867%)
  14 active+recovering+degraded+remapped
 122 active+remapped
   1 active+remapped+inconsistent
 603 active+undersized+degraded
  28 active+undersized+degraded+inconsistent


Would you know if this is pure BTRFS issue or is there any setting I
forgot to use?

Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page
allocation failure: order:1, mode:0x204020
Jan  4 17:11:06 ceph1 kernel: [756636.535669] CPU: 0 PID: 62644 Comm:
kworker/0:2 Not tainted 3.16.0-0.bpo.4-amd64 #1 Debian
3.16.7-ckt2-1~bpo70+1
Jan  4 17:11:06 ceph1 kernel: [756636.535671] Hardware name: HP
ProLiant MicroServer Gen8, BIOS J06 11/09/2013
Jan  4 17:11:06 ceph1 kernel: [756636.535701] Workqueue: events
do_async_commit [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535704]  
0001 81541f8f 00204020
Jan  4 17:11:06 ceph1 kernel: [756636.535707]  811519ed
0001 880075de0c00 0002
Jan  4 17:11:06 ceph1 kernel: [756636.535710]  
0001 880075de0c08 0096
Jan  4 17:11:06 ceph1 kernel: [756636.535713] Call Trace:
Jan  4 17:11:06 ceph1 kernel: [756636.535720] [81541f8f] ?
dump_stack+0x41/0x51
Jan  4 17:11:06 ceph1 kernel: [756636.535725] [811519ed] ?
warn_alloc_failed+0xfd/0x160
Jan  4 17:11:06 ceph1 kernel: [756636.535730] [8115606f] ?
__alloc_pages_nodemask+0x91f/0xbb0
Jan  4 17:11:06 ceph1 kernel: [756636.535734] [8119fe00] ?
kmem_getpages+0x60/0x110
Jan  4 17:11:06 ceph1 kernel: [756636.535737] [811a1648] ?
fallback_alloc+0x158/0x220
Jan  4 17:11:06 ceph1 kernel: [756636.535741] [811a1f44] ?
kmem_cache_alloc+0x1a4/0x1e0
Jan  4 17:11:06 ceph1 kernel: [756636.535745] [812c8fa0] ?
ida_pre_get+0x60/0xd0
Jan  4 17:11:06 ceph1 kernel: [756636.535749] [811bdf61] ?
get_anon_bdev+0x21/0xe0
Jan  4 17:11:06 ceph1 kernel: [756636.535762] [a015caef] ?
btrfs_init_fs_root+0xff/0x1b0 [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535774] [a015de23] ?
btrfs_read_fs_root+0x33/0x40

Re: [ceph-users] OSDs with btrfs are down

2015-01-04 Thread Jiri Kanicky

Hi.

Correction. My SWAP is 3GB on SSD disk. I dont use th nodes for client 
stuff.


Thx Jiri

On 5/01/2015 01:21, Jiri Kanicky wrote:

Hi,

Here is my memory output. I use HP Microservers with 2GB RAM. Swap is 
500MB on SSD disk.


cephadmin@ceph1:~$ free
 total   used   free sharedbuffers cached
Mem:   18857201817860  67860  0 32 694552
-/+ buffers/cache:1123276 762444
Swap:  3859452 6334923225960

More googling took me to the following post:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html

Linux 3.14.1 is affected by serious Btrfs regression(s) that were 
fixed in

later releases.

Unfortunately even latest Linux can crash and corrupt Btrfs file 
system if
OSDs are using snapshots (which is the default). Due to kernel bugs 
related to
Btrfs snapshots I also lost some OSDs until I found that snapshotting 
can be

disabled with filestore btrfs snap = false.


I am wondering if this can be the problem.

-Jiri


On 5/01/2015 01:17, Dyweni - BTRFS wrote:

Hi,

BTRFS crashed because the system ran out of memory...

I see these entries in your logs:


Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page
allocation failure: order:1, mode:0x204020



Jan  4 17:11:06 ceph1 kernel: [756636.536112] BTRFS: error (device
sdb1) in create_pending_snapshot:1334: errno=-12 Out of memory



Jan  4 17:11:06 ceph1 kernel: [756636.536135] BTRFS: error (device
sdb1) in cleanup_transaction:1577: errno=-12 Out of memory



How much memory do you have in this node?  Where you using Ceph
(as a client) on this node?  Do you have swap configured on this
node?









On 2015-01-04 07:12, Jiri Kanicky wrote:

Hi,

My OSDs with btrfs are down on one node. I found the cluster in this 
state:


cephadmin@ceph1:~$ ceph osd tree
# idweight  type name   up/down reweight
-1  10.88   root default
-2  5.44host ceph1
0   2.72osd.0   down0
1   2.72osd.1   down0
-3  5.44host ceph2
2   2.72osd.2   up  1
3   2.72osd.3   up  1


cephadmin@ceph1:~$ ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_ERR 645 pgs degraded; 29 pgs inconsistent; 14 pgs
recovering; 645 pgs stuck degraded; 768 pgs stuck unclean; 631 pgs
stuck undersized; 631 pgs undersized; recovery 397226/915548 objects
degraded (43.387%); 72026/915548 objects misplaced (7.867%); 783 scrub
errors
 monmap e1: 2 mons at
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
epoch 30, quorum 0,1 ceph1,ceph2
 mdsmap e30: 1/1/1 up {0=ceph1=up:active}, 1 up:standby
 osdmap e242: 4 osds: 2 up, 2 in
  pgmap v38318: 768 pgs, 3 pools, 1572 GB data, 447 kobjects
1811 GB used, 3764 GB / 5579 GB avail
397226/915548 objects degraded (43.387%); 72026/915548
objects misplaced (7.867%)
  14 active+recovering+degraded+remapped
 122 active+remapped
   1 active+remapped+inconsistent
 603 active+undersized+degraded
  28 active+undersized+degraded+inconsistent


Would you know if this is pure BTRFS issue or is there any setting I
forgot to use?

Jan  4 17:11:06 ceph1 kernel: [756636.535661] kworker/0:2: page
allocation failure: order:1, mode:0x204020
Jan  4 17:11:06 ceph1 kernel: [756636.535669] CPU: 0 PID: 62644 Comm:
kworker/0:2 Not tainted 3.16.0-0.bpo.4-amd64 #1 Debian
3.16.7-ckt2-1~bpo70+1
Jan  4 17:11:06 ceph1 kernel: [756636.535671] Hardware name: HP
ProLiant MicroServer Gen8, BIOS J06 11/09/2013
Jan  4 17:11:06 ceph1 kernel: [756636.535701] Workqueue: events
do_async_commit [btrfs]
Jan  4 17:11:06 ceph1 kernel: [756636.535704] 
0001 81541f8f 00204020
Jan  4 17:11:06 ceph1 kernel: [756636.535707] 811519ed
0001 880075de0c00 0002
Jan  4 17:11:06 ceph1 kernel: [756636.535710] 
0001 880075de0c08 0096
Jan  4 17:11:06 ceph1 kernel: [756636.535713] Call Trace:
Jan  4 17:11:06 ceph1 kernel: [756636.535720] [81541f8f] ?
dump_stack+0x41/0x51
Jan  4 17:11:06 ceph1 kernel: [756636.535725] [811519ed] ?
warn_alloc_failed+0xfd/0x160
Jan  4 17:11:06 ceph1 kernel: [756636.535730] [8115606f] ?
__alloc_pages_nodemask+0x91f/0xbb0
Jan  4 17:11:06 ceph1 kernel: [756636.535734] [8119fe00] ?
kmem_getpages+0x60/0x110
Jan  4 17:11:06 ceph1 kernel: [756636.535737] [811a1648] ?
fallback_alloc+0x158/0x220
Jan  4 17:11:06 ceph1 kernel: [756636.535741] [811a1f44] ?
kmem_cache_alloc+0x1a4/0x1e0
Jan  4 17:11:06 ceph1 kernel: [756636.535745] [812c8fa0] ?
ida_pre_get+0x60/0xd0
Jan  4 17:11:06 ceph1 kernel: [756636.535749] [811bdf61] ?
get_anon_bdev+0x21/0xe0
Jan  4 17:11:06 ceph1 kernel: [756636.535762

Re: [ceph-users] OSDs with btrfs are down

2015-01-04 Thread Jiri Kanicky

Hi.

I have been experiencing same issues on both nodes over the past 2 days 
(never both nodes at the same time).  It seems the issue occurs after 
some time when copying  a large number of files to CephFS on my client 
node (I dont use RBD yet).


These are new HP servers and the memory does not seem to have any issues 
in mem test. I use SSD for OS and normal drives for OSD. I think that 
the issue is not related to drives as it would be too much coincident to 
have 6 drives with bad blocks on both nodes.


I will also disable the snapshots and report back after few days.

Thx Jiri


On 5/01/2015 01:33, Dyweni - Ceph-Users wrote:



On 2015-01-04 08:21, Jiri Kanicky wrote:


More googling took me to the following post:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html 



Linux 3.14.1 is affected by serious Btrfs regression(s) that were 
fixed in

later releases.

Unfortunately even latest Linux can crash and corrupt Btrfs file 
system if
OSDs are using snapshots (which is the default). Due to kernel bugs 
related to
Btrfs snapshots I also lost some OSDs until I found that snapshotting 
can be

disabled with filestore btrfs snap = false.


I am wondering if this can be the problem.




Very interesting... I think I was just hit with that over night. :)

Yes, I would definitely recommend turning off snapshots.  I'm going to 
do that myself now.


Have you tested the memory in your server lately?  Memtest86+ on the 
ram, and badblocks on the SSD swap partition?






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs with btrfs are down

2015-01-04 Thread Jiri Kanicky

Hi.

Do you know how to tell that the option filestore btrfs snap = false 
is set?


Thx Jiri

On 5/01/2015 02:25, Jiri Kanicky wrote:

Hi.

I have been experiencing same issues on both nodes over the past 2 
days (never both nodes at the same time).  It seems the issue occurs 
after some time when copying  a large number of files to CephFS on my 
client node (I dont use RBD yet).


These are new HP servers and the memory does not seem to have any 
issues in mem test. I use SSD for OS and normal drives for OSD. I 
think that the issue is not related to drives as it would be too much 
coincident to have 6 drives with bad blocks on both nodes.


I will also disable the snapshots and report back after few days.

Thx Jiri


On 5/01/2015 01:33, Dyweni - Ceph-Users wrote:



On 2015-01-04 08:21, Jiri Kanicky wrote:


More googling took me to the following post:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-June/040279.html 



Linux 3.14.1 is affected by serious Btrfs regression(s) that were 
fixed in

later releases.

Unfortunately even latest Linux can crash and corrupt Btrfs file 
system if
OSDs are using snapshots (which is the default). Due to kernel bugs 
related to
Btrfs snapshots I also lost some OSDs until I found that 
snapshotting can be

disabled with filestore btrfs snap = false.


I am wondering if this can be the problem.




Very interesting... I think I was just hit with that over night. :)

Yes, I would definitely recommend turning off snapshots.  I'm going 
to do that myself now.


Have you tested the memory in your server lately?  Memtest86+ on the 
ram, and badblocks on the SSD swap partition?






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] redundancy with 2 nodes

2015-01-02 Thread Jiri Kanicky

Hi,

I noticed this message after shutting down the other node. You might be 
right that I need 3 monitors.

2015-01-01 15:47:35.990260 7f22858dd700  0 monclient: hunting for new mon

But what is quite unexpected is that you cannot run even ceph status 
on the running node t find out the state of the cluster.


Thx Jiri


On 1/01/2015 15:46, Jiri Kanicky wrote:

Hi,

I have:
- 2 monitors, one on each node
- 4 OSDs, two on each node
- 2 MDS, one on each node

Yes, all pools are set with size=2 and min_size=1

cephadmin@ceph1:~$ ceph osd dump
epoch 88
fsid bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
created 2014-12-27 23:38:00.455097
modified 2014-12-30 20:45:51.343217
flags
pool 0 'rbd' replicated *size 2 min_size 1* crush_ruleset 0 
object_hash rjenkins p  g_num 256 pgp_num 256 
last_change 86 flags hashpspool stripe_width 0
pool 1 'media' replicated *size 2 min_size 1* crush_ruleset 0 
object_hash rjenkins   pg_num 256 pgp_num 256 
last_change 60 flags hashpspool stripe_width 0
pool 2 'data' replicated size *2 min_size 1* crush_ruleset 0 
object_hash rjenkins   pg_num 256 pgp_num 256 
last_change 63 flags hashpspool stripe_width 0
pool 3 'cephfs_test' replicated *size 2 min_size 1* crush_ruleset 0 
object_hash rj  enkins pg_num 256 pgp_num 256 
last_change 71 flags hashpspool crash_replay_inter  
val 45 stripe_width 0
pool 4 'cephfs_metadata' replicated *size 2 min_size 1* crush_ruleset 
0 object_has  h rjenkins pg_num 256 pgp_num 256 
last_change 69 flags hashpspool stripe_width 0

max_osd 4
osd.0 up   in  weight 1 up_from 55 up_thru 86 down_at 51 
last_clean_interval [39  ,50) 192.168.30.21:6800/17319 
10.1.1.21:6800/17319 10.1.1.21:6801/17319 192.168.  
30.21:6801/17319 exists,up 4f3172e1-adb8-4ca3-94af-6f0b8fcce35a
osd.1 up   in  weight 1 up_from 57 up_thru 86 down_at 53 
last_clean_interval [41  ,52) 192.168.30.21:6803/17684 
10.1.1.21:6802/17684 10.1.1.21:6804/17684 192.168.  
30.21:6805/17684 exists,up 1790347a-94fa-4b81-b429-1e7c7f11d3fd
osd.2 up   in  weight 1 up_from 79 up_thru 86 down_at 74 
last_clean_interval [13  ,73) 192.168.30.22:6801/3178 
10.1.1.22:6800/3178 10.1.1.22:6801/3178 192.168.30.  
22:6802/3178 exists,up 5520835f-c411-4750-974b-34e9aea2585d
osd.3 up   in  weight 1 up_from 81 up_thru 86 down_at 72 
last_clean_interval [20  ,71) 192.168.30.22:6804/3414 
10.1.1.22:6802/3414 10.1.1.22:6803/3414 192.168.30.  
22:6805/3414 exists,up 25e62059-6392-4a69-99c9-214ae335004


Thx Jiri

On 1/01/2015 15:21, Lindsay Mathieson wrote:

On Thu, 1 Jan 2015 02:59:05 PM Jiri Kanicky wrote:

I would expect that if I shut down one node, the system will keep
running. But when I tested it, I cannot even execute ceph status
command on the running node.

2 osd Nodes, 3 Mon nodes here, works perfectly for me.

How many monitors do you have?
Maybe you need a third monitor only node for quorum?



I set osd_pool_default_size = 2 (min_size=1) on all pools, so I
thought that each copy will reside on each node. Which means that if 1
node goes down the second one will be still operational.

does:
ceph osd pool get {pool name} size
   return 2

ceph osd pool get {pool name} min_size
   return 1




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] redundancy with 2 nodes

2014-12-31 Thread Jiri Kanicky

Hi,

Is it possible to achieve redundancy with 2 nodes only?

cephadmin@ceph1:~$ ceph osd tree
# idweight  type name   up/down reweight
-1  10.88   root default
-2  5.44host ceph1
0   2.72osd.0   up  1
1   2.72osd.1   up  1
-3  5.44host ceph2
2   2.72osd.2   up  1
3   2.72osd.3   up  1

cephadmin@ceph1:~$ ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_OK
 monmap e1: 2 mons at 
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 
12, quorum 0,1 ceph1,ceph2

 mdsmap e7: 1/1/1 up {0=ceph1=up:active}, 1 up:standby
 osdmap e88: 4 osds: 4 up, 4 in
  pgmap v2051: 1280 pgs, 5 pools, 13184 MB data, 3328 objects
26457 MB used, 11128 GB / 11158 GB avail
1280 active+clean

I would expect that if I shut down one node, the system will keep 
running. But when I tested it, I cannot even execute ceph status 
command on the running node.


I set osd_pool_default_size = 2 (min_size=1) on all pools, so I 
thought that each copy will reside on each node. Which means that if 1 
node goes down the second one will be still operational.


I think my assumptions are wrong, but I could not find the explanation why.

Thanks Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] redundancy with 2 nodes

2014-12-31 Thread Jiri Kanicky

Hi,

I have:
- 2 monitors, one on each node
- 4 OSDs, two on each node
- 2 MDS, one on each node

Yes, all pools are set with size=2 and min_size=1

cephadmin@ceph1:~$ ceph osd dump
epoch 88
fsid bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
created 2014-12-27 23:38:00.455097
modified 2014-12-30 20:45:51.343217
flags
pool 0 'rbd' replicated *size 2 min_size 1* crush_ruleset 0 object_hash 
rjenkins p  g_num 256 pgp_num 256 last_change 86 flags 
hashpspool stripe_width 0
pool 1 'media' replicated *size 2 min_size 1* crush_ruleset 0 
object_hash rjenkins   pg_num 256 pgp_num 256 
last_change 60 flags hashpspool stripe_width 0
pool 2 'data' replicated size *2 min_size 1* crush_ruleset 0 object_hash 
rjenkins   pg_num 256 pgp_num 256 last_change 63 flags 
hashpspool stripe_width 0
pool 3 'cephfs_test' replicated *size 2 min_size 1* crush_ruleset 0 
object_hash rj  enkins pg_num 256 pgp_num 256 
last_change 71 flags hashpspool crash_replay_inter  val 
45 stripe_width 0
pool 4 'cephfs_metadata' replicated *size 2 min_size 1* crush_ruleset 0 
object_has  h rjenkins pg_num 256 pgp_num 256 
last_change 69 flags hashpspool stripe_width 0

max_osd 4
osd.0 up   in  weight 1 up_from 55 up_thru 86 down_at 51 
last_clean_interval [39  ,50) 192.168.30.21:6800/17319 
10.1.1.21:6800/17319 10.1.1.21:6801/17319 192.168.  
30.21:6801/17319 exists,up 4f3172e1-adb8-4ca3-94af-6f0b8fcce35a
osd.1 up   in  weight 1 up_from 57 up_thru 86 down_at 53 
last_clean_interval [41  ,52) 192.168.30.21:6803/17684 
10.1.1.21:6802/17684 10.1.1.21:6804/17684 192.168.  
30.21:6805/17684 exists,up 1790347a-94fa-4b81-b429-1e7c7f11d3fd
osd.2 up   in  weight 1 up_from 79 up_thru 86 down_at 74 
last_clean_interval [13  ,73) 192.168.30.22:6801/3178 
10.1.1.22:6800/3178 10.1.1.22:6801/3178 192.168.30.  
22:6802/3178 exists,up 5520835f-c411-4750-974b-34e9aea2585d
osd.3 up   in  weight 1 up_from 81 up_thru 86 down_at 72 
last_clean_interval [20  ,71) 192.168.30.22:6804/3414 
10.1.1.22:6802/3414 10.1.1.22:6803/3414 192.168.30.  
22:6805/3414 exists,up 25e62059-6392-4a69-99c9-214ae335004


Thx Jiri

On 1/01/2015 15:21, Lindsay Mathieson wrote:

On Thu, 1 Jan 2015 02:59:05 PM Jiri Kanicky wrote:

I would expect that if I shut down one node, the system will keep
running. But when I tested it, I cannot even execute ceph status
command on the running node.

2 osd Nodes, 3 Mon nodes here, works perfectly for me.

How many monitors do you have?
Maybe you need a third monitor only node for quorum?



I set osd_pool_default_size = 2 (min_size=1) on all pools, so I
thought that each copy will reside on each node. Which means that if 1
node goes down the second one will be still operational.


does:
ceph osd pool get {pool name} size
   return 2

ceph osd pool get {pool name} min_size
   return 1




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs kernel module reports error on mount

2014-12-30 Thread Jiri Kanicky

Hi.

I have got the same message in Debian Jessie, while the CephFS mounts 
and works fine.


Jiri.

On 18/12/2014 01:00, John Spray wrote:

Hmm, from a quick google it appears you are not the only one who has
seen this symptom with mount.ceph.  Our mtab code appears to have
diverged a bit from the upstream util-linux repo, so it seems entirely
possible we have a bug in ours somewhere.  I've opened
http://tracker.ceph.com/issues/10351 to track it.

Cheers,
John

On Wed, Dec 17, 2014 at 1:31 PM, Lindsay Mathieson
lindsay.mathie...@gmail.com wrote:

mount reports:

mount: error writing /etc/mtab: Invalid argument



fstab entry is:



vnb.proxmox.softlog,vng.proxmox.softlog,vnt.proxmox.softlog:/ /mnt/test ceph
_netdev,defaults,name=admin,secretfile=/etc/pve/priv/admin.secret 0 0





However the mounts is successful and a mtab entry is made.



debian wheezy, ceph 0.87



--

Lindsay


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread Jiri Kanicky

Hi,

I just build my CEPH cluster but having problems with the health of the 
cluster.


Here are few details:
- I followed the ceph documentation.
- I used btrfs filesystem for all OSDs
- I did not set osd pool default size = 2  as I thought that if I have 
2 nodes + 4 OSDs, I can leave default=3. I am not sure if this was right.
- I noticed that default pools data,metadata were not created. Only 
rbd pool was created.
- As it was complaining that the pg_num is too low, I increased the 
pg_num for pool rbd to 133 (400/3) and end up with pool rbd pg_num 133 
 pgp_num 64.


Would you give me hint where I have made the mistake? (I can remove the 
OSDs and start over if needed.)



cephadmin@ceph1:/etc/ceph$ sudo ceph health
HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck 
unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num 133 
 pgp_num 64

cephadmin@ceph1:/etc/ceph$ sudo ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs 
stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd 
pg_num 133  pgp_num 64
 monmap e1: 2 mons at 
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 
8, quorum 0,1 ceph1,ceph2

 osdmap e42: 4 osds: 4 up, 4 in
  pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects
11704 kB used, 11154 GB / 11158 GB avail
  29 active+undersized+degraded
 104 active+remapped


cephadmin@ceph1:/etc/ceph$ sudo ceph osd tree
# idweight  type name   up/down reweight
-1  10.88   root default
-2  5.44host ceph1
0   2.72osd.0   up  1
1   2.72osd.1   up  1
-3  5.44host ceph2
2   2.72osd.2   up  1
3   2.72osd.3   up  1


cephadmin@ceph1:/etc/ceph$ sudo ceph osd lspools
0 rbd,

cephadmin@ceph1:/etc/ceph$ cat ceph.conf
[global]
fsid = bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
public_network = 192.168.30.0/24
cluster_network = 10.1.1.0/24
mon_initial_members = ceph1, ceph2
mon_host = 192.168.30.21,192.168.30.22
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

Thank you
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread Jiri Kanicky

Hi Christian.

Thank you for your comments again. Very helpful.

I will try to fix the current pool and see how it goes. Its good to 
learn some troubleshooting skills.


Regarding the BTRFS vs XFS, not sure if the documentation is old. My 
decision was based on this:


http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/

Note

We currently recommendXFSfor production deployments. We 
recommendbtrfsfor testing, development, and any non-critical 
deployments. *We believe thatbtrfshas the correct feature set 
and roadmap to serve Ceph in the long-term*, butXFSandext4provide the 
necessary stability for today’s deployments.btrfsdevelopment is 
proceeding rapidly: users should be comfortable installing the latest 
released upstream kernels and be able to track development activity for 
critical bug fixes.




Thanks
Jiri


On 28/12/2014 16:01, Christian Balzer wrote:

Hello,

On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote:


Hi Christian.

Thank you for your suggestions.

I will set the osd pool default size to 2 as you recommended. As
mentioned the documentation is talking about OSDs, not nodes, so that
must have confused me.


Note that changing this will only affect new pools of course. So to sort
out your current state either start over with this value set before
creating/starting anything or reduce the current size (ceph osd pool set
poolname size).

Have a look at the crushmap example or even better your own, current one
and you will see where by default the host is the failure domain.
Which of course makes a lot of sense.
  

Regarding the BTRFS, i thought that btrfs is better option for the
future providing more features. I know that XFS might be more stable,
but again my impression was that btrfs is the focus for future
development. Is that correct?


I'm not a developer, but if you scour the ML archives you will find a
number of threads about BTRFS (and ZFS).
The biggest issues with BTRFS are not just stability but also the fact
that it degrades rather quickly (fragmentation) due to the COW nature of
it and less smarts than ZFS in that area.
So development on the Ceph side is not the issue per se.

IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS might
become the better choice (in the future), with KV store backends being an
alternative for some use cases (also far from production ready at this
time).

Regards,

Christian

You are right with the round up. I forgot about that.

Thanks for your help. Much appreciated.
Jiri

- Reply message -
From: Christian Balzer ch...@gol.com
To: ceph-us...@ceph.com
Cc: Jiri Kanicky ji...@ganomi.com
Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck
degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun, Dec
28, 2014 03:29

Hello,

On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote:


Hi,

I just build my CEPH cluster but having problems with the health of
the cluster.


You're not telling us the version, but it's clearly 0.87 or beyond.


Here are few details:
- I followed the ceph documentation.

Outdated, unfortunately.


- I used btrfs filesystem for all OSDs

Big mistake number 1, do some research (google, ML archives).
Though not related to to  your problems.


- I did not set osd pool default size = 2  as I thought that if I
have 2 nodes + 4 OSDs, I can leave default=3. I am not sure if this
was right.

Big mistake, assumption number 2,  replications size by the default CRUSH
rule is determined by hosts. So that's your main issue here.
Either set it to 2 or use 3 hosts.


- I noticed that default pools data,metadata were not created. Only
rbd pool was created.

See outdated docs above. The majority of use cases is with RBD, so since
Giant the cephfs pools are not created by default.


- As it was complaining that the pg_num is too low, I increased the
pg_num for pool rbd to 133 (400/3) and end up with pool rbd pg_num
133
   pgp_num 64.


Re-read the (in this case correct) documentation.
It clearly states to round up to nearest power of 2, in your case 256.

Regards.

Christian


Would you give me hint where I have made the mistake? (I can remove
the OSDs and start over if needed.)


cephadmin@ceph1:/etc/ceph$ sudo ceph health
HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck
unclean; 29 pgs stuck undersized; 29 pgs undersized; pool rbd pg_num
133
   pgp_num 64
cephadmin@ceph1:/etc/ceph$ sudo ceph status
  cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
   health HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133
pgs stuck unclean; 29 pgs stuck undersized; 29 pgs undersized; pool
rbd pg_num 133  pgp_num 64
   monmap e1: 2 mons at
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election
epoch 8, quorum 0,1 ceph1,ceph2
   osdmap e42: 4 osds: 4 up, 4 in
pgmap v77: 133 pgs, 1 pools, 0 bytes data, 0 objects
  11704 kB used, 11154 GB / 11158 GB avail
29 active+undersized+degraded

Re: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck degraded; 133 pgs stuck unclean; 29 pgs stuck undersized;

2014-12-27 Thread Jiri Kanicky

Hi Christian,

Thank you for the valuable info. As I will use this cluster mainly at 
home for my data, and testing (backup in place), I will continue to use 
BTRFS. In production, I would go with XFS as recommended. ZFS - perhaps 
when this will become supported officially.


BTW, I fixed the HEALTH of my cluster:
1. I set ceph osd pool set rbd size 2
2. I set ceph osd pool set rbd pg_num 256 and ceph osd pool set rbd 
pgp_num 256


5 pgs remained stuck unclean (stuck unclean since forever, current state 
active, last acting). I fixed this by restarting ceph -a. I think the 
OSD restart fixed this. I guess there might be more elegant solution, 
but I was not able to figure it out. Tried pg repair but that didn't 
do trick.


Anyway, it seems to be healthy now :).
cephadmin@ceph1:~$ sudo ceph status
cluster bce2ff4d-e03b-4b75-9b17-8a48ee4d7788
 health HEALTH_OK
 monmap e1: 2 mons at 
{ceph1=192.168.30.21:6789/0,ceph2=192.168.30.22:6789/0}, election epoch 
10, quorum 0,1 ceph1,ceph2

 osdmap e59: 4 osds: 4 up, 4 in
  pgmap v179: 256 pgs, 1 pools, 0 bytes data, 0 objects
16924 kB used, 11154 GB / 11158 GB avail
 256 active+clean

Thanks for the help!
Jiri

On 28/12/2014 16:59, Christian Balzer wrote:

Hello Jiri,

On Sun, 28 Dec 2014 16:14:04 +1100 Jiri Kanicky wrote:


Hi Christian.

Thank you for your comments again. Very helpful.

I will try to fix the current pool and see how it goes. Its good to
learn some troubleshooting skills.


Indeed, knowing what to do when things break is where it's at.


Regarding the BTRFS vs XFS, not sure if the documentation is old. My
decision was based on this:

http://ceph.com/docs/master/rados/configuration/filesystem-recommendations/


It's dated for sure and a bit of wishful thinking on behalf of the Ceph
developers.
Who understandably didn't want to re-invent the wheel inside Ceph when the
underlying file system could provide it (checksums, snapshots, etc).

ZFS has all the features (and much better tested) BTRFS is aspiring to and
if kept below 80% utilization doesn't fragment itself to death.

And the end of that page they mention deduplication, which of course (as I
wrote recently in the use ZFS for OSDs thread is unlikely to do anything
worthwhile at all.

Simply put, some things _need_ to be done in Ceph to work properly and
can't be delegated to the underlying FS or other storage backend.

Christian


Note

We currently recommendXFSfor production deployments. We
recommendbtrfsfor testing, development, and any non-critical
deployments. *We believe thatbtrfshas the correct feature set
and roadmap to serve Ceph in the long-term*, butXFSandext4provide the
necessary stability for today’s deployments.btrfsdevelopment is
proceeding rapidly: users should be comfortable installing the latest
released upstream kernels and be able to track development activity for
critical bug fixes.



Thanks
Jiri


On 28/12/2014 16:01, Christian Balzer wrote:

Hello,

On Sun, 28 Dec 2014 11:58:59 +1100 ji...@ganomi.com wrote:


Hi Christian.

Thank you for your suggestions.

I will set the osd pool default size to 2 as you recommended. As
mentioned the documentation is talking about OSDs, not nodes, so that
must have confused me.


Note that changing this will only affect new pools of course. So to
sort out your current state either start over with this value set
before creating/starting anything or reduce the current size (ceph osd
pool set poolname size).

Have a look at the crushmap example or even better your own, current
one and you will see where by default the host is the failure domain.
Which of course makes a lot of sense.
   

Regarding the BTRFS, i thought that btrfs is better option for the
future providing more features. I know that XFS might be more stable,
but again my impression was that btrfs is the focus for future
development. Is that correct?


I'm not a developer, but if you scour the ML archives you will find a
number of threads about BTRFS (and ZFS).
The biggest issues with BTRFS are not just stability but also the fact
that it degrades rather quickly (fragmentation) due to the COW nature
of it and less smarts than ZFS in that area.
So development on the Ceph side is not the issue per se.

IMHO BTRFS looks more and more stillborn and with regard to Ceph ZFS
might become the better choice (in the future), with KV store backends
being an alternative for some use cases (also far from production
ready at this time).

Regards,

Christian

You are right with the round up. I forgot about that.

Thanks for your help. Much appreciated.
Jiri

- Reply message -
From: Christian Balzer ch...@gol.com
To: ceph-us...@ceph.com
Cc: Jiri Kanicky ji...@ganomi.com
Subject: [ceph-users] HEALTH_WARN 29 pgs degraded; 29 pgs stuck
degraded; 133 pgs stuck unclean; 29 pgs stuck undersized; Date: Sun,
Dec 28, 2014 03:29

Hello,

On Sun, 28 Dec 2014 01:52:39 +1100 Jiri Kanicky wrote:


Hi,

I just build my CEPH cluster but having

[ceph-users] do I have to use sudo for CEPH install

2014-12-01 Thread Jiri Kanicky

Hi.

Do I have to install sudo in Debian Wheezy to deploy CEPH succesfully? I 
dont normally use sudo.


Thank you
Jiri
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com