Re: [ceph-users] Issue with free Inodes

2015-03-25 Thread Kamil Kuramshin

May be some one can spot a new light,

1. Only SSD-cache OSDs affected by this issue
2. Total cache OSD count is 12x60GiB, backend filesystem is ext4
3. I have created 2 cache tier pools with replica size=3 on that OSD, 
both with pg_num:400, pgp_num:400

4. There was a crush ruleset:
superuser@admin:~$ ceph osd crush rule dump ssd
{ rule_id: 3,
  rule_name: ssd,
  ruleset: 3,
  type: 1,
  min_size: 1,
  max_size: 10,
  steps: [
{ op: take,
  item: -21,
  item_name: ssd},
{ op: chooseleaf_firstn,
  num: 0,
  type: disktype},
{ op: emit}]}
for gathering all SSD OSDs from all nodes by *disktype*

I guess there may be a lot of *directories* that was created on 
filesystem for organizing placement groups, can that cause that very big 
amount of inodes occupied by directory records?




24.03.2015 16:52, Gregory Farnum пишет:

On Tue, Mar 24, 2015 at 12:13 AM, Christian Balzer ch...@gol.com wrote:

On Tue, 24 Mar 2015 09:41:04 +0300 Kamil Kuramshin wrote:


Yes I read it and do no not understand what you mean when say *verify
this*? All 3335808 inodes are definetly files and direcories created by
ceph OSD process:


What I mean is how/why did Ceph create 3+ million files, where in the tree
are they actually or are they evenly distributed in the respective PG
sub-directories.

Or to ask it differently, how large is your cluster (how many OSDs,
objects), in short the output of ceph -s.

If cache-tiers actually are reserving each object that exists on the
backing store (even if there isn't data in it yet on the cache tier) and
your cluster is large enough, it might explain this.

Nope. As you've said, this doesn't make any sense unless the objects
are all ludicrously small (and you can't actually get 10-byte objects
in Ceph; the names alone tend to be bigger than that) or something
else is using up inodes.


And that should both be mentioned and precautions to not run out of inodes
should be made by the Ceph code.

If not, this may be a bug after all.

Would be nice if somebody from the Ceph devs could have gander at this.

Christian


*tune2fs 1.42.5 (29-Jul-2012)*
Filesystem volume name:   none
Last mounted on:  /var/lib/ceph/tmp/mnt.05NAJ3
Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33
Filesystem magic number:  0xEF53
Filesystem revision #:1 (dynamic)
Filesystem features:  has_journal ext_attr resize_inode dir_index
filetype extent flex_bg sparse_super large_file huge_file uninit_bg
dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options:user_xattr acl
Filesystem state: clean
Errors behavior:  Continue
Filesystem OS type:   Linux
*Inode count:  3335808*
Block count:  13342945
Reserved block count: 667147
Free blocks:  5674105
*Free inodes:  0*
First block:  0
Block size:   4096
Fragment size:4096
Reserved GDT blocks:  1020
Blocks per group: 32768
Fragments per group:  32768
Inodes per group: 8176
Inode blocks per group:   511
Flex block group size:16
Filesystem created:   Fri Feb 20 16:44:25 2015
Last mount time:  Tue Mar 24 09:33:19 2015
Last write time:  Tue Mar 24 09:33:27 2015
Mount count:  7
Maximum mount count:  -1
Last checked: Fri Feb 20 16:44:25 2015
Check interval:   0 (none)
Lifetime writes:  4116 GB
Reserved blocks uid:  0 (user root)
Reserved blocks gid:  0 (group root)
First inode:  11
Inode size:   256
Required extra isize: 28
Desired extra isize:  28
Journal inode:8
Default directory hash:   half_md4
Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b
Journal backup:   inode blocks

*fsck.ext4 /dev/sda1*
e2fsck 1.42.5 (29-Jul-2012)
/dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks

23.03.2015 17:09, Christian Balzer пишет:

On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote:


Yes, I understand that.

The initial purpose of first email was just an advise for new comers.
My fault was in that I was selected ext4 for SSD disks as backend.
But I  did not foresee that inode number can reach its limit before
the free space :)

And maybe there must be some sort of warning not only for free space
in MiBs(GiBs,TiBs) and there must be dedicated warning about free
inodes for filesystems with static inode allocation  like ext4.
Because if OSD reach inode limit it becames totally unusable and
immediately goes down, and from that moment there is no way to start
it!


While all that is true and should probably be addressed, please re-read
what I wrote before.

With the 3.3 million inodes used and thus likely as many files (did you
verify this?) and 4MB objects that would make something in the 12TB
ballpark area.

Something very very strange and wrong is going on with your cache tier.

Christian


23.03.2015 

Re: [ceph-users] Issue with free Inodes

2015-03-24 Thread Gregory Farnum
On Tue, Mar 24, 2015 at 12:13 AM, Christian Balzer ch...@gol.com wrote:
 On Tue, 24 Mar 2015 09:41:04 +0300 Kamil Kuramshin wrote:

 Yes I read it and do no not understand what you mean when say *verify
 this*? All 3335808 inodes are definetly files and direcories created by
 ceph OSD process:

 What I mean is how/why did Ceph create 3+ million files, where in the tree
 are they actually or are they evenly distributed in the respective PG
 sub-directories.

 Or to ask it differently, how large is your cluster (how many OSDs,
 objects), in short the output of ceph -s.

 If cache-tiers actually are reserving each object that exists on the
 backing store (even if there isn't data in it yet on the cache tier) and
 your cluster is large enough, it might explain this.

Nope. As you've said, this doesn't make any sense unless the objects
are all ludicrously small (and you can't actually get 10-byte objects
in Ceph; the names alone tend to be bigger than that) or something
else is using up inodes.


 And that should both be mentioned and precautions to not run out of inodes
 should be made by the Ceph code.

 If not, this may be a bug after all.

 Would be nice if somebody from the Ceph devs could have gander at this.

 Christian

 *tune2fs 1.42.5 (29-Jul-2012)*
 Filesystem volume name:   none
 Last mounted on:  /var/lib/ceph/tmp/mnt.05NAJ3
 Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33
 Filesystem magic number:  0xEF53
 Filesystem revision #:1 (dynamic)
 Filesystem features:  has_journal ext_attr resize_inode dir_index
 filetype extent flex_bg sparse_super large_file huge_file uninit_bg
 dir_nlink extra_isize
 Filesystem flags: signed_directory_hash
 Default mount options:user_xattr acl
 Filesystem state: clean
 Errors behavior:  Continue
 Filesystem OS type:   Linux
 *Inode count:  3335808*
 Block count:  13342945
 Reserved block count: 667147
 Free blocks:  5674105
 *Free inodes:  0*
 First block:  0
 Block size:   4096
 Fragment size:4096
 Reserved GDT blocks:  1020
 Blocks per group: 32768
 Fragments per group:  32768
 Inodes per group: 8176
 Inode blocks per group:   511
 Flex block group size:16
 Filesystem created:   Fri Feb 20 16:44:25 2015
 Last mount time:  Tue Mar 24 09:33:19 2015
 Last write time:  Tue Mar 24 09:33:27 2015
 Mount count:  7
 Maximum mount count:  -1
 Last checked: Fri Feb 20 16:44:25 2015
 Check interval:   0 (none)
 Lifetime writes:  4116 GB
 Reserved blocks uid:  0 (user root)
 Reserved blocks gid:  0 (group root)
 First inode:  11
 Inode size:   256
 Required extra isize: 28
 Desired extra isize:  28
 Journal inode:8
 Default directory hash:   half_md4
 Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b
 Journal backup:   inode blocks

 *fsck.ext4 /dev/sda1*
 e2fsck 1.42.5 (29-Jul-2012)
 /dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks

 23.03.2015 17:09, Christian Balzer пишет:
  On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote:
 
  Yes, I understand that.
 
  The initial purpose of first email was just an advise for new comers.
  My fault was in that I was selected ext4 for SSD disks as backend.
  But I  did not foresee that inode number can reach its limit before
  the free space :)
 
  And maybe there must be some sort of warning not only for free space
  in MiBs(GiBs,TiBs) and there must be dedicated warning about free
  inodes for filesystems with static inode allocation  like ext4.
  Because if OSD reach inode limit it becames totally unusable and
  immediately goes down, and from that moment there is no way to start
  it!
 
  While all that is true and should probably be addressed, please re-read
  what I wrote before.
 
  With the 3.3 million inodes used and thus likely as many files (did you
  verify this?) and 4MB objects that would make something in the 12TB
  ballpark area.
 
  Something very very strange and wrong is going on with your cache tier.
 
  Christian
 
  23.03.2015 13:42, Thomas Foster пишет:
  You could fix this by changing your block size when formatting the
  mount-point with the mkfs -b command.  I had this same issue when
  dealing with the filesystem using glusterfs and the solution is to
  either use a filesystem that allocates inodes automatically or change
  the block size when you build the filesystem.  Unfortunately, the
  only way to fix the problem that I have seen is to reformat
 
  On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin
  kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote:
 
   In my case there was cache pool for ec-pool serving RBD-images,
   and object size is 4Mb, and client was an /kernel-rbd /client
   each SSD disk is 60G disk, 2 disk per node,  6 nodes in total =
  12 OSDs in total

Re: [ceph-users] Issue with free Inodes

2015-03-24 Thread Kamil Kuramshin

Yes I read it and do no not understand what you mean when say *verify this*?
All 3335808 inodes are definetly files and direcories created by ceph 
OSD process:


*tune2fs 1.42.5 (29-Jul-2012)*
Filesystem volume name:   none
Last mounted on:  /var/lib/ceph/tmp/mnt.05NAJ3
Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33
Filesystem magic number:  0xEF53
Filesystem revision #:1 (dynamic)
Filesystem features:  has_journal ext_attr resize_inode dir_index 
filetype extent flex_bg sparse_super large_file huge_file uninit_bg 
dir_nlink extra_isize

Filesystem flags: signed_directory_hash
Default mount options:user_xattr acl
Filesystem state: clean
Errors behavior:  Continue
Filesystem OS type:   Linux
*Inode count:  3335808*
Block count:  13342945
Reserved block count: 667147
Free blocks:  5674105
*Free inodes:  0*
First block:  0
Block size:   4096
Fragment size:4096
Reserved GDT blocks:  1020
Blocks per group: 32768
Fragments per group:  32768
Inodes per group: 8176
Inode blocks per group:   511
Flex block group size:16
Filesystem created:   Fri Feb 20 16:44:25 2015
Last mount time:  Tue Mar 24 09:33:19 2015
Last write time:  Tue Mar 24 09:33:27 2015
Mount count:  7
Maximum mount count:  -1
Last checked: Fri Feb 20 16:44:25 2015
Check interval:   0 (none)
Lifetime writes:  4116 GB
Reserved blocks uid:  0 (user root)
Reserved blocks gid:  0 (group root)
First inode:  11
Inode size:   256
Required extra isize: 28
Desired extra isize:  28
Journal inode:8
Default directory hash:   half_md4
Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b
Journal backup:   inode blocks

*fsck.ext4 /dev/sda1*
e2fsck 1.42.5 (29-Jul-2012)
/dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks

23.03.2015 17:09, Christian Balzer пишет:

On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote:


Yes, I understand that.

The initial purpose of first email was just an advise for new comers. My
fault was in that I was selected ext4 for SSD disks as backend.
But I  did not foresee that inode number can reach its limit before the
free space :)

And maybe there must be some sort of warning not only for free space in
MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes
for filesystems with static inode allocation  like ext4.
Because if OSD reach inode limit it becames totally unusable and
immediately goes down, and from that moment there is no way to start it!


While all that is true and should probably be addressed, please re-read
what I wrote before.

With the 3.3 million inodes used and thus likely as many files (did you
verify this?) and 4MB objects that would make something in the 12TB
ballpark area.

Something very very strange and wrong is going on with your cache tier.

Christian


23.03.2015 13:42, Thomas Foster пишет:

You could fix this by changing your block size when formatting the
mount-point with the mkfs -b command.  I had this same issue when
dealing with the filesystem using glusterfs and the solution is to
either use a filesystem that allocates inodes automatically or change
the block size when you build the filesystem.  Unfortunately, the only
way to fix the problem that I have seen is to reformat

On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin
kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote:

 In my case there was cache pool for ec-pool serving RBD-images,
 and object size is 4Mb, and client was an /kernel-rbd /client
 each SSD disk is 60G disk, 2 disk per node,  6 nodes in total = 12
 OSDs in total


 23.03.2015 12:00, Christian Balzer пишет:

 Hello,

 This is rather confusing, as cache-tiers are just normal
OSDs/pools and thus should have Ceph objects of around 4MB in size by
default.

 This is matched on what I see with Ext4 here (normal OSD, not a
cache tier):
 ---
 size:
 /dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
 inodes:
 /dev/sde1  183148544 55654 183092890
1% /var/lib/ceph/osd/ceph-0 ---

 On a more fragmented cluster I see a 5:1 size to inode ratio.

 I just can't fathom how there could be 3.3 million inodes (and
thus a close number of files) using 30G, making the average file size
below 10 Bytes.

 Something other than your choice of file system is probably at
play here.

 How fragmented are those SSDs?
 What's your default Ceph object size?
 Where _are_ those 3 million files in that OSD, are they actually
in the object files like:
 -rw-r--r-- 1 root root 4194304 Jan  9
15:27 
/var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3

 What's your use case, RBD, CephFS, RadosGW?

 Regards,

 Christian


Re: [ceph-users] Issue with free Inodes

2015-03-24 Thread Christian Balzer
On Tue, 24 Mar 2015 09:41:04 +0300 Kamil Kuramshin wrote:

 Yes I read it and do no not understand what you mean when say *verify
 this*? All 3335808 inodes are definetly files and direcories created by
 ceph OSD process:
 
What I mean is how/why did Ceph create 3+ million files, where in the tree
are they actually or are they evenly distributed in the respective PG
sub-directories. 

Or to ask it differently, how large is your cluster (how many OSDs,
objects), in short the output of ceph -s.

If cache-tiers actually are reserving each object that exists on the
backing store (even if there isn't data in it yet on the cache tier) and
your cluster is large enough, it might explain this.

And that should both be mentioned and precautions to not run out of inodes
should be made by the Ceph code.

If not, this may be a bug after all.

Would be nice if somebody from the Ceph devs could have gander at this.

Christian

 *tune2fs 1.42.5 (29-Jul-2012)*
 Filesystem volume name:   none
 Last mounted on:  /var/lib/ceph/tmp/mnt.05NAJ3
 Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33
 Filesystem magic number:  0xEF53
 Filesystem revision #:1 (dynamic)
 Filesystem features:  has_journal ext_attr resize_inode dir_index 
 filetype extent flex_bg sparse_super large_file huge_file uninit_bg 
 dir_nlink extra_isize
 Filesystem flags: signed_directory_hash
 Default mount options:user_xattr acl
 Filesystem state: clean
 Errors behavior:  Continue
 Filesystem OS type:   Linux
 *Inode count:  3335808*
 Block count:  13342945
 Reserved block count: 667147
 Free blocks:  5674105
 *Free inodes:  0*
 First block:  0
 Block size:   4096
 Fragment size:4096
 Reserved GDT blocks:  1020
 Blocks per group: 32768
 Fragments per group:  32768
 Inodes per group: 8176
 Inode blocks per group:   511
 Flex block group size:16
 Filesystem created:   Fri Feb 20 16:44:25 2015
 Last mount time:  Tue Mar 24 09:33:19 2015
 Last write time:  Tue Mar 24 09:33:27 2015
 Mount count:  7
 Maximum mount count:  -1
 Last checked: Fri Feb 20 16:44:25 2015
 Check interval:   0 (none)
 Lifetime writes:  4116 GB
 Reserved blocks uid:  0 (user root)
 Reserved blocks gid:  0 (group root)
 First inode:  11
 Inode size:   256
 Required extra isize: 28
 Desired extra isize:  28
 Journal inode:8
 Default directory hash:   half_md4
 Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b
 Journal backup:   inode blocks
 
 *fsck.ext4 /dev/sda1*
 e2fsck 1.42.5 (29-Jul-2012)
 /dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks
 
 23.03.2015 17:09, Christian Balzer пишет:
  On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote:
 
  Yes, I understand that.
 
  The initial purpose of first email was just an advise for new comers.
  My fault was in that I was selected ext4 for SSD disks as backend.
  But I  did not foresee that inode number can reach its limit before
  the free space :)
 
  And maybe there must be some sort of warning not only for free space
  in MiBs(GiBs,TiBs) and there must be dedicated warning about free
  inodes for filesystems with static inode allocation  like ext4.
  Because if OSD reach inode limit it becames totally unusable and
  immediately goes down, and from that moment there is no way to start
  it!
 
  While all that is true and should probably be addressed, please re-read
  what I wrote before.
 
  With the 3.3 million inodes used and thus likely as many files (did you
  verify this?) and 4MB objects that would make something in the 12TB
  ballpark area.
 
  Something very very strange and wrong is going on with your cache tier.
 
  Christian
 
  23.03.2015 13:42, Thomas Foster пишет:
  You could fix this by changing your block size when formatting the
  mount-point with the mkfs -b command.  I had this same issue when
  dealing with the filesystem using glusterfs and the solution is to
  either use a filesystem that allocates inodes automatically or change
  the block size when you build the filesystem.  Unfortunately, the
  only way to fix the problem that I have seen is to reformat
 
  On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin
  kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote:
 
   In my case there was cache pool for ec-pool serving RBD-images,
   and object size is 4Mb, and client was an /kernel-rbd /client
   each SSD disk is 60G disk, 2 disk per node,  6 nodes in total =
  12 OSDs in total
 
 
   23.03.2015 12:00, Christian Balzer пишет:
   Hello,
 
   This is rather confusing, as cache-tiers are just normal
  OSDs/pools and thus should have Ceph objects of around 4MB in size
  by default.
 
   This is matched on what I see with Ext4 here (normal OSD, not a
  cache tier):
   

Re: [ceph-users] Issue with free Inodes

2015-03-23 Thread Kamil Kuramshin

Yes, I understand that.

The initial purpose of first email was just an advise for new comers. My 
fault was in that I was selected ext4 for SSD disks as backend.
But I  did not foresee that inode number can reach its limit before the 
free space :)


And maybe there must be some sort of warning not only for free space in 
MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes 
for filesystems with static inode allocation  like ext4.
Because if OSD reach inode limit it becames totally unusable and 
immediately goes down, and from that moment there is no way to start it!



23.03.2015 13:42, Thomas Foster пишет:
You could fix this by changing your block size when formatting the 
mount-point with the mkfs -b command.  I had this same issue when 
dealing with the filesystem using glusterfs and the solution is to 
either use a filesystem that allocates inodes automatically or change 
the block size when you build the filesystem.  Unfortunately, the only 
way to fix the problem that I have seen is to reformat


On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin 
kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote:


In my case there was cache pool for ec-pool serving RBD-images,
and object size is 4Mb, and client was an /kernel-rbd /client
each SSD disk is 60G disk, 2 disk per node,  6 nodes in total = 12
OSDs in total


23.03.2015 12:00, Christian Balzer пишет:

Hello,

This is rather confusing, as cache-tiers are just normal OSDs/pools and
thus should have Ceph objects of around 4MB in size by default.

This is matched on what I see with Ext4 here (normal OSD, not a cache
tier):
---
size:
/dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
inodes:
/dev/sde1  183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0
---

On a more fragmented cluster I see a 5:1 size to inode ratio.

I just can't fathom how there could be 3.3 million inodes (and thus a
close number of files) using 30G, making the average file size below 10
Bytes.

Something other than your choice of file system is probably at play here.

How fragmented are those SSDs?
What's your default Ceph object size?
Where _are_ those 3 million files in that OSD, are they actually in the
object files like:
-rw-r--r-- 1 root root 4194304 Jan  9 15:27 
/var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3

What's your use case, RBD, CephFS, RadosGW?

Regards,

Christian

On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote:


Recently got a problem with OSDs based on SSD disks used in cache tier
for EC-pool

superuser@node02:~$ df -i
FilesystemInodes   IUsed *IFree* IUse% Mounted on
...
/dev/sdb13335808 3335808 *0* 100%
/var/lib/ceph/osd/ceph-45
/dev/sda13335808 3335808 *0* 100%
/var/lib/ceph/osd/ceph-46

Now that OSDs are down on each ceph-node and cache tiering is not
working.

superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log
2015-03-23 10:04:23.631137 7fb105345840  0 ceph version 0.87.1
(283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465
2015-03-23 10:04:23.640676 7fb105345840  0
filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53)
2015-03-23 10:04:23.640735 7fb105345840 -1
genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features:
unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space
left on device
2015-03-23 10:04:23.640763 7fb105345840 -1
filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error:
(28) No space left on device
2015-03-23 10:04:23.640772 7fb105345840 -1
filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in
_detect_fs: (28) No space left on device
2015-03-23 10:04:23.640783 7fb105345840 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-45: (28) *No space left on device*

In the same time*df -h *is confusing:

superuser@node01:~$ df -h
Filesystem  Size  Used *Avail* Use% Mounted on
...
/dev/sda150G   29G *20G*
60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G   27G
*21G*  56% /var/lib/ceph/osd/ceph-46


Filesystem used on affected OSDs is EXt4. All OSDs are deployed with
ceph-deploy:
$ ceph-deploy osd create --zap-disk --fs-type ext4 node-name:device


Help me out what it was just test deployment and all EC-pool data was
lost since I /can't start OSDs/ and ceph cluster/becames degraded /until
I removed all affected tiered pools (cache  EC)
So this is just my observation of what kind of problems can be faced if
you choose wrong Filesystem for OSD backend.
And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems
because 

Re: [ceph-users] Issue with free Inodes

2015-03-23 Thread Thomas Foster
You could fix this by changing your block size when formatting the
mount-point with the mkfs -b command.  I had this same issue when dealing
with the filesystem using glusterfs and the solution is to either use a
filesystem that allocates inodes automatically or change the block size
when you build the filesystem.  Unfortunately, the only way to fix the
problem that I have seen is to reformat

On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin kamil.kurams...@tatar.ru
wrote:

  In my case there was cache pool for ec-pool serving RBD-images, and
 object size is 4Mb, and client was an *kernel-rbd *client
 each SSD disk is 60G disk, 2 disk per node,  6 nodes in total = 12 OSDs in
 total


 23.03.2015 12:00, Christian Balzer пишет:

 Hello,

 This is rather confusing, as cache-tiers are just normal OSDs/pools and
 thus should have Ceph objects of around 4MB in size by default.

 This is matched on what I see with Ext4 here (normal OSD, not a cache
 tier):
 ---
 size:
 /dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
 inodes:
 /dev/sde1  183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0
 ---

 On a more fragmented cluster I see a 5:1 size to inode ratio.

 I just can't fathom how there could be 3.3 million inodes (and thus a
 close number of files) using 30G, making the average file size below 10
 Bytes.

 Something other than your choice of file system is probably at play here.

 How fragmented are those SSDs?
 What's your default Ceph object size?
 Where _are_ those 3 million files in that OSD, are they actually in the
 object files like:
 -rw-r--r-- 1 root root 4194304 Jan  9 15:27 
 /var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3

 What's your use case, RBD, CephFS, RadosGW?

 Regards,

 Christian

 On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote:


  Recently got a problem with OSDs based on SSD disks used in cache tier
 for EC-pool

 superuser@node02:~$ df -i
 FilesystemInodes   IUsed *IFree* IUse% Mounted on
 ...
 /dev/sdb13335808 3335808 *0* 100%
 /var/lib/ceph/osd/ceph-45
 /dev/sda13335808 3335808 *0* 100%
 /var/lib/ceph/osd/ceph-46

 Now that OSDs are down on each ceph-node and cache tiering is not
 working.

 superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log
 2015-03-23 10:04:23.631137 7fb105345840  0 ceph version 0.87.1
 (283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465
 2015-03-23 10:04:23.640676 7fb105345840  0
 filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53)
 2015-03-23 10:04:23.640735 7fb105345840 -1
 genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features:
 unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space
 left on device
 2015-03-23 10:04:23.640763 7fb105345840 -1
 filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error:
 (28) No space left on device
 2015-03-23 10:04:23.640772 7fb105345840 -1
 filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in
 _detect_fs: (28) No space left on device
 2015-03-23 10:04:23.640783 7fb105345840 -1  ** ERROR: error converting
 store /var/lib/ceph/osd/ceph-45: (28) *No space left on device*

 In the same time*df -h *is confusing:

 superuser@node01:~$ df -h
 Filesystem  Size  Used *Avail* Use% Mounted on
 ...
 /dev/sda150G   29G *20G*
 60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G   27G
 *21G*  56% /var/lib/ceph/osd/ceph-46


 Filesystem used on affected OSDs is EXt4. All OSDs are deployed with
 ceph-deploy:
 $ ceph-deploy osd create --zap-disk --fs-type ext4 node-name:device


 Help me out what it was just test deployment and all EC-pool data was
 lost since I /can't start OSDs/ and ceph cluster/becames degraded /until
 I removed all affected tiered pools (cache  EC)
 So this is just my observation of what kind of problems can be faced if
 you choose wrong Filesystem for OSD backend.
 And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems
 because both are supporting dynamic inode allocation and this problem
 can't arise with them.





 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with free Inodes

2015-03-23 Thread Christian Balzer
On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote:

 Yes, I understand that.
 
 The initial purpose of first email was just an advise for new comers. My 
 fault was in that I was selected ext4 for SSD disks as backend.
 But I  did not foresee that inode number can reach its limit before the 
 free space :)
 
 And maybe there must be some sort of warning not only for free space in 
 MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes 
 for filesystems with static inode allocation  like ext4.
 Because if OSD reach inode limit it becames totally unusable and 
 immediately goes down, and from that moment there is no way to start it!
 
While all that is true and should probably be addressed, please re-read
what I wrote before.

With the 3.3 million inodes used and thus likely as many files (did you
verify this?) and 4MB objects that would make something in the 12TB
ballpark area.

Something very very strange and wrong is going on with your cache tier.

Christian

 
 23.03.2015 13:42, Thomas Foster пишет:
  You could fix this by changing your block size when formatting the 
  mount-point with the mkfs -b command.  I had this same issue when 
  dealing with the filesystem using glusterfs and the solution is to 
  either use a filesystem that allocates inodes automatically or change 
  the block size when you build the filesystem.  Unfortunately, the only 
  way to fix the problem that I have seen is to reformat
 
  On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin 
  kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote:
 
  In my case there was cache pool for ec-pool serving RBD-images,
  and object size is 4Mb, and client was an /kernel-rbd /client
  each SSD disk is 60G disk, 2 disk per node,  6 nodes in total = 12
  OSDs in total
 
 
  23.03.2015 12:00, Christian Balzer пишет:
  Hello,
 
  This is rather confusing, as cache-tiers are just normal
  OSDs/pools and thus should have Ceph objects of around 4MB in size by
  default.
 
  This is matched on what I see with Ext4 here (normal OSD, not a
  cache tier):
  ---
  size:
  /dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
  inodes:
  /dev/sde1  183148544 55654 183092890
  1% /var/lib/ceph/osd/ceph-0 ---
 
  On a more fragmented cluster I see a 5:1 size to inode ratio.
 
  I just can't fathom how there could be 3.3 million inodes (and
  thus a close number of files) using 30G, making the average file size
  below 10 Bytes.
 
  Something other than your choice of file system is probably at
  play here.
 
  How fragmented are those SSDs?
  What's your default Ceph object size?
  Where _are_ those 3 million files in that OSD, are they actually
  in the object files like:
  -rw-r--r-- 1 root root 4194304 Jan  9
  15:27 
  /var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3
 
  What's your use case, RBD, CephFS, RadosGW?
 
  Regards,
 
  Christian
 
  On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote:
 
  Recently got a problem with OSDs based on SSD disks used in
  cache tier for EC-pool
 
  superuser@node02:~$ df -i
  FilesystemInodes   IUsed *IFree* IUse%
  Mounted on ...
  /dev/sdb13335808 3335808 *0* 100%
  /var/lib/ceph/osd/ceph-45
  /dev/sda13335808 3335808 *0* 100%
  /var/lib/ceph/osd/ceph-46
 
  Now that OSDs are down on each ceph-node and cache tiering is not
  working.
 
  superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log
  2015-03-23 10:04:23.631137 7fb105345840  0 ceph version 0.87.1
  (283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd,
  pid 1453465 2015-03-23 10:04:23.640676 7fb105345840  0
  filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic
  0xef53) 2015-03-23 10:04:23.640735 7fb105345840 -1
  genericfilestorebackend(/var/lib/ceph/osd/ceph-45)
  detect_features: unable to
  create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space left on
  device 2015-03-23 10:04:23.640763 7fb105345840 -1
  filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features
  error: (28) No space left on device
  2015-03-23 10:04:23.640772 7fb105345840 -1
  filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in
  _detect_fs: (28) No space left on device
  2015-03-23 10:04:23.640783 7fb105345840 -1  ** ERROR: error
  converting store /var/lib/ceph/osd/ceph-45: (28) *No space left on
  device*
 
  In the same time*df -h *is confusing:
 
  superuser@node01:~$ df -h
  Filesystem  Size  Used *Avail* Use% Mounted on
  ...
  /dev/sda150G   29G *20G*
  60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G
  27G *21G*  56% /var/lib/ceph/osd/ceph-46
 
 
  Filesystem used on affected OSDs is EXt4. All OSDs are deployed
 

Re: [ceph-users] Issue with free Inodes

2015-03-23 Thread Kamil Kuramshin
In my case there was cache pool for ec-pool serving RBD-images, and 
object size is 4Mb, and client was an /kernel-rbd /client
each SSD disk is 60G disk, 2 disk per node,  6 nodes in total = 12 OSDs 
in total



23.03.2015 12:00, Christian Balzer пишет:

Hello,

This is rather confusing, as cache-tiers are just normal OSDs/pools and
thus should have Ceph objects of around 4MB in size by default.

This is matched on what I see with Ext4 here (normal OSD, not a cache
tier):
---
size:
/dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
inodes:
/dev/sde1  183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0
---

On a more fragmented cluster I see a 5:1 size to inode ratio.

I just can't fathom how there could be 3.3 million inodes (and thus a
close number of files) using 30G, making the average file size below 10
Bytes.

Something other than your choice of file system is probably at play here.

How fragmented are those SSDs?
What's your default Ceph object size?
Where _are_ those 3 million files in that OSD, are they actually in the
object files like:
-rw-r--r-- 1 root root 4194304 Jan  9 15:27 
/var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3

What's your use case, RBD, CephFS, RadosGW?

Regards,

Christian

On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote:


Recently got a problem with OSDs based on SSD disks used in cache tier
for EC-pool

superuser@node02:~$ df -i
FilesystemInodes   IUsed *IFree* IUse% Mounted on
...
/dev/sdb13335808 3335808 *0* 100%
/var/lib/ceph/osd/ceph-45
/dev/sda13335808 3335808 *0* 100%
/var/lib/ceph/osd/ceph-46

Now that OSDs are down on each ceph-node and cache tiering is not
working.

superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log
2015-03-23 10:04:23.631137 7fb105345840  0 ceph version 0.87.1
(283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465
2015-03-23 10:04:23.640676 7fb105345840  0
filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53)
2015-03-23 10:04:23.640735 7fb105345840 -1
genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features:
unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space
left on device
2015-03-23 10:04:23.640763 7fb105345840 -1
filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error:
(28) No space left on device
2015-03-23 10:04:23.640772 7fb105345840 -1
filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in
_detect_fs: (28) No space left on device
2015-03-23 10:04:23.640783 7fb105345840 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-45: (28) *No space left on device*

In the same time*df -h *is confusing:

superuser@node01:~$ df -h
Filesystem  Size  Used *Avail* Use% Mounted on
...
/dev/sda150G   29G *20G*
60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G   27G
*21G*  56% /var/lib/ceph/osd/ceph-46


Filesystem used on affected OSDs is EXt4. All OSDs are deployed with
ceph-deploy:
$ ceph-deploy osd create --zap-disk --fs-type ext4 node-name:device


Help me out what it was just test deployment and all EC-pool data was
lost since I /can't start OSDs/ and ceph cluster/becames degraded /until
I removed all affected tiered pools (cache  EC)
So this is just my observation of what kind of problems can be faced if
you choose wrong Filesystem for OSD backend.
And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems
because both are supporting dynamic inode allocation and this problem
can't arise with them.






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with free Inodes

2015-03-23 Thread Christian Balzer

Hello,

This is rather confusing, as cache-tiers are just normal OSDs/pools and
thus should have Ceph objects of around 4MB in size by default.

This is matched on what I see with Ext4 here (normal OSD, not a cache
tier):
---
size:
/dev/sde1   2.7T  204G  2.4T   8% /var/lib/ceph/osd/ceph-0
inodes:
/dev/sde1  183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0
---

On a more fragmented cluster I see a 5:1 size to inode ratio.

I just can't fathom how there could be 3.3 million inodes (and thus a
close number of files) using 30G, making the average file size below 10
Bytes. 

Something other than your choice of file system is probably at play here.

How fragmented are those SSDs?
What's your default Ceph object size?
Where _are_ those 3 million files in that OSD, are they actually in the
object files like:
-rw-r--r-- 1 root root 4194304 Jan  9 15:27 
/var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3

What's your use case, RBD, CephFS, RadosGW?

Regards,

Christian

On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote:

 Recently got a problem with OSDs based on SSD disks used in cache tier 
 for EC-pool
 
 superuser@node02:~$ df -i
 FilesystemInodes   IUsed *IFree* IUse% Mounted on
 ...
 /dev/sdb13335808 3335808 *0* 100% 
 /var/lib/ceph/osd/ceph-45
 /dev/sda13335808 3335808 *0* 100% 
 /var/lib/ceph/osd/ceph-46
 
 Now that OSDs are down on each ceph-node and cache tiering is not
 working.
 
 superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log
 2015-03-23 10:04:23.631137 7fb105345840  0 ceph version 0.87.1 
 (283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465
 2015-03-23 10:04:23.640676 7fb105345840  0 
 filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53)
 2015-03-23 10:04:23.640735 7fb105345840 -1 
 genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features: 
 unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space 
 left on device
 2015-03-23 10:04:23.640763 7fb105345840 -1 
 filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error: 
 (28) No space left on device
 2015-03-23 10:04:23.640772 7fb105345840 -1 
 filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in 
 _detect_fs: (28) No space left on device
 2015-03-23 10:04:23.640783 7fb105345840 -1  ** ERROR: error converting 
 store /var/lib/ceph/osd/ceph-45: (28) *No space left on device*
 
 In the same time*df -h *is confusing:
 
 superuser@node01:~$ df -h
 Filesystem  Size  Used *Avail* Use% Mounted on
 ...
 /dev/sda150G   29G *20G*
 60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G   27G
 *21G*  56% /var/lib/ceph/osd/ceph-46
 
 
 Filesystem used on affected OSDs is EXt4. All OSDs are deployed with 
 ceph-deploy:
 $ ceph-deploy osd create --zap-disk --fs-type ext4 node-name:device
 
 
 Help me out what it was just test deployment and all EC-pool data was 
 lost since I /can't start OSDs/ and ceph cluster/becames degraded /until 
 I removed all affected tiered pools (cache  EC)
 So this is just my observation of what kind of problems can be faced if 
 you choose wrong Filesystem for OSD backend.
 And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems 
 because both are supporting dynamic inode allocation and this problem 
 can't arise with them.
 
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com