Re: [ceph-users] Issue with free Inodes
May be some one can spot a new light, 1. Only SSD-cache OSDs affected by this issue 2. Total cache OSD count is 12x60GiB, backend filesystem is ext4 3. I have created 2 cache tier pools with replica size=3 on that OSD, both with pg_num:400, pgp_num:400 4. There was a crush ruleset: superuser@admin:~$ ceph osd crush rule dump ssd { rule_id: 3, rule_name: ssd, ruleset: 3, type: 1, min_size: 1, max_size: 10, steps: [ { op: take, item: -21, item_name: ssd}, { op: chooseleaf_firstn, num: 0, type: disktype}, { op: emit}]} for gathering all SSD OSDs from all nodes by *disktype* I guess there may be a lot of *directories* that was created on filesystem for organizing placement groups, can that cause that very big amount of inodes occupied by directory records? 24.03.2015 16:52, Gregory Farnum пишет: On Tue, Mar 24, 2015 at 12:13 AM, Christian Balzer ch...@gol.com wrote: On Tue, 24 Mar 2015 09:41:04 +0300 Kamil Kuramshin wrote: Yes I read it and do no not understand what you mean when say *verify this*? All 3335808 inodes are definetly files and direcories created by ceph OSD process: What I mean is how/why did Ceph create 3+ million files, where in the tree are they actually or are they evenly distributed in the respective PG sub-directories. Or to ask it differently, how large is your cluster (how many OSDs, objects), in short the output of ceph -s. If cache-tiers actually are reserving each object that exists on the backing store (even if there isn't data in it yet on the cache tier) and your cluster is large enough, it might explain this. Nope. As you've said, this doesn't make any sense unless the objects are all ludicrously small (and you can't actually get 10-byte objects in Ceph; the names alone tend to be bigger than that) or something else is using up inodes. And that should both be mentioned and precautions to not run out of inodes should be made by the Ceph code. If not, this may be a bug after all. Would be nice if somebody from the Ceph devs could have gander at this. Christian *tune2fs 1.42.5 (29-Jul-2012)* Filesystem volume name: none Last mounted on: /var/lib/ceph/tmp/mnt.05NAJ3 Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33 Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux *Inode count: 3335808* Block count: 13342945 Reserved block count: 667147 Free blocks: 5674105 *Free inodes: 0* First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 1020 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8176 Inode blocks per group: 511 Flex block group size:16 Filesystem created: Fri Feb 20 16:44:25 2015 Last mount time: Tue Mar 24 09:33:19 2015 Last write time: Tue Mar 24 09:33:27 2015 Mount count: 7 Maximum mount count: -1 Last checked: Fri Feb 20 16:44:25 2015 Check interval: 0 (none) Lifetime writes: 4116 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode:8 Default directory hash: half_md4 Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b Journal backup: inode blocks *fsck.ext4 /dev/sda1* e2fsck 1.42.5 (29-Jul-2012) /dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks 23.03.2015 17:09, Christian Balzer пишет: On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote: Yes, I understand that. The initial purpose of first email was just an advise for new comers. My fault was in that I was selected ext4 for SSD disks as backend. But I did not foresee that inode number can reach its limit before the free space :) And maybe there must be some sort of warning not only for free space in MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes for filesystems with static inode allocation like ext4. Because if OSD reach inode limit it becames totally unusable and immediately goes down, and from that moment there is no way to start it! While all that is true and should probably be addressed, please re-read what I wrote before. With the 3.3 million inodes used and thus likely as many files (did you verify this?) and 4MB objects that would make something in the 12TB ballpark area. Something very very strange and wrong is going on with your cache tier. Christian 23.03.2015
Re: [ceph-users] Issue with free Inodes
On Tue, Mar 24, 2015 at 12:13 AM, Christian Balzer ch...@gol.com wrote: On Tue, 24 Mar 2015 09:41:04 +0300 Kamil Kuramshin wrote: Yes I read it and do no not understand what you mean when say *verify this*? All 3335808 inodes are definetly files and direcories created by ceph OSD process: What I mean is how/why did Ceph create 3+ million files, where in the tree are they actually or are they evenly distributed in the respective PG sub-directories. Or to ask it differently, how large is your cluster (how many OSDs, objects), in short the output of ceph -s. If cache-tiers actually are reserving each object that exists on the backing store (even if there isn't data in it yet on the cache tier) and your cluster is large enough, it might explain this. Nope. As you've said, this doesn't make any sense unless the objects are all ludicrously small (and you can't actually get 10-byte objects in Ceph; the names alone tend to be bigger than that) or something else is using up inodes. And that should both be mentioned and precautions to not run out of inodes should be made by the Ceph code. If not, this may be a bug after all. Would be nice if somebody from the Ceph devs could have gander at this. Christian *tune2fs 1.42.5 (29-Jul-2012)* Filesystem volume name: none Last mounted on: /var/lib/ceph/tmp/mnt.05NAJ3 Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33 Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux *Inode count: 3335808* Block count: 13342945 Reserved block count: 667147 Free blocks: 5674105 *Free inodes: 0* First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 1020 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8176 Inode blocks per group: 511 Flex block group size:16 Filesystem created: Fri Feb 20 16:44:25 2015 Last mount time: Tue Mar 24 09:33:19 2015 Last write time: Tue Mar 24 09:33:27 2015 Mount count: 7 Maximum mount count: -1 Last checked: Fri Feb 20 16:44:25 2015 Check interval: 0 (none) Lifetime writes: 4116 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode:8 Default directory hash: half_md4 Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b Journal backup: inode blocks *fsck.ext4 /dev/sda1* e2fsck 1.42.5 (29-Jul-2012) /dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks 23.03.2015 17:09, Christian Balzer пишет: On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote: Yes, I understand that. The initial purpose of first email was just an advise for new comers. My fault was in that I was selected ext4 for SSD disks as backend. But I did not foresee that inode number can reach its limit before the free space :) And maybe there must be some sort of warning not only for free space in MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes for filesystems with static inode allocation like ext4. Because if OSD reach inode limit it becames totally unusable and immediately goes down, and from that moment there is no way to start it! While all that is true and should probably be addressed, please re-read what I wrote before. With the 3.3 million inodes used and thus likely as many files (did you verify this?) and 4MB objects that would make something in the 12TB ballpark area. Something very very strange and wrong is going on with your cache tier. Christian 23.03.2015 13:42, Thomas Foster пишет: You could fix this by changing your block size when formatting the mount-point with the mkfs -b command. I had this same issue when dealing with the filesystem using glusterfs and the solution is to either use a filesystem that allocates inodes automatically or change the block size when you build the filesystem. Unfortunately, the only way to fix the problem that I have seen is to reformat On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote: In my case there was cache pool for ec-pool serving RBD-images, and object size is 4Mb, and client was an /kernel-rbd /client each SSD disk is 60G disk, 2 disk per node, 6 nodes in total = 12 OSDs in total
Re: [ceph-users] Issue with free Inodes
Yes I read it and do no not understand what you mean when say *verify this*? All 3335808 inodes are definetly files and direcories created by ceph OSD process: *tune2fs 1.42.5 (29-Jul-2012)* Filesystem volume name: none Last mounted on: /var/lib/ceph/tmp/mnt.05NAJ3 Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33 Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux *Inode count: 3335808* Block count: 13342945 Reserved block count: 667147 Free blocks: 5674105 *Free inodes: 0* First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 1020 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8176 Inode blocks per group: 511 Flex block group size:16 Filesystem created: Fri Feb 20 16:44:25 2015 Last mount time: Tue Mar 24 09:33:19 2015 Last write time: Tue Mar 24 09:33:27 2015 Mount count: 7 Maximum mount count: -1 Last checked: Fri Feb 20 16:44:25 2015 Check interval: 0 (none) Lifetime writes: 4116 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode:8 Default directory hash: half_md4 Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b Journal backup: inode blocks *fsck.ext4 /dev/sda1* e2fsck 1.42.5 (29-Jul-2012) /dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks 23.03.2015 17:09, Christian Balzer пишет: On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote: Yes, I understand that. The initial purpose of first email was just an advise for new comers. My fault was in that I was selected ext4 for SSD disks as backend. But I did not foresee that inode number can reach its limit before the free space :) And maybe there must be some sort of warning not only for free space in MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes for filesystems with static inode allocation like ext4. Because if OSD reach inode limit it becames totally unusable and immediately goes down, and from that moment there is no way to start it! While all that is true and should probably be addressed, please re-read what I wrote before. With the 3.3 million inodes used and thus likely as many files (did you verify this?) and 4MB objects that would make something in the 12TB ballpark area. Something very very strange and wrong is going on with your cache tier. Christian 23.03.2015 13:42, Thomas Foster пишет: You could fix this by changing your block size when formatting the mount-point with the mkfs -b command. I had this same issue when dealing with the filesystem using glusterfs and the solution is to either use a filesystem that allocates inodes automatically or change the block size when you build the filesystem. Unfortunately, the only way to fix the problem that I have seen is to reformat On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote: In my case there was cache pool for ec-pool serving RBD-images, and object size is 4Mb, and client was an /kernel-rbd /client each SSD disk is 60G disk, 2 disk per node, 6 nodes in total = 12 OSDs in total 23.03.2015 12:00, Christian Balzer пишет: Hello, This is rather confusing, as cache-tiers are just normal OSDs/pools and thus should have Ceph objects of around 4MB in size by default. This is matched on what I see with Ext4 here (normal OSD, not a cache tier): --- size: /dev/sde1 2.7T 204G 2.4T 8% /var/lib/ceph/osd/ceph-0 inodes: /dev/sde1 183148544 55654 183092890 1% /var/lib/ceph/osd/ceph-0 --- On a more fragmented cluster I see a 5:1 size to inode ratio. I just can't fathom how there could be 3.3 million inodes (and thus a close number of files) using 30G, making the average file size below 10 Bytes. Something other than your choice of file system is probably at play here. How fragmented are those SSDs? What's your default Ceph object size? Where _are_ those 3 million files in that OSD, are they actually in the object files like: -rw-r--r-- 1 root root 4194304 Jan 9 15:27 /var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3 What's your use case, RBD, CephFS, RadosGW? Regards, Christian
Re: [ceph-users] Issue with free Inodes
On Tue, 24 Mar 2015 09:41:04 +0300 Kamil Kuramshin wrote: Yes I read it and do no not understand what you mean when say *verify this*? All 3335808 inodes are definetly files and direcories created by ceph OSD process: What I mean is how/why did Ceph create 3+ million files, where in the tree are they actually or are they evenly distributed in the respective PG sub-directories. Or to ask it differently, how large is your cluster (how many OSDs, objects), in short the output of ceph -s. If cache-tiers actually are reserving each object that exists on the backing store (even if there isn't data in it yet on the cache tier) and your cluster is large enough, it might explain this. And that should both be mentioned and precautions to not run out of inodes should be made by the Ceph code. If not, this may be a bug after all. Would be nice if somebody from the Ceph devs could have gander at this. Christian *tune2fs 1.42.5 (29-Jul-2012)* Filesystem volume name: none Last mounted on: /var/lib/ceph/tmp/mnt.05NAJ3 Filesystem UUID: e4dcca8a-7b68-4f60-9b10-c164dc7f9e33 Filesystem magic number: 0xEF53 Filesystem revision #:1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options:user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux *Inode count: 3335808* Block count: 13342945 Reserved block count: 667147 Free blocks: 5674105 *Free inodes: 0* First block: 0 Block size: 4096 Fragment size:4096 Reserved GDT blocks: 1020 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8176 Inode blocks per group: 511 Flex block group size:16 Filesystem created: Fri Feb 20 16:44:25 2015 Last mount time: Tue Mar 24 09:33:19 2015 Last write time: Tue Mar 24 09:33:27 2015 Mount count: 7 Maximum mount count: -1 Last checked: Fri Feb 20 16:44:25 2015 Check interval: 0 (none) Lifetime writes: 4116 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode:8 Default directory hash: half_md4 Directory Hash Seed: 148ee5dd-7ee0-470c-a08a-b11c318ff90b Journal backup: inode blocks *fsck.ext4 /dev/sda1* e2fsck 1.42.5 (29-Jul-2012) /dev/sda1: clean, 3335808/3335808 files, 7668840/13342945 blocks 23.03.2015 17:09, Christian Balzer пишет: On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote: Yes, I understand that. The initial purpose of first email was just an advise for new comers. My fault was in that I was selected ext4 for SSD disks as backend. But I did not foresee that inode number can reach its limit before the free space :) And maybe there must be some sort of warning not only for free space in MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes for filesystems with static inode allocation like ext4. Because if OSD reach inode limit it becames totally unusable and immediately goes down, and from that moment there is no way to start it! While all that is true and should probably be addressed, please re-read what I wrote before. With the 3.3 million inodes used and thus likely as many files (did you verify this?) and 4MB objects that would make something in the 12TB ballpark area. Something very very strange and wrong is going on with your cache tier. Christian 23.03.2015 13:42, Thomas Foster пишет: You could fix this by changing your block size when formatting the mount-point with the mkfs -b command. I had this same issue when dealing with the filesystem using glusterfs and the solution is to either use a filesystem that allocates inodes automatically or change the block size when you build the filesystem. Unfortunately, the only way to fix the problem that I have seen is to reformat On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote: In my case there was cache pool for ec-pool serving RBD-images, and object size is 4Mb, and client was an /kernel-rbd /client each SSD disk is 60G disk, 2 disk per node, 6 nodes in total = 12 OSDs in total 23.03.2015 12:00, Christian Balzer пишет: Hello, This is rather confusing, as cache-tiers are just normal OSDs/pools and thus should have Ceph objects of around 4MB in size by default. This is matched on what I see with Ext4 here (normal OSD, not a cache tier):
Re: [ceph-users] Issue with free Inodes
Yes, I understand that. The initial purpose of first email was just an advise for new comers. My fault was in that I was selected ext4 for SSD disks as backend. But I did not foresee that inode number can reach its limit before the free space :) And maybe there must be some sort of warning not only for free space in MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes for filesystems with static inode allocation like ext4. Because if OSD reach inode limit it becames totally unusable and immediately goes down, and from that moment there is no way to start it! 23.03.2015 13:42, Thomas Foster пишет: You could fix this by changing your block size when formatting the mount-point with the mkfs -b command. I had this same issue when dealing with the filesystem using glusterfs and the solution is to either use a filesystem that allocates inodes automatically or change the block size when you build the filesystem. Unfortunately, the only way to fix the problem that I have seen is to reformat On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote: In my case there was cache pool for ec-pool serving RBD-images, and object size is 4Mb, and client was an /kernel-rbd /client each SSD disk is 60G disk, 2 disk per node, 6 nodes in total = 12 OSDs in total 23.03.2015 12:00, Christian Balzer пишет: Hello, This is rather confusing, as cache-tiers are just normal OSDs/pools and thus should have Ceph objects of around 4MB in size by default. This is matched on what I see with Ext4 here (normal OSD, not a cache tier): --- size: /dev/sde1 2.7T 204G 2.4T 8% /var/lib/ceph/osd/ceph-0 inodes: /dev/sde1 183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0 --- On a more fragmented cluster I see a 5:1 size to inode ratio. I just can't fathom how there could be 3.3 million inodes (and thus a close number of files) using 30G, making the average file size below 10 Bytes. Something other than your choice of file system is probably at play here. How fragmented are those SSDs? What's your default Ceph object size? Where _are_ those 3 million files in that OSD, are they actually in the object files like: -rw-r--r-- 1 root root 4194304 Jan 9 15:27 /var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3 What's your use case, RBD, CephFS, RadosGW? Regards, Christian On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote: Recently got a problem with OSDs based on SSD disks used in cache tier for EC-pool superuser@node02:~$ df -i FilesystemInodes IUsed *IFree* IUse% Mounted on ... /dev/sdb13335808 3335808 *0* 100% /var/lib/ceph/osd/ceph-45 /dev/sda13335808 3335808 *0* 100% /var/lib/ceph/osd/ceph-46 Now that OSDs are down on each ceph-node and cache tiering is not working. superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log 2015-03-23 10:04:23.631137 7fb105345840 0 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465 2015-03-23 10:04:23.640676 7fb105345840 0 filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53) 2015-03-23 10:04:23.640735 7fb105345840 -1 genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features: unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space left on device 2015-03-23 10:04:23.640763 7fb105345840 -1 filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error: (28) No space left on device 2015-03-23 10:04:23.640772 7fb105345840 -1 filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in _detect_fs: (28) No space left on device 2015-03-23 10:04:23.640783 7fb105345840 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-45: (28) *No space left on device* In the same time*df -h *is confusing: superuser@node01:~$ df -h Filesystem Size Used *Avail* Use% Mounted on ... /dev/sda150G 29G *20G* 60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G 27G *21G* 56% /var/lib/ceph/osd/ceph-46 Filesystem used on affected OSDs is EXt4. All OSDs are deployed with ceph-deploy: $ ceph-deploy osd create --zap-disk --fs-type ext4 node-name:device Help me out what it was just test deployment and all EC-pool data was lost since I /can't start OSDs/ and ceph cluster/becames degraded /until I removed all affected tiered pools (cache EC) So this is just my observation of what kind of problems can be faced if you choose wrong Filesystem for OSD backend. And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems because
Re: [ceph-users] Issue with free Inodes
You could fix this by changing your block size when formatting the mount-point with the mkfs -b command. I had this same issue when dealing with the filesystem using glusterfs and the solution is to either use a filesystem that allocates inodes automatically or change the block size when you build the filesystem. Unfortunately, the only way to fix the problem that I have seen is to reformat On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin kamil.kurams...@tatar.ru wrote: In my case there was cache pool for ec-pool serving RBD-images, and object size is 4Mb, and client was an *kernel-rbd *client each SSD disk is 60G disk, 2 disk per node, 6 nodes in total = 12 OSDs in total 23.03.2015 12:00, Christian Balzer пишет: Hello, This is rather confusing, as cache-tiers are just normal OSDs/pools and thus should have Ceph objects of around 4MB in size by default. This is matched on what I see with Ext4 here (normal OSD, not a cache tier): --- size: /dev/sde1 2.7T 204G 2.4T 8% /var/lib/ceph/osd/ceph-0 inodes: /dev/sde1 183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0 --- On a more fragmented cluster I see a 5:1 size to inode ratio. I just can't fathom how there could be 3.3 million inodes (and thus a close number of files) using 30G, making the average file size below 10 Bytes. Something other than your choice of file system is probably at play here. How fragmented are those SSDs? What's your default Ceph object size? Where _are_ those 3 million files in that OSD, are they actually in the object files like: -rw-r--r-- 1 root root 4194304 Jan 9 15:27 /var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3 What's your use case, RBD, CephFS, RadosGW? Regards, Christian On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote: Recently got a problem with OSDs based on SSD disks used in cache tier for EC-pool superuser@node02:~$ df -i FilesystemInodes IUsed *IFree* IUse% Mounted on ... /dev/sdb13335808 3335808 *0* 100% /var/lib/ceph/osd/ceph-45 /dev/sda13335808 3335808 *0* 100% /var/lib/ceph/osd/ceph-46 Now that OSDs are down on each ceph-node and cache tiering is not working. superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log 2015-03-23 10:04:23.631137 7fb105345840 0 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465 2015-03-23 10:04:23.640676 7fb105345840 0 filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53) 2015-03-23 10:04:23.640735 7fb105345840 -1 genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features: unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space left on device 2015-03-23 10:04:23.640763 7fb105345840 -1 filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error: (28) No space left on device 2015-03-23 10:04:23.640772 7fb105345840 -1 filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in _detect_fs: (28) No space left on device 2015-03-23 10:04:23.640783 7fb105345840 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-45: (28) *No space left on device* In the same time*df -h *is confusing: superuser@node01:~$ df -h Filesystem Size Used *Avail* Use% Mounted on ... /dev/sda150G 29G *20G* 60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G 27G *21G* 56% /var/lib/ceph/osd/ceph-46 Filesystem used on affected OSDs is EXt4. All OSDs are deployed with ceph-deploy: $ ceph-deploy osd create --zap-disk --fs-type ext4 node-name:device Help me out what it was just test deployment and all EC-pool data was lost since I /can't start OSDs/ and ceph cluster/becames degraded /until I removed all affected tiered pools (cache EC) So this is just my observation of what kind of problems can be faced if you choose wrong Filesystem for OSD backend. And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems because both are supporting dynamic inode allocation and this problem can't arise with them. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issue with free Inodes
On Mon, 23 Mar 2015 15:26:07 +0300 Kamil Kuramshin wrote: Yes, I understand that. The initial purpose of first email was just an advise for new comers. My fault was in that I was selected ext4 for SSD disks as backend. But I did not foresee that inode number can reach its limit before the free space :) And maybe there must be some sort of warning not only for free space in MiBs(GiBs,TiBs) and there must be dedicated warning about free inodes for filesystems with static inode allocation like ext4. Because if OSD reach inode limit it becames totally unusable and immediately goes down, and from that moment there is no way to start it! While all that is true and should probably be addressed, please re-read what I wrote before. With the 3.3 million inodes used and thus likely as many files (did you verify this?) and 4MB objects that would make something in the 12TB ballpark area. Something very very strange and wrong is going on with your cache tier. Christian 23.03.2015 13:42, Thomas Foster пишет: You could fix this by changing your block size when formatting the mount-point with the mkfs -b command. I had this same issue when dealing with the filesystem using glusterfs and the solution is to either use a filesystem that allocates inodes automatically or change the block size when you build the filesystem. Unfortunately, the only way to fix the problem that I have seen is to reformat On Mon, Mar 23, 2015 at 5:51 AM, Kamil Kuramshin kamil.kurams...@tatar.ru mailto:kamil.kurams...@tatar.ru wrote: In my case there was cache pool for ec-pool serving RBD-images, and object size is 4Mb, and client was an /kernel-rbd /client each SSD disk is 60G disk, 2 disk per node, 6 nodes in total = 12 OSDs in total 23.03.2015 12:00, Christian Balzer пишет: Hello, This is rather confusing, as cache-tiers are just normal OSDs/pools and thus should have Ceph objects of around 4MB in size by default. This is matched on what I see with Ext4 here (normal OSD, not a cache tier): --- size: /dev/sde1 2.7T 204G 2.4T 8% /var/lib/ceph/osd/ceph-0 inodes: /dev/sde1 183148544 55654 183092890 1% /var/lib/ceph/osd/ceph-0 --- On a more fragmented cluster I see a 5:1 size to inode ratio. I just can't fathom how there could be 3.3 million inodes (and thus a close number of files) using 30G, making the average file size below 10 Bytes. Something other than your choice of file system is probably at play here. How fragmented are those SSDs? What's your default Ceph object size? Where _are_ those 3 million files in that OSD, are they actually in the object files like: -rw-r--r-- 1 root root 4194304 Jan 9 15:27 /var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3 What's your use case, RBD, CephFS, RadosGW? Regards, Christian On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote: Recently got a problem with OSDs based on SSD disks used in cache tier for EC-pool superuser@node02:~$ df -i FilesystemInodes IUsed *IFree* IUse% Mounted on ... /dev/sdb13335808 3335808 *0* 100% /var/lib/ceph/osd/ceph-45 /dev/sda13335808 3335808 *0* 100% /var/lib/ceph/osd/ceph-46 Now that OSDs are down on each ceph-node and cache tiering is not working. superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log 2015-03-23 10:04:23.631137 7fb105345840 0 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465 2015-03-23 10:04:23.640676 7fb105345840 0 filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53) 2015-03-23 10:04:23.640735 7fb105345840 -1 genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features: unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space left on device 2015-03-23 10:04:23.640763 7fb105345840 -1 filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error: (28) No space left on device 2015-03-23 10:04:23.640772 7fb105345840 -1 filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in _detect_fs: (28) No space left on device 2015-03-23 10:04:23.640783 7fb105345840 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-45: (28) *No space left on device* In the same time*df -h *is confusing: superuser@node01:~$ df -h Filesystem Size Used *Avail* Use% Mounted on ... /dev/sda150G 29G *20G* 60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G 27G *21G* 56% /var/lib/ceph/osd/ceph-46 Filesystem used on affected OSDs is EXt4. All OSDs are deployed
Re: [ceph-users] Issue with free Inodes
In my case there was cache pool for ec-pool serving RBD-images, and object size is 4Mb, and client was an /kernel-rbd /client each SSD disk is 60G disk, 2 disk per node, 6 nodes in total = 12 OSDs in total 23.03.2015 12:00, Christian Balzer пишет: Hello, This is rather confusing, as cache-tiers are just normal OSDs/pools and thus should have Ceph objects of around 4MB in size by default. This is matched on what I see with Ext4 here (normal OSD, not a cache tier): --- size: /dev/sde1 2.7T 204G 2.4T 8% /var/lib/ceph/osd/ceph-0 inodes: /dev/sde1 183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0 --- On a more fragmented cluster I see a 5:1 size to inode ratio. I just can't fathom how there could be 3.3 million inodes (and thus a close number of files) using 30G, making the average file size below 10 Bytes. Something other than your choice of file system is probably at play here. How fragmented are those SSDs? What's your default Ceph object size? Where _are_ those 3 million files in that OSD, are they actually in the object files like: -rw-r--r-- 1 root root 4194304 Jan 9 15:27 /var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3 What's your use case, RBD, CephFS, RadosGW? Regards, Christian On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote: Recently got a problem with OSDs based on SSD disks used in cache tier for EC-pool superuser@node02:~$ df -i FilesystemInodes IUsed *IFree* IUse% Mounted on ... /dev/sdb13335808 3335808 *0* 100% /var/lib/ceph/osd/ceph-45 /dev/sda13335808 3335808 *0* 100% /var/lib/ceph/osd/ceph-46 Now that OSDs are down on each ceph-node and cache tiering is not working. superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log 2015-03-23 10:04:23.631137 7fb105345840 0 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465 2015-03-23 10:04:23.640676 7fb105345840 0 filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53) 2015-03-23 10:04:23.640735 7fb105345840 -1 genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features: unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space left on device 2015-03-23 10:04:23.640763 7fb105345840 -1 filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error: (28) No space left on device 2015-03-23 10:04:23.640772 7fb105345840 -1 filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in _detect_fs: (28) No space left on device 2015-03-23 10:04:23.640783 7fb105345840 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-45: (28) *No space left on device* In the same time*df -h *is confusing: superuser@node01:~$ df -h Filesystem Size Used *Avail* Use% Mounted on ... /dev/sda150G 29G *20G* 60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G 27G *21G* 56% /var/lib/ceph/osd/ceph-46 Filesystem used on affected OSDs is EXt4. All OSDs are deployed with ceph-deploy: $ ceph-deploy osd create --zap-disk --fs-type ext4 node-name:device Help me out what it was just test deployment and all EC-pool data was lost since I /can't start OSDs/ and ceph cluster/becames degraded /until I removed all affected tiered pools (cache EC) So this is just my observation of what kind of problems can be faced if you choose wrong Filesystem for OSD backend. And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems because both are supporting dynamic inode allocation and this problem can't arise with them. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issue with free Inodes
Hello, This is rather confusing, as cache-tiers are just normal OSDs/pools and thus should have Ceph objects of around 4MB in size by default. This is matched on what I see with Ext4 here (normal OSD, not a cache tier): --- size: /dev/sde1 2.7T 204G 2.4T 8% /var/lib/ceph/osd/ceph-0 inodes: /dev/sde1 183148544 55654 1830928901% /var/lib/ceph/osd/ceph-0 --- On a more fragmented cluster I see a 5:1 size to inode ratio. I just can't fathom how there could be 3.3 million inodes (and thus a close number of files) using 30G, making the average file size below 10 Bytes. Something other than your choice of file system is probably at play here. How fragmented are those SSDs? What's your default Ceph object size? Where _are_ those 3 million files in that OSD, are they actually in the object files like: -rw-r--r-- 1 root root 4194304 Jan 9 15:27 /var/lib/ceph/osd/ceph-0/current/3.117_head/DIR_7/DIR_1/DIR_5/rb.0.23a8f.238e1f29.00027632__head_C4F3D517__3 What's your use case, RBD, CephFS, RadosGW? Regards, Christian On Mon, 23 Mar 2015 10:32:55 +0300 Kamil Kuramshin wrote: Recently got a problem with OSDs based on SSD disks used in cache tier for EC-pool superuser@node02:~$ df -i FilesystemInodes IUsed *IFree* IUse% Mounted on ... /dev/sdb13335808 3335808 *0* 100% /var/lib/ceph/osd/ceph-45 /dev/sda13335808 3335808 *0* 100% /var/lib/ceph/osd/ceph-46 Now that OSDs are down on each ceph-node and cache tiering is not working. superuser@node01:~$ sudo tail /var/log/ceph/ceph-osd.45.log 2015-03-23 10:04:23.631137 7fb105345840 0 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e), process ceph-osd, pid 1453465 2015-03-23 10:04:23.640676 7fb105345840 0 filestore(/var/lib/ceph/osd/ceph-45) backend generic (magic 0xef53) 2015-03-23 10:04:23.640735 7fb105345840 -1 genericfilestorebackend(/var/lib/ceph/osd/ceph-45) detect_features: unable to create /var/lib/ceph/osd/ceph-45/fiemap_test: (28) No space left on device 2015-03-23 10:04:23.640763 7fb105345840 -1 filestore(/var/lib/ceph/osd/ceph-45) _detect_fs: detect_features error: (28) No space left on device 2015-03-23 10:04:23.640772 7fb105345840 -1 filestore(/var/lib/ceph/osd/ceph-45) FileStore::mount : error in _detect_fs: (28) No space left on device 2015-03-23 10:04:23.640783 7fb105345840 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-45: (28) *No space left on device* In the same time*df -h *is confusing: superuser@node01:~$ df -h Filesystem Size Used *Avail* Use% Mounted on ... /dev/sda150G 29G *20G* 60% /var/lib/ceph/osd/ceph-45 /dev/sdb150G 27G *21G* 56% /var/lib/ceph/osd/ceph-46 Filesystem used on affected OSDs is EXt4. All OSDs are deployed with ceph-deploy: $ ceph-deploy osd create --zap-disk --fs-type ext4 node-name:device Help me out what it was just test deployment and all EC-pool data was lost since I /can't start OSDs/ and ceph cluster/becames degraded /until I removed all affected tiered pools (cache EC) So this is just my observation of what kind of problems can be faced if you choose wrong Filesystem for OSD backend. And now I *strongly* recommend you to choose*XFS* or *Btrfs* filesystems because both are supporting dynamic inode allocation and this problem can't arise with them. -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com