Re: [ceph-users] scubbing for a long time and not finished
Currently, users do not know when some pg do scrubbing for a long time. I think whether we could give some warming if it happend (defined as osd_scrub_max_time). It would tell the user something may be wrong in cluster. 2015-03-17 21:21 GMT+08:00 池信泽 xmdx...@gmail.com: On 周二, 3月 17, 2015 at 10:01 上午, Xinze Chi xmdx...@gmail.com wrote: hi,all: I find a pg on my test cluster in doing scrubbing for a long time and not finish. there are not some useful scrubbing log. scrubs_active is 1, so inc_scrubs_pending return false. I think the reason is that some scrub message is lost, so primary can not continue chunky_scrub , so it hang up at scrubbing. Could anyone give some suggestion? Thanks [root@ceph0 ~]# date Tue Mar 17 09:54:54 CST 2015 [root@ceph0 ~]# ceph pg dump | grep scrub dumped all in format plain pg_stat objects mip degr misp unf bytes log disklog state state_stamp v reported up up_primary acting acting_primary last_scrub scrub_stamplast_deep_scrub deep_scrub_stamp 1.97 30 0 0 0 0 117702656 31 31 active+clean+scrubbing 2015-03-16 14:50:02.110796 78'31 78:50 [9,6,1] 9 [9,6,1] 9 0'0 2015-03-15 14:49:33.661597 0'0 2015-03-13 14:48:53.341679 The attachment is the log from primary, the scrubbing pg is 1.97s0. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
Hello, On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote: On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote: Hi Greg, Thanks for your input and completely agree that we cannot expect developers to fully document what impact each setting has on a cluster, particularly in a performance related way That said, if you or others could spare some time for a few pointers it would be much appreciated and I will endeavour to create some useful results/documents that are more relevant to end users. I have taken on board what you said about the WB throttle and have been experimenting with it by switching it on and off. I know it's a bit of a blunt configuration change, but it was useful to understand its effect. With it off, I do see initially quite a large performance increase but overtime it actually starts to slow the average throughput down. Like you said, I am guessing this is to do with it making sure the journal doesn't get to far ahead, leaving it with massive sync's to carry out. One thing I do see with the WBT enabled and to some extent with it disabled, is that there are large periods of small block writes at the max speed of the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces of performing an OSD bench (64kb io's for 500MB) where this behaviour can be seen. If you're doing 64k IOs then I believe it's creating a new on-disk file for each of those writes. How that's laid out on-disk will depend on your filesystem and the specific config options that we're using to try to avoid running too far ahead of the journal. Could you elaborate on that a bit? I would have expected those 64KB writes to go to the same object (file) until it is full (4MB). Because this behavior would explain some (if not all) of the write amplification I've seen in the past with small writes (see the SSD Hardware recommendation thread). Christian I think you're just using these config options in conflict with eachother. You've set the min sync time to 20 seconds for some reason, presumably to try and batch stuff up? So in that case you probably want to let your journal run for twenty seconds worth of backing disk IO before you start throttling it, and probably 10-20 seconds worth of IO before forcing file flushes. That means increasing the throttle limits while still leaving the flusher enabled. -Greg http://www.sys-pro.co.uk/misc/wbt_on.png http://www.sys-pro.co.uk/misc/wbt_off.png I would really appreciate if someone could comment on why this type of behaviour happens? As can be seen in the trace, if the blocks are submitted to the disk as larger IO's and with higher concurrency, hundreds of Mb of data can be flushed in seconds. Is this something specific to the filesystem behaviour which Ceph cannot influence, like dirty filesystem metadata/inodes which can't be merged into larger IO's? For sequential writes, I would have thought that in an optimum scenario, a spinning disk should be able to almost maintain its large block write speed (100MB/s) no matter the underlying block size. That being said, from what I understand when a sync is called it will try and flush all dirty data so the end result is probably slightly different to a traditional battery backed write back cache. Chris, would you be interested in forming a ceph-users based performance team? There's a developer performance meeting which is mainly concerned with improving the internals of Ceph. There is also a raft of information on the mailing list archives where people have said hey look at my SSD speed at x,y,z settings, but making comparisons or recommendations is not that easy. It may also reduce a lot of the repetitive posts of why is X so slowetc -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD Hardware recommendation
On Wed, 18 Mar 2015 08:59:14 +0100 Josef Johansson wrote: Hi, On 18 Mar 2015, at 05:29, Christian Balzer ch...@gol.com wrote: Hello, On Wed, 18 Mar 2015 03:52:22 +0100 Josef Johansson wrote: [snip] We though of doing a cluster with 3 servers, and any recommendation of supermicro servers would be appreciated. Why 3, replication of 3? With Intel SSDs and diligent (SMART/NAGIOS) wear level monitoring I'd personally feel safe with a replication factor of 2. I’ve seen recommendations of replication 2! The Intel SSDs are indeed endurable. This is only with Intel SSDs I assume? From the specifications and reviews I've seen the Samsung 845DC PRO, the SM 843T and even more so the SV843 (http://www.samsung.com/global/business/semiconductor/product/flash-ssd/overview don't you love it when the same company has different, competing products?) should do just fine when it comes to endurance and performance. Alas I have no first hand experience with either, just the (read-optimized) 845DC EVO. This 1U http://www.supermicro.com.tw/products/system/1U/1028/SYS-1028U-TR4T_.cfm http://www.supermicro.com.tw/products/system/1U/1028/SYS-1028U-TR4T_.cfm is really nice, missing the SuperDOM peripherals though.. While I certainly see use cases for SuperDOM, not all models have 2 connectors, so no chance to RAID1 things, thus the need to _definitely_ have to pull the server out (and re-install the OS) should it fail. so you really get 8 drives if you need two for OS. And the rails.. don’t get me started, but lately they do just snap into the racks! No screws needed. That’s a refresh from earlier 1U SM rails. Ah, the only 1U servers I'm currently deploying from SM are older ones, so still no snap-in rails. Everything 2U has been that way for at least 2 years, though. ^^ Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mapping OSD to physical device
I don't use ceph-deploy, but using ceph-disk for creating the OSDs automatically uses the by-partuuid reference for the journals (at least I recall only using /dev/sdX for the journal reference, which is what I have in my documentation). Since ceph-disk does all the partitioning, it automatically finds the volume with udev, mounts it in the correct location and accesses the journal on the right disk. It also may be a limitation on the version of ceph-deploy/ceph-disk you are using. On Thu, Mar 19, 2015 at 5:54 PM, Colin Corr co...@pc-doctor.com wrote: On 03/19/2015 12:27 PM, Robert LeBlanc wrote: Udev already provides some of this for you. Look in /dev/disk/by-*. You can reference drives by UUID, id or path (for SAS/SCSI/FC/iSCSI/etc) which will provide some consistency across reboots and hardware changes. Thanks for the quick responses. And to Kobi (off list) as well. It seems the optimal way to do this is to create the OSDs by ID in the first place. So, for /dev/sde with a journal on /dev/sda5: root@osd1:~$ ls -l /dev/disk/by-id/ | grep sde lrwxrwxrwx 1 root root 9 Mar 19 23:36 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5TUCJX9 - ../../sde lrwxrwxrwx 1 root root 9 Mar 19 23:36 wwn-0x50014ee20a66aefe - ../../sde root@osd1:~$ ls -l /dev/disk/by-id/ | grep sda5 lrwxrwxrwx 1 root root 10 Mar 19 23:36 ata-Crucial_CT480M500SSD1_14210C292B50-part5 - ../../sda5 lrwxrwxrwx 1 root root 10 Mar 19 23:36 wwn-0x500a07510c292b50-part5 - ../../sda5 The deploy command looks like this: ceph-deploy --overwrite-conf osd create osd1:/dev/disk/by-id/wwn-0x50014ee20a66aefe:/dev/disk/by-id/wwn-0x500a07510c292b50-part5 And alternatively, create a udev rule set for existing devices. I haven't tested yet, but I am guessing that the udev rule for that same disk (deployed as sde) would look something like this: KERNEL==sde, SUBSYSTEM==block, DEVLINKS==/dev/disk/by-id/wwn-0x50014ee20a66aefe Many thanks for the assistance! Colin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] hadoop namenode not starting due to bindException while deploying hadoop with cephFS
Hi, I have a 5 node ceph(v0.87) cluster and am trying to deploy hadoop with cephFS. I have installed hadoop-1.1.1 in the nodes and changed the conf/core-site.xml file according to the ceph documentation http://ceph.com/docs/master/cephfs/hadoop/ but after changing the file the namenode is not starting (namenode can be formatted) but the other services(datanode, jobtracker, tasktracker) are running in hadoop. The default hadoop works fine but when I change the core-site.xml file as above I get the following bindException as can be seen from the namenode log: 2015-03-19 01:37:31,436 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to node1/10.242.144.225:6789 : Cannot assign requested address I have one monitor for the ceph cluster (node1/10.242.144.225) and I included in the core-site.xml file ceph://10.242.144.225:6789 as the value of fs.default.name. The 6789 port is the default port being used by the monitor node of ceph, so that may be the reason for the bindException but the ceph documentation mentions that it should be included like this in the core-site.xml file. It would be really helpful to get some pointers to where I am doing wrong in the setup. Thank you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 'pgs stuck unclean ' problem
Dear all, Ceph 0.72.2 is deployed in three hosts. But the ceph's status is HEALTH_WARN . The status is as follows: # ceph -s cluster e25909ed-25d9-42fd-8c97-0ed31eec6194 health HEALTH_WARN 768 pgs degraded; 768 pgs stuck unclean; recovery 2/3 objects degraded (66.667%) monmap e3: 3 mons at {ceph-node1=192.168.57.101:6789/0,ceph-node2=192.168.57.102:6789/0,ceph-node3=192.168.57.103:6789/0}, election epoch 34, quorum 0,1,2 ceph-node1,ceph-node2,ceph-node3 osdmap e170: 9 osds: 9 up, 9 in pgmap v1741: 768 pgs, 7 pools, 36 bytes data, 1 objects 367 MB used, 45612 MB / 45980 MB avail 2/3 objects degraded (66.667%) 768 active+degraded There are 3 pools created, but 7 pools appears in above ceph status. # ceph osd lspools 5 data,6 metadata,7 rbd, The object in pool 'data' justs has one replication. But the pool's replication is set as 3. # ceph osd map data object1 osdmap e170 pool 'data' (5) object 'object1' - pg 5.bac5debc (5.bc) - up [6] acting [6] # ceph osd dump|more epoch 170 fsid e25909ed-25d9-42fd-8c97-0ed31eec6194 created 2015-03-16 11:23:28.805286 modified 2015-03-19 15:45:39.451077 flags pool 5 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 155 owner 0 pool 6 'metadata' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 161 owner 0 pool 7 'rbd' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 163 owner 0 Other info is depicted here. # ceph osd tree # idweight type name up/down reweight -1 0 root default -7 0 rack rack03 -4 0 host ceph-node3 6 0 osd.6 up 1 7 0 osd.7 up 1 8 0 osd.8 up 1 -6 0 rack rack02 -3 0 host ceph-node2 3 0 osd.3 up 1 4 0 osd.4 up 1 5 0 osd.5 up 1 -5 0 rack rack01 -2 0 host ceph-node1 0 0 osd.0 up 1 1 0 osd.1 up 1 2 0 osd.2 up 1 The crushmap is : # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 root # buckets host ceph-node3 { id -4 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.6 weight 0.000 item osd.7 weight 0.000 item osd.8 weight 0.000 } rack rack03 { id -7 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item ceph-node3 weight 0.000 } host ceph-node2 { id -3 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.3 weight 0.000 item osd.4 weight 0.000 item osd.5 weight 0.000 } rack rack02 { id -6 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item ceph-node2 weight 0.000 } host ceph-node1 { id -2 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item osd.0 weight 0.000 item osd.1 weight 0.000 item osd.2 weight 0.000 } rack rack01 { id -5 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item ceph-node1 weight 0.000 } root default { id -1 # do not change unnecessarily # weight 0.000 alg straw hash 0 # rjenkins1 item rack03 weight 0.000 item rack02 weight 0.000 item rack01 weight 0.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map # ceph health detail |more HEALTH_WARN 768 pgs degraded; 768 pgs stuck unclean; recovery 2/3 objects degraded (66.667%) pg 5.17 is stuck unclean since forever, current state active+degraded, last acting [6] pg 6.14 is stuck unclean since forever, current state active+degraded, last acting [6] pg 7.15 is stuck unclean since forever, current state active+degraded, last acting [6] pg 5.14 is stuck unclean since forever, current
[ceph-users] Segfault after modifying CRUSHMAP
Hi guys, I was creating new buckets and adjusting the crush map when 1 monitor stopped replying. The scenario is: 2 servers 2 MONs 21 OSDs each server Error message in the mon.log: NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. I uploaded the stderr to: http://ur1.ca/jxbrp Does anybody have any idea? Thank you, Gian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD Hardware recommendation
Hello, On Wed, 18 Mar 2015 11:41:17 +0100 Francois Lafont wrote: Hi, Christian Balzer wrote : Consider what you think your IO load (writes) generated by your client(s) will be, multiply that by your replication factor, divide by the number of OSDs, that will give you the base load per OSD. Then multiply by 2 (journal on OSD) per OSD. Finally based on my experience and measurements (link below) multiply that by at least 6, probably 10 to be on safe side. Use that number to find the SSD that can handle this write load for the time period you're budgeting that cluster for. http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html Thanks Christian for this interesting explanations. I have read your link and I'd like to understand why the write amplification is greater than the the replication factor. For me, in theory, write amplification should be approximatively equal to the replication factor. What are the reasons of this difference? Er... in fact, after thinking about it a little, I imagine that 1 write IO in the client side becomes 2*R IO in the cluster side (where R is the replication factor) because there are R IO for the OSD and R IO for the journal. So, with R = 2, I can imagine a write amplification equal to 4 but I don't understand why it's 5 or 6. Is it possible to have explanations about this? You're asking the wrong person, as I'm neither a Ceph or kernel developer. ^o^ Back then Mark Nelson from the Ceph team didn't expect to see those numbers as well, but both Mark Wu and I saw them. Anyways, lets start with the basics and things that are understandable without any detail knowledge. Assume a cluster with 2 nodes, 10 OSDs each and a replication of 2 (Since we're talking about SSD cluster here and keep things related to the question of the OP). Now a client writes 40MB of data to the cluster. Assuming an ideal scenario where all PGs are evenly distributed (they won't be) and this is totally fresh data (resulting in 10 4MB Ceph objects), this would mean that each OSD will receive 4MB (10 primary PGs, 10 secondary ones). With journals on the same SSD (currently the best way based on tests), we get a write amplification of 2, as that data is written both to the journal and the actual storage space. But as my results in the link above showed, that is very much dependent on the write size. With a 4MB block size (the ideal size for default RBD pools and objects) I saw even slightly less than the 2x amplifications expected, I assume that was due to caching and PG imbalances. Now my guess what happens with small (4KB) writes is that all these small writes do not coalesce sufficiently before being written to the object on the OSD. So up to 1000 4KB writes could happen to that 4MB object (clearly is it much less than that, but how much I can't tell), resulting in the same blocks being rewritten several times. There's also the journaling done by the respective file system (I used ext4 during that test) and while there are bound to be some differences in a worst case scenario that could result in another 2x write amplification (FS journal and actual file). In addition Ceph updates various files like the omap leveldb and meta-data, quantifying that however would require much more detailed analysis or familiarity with the Ceph code. Regards, Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Readonly cache tiering and rbd.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, - From the documentation: Cache Tier readonly: Read-only Mode: When admins configure tiers with readonly mode, Ceph clients write data to the backing tier. On read, Ceph copies the requested object(s) from the backing tier to the cache tier. Stale objects get removed from the cache tier based on the defined policy. This approach is ideal for immutable data (e.g., presenting pictures/videos on a social network, DNA data, X-Ray imaging, etc.), because reading data from a cache pool that might contain out-of-date data provides weak consistency. Do not use readonly mode for mutable data. Does this mean that when a client (xen / kvm with a RBD volume) writes some data that the OSD does not mark the readonly cache dirty? In other words, what does 'weak consistency' mean here? Regards, Matthijs -BEGIN PGP SIGNATURE- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) iQIcBAEBAgAGBQJVCrcNAAoJEBXBjvSJ+ky+TK4P/2EbWeZICmzwS1RIeZZhRJL7 0tdcrzlETH7E6UZJS/dkOK/qea2ouXPipwnO8axj9nBc9ixHDx4ODTqeJ8t2Tm9T 6xtIcVtjatBsI9chkAcLhYK/vfLCTVeJFLwPeQLu/miYHmcn88eHuhkn/A2ARCdj MsmIYfTaV8VEY/4oUD2kMHog1yL/Io36vgAEgnMJrtSC2wQvyqiVO9ZVCaStkP8H ztIeKyhlCJRRWBA0PsvIiBX9brQhIPFIWDA8h+ypppA4YQLNsMq7xrNezrF4mSJt /keMwqUSeTsm7wkL1PLSAByosOjFsXKJkUHDsNtT6Dyzb5hzTTaA5XcWS7FFrA1p GnIEXGqf1Xk41zWFQhSzvUImxCtAAIF4DBDvndtEroMmofNLKGbfULKHJvvrkSKd uVswpSa7diA7dQXkUmisp/ZtoXuMtgA4WtJ4FmKRkCx1OpXHjKQjPm212ZD7hiQk z8zpasnQAvfE/0otvKaBXU5jTaMI8bhDaIZwY6wqpTxvok1MghsFMM619SQqy0nM tg0qf2Qb2NQIz0jvvlSsfhzyUmKP9WrSNVvYGeNCkxF1T0i1pRun1f4gMo+6lalj zLsoLufjgvd4w6e9G+p8eoLv4rcBEtNa8bX0o1vpC7k+Rh8STXcYeTSDAkU0xnf4 jgQXA5kan6ezsEMyqU7I =WCq1 -END PGP SIGNATURE- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Code for object deletion
Can anyone tel me where code for deleting objects using command rados rm test-object-1 --pool=data will be found for ceph-version 0.80.5?? Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Server Specific Pools
Hi, I have a Ceph cluster with both ARM and x86 based servers in the same cluster. Is there a way for me to define Pools or some logical separation that would allow me to use only 1 set of machines for a particular test. That way it makes easy for me to run tests either on x86 or ARM and do some comparison testing. Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Server Specific Pools
Pankaj, You can define them via different crush rules, and then assign a pool to a given crush rule. This is the same in practice as having a node type with all SSDs and another with all spinners. You can read more about how to set this up here: http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds Cheers, On Thu, Mar 19, 2015 at 9:28 PM, Garg, Pankaj pankaj.g...@caviumnetworks.com wrote: Hi, I have a Ceph cluster with both ARM and x86 based servers in the same cluster. Is there a way for me to define Pools or some logical separation that would allow me to use only 1 set of machines for a particular test. That way it makes easy for me to run tests either on x86 or ARM and do some comparison testing. Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- David Burley NOC Manager, Sr. Systems Programmer/Analyst Slashdot Media e: da...@slashdotmedia.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mapping OSD to physical device
On 03/19/2015 12:27 PM, Robert LeBlanc wrote: Udev already provides some of this for you. Look in /dev/disk/by-*. You can reference drives by UUID, id or path (for SAS/SCSI/FC/iSCSI/etc) which will provide some consistency across reboots and hardware changes. Thanks for the quick responses. And to Kobi (off list) as well. It seems the optimal way to do this is to create the OSDs by ID in the first place. So, for /dev/sde with a journal on /dev/sda5: root@osd1:~$ ls -l /dev/disk/by-id/ | grep sde lrwxrwxrwx 1 root root 9 Mar 19 23:36 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5TUCJX9 - ../../sde lrwxrwxrwx 1 root root 9 Mar 19 23:36 wwn-0x50014ee20a66aefe - ../../sde root@osd1:~$ ls -l /dev/disk/by-id/ | grep sda5 lrwxrwxrwx 1 root root 10 Mar 19 23:36 ata-Crucial_CT480M500SSD1_14210C292B50-part5 - ../../sda5 lrwxrwxrwx 1 root root 10 Mar 19 23:36 wwn-0x500a07510c292b50-part5 - ../../sda5 The deploy command looks like this: ceph-deploy --overwrite-conf osd create osd1:/dev/disk/by-id/wwn-0x50014ee20a66aefe:/dev/disk/by-id/wwn-0x500a07510c292b50-part5 And alternatively, create a udev rule set for existing devices. I haven't tested yet, but I am guessing that the udev rule for that same disk (deployed as sde) would look something like this: KERNEL==sde, SUBSYSTEM==block, DEVLINKS==/dev/disk/by-id/wwn-0x50014ee20a66aefe Many thanks for the assistance! Colin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FastCGI and RadosGW issue?
- Original Message - From: Potato Farmer potato_far...@outlook.com To: ceph-users@lists.ceph.com Sent: Thursday, March 19, 2015 12:26:41 PM Subject: [ceph-users] FastCGI and RadosGW issue? Hi, I am running into an issue uploading to a bucket over an s3 connection to ceph. I can create buckets just fine. I just can’t create a key and copy data to it. Command that causes the error: key.set_contents_from_string(testing from string) I encounter the following error: Traceback (most recent call last): File stdin, line 1, in module File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1424, in set_contents_from_string encrypt_key=encrypt_key) File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1291, in set_contents_from_file chunked_transfer=chunked_transfer, size=size) File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 748, in send_file chunked_transfer=chunked_transfer, size=size) File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 949, in _send_file_internal query_args=query_args File /usr/lib/python2.7/site-packages/boto/s3/connection.py, line 664, in make_request retry_handler=retry_handler File /usr/lib/python2.7/site-packages/boto/connection.py, line 1068, in make_request retry_handler=retry_handler) File /usr/lib/python2.7/site-packages/boto/connection.py, line 1025, in _mexe raise BotoServerError(response.status, response.reason, body) boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error None In the Apache logs I see the following: [Thu Mar 19 12:03:13 2015] [error] [] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Thu Mar 19 12:03:13 2015] [error] [] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Thu Mar 19 12:03:32 2015] [error] [] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Thu Mar 19 12:03:32 2015] [error] [] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi I do not get any data to show in the radosgw logs, it is empty. I have turned off FastCGIWrapper and set rgw print continue to false in ceph.conf. I am using the version of FastCGI provided by the ceph repo. In this case you don't need to have 'rgw print continue' set to false; either remove that line, or set it to true. Yehuda Has anyone run into this before? Any suggestions? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] FastCGI and RadosGW issue?
Yehuda, You rock! Thank you for the suggestion. That fixed the issue. :) -Original Message- From: Yehuda Sadeh-Weinraub [mailto:yeh...@redhat.com] Sent: Thursday, March 19, 2015 12:45 PM To: Potato Farmer Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] FastCGI and RadosGW issue? - Original Message - From: Potato Farmer potato_far...@outlook.com To: ceph-users@lists.ceph.com Sent: Thursday, March 19, 2015 12:26:41 PM Subject: [ceph-users] FastCGI and RadosGW issue? Hi, I am running into an issue uploading to a bucket over an s3 connection to ceph. I can create buckets just fine. I just can’t create a key and copy data to it. Command that causes the error: key.set_contents_from_string(testing from string) I encounter the following error: Traceback (most recent call last): File stdin, line 1, in module File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1424, in set_contents_from_string encrypt_key=encrypt_key) File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1291, in set_contents_from_file chunked_transfer=chunked_transfer, size=size) File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 748, in send_file chunked_transfer=chunked_transfer, size=size) File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 949, in _send_file_internal query_args=query_args File /usr/lib/python2.7/site-packages/boto/s3/connection.py, line 664, in make_request retry_handler=retry_handler File /usr/lib/python2.7/site-packages/boto/connection.py, line 1068, in make_request retry_handler=retry_handler) File /usr/lib/python2.7/site-packages/boto/connection.py, line 1025, in _mexe raise BotoServerError(response.status, response.reason, body) boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error None In the Apache logs I see the following: [Thu Mar 19 12:03:13 2015] [error] [] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Thu Mar 19 12:03:13 2015] [error] [] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Thu Mar 19 12:03:32 2015] [error] [] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Thu Mar 19 12:03:32 2015] [error] [] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi I do not get any data to show in the radosgw logs, it is empty. I have turned off FastCGIWrapper and set rgw print continue to false in ceph.conf. I am using the version of FastCGI provided by the ceph repo. In this case you don't need to have 'rgw print continue' set to false; either remove that line, or set it to true. Yehuda Has anyone run into this before? Any suggestions? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Readonly cache tiering and rbd.
On Thu, Mar 19, 2015 at 4:46 AM, Matthijs Möhlmann matth...@cacholong.nl wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, - From the documentation: Cache Tier readonly: Read-only Mode: When admins configure tiers with readonly mode, Ceph clients write data to the backing tier. On read, Ceph copies the requested object(s) from the backing tier to the cache tier. Stale objects get removed from the cache tier based on the defined policy. This approach is ideal for immutable data (e.g., presenting pictures/videos on a social network, DNA data, X-Ray imaging, etc.), because reading data from a cache pool that might contain out-of-date data provides weak consistency. Do not use readonly mode for mutable data. Does this mean that when a client (xen / kvm with a RBD volume) writes some data that the OSD does not mark the readonly cache dirty? Yes, exactly. Reads are directed to the cache but writes go directly to the base tier, and there's no attempt at communication about the changed objects. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceiling on number of PGs in a OSD
Hi, Is there a celing on the number for number of placement groups in a OSD beyond which steady state and/or recovery performance will start to suffer? Example: I need to create a pool with 750 osds (25 OSD per server, 50 servers). The PG calculator gives me 65536 placement groups with 300 PGs per OSD. Now as the cluster expands, the number of PGs in a OSD has to increase as well. If the cluster size inceases by a factor of 10, the number of PGs per OSD will also need to be increased. What would be the impact of large pg number in a OSD on peering and rebalancing. There is 3GB per OSD available. thanks, Sreenath ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] scubbing for a long time and not finished
On Thu, 19 Mar 2015, Xinze Chi wrote: Currently, users do not know when some pg do scrubbing for a long time. I think whether we could give some warming if it happend (defined as osd_scrub_max_time). It would tell the user something may be wrong in cluster. This should be pretty straightforward to add along with the other stuck x warnings based on the pg_stat_t state timestamps. On the otherhead, that may be a somewhat heavyweight approach (each new warning bloats the stat structure a bit); open to other ideas! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cciss driver package for RHEL7
I understand there's a KMOD_CCISS package available. However, I can't find it for download. Anybody have any ideas? Thanks! Dan O'Reilly UNIX Systems Administration [cid:image001.jpg@01D06222.B852F940] 9601 S. Meridian Blvd. Englewood, CO 80112 720-514-6293 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Issue with Ceph mons starting up- leveldb store
Hello: We have a cuttlefish (0.61.9) 192-OSD cluster that we are trying to get back to a quorum. We have 2 mon nodes up and ready, we just need this 3rd. We moved the data dir over (/var/lib/ceph/mon) from one of the good ones to this 3rd node, but it won't start- we see this error, after which no further logging occurs: 2015-03-19 06:25:05.395210 7fcb57f1c7c0 -1 failed to create new leveldb store 2015-03-19 06:25:05.417716 7f272ae0d7c0 0 ceph version 0.61.9 (7440dcd135750839fa0f00263f80722ff6f51e90), process ceph-mon, pid 37967 Does anyone have an idea why the mon process would have issues creating the leveldb store (we've seen this error since the outage) and where does it create it? Is it part of the paxos implementation? Thanks for any help, -andy ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] FastCGI and RadosGW issue?
Hi, I am running into an issue uploading to a bucket over an s3 connection to ceph. I can create buckets just fine. I just can't create a key and copy data to it. Command that causes the error: key.set_contents_from_string(testing from string) I encounter the following error: Traceback (most recent call last): File stdin, line 1, in module File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1424, in set_contents_from_string encrypt_key=encrypt_key) File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1291, in set_contents_from_file chunked_transfer=chunked_transfer, size=size) File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 748, in send_file chunked_transfer=chunked_transfer, size=size) File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 949, in _send_file_internal query_args=query_args File /usr/lib/python2.7/site-packages/boto/s3/connection.py, line 664, in make_request retry_handler=retry_handler File /usr/lib/python2.7/site-packages/boto/connection.py, line 1068, in make_request retry_handler=retry_handler) File /usr/lib/python2.7/site-packages/boto/connection.py, line 1025, in _mexe raise BotoServerError(response.status, response.reason, body) boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error None In the Apache logs I see the following: [Thu Mar 19 12:03:13 2015] [error] [] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Thu Mar 19 12:03:13 2015] [error] [] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi [Thu Mar 19 12:03:32 2015] [error] [] FastCGI: comm with server /var/www/s3gw.fcgi aborted: idle timeout (30 sec) [Thu Mar 19 12:03:32 2015] [error] [] FastCGI: incomplete headers (0 bytes) received from server /var/www/s3gw.fcgi I do not get any data to show in the radosgw logs, it is empty. I have turned off FastCGIWrapper and set rgw print continue to false in ceph.conf. I am using the version of FastCGI provided by the ceph repo. Has anyone run into this before? Any suggestions? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mapping OSD to physical device
Udev already provides some of this for you. Look in /dev/disk/by-*. You can reference drives by UUID, id or path (for SAS/SCSI/FC/iSCSI/etc) which will provide some consistency across reboots and hardware changes. On Thu, Mar 19, 2015 at 1:10 PM, Colin Corr co...@pc-doctor.com wrote: Greetings Cephers, I have been lurking on this list for a while, but this is my first inquiry. I have been playing with Ceph for the past 9 months and am in the process of deploying a production Ceph cluster. I am seeking advice on an issue that I have encountered. I do not believe it is a Ceph specific issue, but more of a Linux issue. Technically, its not an issue, just undesired behaviour that I am hoping someone here has encountered and can provide some insight as to a work around. Basically, there are occasions when an OSD host machine gets rebooted. Sometimes one or more drives does not spin up properly. This causes the OSD to go offline, along with all other OSDs after it in sequence. I created my OSDs using the online docs with the Linux device name (ex. /dev/sdc, sdd, sde, etc). So, osd.0 = /dev/sdc, osd.1 = /dev/sdd, osd.2 = /dev/sde, osd.3 = dev/sdf, etc. But, if one of the drives fails/does not spin up, then Linux will rename the drives. Example, /dev/sdd fails on reboot, so now osd.1 comes up with /dev/sde, but /dev/sde is actually the osd.2 drive and osd.2 comes up with what was the osd.3 drive, then they all fall offline in sequence after the one failed osd.1. As expected, if I replace the failed drive and reboot, Linux enumerates the drives and gives them the original device names and Ceph behaves properly by marking the affected osd as down and out, while the remaining drives in sequence come up and recover gracefully. Does anyone have any thoughts or experience with how one can ensure that Linux device names will always map to the physical device ID? I was thinking along the lines of a udev ruleset for the drives or something similar. Or, is there a better way to create the OSD using the physical device ID? Basically, some sort of way to ensure that a specific physical drive always gets mapped to the same device name and OSD. Thanks for any insight or thoughts on this, Colin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Mapping OSD to physical device
Greetings Cephers, I have been lurking on this list for a while, but this is my first inquiry. I have been playing with Ceph for the past 9 months and am in the process of deploying a production Ceph cluster. I am seeking advice on an issue that I have encountered. I do not believe it is a Ceph specific issue, but more of a Linux issue. Technically, its not an issue, just undesired behaviour that I am hoping someone here has encountered and can provide some insight as to a work around. Basically, there are occasions when an OSD host machine gets rebooted. Sometimes one or more drives does not spin up properly. This causes the OSD to go offline, along with all other OSDs after it in sequence. I created my OSDs using the online docs with the Linux device name (ex. /dev/sdc, sdd, sde, etc). So, osd.0 = /dev/sdc, osd.1 = /dev/sdd, osd.2 = /dev/sde, osd.3 = dev/sdf, etc. But, if one of the drives fails/does not spin up, then Linux will rename the drives. Example, /dev/sdd fails on reboot, so now osd.1 comes up with /dev/sde, but /dev/sde is actually the osd.2 drive and osd.2 comes up with what was the osd.3 drive, then they all fall offline in sequence after the one failed osd.1. As expected, if I replace the failed drive and reboot, Linux enumerates the drives and gives them the original device names and Ceph behaves properly by marking the affected osd as down and out, while the remaining drives in sequence come up and recover gracefully. Does anyone have any thoughts or experience with how one can ensure that Linux device names will always map to the physical device ID? I was thinking along the lines of a udev ruleset for the drives or something similar. Or, is there a better way to create the OSD using the physical device ID? Basically, some sort of way to ensure that a specific physical drive always gets mapped to the same device name and OSD. Thanks for any insight or thoughts on this, Colin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
I think this could be part of what I am seeing. I found this post from back in 2003 http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083 Which seems to describe a work around for the behaviour to what I am seeing. The constant small block IO I was seeing looks like it was either the pg log and info updates or FS metatdata. I have been going through the blktraces I did today and 90% of the time I am just seeing 8kb writes and journal writes. I think the journal and filestore settings I have been adjusting, have just been moving the data sync around the benchmark timeline and altering when the journal starts throttling. It seems that with small IO's the metadata overhead takes several times longer than the actual data writing. This probably also explains why a full SSD OSD is faster than a HDD+SSD even for brief bursts of IO. In the thread I posted above, it seems that adding something like flashcache can massively help overcome this problem, so this is something I might look into. It’s a shame I didn't get BBWC with my OSD nodes as this would have also likely alleviated this problem with a lot less hassle. Ah, no, you're right. With the bench command it all goes in to one object, it's just a separate transaction for each 64k write. But again depending on flusher and throttler settings in the OSD, and the backing FS' configuration, it can be a lot of individual updates — in particular, every time there's a sync it has to update the inode. Certainly that'll be the case in the described configuration, with relatively low writeahead limits on the journal but high sync intervals — once you hit the limits, every write will get an immediate flush request. But none of that should have much impact on your write amplification tests unless you're actually using osd bench to test it. You're more likely to be seeing the overhead of the pg log entry, pg info change, etc that's associated with each write. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?
On Wed, Mar 18, 2015 at 11:10 PM, Christian Balzer ch...@gol.com wrote: Hello, On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote: On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote: Hi Greg, Thanks for your input and completely agree that we cannot expect developers to fully document what impact each setting has on a cluster, particularly in a performance related way That said, if you or others could spare some time for a few pointers it would be much appreciated and I will endeavour to create some useful results/documents that are more relevant to end users. I have taken on board what you said about the WB throttle and have been experimenting with it by switching it on and off. I know it's a bit of a blunt configuration change, but it was useful to understand its effect. With it off, I do see initially quite a large performance increase but overtime it actually starts to slow the average throughput down. Like you said, I am guessing this is to do with it making sure the journal doesn't get to far ahead, leaving it with massive sync's to carry out. One thing I do see with the WBT enabled and to some extent with it disabled, is that there are large periods of small block writes at the max speed of the underlying sata disk (70-80iops). Here are 2 blktrace seekwatcher traces of performing an OSD bench (64kb io's for 500MB) where this behaviour can be seen. If you're doing 64k IOs then I believe it's creating a new on-disk file for each of those writes. How that's laid out on-disk will depend on your filesystem and the specific config options that we're using to try to avoid running too far ahead of the journal. Could you elaborate on that a bit? I would have expected those 64KB writes to go to the same object (file) until it is full (4MB). Because this behavior would explain some (if not all) of the write amplification I've seen in the past with small writes (see the SSD Hardware recommendation thread). Ah, no, you're right. With the bench command it all goes in to one object, it's just a separate transaction for each 64k write. But again depending on flusher and throttler settings in the OSD, and the backing FS' configuration, it can be a lot of individual updates — in particular, every time there's a sync it has to update the inode. Certainly that'll be the case in the described configuration, with relatively low writeahead limits on the journal but high sync intervals — once you hit the limits, every write will get an immediate flush request. But none of that should have much impact on your write amplification tests unless you're actually using osd bench to test it. You're more likely to be seeing the overhead of the pg log entry, pg info change, etc that's associated with each write. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] PGs issue
Hello, everyone! I have created a Ceph cluster (v0.87.1-1) using the info from the 'Quick deploy http://docs.ceph.com/docs/master/start/quick-ceph-deploy/' page, with the following setup: - 1 x admin / deploy node; - 3 x OSD and MON nodes; - each OSD node has 2 x 8 GB HDDs; The setup was made using Virtual Box images, on Ubuntu 14.04.2. After performing all the steps, the 'ceph health' output lists the cluster in the HEALTH_WARN state, with the following details: HEALTH_WARN 64 pgs degraded; 64 pgs stuck degraded; 64 pgs stuck unclean; 64 pgs stuck undersized; 64 pgs undersized; too few pgs per osd (10 min 20) The output of 'ceph -s': cluster b483bc59-c95e-44b1-8f8d-86d3feffcfab health HEALTH_WARN 64 pgs degraded; 64 pgs stuck degraded; 64 pgs stuck unclean; 64 pgs stuck undersized; 64 pgs undersized; too few pgs per osd (10 min 20) monmap e1: 3 mons at {osd-003= 192.168.122.23:6789/0,osd-002=192.168.122.22:6789/0,osd-001=192.168.122.21:6789/0}, election epoch 6, quorum 0,1,2 osd-001,osd-002,osd-003 osdmap e20: 6 osds: 6 up, 6 in pgmap v36: 64 pgs, 1 pools, 0 bytes data, 0 objects 199 MB used, 18166 MB / 18365 MB avail 64 active+undersized+degraded I have tried to increase the pg_num and pgp_num to 512, as advised here http://ceph.com/docs/master/rados/operations/placement-groups/#a-preselection-of-pg-num, but Ceph refused to do that, with the following error: Error E2BIG: specified pg_num 512 is too large (creating 384 new PGs on ~6 OSDs exceeds per-OSD max of 32) After changing the pg*_num to 256, as advised here http://ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups, the warning was changed to: health HEALTH_WARN 256 pgs degraded; 256 pgs stuck unclean; 256 pgs undersized What is the issue behind these warning? and what do I need to do to fix it? I'm a newcomer in the Ceph world, so please don't shoot me if this issue has been answered / discussed countless times before :) I have searched the web and the mailing list for the answers, but I couldn't find a valid solution. Any help is highly appreciated. Thank you! Regards, Bogdan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PGs issue
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Bogdan SOLGA Sent: 19 March 2015 20:51 To: ceph-users@lists.ceph.com Subject: [ceph-users] PGs issue Hello, everyone! I have created a Ceph cluster (v0.87.1-1) using the info from the 'Quick deploy' page, with the following setup: • 1 x admin / deploy node; • 3 x OSD and MON nodes; o each OSD node has 2 x 8 GB HDDs; The setup was made using Virtual Box images, on Ubuntu 14.04.2. After performing all the steps, the 'ceph health' output lists the cluster in the HEALTH_WARN state, with the following details: HEALTH_WARN 64 pgs degraded; 64 pgs stuck degraded; 64 pgs stuck unclean; 64 pgs stuck undersized; 64 pgs undersized; too few pgs per osd (10 min 20) The output of 'ceph -s': cluster b483bc59-c95e-44b1-8f8d-86d3feffcfab health HEALTH_WARN 64 pgs degraded; 64 pgs stuck degraded; 64 pgs stuck unclean; 64 pgs stuck undersized; 64 pgs undersized; too few pgs per osd (10 min 20) monmap e1: 3 mons at {osd-003=192.168.122.23:6789/0,osd- 002=192.168.122.22:6789/0,osd-001=192.168.122.21:6789/0}, election epoch 6, quorum 0,1,2 osd-001,osd-002,osd-003 osdmap e20: 6 osds: 6 up, 6 in pgmap v36: 64 pgs, 1 pools, 0 bytes data, 0 objects 199 MB used, 18166 MB / 18365 MB avail 64 active+undersized+degraded I have tried to increase the pg_num and pgp_num to 512, as advised here, but Ceph refused to do that, with the following error: Error E2BIG: specified pg_num 512 is too large (creating 384 new PGs on ~6 OSDs exceeds per-OSD max of 32) After changing the pg*_num to 256, as advised here, the warning was changed to: health HEALTH_WARN 256 pgs degraded; 256 pgs stuck unclean; 256 pgs undersized What is the issue behind these warning? and what do I need to do to fix it? It's basically telling you that you current available OSD's don't meet the requirements to suit the number of replica's you have requested. What replica size have you configured for that pool? I'm a newcomer in the Ceph world, so please don't shoot me if this issue has been answered / discussed countless times before :) I have searched the web and the mailing list for the answers, but I couldn't find a valid solution. Any help is highly appreciated. Thank you! Regards, Bogdan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] OSD + Flashcache + udev + Partition uuid
I'm looking at trialling OSD's with a small flashcache device over them to hopefully reduce the impact of metadata updates when doing small block io. Inspiration from here:- http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083 One thing I suspect will happen, is that when the OSD node starts up udev could possibly mount the base OSD partition instead of flashcached device, as the base disk will have the ceph partition uuid type. This could result in quite nasty corruption. I have had a look at the Ceph udev rules and can see that something similar has been done for encrypted OSD's. Am I correct in assuming that what I need to do is to create a new partition uuid type for flashcached OSD's and then create a udev rule to activate these new uuid'd OSD's once flashcache has finished assembling them? Many Thanks, Nick ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD + Flashcache + udev + Partition uuid
On Thu, Mar 19, 2015 at 2:41 PM, Nick Fisk n...@fisk.me.uk wrote: I'm looking at trialling OSD's with a small flashcache device over them to hopefully reduce the impact of metadata updates when doing small block io. Inspiration from here:- http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083 One thing I suspect will happen, is that when the OSD node starts up udev could possibly mount the base OSD partition instead of flashcached device, as the base disk will have the ceph partition uuid type. This could result in quite nasty corruption. I have had a look at the Ceph udev rules and can see that something similar has been done for encrypted OSD's. Am I correct in assuming that what I need to do is to create a new partition uuid type for flashcached OSD's and then create a udev rule to activate these new uuid'd OSD's once flashcache has finished assembling them? I haven't worked with the udev rules in a while, but that sounds like the right way to go. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Issue with Ceph mons starting up- leveldb store
On 19/03/2015, at 15.50, Andrew Diller dill...@gmail.com wrote: We moved the data dir over (/var/lib/ceph/mon) from one of the good ones to this 3rd node, but it won't start- we see this error, after which no further logging occurs: 2015-03-19 06:25:05.395210 7fcb57f1c7c0 -1 failed to create new leveldb store 2015-03-19 06:25:05.417716 7f272ae0d7c0 0 ceph version 0.61.9 (7440dcd135750839fa0f00263f80722ff6f51e90), process ceph-mon, pid 37967 Does anyone have an idea why the mon process would have issues creating the leveldb store (we've seen this error since the outage) and where does it create it? Is it part of the paxos implementation? Just guessing... maybe the simple offen RC, permission on dirs along the path. /Steffen___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cciss driver package for RHEL7
On 19/03/2015, at 15.57, O'Reilly, Dan daniel.orei...@dish.com wrote: I understand there’s a KMOD_CCISS package available. However, I can’t find it for download. Anybody have any ideas? Oh I believe HP swapped cciss for hpsa (Smart Array) driver long ago… so maybe only download cciss latest source and then compile your self, or… Sourceforge http://cciss.sourceforge.net/ says: *New* The cciss driver has been removed from RHEL7 and SLES12. If you really want cciss on RHEL7 checkout the elrepo http://elrepo.org/ directory. /Steffen___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cciss driver package for RHEL7
The problem with using the hpsa driver is that I need to install RHEL 7.1 on a Proliant system using the SmartArray 400 controller. Therefore, I need a driver that supports it to even install RHEL 7.1. RHEL 7.1 doesn’t generically recognize that controller out of the box. From: Steffen W Sørensen [mailto:ste...@me.com] Sent: Thursday, March 19, 2015 10:08 AM To: O'Reilly, Dan Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] cciss driver package for RHEL7 On 19/03/2015, at 15.57, O'Reilly, Dan daniel.orei...@dish.commailto:daniel.orei...@dish.com wrote: I understand there’s a KMOD_CCISS package available. However, I can’t find it for download. Anybody have any ideas? Oh I believe HP swapped cciss for hpsa (Smart Array) driver long ago… so maybe only download cciss latest source and then compile your self, or… Sourceforgehttp://cciss.sourceforge.net says: *New* The cciss driver has been removed from RHEL7 and SLES12. If you really want cciss on RHEL7 checkout the elrepohttp://elrepo.org/ directory. /Steffen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com