Re: [ceph-users] pgs inconsistent
On 15.08.2019 16:38, huxia...@horebdata.cn wrote: Dear folks, I had a Ceph cluster with replication 2, 3 nodes, each node with 3 OSDs, on Luminous 12.2.12. Some days ago i had one OSD down (the disk is still fine) due to some errors on rocksdb crash. I tried to restart that OSD but failed. So I tried to rebalance but encountered PGs inconsistent. what can i do to make the cluster working again? thanks a lot for helping me out Samuel ** # ceph -s cluster: id: 289e3afa-f188-49b0-9bea-1ab57cc2beb8 health: HEALTH_ERR pauserd,pausewr,noout flag(s) set 191444 scrub errors Possible data damage: 376 pgs inconsistent services: mon: 3 daemons, quorum horeb71,horeb72,horeb73 mgr: horeb73(active), standbys: horeb71, horeb72 osd: 9 osds: 8 up, 8 in flags pauserd,pausewr,noout data: pools: 1 pools, 1024 pgs objects: 524.29k objects, 1.99TiB usage: 3.67TiB used, 2.58TiB / 6.25TiB avail pgs: 645 active+clean 376 active+clean+inconsistent 3 active+clean+scrubbing+deep that was a lot of inconsistent pg's. When you say replication = 2 do you mean you have 2 copies as in size=3 min-size=2 , or that you have size=2 min-size=1 ? the reason i ask is that min-size=1 is a well known way to get into lots of problems. (one disk can accept a write alone, and before it is recoverd/backfilled the drive can die) if you have min-size=1 i would recommend you set min-size=2 as the first step, to avoid creating more inconsistency while troubleshooting. if you have the space for it in the cluster you should also set size=3 if you run "#ceph health detail" you will get a list of the pg's that are inconsistent. check if there is a repeat offender osd in that list of pg's, and check that disk for issues. check dmesg and logs of the osd, and if there are smart errors. You can try to repair the inconsistent pg's automagically by running the command "#ceph pg repair [pg id]" but make sure the hardware is good first. good luck Ronny ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] VM management setup
Proxmox VE is a simple solution. https://www.proxmox.com/en/proxmox-ve based on debian. can administer an internal ceph cluster or connect to an external connected . easy and almost self explanatory web interface. good luck in your search ! Ronny On 05.04.2019 21:34, jes...@krogh.cc wrote: Hi. Knowing this is a bit off-topic but seeking recommendations and advise anyway. We're seeking a "management" solution for VM's - currently in the 40-50 VM - but would like to have better access in managing them and potintially migrate them across multiple hosts, setup block devices, etc, etc. This is only to be used internally in a department where a bunch of engineering people will manage it, no costumers and that kind of thing. Up until now we have been using virt-manager with kvm - and have been quite satisfied when we were in the "few vms", but it seems like the time to move on. Thus we're looking for something "simple" that can help manage a ceph+kvm based setup - the simpler and more to the point the better. Any recommendations? .. found a lot of names allready .. OpenStack CloudStack Proxmox .. But recommendations are truely welcome. Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v14.2.0 Nautilus released
with Debian buster frozen, If there are issues with ceph on debian that would best be fixed in debian, now is the last chance to get anything into buster before the next release. it is also important to get mimic and luminous packages built for Buster. Since you want to avoid a situation where you have to upgrade both the OS and ceph at the same time. kind regards Ronny Aasen On 20.03.2019 07:09, Alfredo Deza wrote: There aren't any Debian packages built for this release because we haven't updated the infrastructure to build (and test) Debian packages yet. On Tue, Mar 19, 2019 at 10:24 AM Sean Purdy wrote: Hi, Will debian packages be released? I don't see them in the nautilus repo. I thought that Nautilus was going to be debian-friendly, unlike Mimic. Sean On Tue, 19 Mar 2019 14:58:41 +0100 Abhishek Lekshmanan wrote: We're glad to announce the first release of Nautilus v14.2.0 stable series. There have been a lot of changes across components from the previous Ceph releases, and we advise everyone to go through the release and upgrade notes carefully. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] debian packages on download.ceph.com
the 2019-02-12 Debian Buster went into soft freeze https://release.debian.org/buster/freeze_policy.html So all the debian developers are hard at work getting buster ready for release. It would be really awesome if we could get debian buster packages built on http://download.ceph.com/ both for luminous and mimic, so one can test upgrades. since there is no mimic on stretch we are forced to upgrade to buster before upgrading ceph to mimic. This is basically the last chance to be able to try ceph on Buster, and find all the bugs potentially affecting ceph on debian. Before the release, while it is still possible to get them fixed. on a related note. are the build infrastructure for ceph on git somewhere ? kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mimic 13.2.3?
On 09.01.2019 17:27, Matthew Vernon wrote: Hi, On 08/01/2019 18:58, David Galloway wrote: The current distro matrix is: Luminous: xenial centos7 trusty jessie stretch Mimic: bionic xenial centos7 Thanks for clarifying :) This may have been different in previous point releases because, as Greg mentioned in an earlier post in this thread, the release process has changed hands and I'm still working on getting a solid/bulletproof process documented, in place, and (more) automated. I wouldn't be the final decision maker but if you think we should be building Mimic packages for Debian (for example), we could consider it. The build process should support it I believe. Could I suggest building Luminous for Bionic, and Mimic for Buster, please? Getting mimic, and luminous built for buster would be awesome, would let us start a bit testing on mimic, But would also allow to detect and fix potential bugs before buster hard freezes it is important to get luminous built as well, since we do not want to upgrade both OS and ceph in the same process. To many moving parts. since it is impossible to get (official) mimic on current debian stable, one would assume one first upgraded Debian to buster, still running luminous and, and afterwards upgraded luminous to mimic. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] disk controller failure
On 13.12.2018 18:19, Alex Gorbachev wrote: On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder wrote: Hi Cephers, one of our OSD nodes is experiencing a Disk controller problem/failure (frequent resetting), so the OSDs on this controller are flapping (up/down in/out). I will hopefully get the replacement part soon. I have some simple questions, what are the best steps to take now before an after replacement of the controller? - marking down and shutting down all osds on that node? - waiting for rebalance is finished - replace the controller - just restart the osds? Or redeploy them, since they still hold data? We are running: ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable) CentOS 7.5 Sorry for my naive questions. I usually do ceph osd set noout first to prevent any recoveries Then replace the hardware and make sure all OSDs come back online Then ceph osd unset noout Best regards, Alex Setting noout prevents the osd's from re-balancing. ie when you do a short fix and do not want it to start re-balancing, since you know the data will be available shortly.. eg a reboot or similar. if osd's are flapping you normally want them out of the cluster, so they do not impact performance any more. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM
On 11.12.2018 12:59, Kevin Olbrich wrote: Hi! Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous cluster (which already holds lot's of images). The server has access to both local and cluster-storage, I only need to live migrate the storage, not machine. I have never used live migration as it can cause more issues and the VMs that are already migrated, had planned downtime. Taking the VM offline and convert/import using qemu-img would take some hours but I would like to still serve clients, even if it is slower. The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with BBU). There are two HDDs bound as RAID1 which are constantly under 30% - 60% load (this goes up to 100% during reboot, updates or login prime-time). What happens when either the local compute node or the ceph cluster fails (degraded)? Or network is unavailable? Are all writes performed to both locations? Is this fail-safe? Or does the VM crash in worst case, which can lead to dirty shutdown for MS-EX DBs? the disk is on the source location untill the migration is finalized. if the local compute node crashed and the vm dies with it before the migration is done. the disk is on the source location as expected. if nodes on the ceph cluster dies but the cluster is operational, ceph just selfheal and the migration is finished. if the cluster dies hard enough to actually break, the migration will timeout , and abort. and disk remains on source location. if network is unavailable the transfer will also timeout. good luck Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM
On 11.12.2018 17:39, Lionel Bouton wrote: Le 11/12/2018 à 15:51, Konstantin Shalygin a écrit : Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous cluster (which already holds lot's of images). The server has access to both local and cluster-storage, I only need to live migrate the storage, not machine. I have never used live migration as it can cause more issues and the VMs that are already migrated, had planned downtime. Taking the VM offline and convert/import using qemu-img would take some hours but I would like to still serve clients, even if it is slower. The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with BBU). There are two HDDs bound as RAID1 which are constantly under 30% - 60% load (this goes up to 100% during reboot, updates or login prime-time). What happens when either the local compute node or the ceph cluster fails (degraded)? Or network is unavailable? Are all writes performed to both locations? Is this fail-safe? Or does the VM crash in worst case, which can lead to dirty shutdown for MS-EX DBs? The node currently has 4GB free RAM and 29GB listed as cache / available. These numbers need caution because we have "tuned" enabled which causes de-deplication on RAM and this host runs about 10 Windows VMs. During reboots or updates, RAM can get full again. Maybe I am to cautious about live-storage-migration, maybe I am not. What are your experiences or advices? Thank you very much! I was read your message two times and still can't figure out what is your question? You need move your block image from some storage to Ceph? No, you can't do this without downtime because fs consistency. You can easy migrate your filesystem via rsync for example, with small downtime for reboot VM. I believe OP is trying to use the storage migration feature of QEMU. I've never tried it and I wouldn't recommend it (probably not very tested and there is a large window for failure). use the qemu storage migration feature via proxmox webui several times a day. never any issues. I regularly migrate between ceph rbd, local directory, shared lvm over fiberchannel, nfs server. super easy and convenient. Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Bluestore : Deep Scrubbing vs Checksums
On 22.11.2018 17:06, Eddy Castillon wrote: Hello dear ceph users: We are running a ceph cluster with Luminous (BlueStore). As you may know this new ceph version has a new feature called "Checksums". I would like to ask if this feature replace to deep-scrub. In our cluster, we run deep-scrub ever month however the impact in the performance is high. Source: ceph's documentation: Checksums BlueStore calculates, stores, and verifies checksums for all data and metadata it stores. Any time data is read off of disk, a checksum is used to verify the data is correct before it is exposed to any other part of the system (or the user). checksum's and deep-scrub do different things. and you want to keep doing both. checksum helps in determining if the data is OK or not, when reading it off the disk. if you have data sitting idle on a drive, it can become unreadable due to bad blocks over time. if the data is never read you can end up in a situation where all the objects replicas have become unreadable. deep-scrub periodically reads the data on the drive to verify it is still readable and correct, (checking with other replicas, and checksums) you can schedule the deep-scrub to run on non-peek hours. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] https://ceph-storage.slack.com
On 18.09.2018 21:15, Alfredo Daniel Rezinovsky wrote: Can anyone add me to this slack? with my email alfrenov...@gmail.com Thanks. why would a ceph slack be invite only? Also is the slack bridged to matrix? room id ? kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS performance.
On 10/4/18 7:04 AM, jes...@krogh.cc wrote: Hi All. First thanks for the good discussion and strong answer's I've gotten so far. Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and 10GbitE and metadata on rotating drives - 3x replication - 256GB memory in OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC. Planned changes: - is to get 1-2 more OSD-hosts - experiment with EC-pools for CephFS - MDS onto seperate host and metadata onto SSD's. I'm still struggling to get "non-cached" performance up to "hardware" speed - whatever that means. I do "fio" benchmark using 10GB files, 16 threads, 4M block size -- at which I can "almost" sustained fill the 10GbitE NIC. In this configuraiton I would have expected it to be "way above" 10Gbit speed thus have the NIC not "almost" filled - but fully filled - could that be the metadata activities .. but on "big files" and read - that should not be much - right? Above is actually ok for production, thus .. not a big issue, just information. Single threaded performance is still struggling Cold HHD (read from disk in NFS-server end) / NFS performance: jk@zebra01:~$ pipebench < /nfs/16GB.file > /dev/null Summary: Piped 15.86 GB in 00h00m27.53s: 589.88 MB/second Local page cache (just to say it isn't the profiling tool delivering limitations): jk@zebra03:~$ pipebench < /nfs/16GB.file > /dev/null Summary: Piped 29.24 GB in 00h00m09.15s:3.19 GB/second jk@zebra03:~$ Now from the Ceph system: jk@zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null Summary: Piped 36.79 GB in 00h03m47.66s: 165.49 MB/second Can block/stripe-size be tuned? Does it make sense? Does read-ahead on the CephFS kernel-client need tuning? What performance are other people seeing? Other thoughts - recommendations? On some of the shares we're storing pretty large files (GB size) and need the backup to move them to tape - so it is preferred to be capable of filling an LTO6 drive's write speed to capacity with a single thread. 40'ish 7.2K RPM drives - should - add up to more than above.. right? This is the only current load being put on the cluster - + 100MB/s recovery traffic. the problem with single threaded performance in ceph. Is that it reads the spindles in serial. so you are practically reading one and one drive, and see a single disk's performance, subtracted all the overheads from ceph, network, mds, etc. So you do not get the combined performance of the drives, only one drive at the time. So the trick for ceph performance is to get more spindles working for you at the same time. There are ways to get more performance out of a single thread: - faster components in the path, ie faster disk/network/cpu/memory - larger pre-fetching/read-ahead, with a large enough read-ahead more osd's will participate in reading simultaneously. [1] shows a table of benchmarks with different read-ahead sizes. - erasure coding. while erasure coding does add latency vs replicated pools. You will get more spindles involved in reading in parallel. so for large sequential loads erasure coding can have a benefit. - some sort of extra caching scheme, I have not looked at cachefiles, but it may provide some benefit. you can also play with different cephfs implementations, there is a fuse client, where you can play with different cache solutions. But generally the kernel client is faster. in rbd there is a fancy striping solution, by using --stripe-unit and --stripe-count. This would get more spindles running ; perhaps consider using rbd instead of cephfs if it fits the workload. [1] https://tracker.ceph.com/projects/ceph/wiki/Kernel_client_read_ahead_optimization good luck Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore vs. Filestore
On 03.10.2018 20:10, jes...@krogh.cc wrote: Your use case sounds it might profit from the rados cache tier feature. It's a rarely used feature because it only works in very specific circumstances. But your scenario sounds like it might work. Definitely worth giving it a try. Also, dm-cache with LVM *might* help. But if your active working set is really just 400GB: Bluestore cache should handle this just fine. Don't worry about "unequal" distribution, every 4mb chunk of every file will go to a random OSD. I tried it out - and will do it more but Initial tests didnt really convince me - but I'll try more. One very powerful and simple optimization is moving the metadata pool to SSD only. Even if it's just 3 small but fast SSDs; that can make a huge difference to how fast your filesystem "feels". They are ordered and will hopefully arrive very soon. Can I: 1) Add disks 2) Create pool 3) stop all MDS's 4) rados cppool 5) Start MDS .. Yes, thats a cluster-down on CephFS but shouldn't take long. Or is there a better guide? this post https://ceph.com/community/new-luminous-crush-device-classes/ and this document http://docs.ceph.com/docs/master/rados/operations/pools/ explains how the osd class is used to define a crush placement rule. and then you can set the |crush_rule| on the pool and ceph will move the data. No downtime needed. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore vs. Filestore
On 02.10.2018 21:21, jes...@krogh.cc wrote: On 02.10.2018 19:28, jes...@krogh.cc wrote: In the cephfs world there is no central server that hold the cache. each cephfs client reads data directly from the osd's. I can accept this argument, but nevertheless .. if I used Filestore - it would work. bluestore is fairly new tho, so if your use case fits filestore better, there is no huge reason not to just use that This also means no single point of failure, and you can scale out performance by spreading metadata tree information over multiple MDS servers. and scale out storage and throughput with added osd nodes. so if the cephfs client cache is not sufficient, you can look at at the bluestore cache. http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size I have been there, but it seems to "not work"- I think the need to slice per OSD and statically allocate mem per OSD breaks the efficiency. (but I cannot prove it) or you can look at adding a ssd layer over the spinning disks. with eg bcache. I assume you are using a ssd/nvram for bluestore db already My currently bluestore(s) is backed by 10TB 7.2K RPM drives, allthough behind BBWC. Can you elaborate on the "assumption" as we're not doing that, I'd like to explore that. https://ceph.com/community/new-luminous-bluestore/ read about "multiple devices" you can split out the DB part of the bluestore to a faster drive (ssd) many tend to put db's for 4 spinners on a single ssd. the db is the osd metadata, it say where on the block the objects are. and it increases the performance of bluestore significantly. you should also look at tuning the cephfs metadata servers. make sure the metadata pool is on fast ssd osd's . and tune the mds cache to the mds server's ram, so you cache as much metadata as possible. Yes, we're in the process of doing that - I belive we're seeing the MDS suffering when we saturate a few disks in the setup - and they are sharing. Thus we'll move the metadata as per recommendations to SSD. good luck Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore vs. Filestore
On 02.10.2018 19:28, jes...@krogh.cc wrote: Hi. Based on some recommendations we have setup our CephFS installation using bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS server - 100TB-ish size. Current setup is - a sizeable Linux host with 512GB of memory - one large Dell MD1200 or MD1220 - 100TB + a Linux kernel NFS server. Since our "hot" dataset is < 400GB we can actually serve the hot data directly out of the host page-cache and never really touch the "slow" underlying drives. Except when new bulk data are written where a Perc with BBWC is consuming the data. In the CephFS + Bluestore world, Ceph is "deliberatly" bypassing the host OS page-cache, so even when we have 4-5 x 256GB memory** in the OSD hosts it is really hard to create a synthetic test where they hot data does not end up being read out of the underlying disks. Yes, the client side page cache works very well, but in our scenario we have 30+ hosts pulling the same data over NFS. Is bluestore just a "bad fit" .. Filestore "should" do the right thing? Is the recommendation to make an SSD "overlay" on the slow drives? Thoughts? Jesper * Bluestore should be the new and shiny future - right? ** Total mem 1TB+ In the cephfs world there is no central server that hold the cache. each cephfs client reads data directly from the osd's. this also means no single point of failure, and you can scale out performance by spreading metadata tree information over multiple MDS servers. and scale out storage and throughput with added osd nodes. so if the cephfs client cache is not sufficient, you can look at at the bluestore cache. http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size or you can look at adding a ssd layer over the spinning disks. with eg bcache. I assume you are using a ssd/nvram for bluestore db already you should also look at tuning the cephfs metadata servers. make sure the metadata pool is on fast ssd osd's . and tune the mds cache to the mds server's ram, so you cache as much metadata as possible. good luck Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow Ceph: Any plans on torrent-like transfers from OSDs ?
ceph is a distributed system, it scales by concurrent access to nodes. generally a single client will access a single OSD at the time, iow max possible single thread read is the read speak of the drive. and max possible write is single drive write / (replication size-1) but when you have many vm's accessing the same cluster the load is spread all over (just like when you see the recovery running) A single spinning disk should be able to do 100-150MB/s depending on make and model. even with the overhead of ceph and networking so i still think 20MB/s is a bit on the low side, depending on how you benchmark. I would start by going thru this benchmarking guide, and see if you find some issues: https://tracker.ceph.com/projects/ceph/wiki/Benchmark_Ceph_Cluster_Performance in order to get more singlethread performance out of ceph you must get faster individual parts ( nvram disks/fast ram and processors/fast network/etc/etc) or you can cheat by either spreading the load over more disks. eg you can do rbd fancy striping, or attach multiple disk's with individual controllers in the vm. or use caching and /or readahead. when it comes to cache tiering i would remove that, it does not get the love it needs. and redhat have even stopped supporting it in deployments. but you can use dm-cache or bcache on osd's or/and rbd-cache on kvm clients. good luck Ronny Aasen On 09.09.2018 11:20, Alex Lupsa wrote: Hi, Any ideas about the below ? Thanks, Alex -- Hi, I have a really small homelab 3-node ceph cluster on consumer hw - thanks to Proxmox for making it easy to deploy it. The problem I am having is very very bad transfer rates, ie 20mb/sec for both read and write on 17 OSDs with cache layer. However during recovery the speed hover between 250 to 700mb/sec which proves that the cluster IS capable of reaching way above those 20mb/sec in KVM. Reading the documentation, I see that during recovery "nearly all OSDs participate in resilvering a new drive" - kind of a torrent of data incoming from multiple sources at once, causing a huge deluge. However I believe this does not happen during the normal transfers, so my question is simply - is there any hidden tunables I can enable for this with the implied cost of network and heavy usage of disks ? Will there be in the future if not ? I have tried disabling authx, upgrading the network to 10gbit, have bigger journals, more bluestore cache and disabled the debugging logs as it has been advised on the list. The only thing that did help a bit was cache tiering, but this only helps somewhat as the ops do not get promoted unless I am very adamant about keeping programs in KVM open for very long times so that the writes/reads are promoted. To add some to the injury, once the cache gets full - the whole 3-node cluster grinds to a full halt until I start forcefully evict data from the cache... manually! So I am therefore guessing a really bad misconfiguration from my side. Next step would be removing the cache layer and using those SSDs as bcache instead as it seems to yeld 5x the results, even though it does add yet another layer of complexity and RAM requirements. Full config details: https://pastebin.com/xUM7VF9k rados bench -p ceph_pool 30 write Total time run: 30.983343 Total writes made: 762 Write size: 4194304 Object size:4194304 Bandwidth (MB/sec): 98.3754 Stddev Bandwidth: 20.9586 Max bandwidth (MB/sec): 132 Min bandwidth (MB/sec): 16 Average IOPS: 24 Stddev IOPS:5 Max IOPS: 33 Min IOPS: 4 Average Latency(s): 0.645017 Stddev Latency(s): 0.326411 Max latency(s): 2.08067 Min latency(s): 0.0355789 Cleaning up (deleting benchmark objects) Removed 762 objects Clean up completed and total clean up time :3.925631 Thanks, Alex ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to swap osds between servers
On 03.09.2018 17:42, Andrei Mikhailovsky wrote: Hello everyone, I am in the process of adding an additional osd server to my small ceph cluster as well as migrating from filestore to bluestore. Here is my setup at the moment: Ceph - 12.2.5 , running on Ubuntu 16.04 with latest updates 3 x osd servers with 10x3TB SAS drives, 2 x Intel S3710 200GB ssd and 64GB ram in each server. The same servers are also mon servers. I am adding the following to the cluster: 1 x osd+mon server with 64GB of ram, 2xIntel S3710 200GB ssds. Adding 4 x 6TB disks and 2x 3TB disks. Thus, the new setup will have the following configuration: 4 x osd servers with 8x3TB SAS drives and 1x6TB SAS drive, 2 x Intel S3710 200GB ssd and 64GB ram in each server. This will make sure that all servers have the same amount/capacity drives. There will be 3 mon servers in total. As a result, I will have to remove 2 x 3TB drives from the existing three osd servers and place them into the new osd server and add a 6TB drive into each osd server. As those 6 x 3TB drives which will be taken from the existing osd servers and placed to the new server will have the data stored on them, what is the best way to do this? I would like to minimise the data migration all over the place as it creates a havoc on the cluster performance. What is the best workflow to achieve the hardware upgrade? If I add the new osd host server into the cluster and physically take the osd disk from one server and place it in the other server, will it be recognised and accepted by the cluster? Data will migrate no matter how you change the crushmap. since you want to migrate to bluestore this is also unavoidable. if it is critical data, and you want to minimize impact, I prefer to do it the slow and steady way of adding a new bluestore drive to the new host, with weight 0 and gradually upping it's weight, while gradually lowering the weight of the filestore drive beeing removed. a worse option if you do not have a drive to spare for that, is to gradually drain a drive, remove it from the cluster, move it over, zap and recreate as bluestore, and gradually fill it again. but this takes longer, and if you have space issues can be complicated. an even worse option is to move the osd drive over, (with it's journal and data), and have the cluster shuffle all the data around, this is a big impact. And then you are still running filestore. so you still need to migrate to bluestore kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help Basically..
On 02.09.2018 17:12, Lee wrote: Should I just out the OSD's first or completely zap them and recreate? Or delete and let the cluster repair itself? On the second node when it started back up I had problems with the Journals for ID 5 and 7 they were also recreated all the rest are still the originals. I know that some PG's are on both 24 and 5 and 7 ie. Personally I would never wipe a disk until the cluster is health_OK. out them from the cluster. And if you need the slot for healthy disks you can remove them physically, but label and store together with the journal until you are health_OK kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ls operation is too slow in cephfs
What are you talking about when you say you have mds in a region, afaik only radosgw supports multisite and regions. it sounds like you have a cluster spread out over a geographical area. and this will have a massive impact on latency what is the latency between all servers in the cluster ? kind regards Ronny Aasen On 25.07.2018 12:03, Surya Bala wrote: time got reduced when MDS from the same region became active Each region we have a MDS. OSD nodes are in one region and active MDS is in another region . So that this delay. On Tue, Jul 17, 2018 at 6:23 PM, John Spray <mailto:jsp...@redhat.com>> wrote: On Tue, Jul 17, 2018 at 8:26 AM Surya Bala mailto:sooriya.ba...@gmail.com>> wrote: > > Hi folks, > > We have production cluster with 8 nodes and each node has 60 disks of size 6TB each. We are using cephfs and FUSE client with global mount point. We are doing rsync from our old server to this cluster rsync is slow compared to normal server > > when we do 'ls' inside some folder, which has many more number of files like 1lakhs and 2lakhs, the response is too slow. The first thing to check is what kind of "ls" you're doing. Some systems colorize ls by default, and that involves statting every file in addition to listing the directory. Try with "ls --color=never". It also helps to be more specific about what "too slow" means. How many seconds, and how many files? John ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Reclaim free space on RBD images that use Bluestore?????
On 23.07.2018 22:18, Sean Bolding wrote: I have XenServers that connect via iSCSI to Ceph gateway servers that use lrbd and targetcli. On my ceph cluster the RBD images I create are used as storage repositories in Xenserver for the virtual machine vdisks. Whenever I delete a virtual machine, XenServer shows that the repository size has decreased. This also happens when I mount a virtual drive in Xenserver as a virtual drive in a Windows guest. If I delete a large file, such as an exported VM, it shows as deleted and space available. However; when check in Ceph using ceph –s or ceph df it still shows the space being used. I checked everywhere and it seems there was a reference to it here https://github.com/ceph/ceph/pull/14727 but not sure if a way to trim or discard freed blocks was ever implemented. The only way I have found is to play musical chairs and move the VMs to different repositories and then completely remove the old RBD images in ceph. This is not exactly easy to do. Is there a way to reclaim free space on RBD images that use Bluestore? What commands do I use and where do I use this from? If such command exist do I run them on the ceph cluster or do I run them from XenServer? Please help. Sean I am not familiar with Xen, but it does sounds like you have a rbd mounted with a filesystem on the xen server. in that case it is the same as for other filesystems. Deleted files are just deleted in the file allocation table, and the RBD space is "reclaimed" when the filesystem zeroes out the now unused blocks. in many filesystems you would run the fstrim command to overwrite free'd blocks with zeroes, optionally mount the fs with the the discard option. in xenserver >6.5 this should be a button in xencenter to reclaim freed space. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] active+clean+inconsistent PGs after upgrade to 12.2.7
On 19. juli 2018 10:37, Robert Sander wrote: Hi, just a quick warning: We currently see active+clean+inconsistent PGs on two cluster after upgrading to 12.2.7. I created http://tracker.ceph.com/issues/24994 Regards Did you upgrade from 12.2.5 or 12.2.6 ? sounds like you hit the reason for the 12.2.7 release read : https://ceph.com/releases/12-2-7-luminous-released/ there should come features in 12.2.8 that can deal with the "objects are in sync but checksums are wrong" scenario. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph cluster
On 12. juni 2018 12:17, Muneendra Kumar M wrote: conf file as shown below. If I reconfigure my ipaddress from 10.xx.xx.xx to 192.xx.xx.xx and by changing the public network and mon_host filed in the ceph.conf Will my cluster will work as it is ? Below is my ceph.conf file details. Any inputs will really help me to understand more on the same. no. changing the subnet of the cluster is a complex operation. sincxe you are using private ip addresses anyway i would reconsider changing, and only change it there was no other way. this is the documentation for mimic on how to change. http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Mimic on Debian 9 Stretch
On 04.06.2018 21:08, Joao Eduardo Luis wrote: On 06/04/2018 07:39 PM, Sage Weil wrote: [1] http://lists.ceph.com/private.cgi/ceph-maintainers-ceph.com/2018-April/000603.html [2] http://lists.ceph.com/private.cgi/ceph-maintainers-ceph.com/2018-April/000611.html Just a heads up, seems the ceph-maintainers archives are not public. -Joao The debian-gcc list is public: https://lists.debian.org/debian-gcc/2018/04/msg00137.html Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Mimic on Debian 9 Stretch
On 04. juni 2018 06:41, Charles Alva wrote: Hi Guys, When will the Ceph Mimic packages for Debian Stretch released? I could not find the packages even after changing the sources.list. I am also eager to test mimic on my ceph debian-mimic only contains ceph-deploy atm. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to normally expand OSD’s capacity?
On 10.05.2018 12:24, Yi-Cian Pu wrote: Hi All, We are wondering if there is any way to expand OSD’s capacity. We are studying about this and conducted an experiment. However, in the result, the size of expanded capacity is counted on the USED part rather than the AVAIL one. The following shows the process of our experiment: 1.We prepare a small cluster of luminous v12.2.4 and write some data into pool. The osd.1 is manually deployed and it uses a disk partition of size 100GB (the whole disk size is 320GB). = [root@workstation /]# ceph osd df ID CLASS WEIGHTREWEIGHT SIZE USEAVAIL%USEVARPGS 0hdd 0.289991.0 297G 27062M271G8.89 0.6732 1hdd 0.01.0 100G 27062M 76361M 26.17 1.9732 TOTAL 398G 54125M345G 13.27 MIN/MAX VAR: 0.67/1.97STDDEV: 9.63 = 2.Then, we expand the disk partition used by osd.1 by the following steps: (1)Stop osd.1 daemon (2)Use “parted” command to expand 50GB of the disk partition. (3)Restart osd.1 daemon 3.After we do the above steps, we have the result that the expanded size is counted on USED part. = [root@workstation /]# ceph osd df ID CLASS WEIGHTREWEIGHT SIZE USEAVAIL%USEVARPGS 0hdd 0.289991.0 297G 27063M271G8.89 0.3932 1hdd 0.01.0 150G 78263M 76360M 50.62 2.2132 TOTAL 448G102G345G 22.94 MIN/MAX VAR: 0.39/2.21STDDEV: 21.95 = This is what we have tried, and the result looks very confusing. We’d really want to know if there is any way to normally expand OSD’s capacity. Any feedback or suggestions would be much appreciated. you do not do this in ceph you would most likely not partition the osd drive, you use the whole drive. so you would never get into the position to need to increase. you add space by adding osd's and adding nodes, so increasing osd size is not logical if you must for some oddball reason... you can remove (drain or destroy) - repartition - add the osd and let ceph backfill the drive. or you can just make a new osd with the remaining disk space. since the space increase will change the crushmap there is no way to avoid some data movement, anyway. mvh Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 3 monitor servers to monitor 2 different OSD set of servers
On 26.04.2018 17:05, DHD.KOHA wrote: Hello, I am wondering if this is possible, I am currently running a ceph cluster consisting of 3 servers as monitors and 6 OSD servers that host the disk drives, that is 10x8T for osds on each server. Since we did some cleaning around of the old servers I am able to create another OSD cluster with 3 servers having 16Drives of 10T each. Since adding the above servers to expand the current servers of OSDs doesn't seem to be a good idea according to the documentation, because the disk drives are of different size 8T and 10T I wonder if it is possible to create another cluster using the same 3 monitors and have them to monitor a second cluster also, so that --cluster ceph --cluster ceph2 processes are running on the same monitor set of servers. I do not think you can make a new cluster that share mon servers easily. But you can make a new class of hdd, eg call it hdd10 or something . and create/move some pools to use that class of device. It is the same cluster but the pools will not share disks. so you need to "think" about them almost like separate clusters. and that should be quite straight forward https://ceph.com/community/new-luminous-crush-device-classes/ kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2
the difference in cost between 2 and 3 servers are not HUGE. but the reliability difference between a size 2/1 pool and a 3/2 pool is massive. a 2/1 pool is just a single fault during maintenance away from dataloss. but you need multiple simultaneous faults, and have very bad luck to break a 3/2 pool I would recommend rather using 2/2 pools if you are willing to accept a little downtime when a disk dies. the cluster io would stop until the disks backfill to cover for the lost disk. but it is better then having inconsistent pg's or dataloss because a disk crashed during a routine reboot, or 2 disks also worth to read this link https://www.spinics.net/lists/ceph-users/msg32895.html a good explanation. you have good backups and are willing to restore the whole pool. And it is of course your privilege to run 2/1 pools but be mind full of the risks of doing so. kind regards Ronny Aasen BTW: i did not know ubuntu automagically rebooted after a upgrade. you can probably avoid that reboot somehow in ubuntu. and do the restarts of services manually. if you wish to maintain service during upgrade On 25.04.2018 11:52, Ranjan Ghosh wrote: Thanks a lot for your detailed answer. The problem for us, however, was that we use the Ceph packages that come with the Ubuntu distribution. If you do a Ubuntu upgrade, all packages are upgraded in one go and the server is rebooted. You cannot influence anything or start/stop services one-by-one etc. This was concering me, because the upgrade instructions didn't mention anything about an alternative or what to do in this case. But someone here enlightened me that - in general - it all doesnt matter that much *if you are just accepting a downtime*. And, indeed, it all worked nicely. We stopped all services on all servers, upgraded the Ubuntu version, rebooted all servers and were ready to go again. Didn't encounter any problems there. The only problem turned out to be our own fault and simply a firewall misconfiguration. And, yes, we're running a "size:2 min_size:1" because we're on a very tight budget. If I understand correctly, this means: Make changes of files to one server. *Eventually* copy them to the other server. I hope this *eventually* means after a few minutes. Up until now I've never experienced *any* problems with file integrity with this configuration. In fact, Ceph is incredibly stable. Amazing. I have never ever had any issues whatsoever with broken files/partially written files, files that contain garbage etc. Even after starting/stopping services, rebooting etc. With GlusterFS and other Cluster file system I've experienced many such problems over the years, so this is what makes Ceph so great. I have now a lot of trust in Ceph, that it will eventually repair everything :-) And: If a file that has been written a few seconds ago is really lost it wouldnt be that bad for our use-case. It's a web-server. Most important stuff is in the DB. We have hourly backups of everything. In a huge emergency, we could even restore the backup from an hour ago if we really had to. Not nice, but if it happens every 6 years or sth due to some freak hardware failure, I think it is manageable. I accept it's not the recommended/perfect solution if you have infinite amounts of money at your hands, but in our case, I think it's not extremely audacious either to do it like this, right? Am 11.04.2018 um 19:25 schrieb Ronny Aasen: ceph upgrades are usualy not a problem: ceph have to be upgraded in the right order. normally when each service is on its own machine this is not difficult. but when you have mon, mgr, osd, mds, and klients on the same host you have to do it a bit carefully.. i tend to have a terminal open with "watch ceph -s" running, and i never do another service until the health is ok again. first apt upgrade the packages on all the hosts. This only update the software on disk and not the running services. then do the restart of services in the right order. and only on one host at the time mons: first you restart the mon service on all mon running hosts. all the 3 mons are active at the same time, so there is no "shifting around" but make sure the quorum is ok again before you do the next mon. mgr: then restart mgr on all hosts that run mgr. there is only one active mgr at the time now, so here there will be a bit of shifting around. but it is only for statistics/management so it may affect your ceph -s command, but not the cluster operation. osd: restart osd processes one osd at the time, make sure health_ok before doing the next osd process. do this for all hosts that have osd's mds: restart mds's one at the time. you will notice the standby mds taking over for the mds that was restarted. do both. klients: restart clients, that means remount filesystems, migrate or restart vm's. or restart whatever process uses the old ceph libraries. about pools: s
Re: [ceph-users] Cephalocon APAC 2018 report, videos and slides
On 24.04.2018 17:30, Leonardo Vaz wrote: Hi, Last night I posted the Cephalocon 2018 conference report on the Ceph blog[1], published the video recordings from the sessions on YouTube[2] and the slide decks on Slideshare[3]. [1] https://ceph.com/community/cephalocon-apac-2018-report/ [2] https://www.youtube.com/playlist?list=PLrBUGiINAakNgeLvjald7NcWps_yDCblr [3] https://www.slideshare.net/Inktank_Ceph/tag/cephalocon-apac-2018 I'd like to take the opportunity to apologize for the lots of posts on Twitter and Google+ about the video uploads last night. Seems that even I disabled the checkbox to make posts announcing the new uploads on social media YouTube decided to post it anyway. Sorry for the inconvenience. Kindest regards, Leo thanks to the presenters and yourself for your awesome work. this is a goldmine for us that could not attend. :) kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] configuration section for each host
On 24.04.2018 18:24, Robert Stanford wrote: In examples I see that each host has a section in ceph.conf, on every host (host-a has a section in its conf on host-a, but there's also a host-a section in the ceph.conf on host-b, etc.) Is this really necessary? I've been using just generic osd and monitor sections, and that has worked out fine so far. Am I setting myself up for unexpected problems? only if you want to override default values for that individual host. i have never had anything but generic sections. ceph is moving more and more away from must_have information in the configuration file. in next version you will probably not need initial monitors either since they can be discovered via SRV dns records. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds with different disk sizes may killing performance (?? ?)
On 13. april 2018 05:32, Chad William Seys wrote: Hello, I think your observations suggest that, to a first approximation, filling drives with bytes to the same absolute level is better for performance than filling drives to the same percentage full. Assuming random distribution of PGs, this would cause the smallest drives to be as active as the largest drives. E.g. if every drive had 1TB of data, each would be equally likely to contain the PG of interest. Of course, as more data was added the smallest drives could not hold more and the larger drives become more active, but at least the smaller drives would as active as possible. but in this case you would have a steep drop off of performance. when you reach the fill level where small drives do not accept more data, suddenly you would have a performance cliff where only your larger disks are doing new writes. and only larger disks doing reads on new data. it is also easier to make the logical connection while you are installing new nodes/disks. then a year later when your cluster just happen to reach that fill level. it would also be an easier job balancing disks between nodes when you are adding osd's anyway and the new ones are mostly empty. rather then when your small osd's are full and your large disks have significant data on them. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2
ceph upgrades are usualy not a problem: ceph have to be upgraded in the right order. normally when each service is on its own machine this is not difficult. but when you have mon, mgr, osd, mds, and klients on the same host you have to do it a bit carefully.. i tend to have a terminal open with "watch ceph -s" running, and i never do another service until the health is ok again. first apt upgrade the packages on all the hosts. This only update the software on disk and not the running services. then do the restart of services in the right order. and only on one host at the time mons: first you restart the mon service on all mon running hosts. all the 3 mons are active at the same time, so there is no "shifting around" but make sure the quorum is ok again before you do the next mon. mgr: then restart mgr on all hosts that run mgr. there is only one active mgr at the time now, so here there will be a bit of shifting around. but it is only for statistics/management so it may affect your ceph -s command, but not the cluster operation. osd: restart osd processes one osd at the time, make sure health_ok before doing the next osd process. do this for all hosts that have osd's mds: restart mds's one at the time. you will notice the standby mds taking over for the mds that was restarted. do both. klients: restart clients, that means remount filesystems, migrate or restart vm's. or restart whatever process uses the old ceph libraries. about pools: since you only have 2 osd's you can obviously not be running the recommended 3 replication pools. ? this makes me worry that you may be running size:2 min_size:1 pools. and are daily running risk of dataloss due to corruption and inconsistencies. especially when you restart osd's if your pools are size:2 min_size:2 then your cluster will fail when any osd is restarted, until the osd is up and healthy again. but you have less chance for dataloss then 2/1 pools. if you added a osd on a third host you can run size:3 min_size:2 . the recommended config when you can have both redundancy and high availabillity. kind regards Ronny Aasen On 11.04.2018 17:42, Ranjan Ghosh wrote: Ah, nevermind, we've solved it. It was a firewall issue. The only thing that's weird is that it became an issue immediately after an update. Perhaps it has sth. to do with monitor nodes shifting around or anything. Well, thanks again for your quick support, though. It's much appreciated. BR Ranjan Am 11.04.2018 um 17:07 schrieb Ranjan Ghosh: Thank you for your answer. Do you have any specifics on which thread you're talking about? Would be very interested to read about a success story, because I fear that if I update the other node that the whole cluster comes down. Am 11.04.2018 um 10:47 schrieb Marc Roos: I think you have to update all osd's, mon's etc. I can remember running into similar issue. You should be able to find more about this in mailing list archive. -Original Message- From: Ranjan Ghosh [mailto:gh...@pw6.de] Sent: woensdag 11 april 2018 16:02 To: ceph-users Subject: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2 Hi all, We have a two-cluster-node (with a third "monitoring-only" node). Over the last months, everything ran *perfectly* smooth. Today, I did an Ubuntu "apt-get upgrade" on one of the two servers. Among others, the ceph packages were upgraded from 12.2.1 to 12.2.2. A minor release update, one might think. But, to my surprise, after restarting the services, Ceph is now in degraded state :-( (see below). Only the first node - which ist still on 12.2.1 - seems to be running. I did a bit of research and found this: https://ceph.com/community/new-luminous-pg-overdose-protection/ I did set "mon_max_pg_per_osd = 300" to no avail. Don't know if this is the problem at all. Looking at the status it seems we have 264 pgs, right? When I enter "ceph osd df" (which I found on another website claiming it should print the number of PGs per OSD), it just hangs (need to abort with Ctrl+C). Hope anybody can help me. The cluster know works with the single node, but it is definively quite worrying because we don't have redundancy. Thanks in advance, Ranjan root@tukan2 /var/www/projects # ceph -s cluster: id: 19895e72-4a0c-4d5d-ae23-7f631ec8c8e4 health: HEALTH_WARN insufficient standby MDS daemons available Reduced data availability: 264 pgs inactive Degraded data redundancy: 264 pgs unclean services: mon: 3 daemons, quorum tukan1,tukan2,tukan0 mgr: tukan0(active), standbys: tukan2 mds: cephfs-1/1/1 up {0=tukan2=up:active} osd: 2 osds: 2 up, 2 in data: pools: 3 pools, 264 pgs objects: 0 objects, 0 bytes usage: 0 kB used, 0 kB / 0 kB avail pgs: 100.000% pgs unknown ___
Re: [ceph-users] Bluestore and scrubbing/deep scrubbing
On 29.03.2018 20:02, Alex Gorbachev wrote: w Luminous 12.2.4 cluster with Bluestore, I see a good deal of scrub and deep scrub operations. Tried to find a reference, but nothing obvious out there - was it not supposed to not need scrubbing any more due to CRC checks? crc gives you checks as you read data. iow you do not give corrupt data to clients, while thinking it is ok data. scrubs periodically checks your data for corruption, by reading it and comparing it to crc and the other replicas. this protects against bitrot [1] crc also helps the system to know what object is good and what object is bad during scrub. crc is not a replacement for scrub, but a compliment. it improves the quality of the data you provide to clients, and it makes it easier for scrub to detect errors. kind regards Ronny Aasen [ 1] https://en.wikipedia.org/wiki/Data_degradation ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] split brain case
On 29.03.2018 11:13, ST Wong (ITSC) wrote: Hi, Thanks. > ofcourse the 4 osd's left working now want to selfheal by recreating all objects stored on the 4 split off osd's and have a huge recovery job. and you may risk that the osd's goes into too_full error, unless you have free space in your osd's to recreate all the data in the defective part of the cluster. or they will be stuck in recovery mode until you get the second room running, this depends on your crush map. Means we’ve to made 4 OSD machines sufficient space to hold all data and thus the usable space will be halved? yes if you want to be able to be able to operatate one room as if it was the whole cluster (HA) then you need this. also if you want to have 4+2 instead of 3+2 pool size to avoid the blocking during recovery, that would take a whole lot of ekstra space. you can optionally let the cluster run degraded with 4+2 while one room is down. or temporary set pools to 2+2 while the other room is down, to reduce the space requirements. > point in that slitting the cluster hurts. and if HA is the most important then you may want to check out rbd mirror. Will consider when there is budget to setup another ceph cluster for rdb mirror. i do not know your needs or applications, but while you only have 2 rooms you may just think of it as a single cluster that just happen to occupy 2 rooms. but with that few osd's you should perhaps just put the cluster in a single room the pain of splitting a cluster down the middle is quite significant. and i would perhaps use resources to improve the redundancy of the networks between the buildings instead. have multiple paths between the buildings to prevent service disruption in the building that does not house the cluster. having 5 mons is quite a lot. i think most clusters have 3 mons up into several hundred osd hosts how many servers are your osd's split over ? keep in mind that ceph's default picks one osd from each host. so you would need minimum 4 osd hosts in total to be able to use 4+2 pools and with only 4 hosts you have no failuredomain. but 4 hosts in the minimum sane starting point for a regular small cluster with 3+2 pools (you can loose a node and ceph selfheals as long as there are enough freespace. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] split brain case
On 29.03.2018 10:25, ST Wong (ITSC) wrote: Hi all, We put 8 (4+4) OSD and 5 (2+3) MON servers in server rooms in 2 buildings for redundancy. The buildings are connected through direct connection. While servers in each building have alternate uplinks. What will happen in case the link between the buildings is broken (application servers in each server room will continue to write to OSDs in the same room) ? Thanks a lot. Rgds /st wong my guesstimate is that the serverroom with 3 mons will retain quorum, and continue operation. the room with 2 mon's will notice they are split out and block. assuming you have 3+2 pools and one of the objects is allways on the other server room. some pg's will be active becouse you have 2 objects on the working room. but some pg's will be inactive until they can selfheal and backfill a second copy of the objects. i assume you could have 4+2 replication to avoid this issue. ofcourse the 4 osd's left working now want to selfheal by recreating all objects stored on the 4 split off osd's and have a huge recovery job. and you may risk that the osd's goes into too_full error, unless you have free space in your osd's to recreate all the data in the defective part of the cluster. or they will be stuck in recovery mode until you get the second room running, this depends on your crush map. if you really need to split a cluster into separate rooms, i would have used 3 rooms, with redundant data paths between them. primary path between room A and C is direct. redundant path is via A-B-C. this should reduce the disaster if a single path is broken. with 1 mon in each room. you can loose a whole room to powerloss, and still have a working cluster. and you would only need 33% instead of 50% cluster capacity as free space in your cluster to be able to selfheal point in that slitting the cluster hurts. and if HA is the most important then you may want to check out rbd mirror. kind Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Separate BlueStore WAL/DB : best scenario ?
keep in mind that with 4+2 =6 erasure coding, ceph can not self heal if a node dies if you have only 6 nodes. that means that you have a degraded cluster with low performance, and higher risk until you replace or fix or buy a new node. it is kind of like loosing a disk in raid5 you have to scramble to fix it asap. if you have an additional node. ceph can self heal if a node dies and you can look at it on monday after the meeting... no stress. in my opinion ceph's self healing is one of THE killer apps for ceph. that makes a ceph cluster robust and reliable. pity not to take advantage of that in your design/pool configuration. kind regards Ronny Aasen On 22.03.2018 10:53, Hervé Ballans wrote: Le 21/03/2018 à 11:48, Ronny Aasen a écrit : On 21. mars 2018 11:27, Hervé Ballans wrote: Hi all, I have a question regarding a possible scenario to put both wal and db in a separate SSD device for an OSD node composed by 22 OSDs (HDD SAS 10k 1,8 To). I'm thinking of 2 options (at about the same price) : - add 2 SSD SAS Write Intensive (10DWPD) - or add a unique SSD NVMe 800 Go (it's the minimum capacity currently on the market !..) In both case, that's a lot of partitions on each SSD disk, especially on the second solution where we would have 44 partitions (22 WAL and 22 DB) ! Is this solution workable (I mean in term of i/o speeds), or is it unsafe despite the high PCIe bus transfer rate ? I just want to talk here about throughput performances, not data integrity on the node in case of SSD crashes... Thanks in advance for your advices, if you put the wal and db on the same device anyway, there is no real benefit to having a partition for each. the reason you can split them up is if you have them on different devices. Eg db on ssd, but wal on nvram. it is easier to just colocat wal and db into the same partition since they live on the same device in your case anyway. if you have too many osd's db's on the same ssd, you may end up with the ssd beeing the bottleneck. 4 osd's db's on a ssd have been a "golden rule" on the mailinglist for a while. for nvram you can possibly have some more. but the bottleneck is only one part of the problem. when the 22 partitions db nvram dies, it brings down 22 osd's at once and will be a huge pain on your cluster. (depending on how large it is...) i would spread the db's on more devices to reduce the bottleneck and failure domains in this situation. Hi Ronny, Thank you for your clear answer. OK for putting both wal and db on the same partition, I didn't have this information, but indeed it seems more interesting in my case (in particular if I choose the fastest device, i.e. NVMe*) I plan to have 6 OSDs nodes (same configuration for each) but I don't know yet if I will use replication (x3) or Erasure Coding (4+2 ?) pools. Also in both cases, I could eventually accept the loss of a node on a reduced time (replacement of the journals disk + OSDs reconfiguration). But you're right, I will start on a configuration where I will spread the db's on at least 2 fast disks. Regards, Hervé * Just for information, I look closely at the SAMSUNG PM1725 NVMe PCIe SSD. The (theorical) technical specifications seem interesting, especilly on the IOPS : up to 750K IOPS for Random Read and 120K IOPS for Random Write... kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Separate BlueStore WAL/DB : best scenario ?
On 21. mars 2018 11:27, Hervé Ballans wrote: Hi all, I have a question regarding a possible scenario to put both wal and db in a separate SSD device for an OSD node composed by 22 OSDs (HDD SAS 10k 1,8 To). I'm thinking of 2 options (at about the same price) : - add 2 SSD SAS Write Intensive (10DWPD) - or add a unique SSD NVMe 800 Go (it's the minimum capacity currently on the market !..) In both case, that's a lot of partitions on each SSD disk, especially on the second solution where we would have 44 partitions (22 WAL and 22 DB) ! Is this solution workable (I mean in term of i/o speeds), or is it unsafe despite the high PCIe bus transfer rate ? I just want to talk here about throughput performances, not data integrity on the node in case of SSD crashes... Thanks in advance for your advices, if you put the wal and db on the same device anyway, there is no real benefit to having a partition for each. the reason you can split them up is if you have them on different devices. Eg db on ssd, but wal on nvram. it is easier to just colocat wal and db into the same partition since they live on the same device in your case anyway. if you have too many osd's db's on the same ssd, you may end up with the ssd beeing the bottleneck. 4 osd's db's on a ssd have been a "golden rule" on the mailinglist for a while. for nvram you can possibly have some more. but the bottleneck is only one part of the problem. when the 22 partitions db nvram dies, it brings down 22 osd's at once and will be a huge pain on your cluster. (depending on how large it is...) i would spread the db's on more devices to reduce the bottleneck and failure domains in this situation. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Delete a Pool - how hard should be?
On 06. mars 2018 10:26, Max Cuttins wrote: Il 05/03/2018 20:17, Gregory Farnum ha scritto: You're not wrong, and indeed that's why I pushed back on the latest attempt to make deleting pools even more cumbersome. But having a "trash" concept is also pretty weird. If admins can override it to just immediately delete the data (if they need the space), how is that different from just being another hoop to jump through? If we want to give the data owners a chance to undo, how do we identify and notify *them* rather than the admin running the command? But if admins can't override the trash and delete immediately, what do we do for things like testing and proofs of concept where large-scale data creates and deletes are to be expected? -Greg I'm talking about my experience: * Data Owner are a little bit in their LA LA LAND, and think that they can safely delete some of their data without losses. * Data Owner should think that their pool have been really deleted * Data Owner should not been akwnoledge about the existance of the "/trash/" * So Data Owner ask to restore from backup (but instead we'll use easily the trash). Said so, we also have to think that: * Administrator is always GOD, so he need to be in the possibility to override if needed whenever he needs. * However Administrator should just put in status delete without override this behaviour if there is not need to do so. * Override should be allowed only with many cumbersome telling you that YOU SHOULD NOT OVERRIDE - PLEASE AVOID OVERRIDE I don't like that the software can limit administrators to do his job... in the end Administrator'll always find its way to do what he want (it's the root). Of course I like the feature to push the Admin to follow the right behaviour. some sort of active/inactive toggle both on RBD images, pools, buckets and filesystems trees is nice to allow admins to perform scream tests. "data owner requests deletion - admin disables pool(kicks all clients) - data owner screams - admin reactivates" sounds much better then the last step beeing admin checking if the backups are good.,.. i try to do something similar by renaming pools to be deleted but that is not allways the same as inactive. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph newbie(?) issues
On 05. mars 2018 14:45, Jan Marquardt wrote: Am 05.03.18 um 13:13 schrieb Ronny Aasen: i had some similar issues when i started my proof of concept. especialy the snapshot deletion i remember well. the rule of thumb for filestore that i assume you are running is 1GB ram per TB of osd. so with 8 x 4TB osd's you are looking at 32GB of ram for osd's + some GB's for the mon service, + some GB's for the os itself. i suspect if you inspect your dmesg log and memory graphs you will find that the out of memory killer ends your osd's when the snap deletion (or any other high load task) runs. I ended up reducing the number of osd's per node, since the old mainboard i used was maxed for memory. Well, thanks for the broad hint. Somehow I assumed we fulfill the recommendations, but of course you are right. We'll check if our boards support 48 GB RAM. Unfortunately, there are currently no corresponding messages. But I can't rule out that there haven't been any. corruptions occured for me as well. and they was normaly associated with disks dying or giving read errors. ceph often managed to fix them but sometimes i had to just remove the hurting OSD disk. hage some graph's to look at. personaly i used munin/munin-node since it was just an apt-get away from functioning graphs also i used smartmontools to send me emails about hurting disks. and smartctl to check all disks for errors. I'll check S.M.A.R.T stuff. I am wondering if scrubbing errors are always caused by disk problems or if they also could be triggered by flapping OSDs or other circumstances. good luck with ceph ! Thank you! in my not that extensive experience, schrub errors come mainly from 2 issues. Either disk's giving read errors (should be visible both in the log and dmesg.) or having pools with size=2/min_size=1 instead of the default and recomended size=3/min_size=2 but i can not say that they do not come from crashing OSD's but my case the osd kept crashing due to bad disk and/or low memory. If you have scrub errors you can not get rid of on filestore (not bluestore!) you can read the two following urls. http://ceph.com/geen-categorie/ceph-manually-repair-object/ and on http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ basicaly the steps are: - find the pg :: rados list-inconsistent-pg [pool] - find the problem :: rados list-inconsistent-obj 0.6 --format=json-pretty ; give you the object name look for hints to what is the bad object - find the object on disks :: manually check the objects on each osd for the given pg, check the object metadata (size/date/etc), run md5sum on them all and compare. check objects on the nonrunning osd's and compare there as well. anything to try to determine what object is ok and what is bad. - fix the problem :: assuming you find the bad object, stop the affected osd with the bad object, remove the object manually, restart osd. issue repair command. Once i fixed my min_size=1 misconfiguration, and pulled the dying (but functional) disks from my cluster, and reduced osd count to prevent dying osd's all of those scrub errors went away. have not seen one in 6 months now. kinds regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph newbie(?) issues
On 05. mars 2018 11:21, Jan Marquardt wrote: Hi, we are relatively new to Ceph and are observing some issues, where I'd like to know how likely they are to happen when operating a Ceph cluster. Currently our setup consists of three servers which are acting as OSDs and MONs. Each server has two Intel Xeon L5420 (yes, I know, it's not state of the art, but we thought it would be sufficient for a Proof of Concept. Maybe we were wrong?) and 24 GB RAM and is running 8 OSDs with 4 TB harddisks. 4 OSDs are sharing one SSD for journaling. We started on Kraken and upgraded lately to Luminous. The next two OSD servers and three separate MONs are ready for deployment. Please find attached our ceph.conf. Current usage looks like this: data: pools: 1 pools, 768 pgs objects: 5240k objects, 18357 GB usage: 59825 GB used, 29538 GB / 89364 GB avail We have only one pool which is exclusively used for rbd. We started filling it with data and creating snapshots in January until Mid of February. Everything was working like a charm until we started removing old snapshots then. While we were removing snapshots for the first time, OSDs started flapping. Besides this there was no other load on the cluster. For idle times we solved it by adding osd snap trim priority = 1 osd snap trim sleep = 0.1 to ceph.conf. When there is load from other operations and we remove big snapshots OSD flapping still occurs. Last week our first scrub errors appeared. Repairing the first one was no big deal. The second one however was, because the instructed OSD started crashing. First on Friday osd.17 and today osd.11. ceph1:~# ceph pg repair 0.1b2 instructing pg 0.1b2 on osd.17 to repair ceph1:~# ceph pg repair 0.1b2 instructing pg 0.1b2 on osd.11 to repair I am still researching on the crashes, but already would be thankful for any input. Any opinions, hints and advices would really be appreciated. i had some similar issues when i started my proof of concept. especialy the snapshot deletion i remember well. the rule of thumb for filestore that i assume you are running is 1GB ram per TB of osd. so with 8 x 4TB osd's you are looking at 32GB of ram for osd's + some GB's for the mon service, + some GB's for the os itself. i suspect if you inspect your dmesg log and memory graphs you will find that the out of memory killer ends your osd's when the snap deletion (or any other high load task) runs. I ended up reducing the number of osd's per node, since the old mainboard i used was maxed for memory. corruptions occured for me as well. and they was normaly associated with disks dying or giving read errors. ceph often managed to fix them but sometimes i had to just remove the hurting OSD disk. hage some graph's to look at. personaly i used munin/munin-node since it was just an apt-get away from functioning graphs also i used smartmontools to send me emails about hurting disks. and smartctl to check all disks for errors. good luck with ceph ! kinds regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cannot delete a pool
On 01. mars 2018 13:04, Max Cuttins wrote: I was testing IO and I created a bench pool. But if I tried to delete I get: Error EPERM: pool deletion is disabled; you must first set the mon_allow_pool_delete config option to true before you can destroy a pool So I run: ceph tell mon.\* injectargs '--mon-allow-pool-delete=true' mon.ceph-node1: injectargs:mon_allow_pool_delete = 'true' (not observed, change may require restart) mon.ceph-node2: injectargs:mon_allow_pool_delete = 'true' (not observed, change may require restart) mon.ceph-node3: injectargs:mon_allow_pool_delete = 'true' (not observed, change may require restart) I restarted all the nodes. But the flag has not been observed. Is this the right way to remove a pool? i think you need to set the option in the ceph.conf of the monitors. and then restart the mon's one by one. afaik that is by design. https://blog.widodh.nl/2015/04/protecting-your-ceph-pools-against-removal-or-property-changes/ kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Install previous version of Ceph
On 23. feb. 2018 23:37, Scottix wrote: Hey, We had one of our monitor servers die on us and I have a replacement computer now. In between that time you have released 12.2.3 but we are still on 12.2.2. We are on Ubuntu servers I see all the binaries are in the repo but your package cache only shows 12.2.3, is there a reason for not keeping the previous builds like in my case. I could do an install like apt install ceph-mon=12.2.2 Also how would I go installing 12.2.2 in my scenario since I don't want to update till have this monitor running again. Thanks, Scott did you figure out a solution to this ? I have the same problem now. I assume you have to download the old version manually and install with dpkg -i optionally mirror the ceph repo and build your own repo index containing all versions. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Luminous and calamari
On 16.02.2018 06:20, Laszlo Budai wrote: Hi, I've just started up the dasboard component of the ceph mgr. It looks OK, but from what can be seen, and what I was able to find in the docs, the dashboard is just for monitoring. Is there any plugin that allows management of the ceph resources (pool create/delete). openattic allows for web administation. but i think it is only possible to run it comfortably on opensuse leap atm. I could not find updated debian packages last time i checked. proxmox also allow for ceph administration. but proxmox is probably a bit overkill for only ceph admin. since it is a web admin tool for kvm vm's and lxd containers as well as ceph. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Query regarding min_size.
On 03. jan. 2018 14:51, James Poole wrote: Hi all, Whilst on a training course recently I was told that 'min_size' had an affect on client write performance, in that it's the required number of copies before ceph reports back to the client that an object has been written therefore setting a 'min_size' of 0 would only require a write to be accepted by the journal before confirming it's been accepted. This is contrary to further reading elsewhere that the 'min_size' is the minimum number of copies required of an object to allow I/O and that 'size' is the parameter that would affect write speed i.e. desired number of replicas. Setting 'min_size' to 0 with a 'size' of 3 you would still have an effective 'min_size' of 2 from: https://raw.githubusercontent.com/ceph/ceph/master/doc/release-notes.rst "* Degraded mode (when there fewer than the desired number of replicas) is now more configurable on a per-pool basis, with the min_size parameter. By default, with min_size 0, this allows I/O to objects with N - floor(N/2) replicas, where N is the total number of expected copies. Argonaut behavior was equivalent to having min_size = 1, so I/O would always be possible if any completely up to date copy remained. min_size = 1 could result in lower overall availability in certain cases, such as flapping network partition" Which leads to the conclusion that changing 'min_size' has nothing to do with performance but is solely related to data integrity/resilience. Could someone confirm my assertion is correct? Many thanks James you are correct that it is related to data integrity. the writes to a osd filestore is allways acked internally when it have hit the journal. unrelated to size/min_size. in normal operation, all osd's must ack the write before the write is acked to the client: iow all 3 (size 3) must ack. and min_size is not relevant in any case. min_size is only relevant when a pg is degraded while being remapped or backfilled (or degraded because of no space to remap/backfill into) because of a osd or node failure. in that case min_size specify how many osd's must ack the write before the write is acked to the client. since failure is most likely when disks are stressing (eg with rebuild), reducing min_size is just asking for corruption and data loss. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Running Jewel and Luminous mixed for a longer period
On 30.12.2017 15:41, Milanov, Radoslav Nikiforov wrote: Performance as well - in my testing FileStore was much quicker than BlueStore. with filestore you often have a ssd journal in front, this will often mask/hide slow spinning disk write performance, until the journal size becomes the bottleneck. with bluestore only metadata db and wal is on ssd. so there is no doublewrite, and there is no journal bottleneck. but write latency will be the speed of the disk, and not the speed of the ssd journal. this will feel like a write performance regression. you can use bcache in front of bluestore to regain the "journal+ doublewrite" write characteristic of filestore+journal. kind regards ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph status doesnt show available and used disk space after upgrade
On 20.12.2017 19:02, kevin parrikar wrote: hi All, I have upgraded the cluster from Hammer to Jewel and to Luminous . i am able to upload/download glance images but ceph -s shows 0kb used and Available and probably because of that cinder create is failing. ceph -s cluster: id: 06c5c906-fc43-499f-8a6f-6c8e21807acf health: HEALTH_WARN Reduced data availability: 6176 pgs inactive Degraded data redundancy: 6176 pgs unclean services: mon: 3 daemons, quorum controller3,controller2,controller1 mgr: controller3(active) osd: 71 osds: 71 up, 71 in rgw: 1 daemon active data: pools: 4 pools, 6176 pgs objects: 0 objects, 0 bytes usage: 0 kB used, 0 kB / 0 kB avail pgs: 100.000% pgs unknown 6176 unknown i deployed ceph-mgr using ceph-deploy gather-keys && ceph-deploy mgr create ,it was successfull but for some reason ceph -s is not showing correct values. Can some one help me here please Regards, Kevin is ceph-mgr actually running ? all statistics now require a ceph-mgr to be running. also check the mgr's logfile to see if it is able to authenticate/start properly. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster)
if you have a global setting in ceph.conf it will only affect the creation of new pools. i reccomend using the default size:3 + min_size:2 also check your pools that you have min_size=2 kind regards Ronny Aasen On 15.12.2017 23:00, James Okken wrote: This whole effort went extremely well, thanks to Cary, and Im not used to that with CEPH so far. (And openstack ever) Thank you Cary. Ive upped the replication factor and now I see "replicated size 3" in each of my pools. Is this the only place to check replication level? Is there a Global setting or only a setting per Pool? ceph osd pool ls detail pool 0 'rbd' replicated size 3.. pool 1 'images' replicated size 3... ... One last question! At this replication level how can I tell how much total space I actually have now? Do I just 1/3 the Global size? ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 13680G 12998G 682G 4.99 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS rbd 0 0 0 6448G 0 images 1 216G 3.24 6448G 27745 backups 2 0 0 6448G 0 volumes 3 117G 1.79 6448G 30441 compute 4 0 0 6448G 0 ceph osd df ID WEIGHT REWEIGHT SIZE USEAVAIL %USE VAR PGS 0 0.81689 1.0 836G 36549M 800G 4.27 0.86 67 4 3.7 1.0 3723G 170G 3553G 4.58 0.92 270 1 0.81689 1.0 836G 49612M 788G 5.79 1.16 56 5 3.7 1.0 3723G 192G 3531G 5.17 1.04 282 2 0.81689 1.0 836G 33639M 803G 3.93 0.79 58 3 3.7 1.0 3723G 202G 3521G 5.43 1.09 291 TOTAL 13680G 682G 12998G 4.99 MIN/MAX VAR: 0.79/1.16 STDDEV: 0.67 Thanks! -Original Message- From: Cary [mailto:dynamic.c...@gmail.com] Sent: Friday, December 15, 2017 4:05 PM To: James Okken Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster) James, Those errors are normal. Ceph creates the missing files. You can check "/var/lib/ceph/osd/ceph-6", before and after you run those commands to see what files are added there. Make sure you get the replication factor set. Cary -Dynamic On Fri, Dec 15, 2017 at 6:11 PM, James Okken <james.ok...@dialogic.com> wrote: Thanks again Cary, Yes, once all the backfilling was done I was back to a Healthy cluster. I moved on to the same steps for the next server in the cluster, it is backfilling now. Once that is done I will do the last server in the cluster, and then I think I am done! Just checking on one thing. I get these messages when running this command. I assume this is OK, right? root@node-54:~# ceph-osd -i 4 --mkfs --mkkey --osd-uuid 25c21708-f756-4593-bc9e-c5506622cf07 2017-12-15 17:28:22.849534 7fd2f9e928c0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2017-12-15 17:28:22.855838 7fd2f9e928c0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2017-12-15 17:28:22.856444 7fd2f9e928c0 -1 filestore(/var/lib/ceph/osd/ceph-4) could not find #-1:7b3f43c4:::osd_superblock:0# in index: (2) No such file or directory 2017-12-15 17:28:22.893443 7fd2f9e928c0 -1 created object store /var/lib/ceph/osd/ceph-4 for osd.4 fsid 2b9f7957-d0db-481e-923e-89972f6c594f 2017-12-15 17:28:22.893484 7fd2f9e928c0 -1 auth: error reading file: /var/lib/ceph/osd/ceph-4/keyring: can't open /var/lib/ceph/osd/ceph-4/keyring: (2) No such file or directory 2017-12-15 17:28:22.893662 7fd2f9e928c0 -1 created new key in keyring /var/lib/ceph/osd/ceph-4/keyring thanks -Original Message- From: Cary [mailto:dynamic.c...@gmail.com] Sent: Thursday, December 14, 2017 7:13 PM To: James Okken Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster) James, Usually once the misplaced data has balanced out the cluster should reach a healthy state. If you run a "ceph health detail" Ceph will show you some more detail about what is happening. Is Ceph still recovering, or has it stalled? has the "objects misplaced (62.511%" changed to a lower %? Cary -Dynamic On Thu, Dec 14, 2017 at 10:52 PM, James Okken <james.ok...@dialogic.com> wrote: Thanks Cary! Your directions worked on my first sever. (once I found the missing carriage return in your list of commands, the email musta messed it up. For anyone else: chown -R ceph:ceph /var/lib/ceph/osd/ceph-4 ceph auth add osd.4 osd 'allow *' mon 'allow profile osd' -i /etc/ceph/ceph.osd.4.keyring really is 2 commands: chown -R ceph:ceph /var/lib/ceph/osd/ceph-4 and ceph auth add osd.4 osd 'allow *' mon 'allow profile osd' -i /etc/ceph/ceph.osd.4.keyring Cary, what am I looking for in ceph -w and c
Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster)
On 14.12.2017 18:34, James Okken wrote: Hi all, Please let me know if I am missing steps or using the wrong steps I'm hoping to expand my small CEPH cluster by adding 4TB hard drives to each of the 3 servers in the cluster. I also need to change my replication factor from 1 to 3. This is part of an Openstack environment deployed by Fuel and I had foolishly set my replication factor to 1 in the Fuel settings before deploy. I know this would have been done better at the beginning. I do want to keep the current cluster and not start over. I know this is going thrash my cluster for a while replicating, but there isn't too much data on it yet. To start I need to safely turn off each CEPH server and add in the 4TB drive: To do that I am going to run: ceph osd set noout systemctl stop ceph-osd@1 (or 2 or 3 on the other servers) ceph osd tree (to verify it is down) poweroff, install the 4TB drive, bootup again ceph osd unset noout Next step wouyld be to get CEPH to use the 4TB drives. Each CEPH server already has a 836GB OSD. ceph> osd df ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 0.81689 1.0 836G 101G 734G 12.16 0.90 167 1 0.81689 1.0 836G 115G 721G 13.76 1.02 166 2 0.81689 1.0 836G 121G 715G 14.49 1.08 179 TOTAL 2509G 338G 2171G 13.47 MIN/MAX VAR: 0.90/1.08 STDDEV: 0.97 ceph> df GLOBAL: SIZE AVAIL RAW USED %RAW USED 2509G 2171G 338G 13.47 POOLS: NAMEID USED %USED MAX AVAIL OBJECTS rbd 0 0 0 2145G 0 images 1 216G 9.15 2145G 27745 backups 2 0 0 2145G 0 volumes 3 114G 5.07 2145G 29717 compute 4 0 0 2145G 0 Once I get the 4TB drive into each CEPH server should I look to increasing the current OSD (ie: to 4836GB)? Or create a second 4000GB OSD on each CEPH server? If I am going to create a second OSD on each CEPH server I hope to use this doc: http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/ As far as changing the replication factor from 1 to 3: Here are my pools now: ceph osd pool ls detail pool 0 'rbd' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 1 'images' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 116 flags hashpspool stripe_width 0 removed_snaps [1~3,b~6,12~8,20~2,24~6,2b~8,34~2,37~20] pool 2 'backups' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 7 flags hashpspool stripe_width 0 pool 3 'volumes' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 73 flags hashpspool stripe_width 0 removed_snaps [1~3] pool 4 'compute' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 34 flags hashpspool stripe_width 0 I plan on using these steps I saw online: ceph osd pool set rbd size 3 ceph -s (Verify that replication completes successfully) ceph osd pool set images size 3 ceph -s ceph osd pool set backups size 3 ceph -s ceph osd pool set volumes size 3 ceph -s please let me know any advice or better methods... you normaly want each drive to be it's own osd. it is the number of osd's that give ceph it's scaleabillity. so more osd's = more aggeregate performance. only exception is if you are limited by something like cpu or ram and must limit osd count becouse of that. also remember to up your min_size from 1 to the default 2. with 1 your cluster will accept writes with only a single operational osd. and if that one fail you will have dataloss corruption and inconsistencies. you might also consider upping your size and min_size before taking down a osd, since you obviously will have the pg's on that osd unavailable. and you may want to have the extra redundancy before shaking the tree. with max usage 15% on the most used OSD you should have the space for it. good luck Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Corrupted files on CephFS since Luminous upgrade
On 08. des. 2017 14:49, Florent B wrote: On 08/12/2017 14:29, Yan, Zheng wrote: On Fri, Dec 8, 2017 at 6:51 PM, Florent B <flor...@coppint.com> wrote: I don't know I didn't touched that setting. Which one is recommended ? If multiple dovecot instances are running at the same time and they all modify the same files. you need to set fuse_disable_pagecache to true. Ok, but in my configuration, each mail user is mapped to a single server. So files are accessed only by a single server at a time. how about mail delivery ? if you use dovecot deliver a delivery can occur (and rewrite dovecot index/cache) at the same time as a user accesses imap and writes to dovecot index/cache. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] I cannot make the OSD to work, Journal always breaks 100% time
entry()+0x10) [0x55569c1f2a60] 7: (()+0x76ba) [0x7f24503e36ba] 8: (clone()+0x6d) [0x7f244e45b3dd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. 2017-12-05 13:19:04.442866 7f243d9a1700 -1 os/filestore/FileStore.cc: In function 'void FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d9a1700 time 2017-12-05 13:19:04.435362 os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error") ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55569c1ff790] 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e] 3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b] 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded] 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961] 6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60] 7: (()+0x76ba) [0x7f24503e36ba] 8: (clone()+0x6d) [0x7f244e45b3dd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. 0> 2017-12-05 13:19:04.442866 7f243d9a1700 -1 os/filestore/FileStore.cc: In function 'void FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d9a1700 time 2017-12-05 13:19:04.435362 os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error") ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x80) [0x55569c1ff790] 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e] 3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, std::allocator >&, unsigned long, ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b] 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded] 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961] 6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60] 7: (()+0x76ba) [0x7f24503e36ba] 8: (clone()+0x6d) [0x7f244e45b3dd] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. *** Caught signal (Aborted) ** in thread 7f243d1a0700 thread_name:tp_fstore_op I tried to boot it several times. I zero the journal dd if=/dev/zero of=/dev/sde2 This probably kills the OSD, at the very least it destroys objects that was written to journal (and cluster assumed was safe), unless you flushed it successfully previously. create a new journal ceph-osd --mkjournal -i 6 Flush it. But's empty so ok. /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph --flush-journal and boot manually the osd. /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph Then it breaks. I pasted bin my whole configuration in https://pastebin.com/QfrE71Dg. But I changed also the journal partition from sde4 to sde2 to see if this has something to do. sde is SSD disk so wanted to see no block is corrupting everything. Nothing it breaks 100% of time after a while. I'm desperate to see how it breaks. I must say that this is other OSD that failed and I recovered. Smartscan long is correct xfs_repair is ok on disk everything seems correct. But it keep crashing. Any advice? Can I run the disk without journal for a while until all pg are backup to the other disks? I just increased the size of the pools and min size as well and I need this disk in order to recover all information. you need this disk to recover all information ? do you not have replication and objects are safe? i can not see from your pastebin that you have missing objects (that are only on this one disk) if you need the actualy objects from this disk, then you need to do a recovery. that is a whole other job. if you only need the space of the disk, then you should zap and wipe it. and insert it as a new fresh OSD. but these 2 lines from your pastebin is a bit over the top. how you can have this many degraded objects based on only 289090 objects is hard to get. recovery 20266198323167232/289090 objects degraded (7010342219781.809%) 37154696925806625 scrub errors i have not seen that before so hopefully someone else can chime in. also what exact os kernel and ceph versions are you running? kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Another OSD broken today. How can I recover it?
just as long as you are aware that size=3, min_size=2 is the right config for everyone except those that really know what they are doing. and if you ever run min_size=1 you better be expecting to corrupt your cluster sooner or later. Ronny On 05.12.2017 21:22, Denes Dolhay wrote: Hi, So for this to happen you have to lose another osd before backfilling is done. Thank You! This clarifies it! Denes On 12/05/2017 03:32 PM, Ronny Aasen wrote: On 05. des. 2017 10:26, Denes Dolhay wrote: Hi, This question popped up a few times already under filestore and bluestore too, but please help me understand, why this is? "when you have 2 different objects, both with correct digests, in your cluster, the cluster can not know witch of the 2 objects are the correct one." Doesn't it use an epoch, or an omap epoch when storing new data? If so why can it not use the recent one? this have been discussed a few times on the list. generally you have 2 disks. first disk fail. and writes happen to the other disk.. first disk recovers, and second disk fail before recovery is done. writes happen to second disk.. all objects have correct checksum. and both osd's think they are the correct one. so your cluster is inconsistent. so bluestore checksums does not solve this problem, both objects are objectivly "correct" :) with min_size =2 the cluster would not accept a write unless 2 disks accepted the write. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Another OSD broken today. How can I recover it?
On 05. des. 2017 10:26, Denes Dolhay wrote: Hi, This question popped up a few times already under filestore and bluestore too, but please help me understand, why this is? "when you have 2 different objects, both with correct digests, in your cluster, the cluster can not know witch of the 2 objects are the correct one." Doesn't it use an epoch, or an omap epoch when storing new data? If so why can it not use the recent one? this have been discussed a few times on the list. generally you have 2 disks. first disk fail. and writes happen to the other disk.. first disk recovers, and second disk fail before recovery is done. writes happen to second disk.. all objects have correct checksum. and both osd's think they are the correct one. so your cluster is inconsistent. so bluestore checksums does not solve this problem, both objects are objectivly "correct" :) with min_size =2 the cluster would not accept a write unless 2 disks accepted the write. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Adding multiple OSD
On 05. des. 2017 00:14, Karun Josy wrote: Thank you for detailed explanation! Got one another doubt, This is the total space available in the cluster : TOTAL : 23490G Use : 10170G Avail : 13320G But ecpool shows max avail as just 3 TB. What am I missing ? == $ ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 23490G 13338G 10151G 43.22 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS ostemplates 1 162G 2.79 1134G 42084 imagepool 34 122G 2.11 1891G 34196 cvm1 54 8058 0 1891G 950 ecpool1 55 4246G 42.77 3546G 1232590 $ ceph osd df ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS 0 ssd 1.86469 1.0 1909G 625G 1284G 32.76 0.76 201 1 ssd 1.86469 1.0 1909G 691G 1217G 36.23 0.84 208 2 ssd 0.87320 1.0 894G 587G 306G 65.67 1.52 156 11 ssd 0.87320 1.0 894G 631G 262G 70.68 1.63 186 3 ssd 0.87320 1.0 894G 605G 288G 67.73 1.56 165 14 ssd 0.87320 1.0 894G 635G 258G 71.07 1.64 177 4 ssd 0.87320 1.0 894G 419G 474G 46.93 1.08 127 15 ssd 0.87320 1.0 894G 373G 521G 41.73 0.96 114 16 ssd 0.87320 1.0 894G 492G 401G 55.10 1.27 149 5 ssd 0.87320 1.0 894G 288G 605G 32.25 0.74 87 6 ssd 0.87320 1.0 894G 342G 551G 38.28 0.88 102 7 ssd 0.87320 1.0 894G 300G 593G 33.61 0.78 93 22 ssd 0.87320 1.0 894G 343G 550G 38.43 0.89 104 8 ssd 0.87320 1.0 894G 267G 626G 29.90 0.69 77 9 ssd 0.87320 1.0 894G 376G 518G 42.06 0.97 118 10 ssd 0.87320 1.0 894G 322G 571G 36.12 0.83 102 19 ssd 0.87320 1.0 894G 339G 554G 37.95 0.88 109 12 ssd 0.87320 1.0 894G 360G 534G 40.26 0.93 112 13 ssd 0.87320 1.0 894G 404G 489G 45.21 1.04 120 20 ssd 0.87320 1.0 894G 342G 551G 38.29 0.88 103 23 ssd 0.87320 1.0 894G 148G 745G 16.65 0.38 61 17 ssd 0.87320 1.0 894G 423G 470G 47.34 1.09 117 18 ssd 0.87320 1.0 894G 403G 490G 45.18 1.04 120 21 ssd 0.87320 1.0 894G 444G 450G 49.67 1.15 130 TOTAL 23490G 10170G 13320G 43.30 Karun Josy On Tue, Dec 5, 2017 at 4:42 AM, Karun Josy <karunjo...@gmail.com <mailto:karunjo...@gmail.com>> wrote: Thank you for detailed explanation! Got one another doubt, This is the total space available in the cluster : TOTAL 23490G Use 10170G Avail : 13320G But ecpool shows max avail as just 3 TB. without knowing details of your cluster, this is just assumption guessing, but... perhaps one of your hosts have less free space then the others, replicated can pick 3 of the hosts that have plenty of space, but erasure perhaps require more hosts, so the host with least space is the limiting factor. check ceph osd df tree to see how it looks. kinds regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Another OSD broken today. How can I recover it?
On 05. des. 2017 09:18, Gonzalo Aguilar Delgado wrote: Hi, I created this. http://paste.debian.net/999172/ But the expiration date is too short. So I did this too https://pastebin.com/QfrE71Dg. What I want to mention is that there's no known cause for what's happening. It's true that time desynch happens on reboot because few millis skew. But ntp corrects it fast. There are no network issues and the log of the osd is in the output. I only see in other osd the errors that are becoming more and more usual: 2017-12-05 08:58:56.637773 7f0feff7f700 -1 log_channel(cluster) log [ERR] : 10.7a shard 2: soid 10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head data_digest 0xfae07534 != data_digest 0xe2de2a76 from auth oi 10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head(3873'5250781 client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 5250781 dd e2de2a76 od alloc_hint [0 0]) 2017-12-05 08:58:56.637775 7f0feff7f700 -1 log_channel(cluster) log [ERR] : 10.7a shard 6: soid 10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head data_digest 0xfae07534 != data_digest 0xe2de2a76 from auth oi 10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head(3873'5250781 client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 5250781 dd e2de2a76 od alloc_hint [0 0]) 2017-12-05 08:58:56.63 7f0feff7f700 -1 log_channel(cluster) log [ERR] : 10.7a soid 10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head: failed to pick suitable auth object Digests not matching basically. Someone told me that this can be caused by a faulty disk. So I replaced the offending drive, and now I found the new disk is happening the same. Ok. But this thread is not for checking the source of the problem. This will be done later. This thread is to try recover an OSD that seems ok to the object store tool. This is: Why it breaks here? if i get errors on a disk that i suspect are from reasons other then the disk beeing faulty. i remove the disk from the cluster. run it thru smart disk tests + long test. then run it thru the vendors diagnostic tools (i have a separate 1u machine for this) if the disk clears as OK i wipe it and reinsert it as a new OSD the reason you are getting corrupt digests are probably the very common way most people get corruptions.. you have size=2 , min_size=1 when you have 2 different objects, both with correct digests, in your cluster, the cluster can not know witch of the 2 objects are the correct one. just search this list for all the users that end up in your situation for the same reason, also read this : http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/016663.html simple rule of thumb size=2, min_size=1 :: i do not care about my data, the data is volatile but i want the cluster to accept writes _all the time_ size=2, min_size=2 :: i can not afford real redundancy, but i do care a little about my data, i accept that the cluster will block writes in error situations until the problem is fixed. size=3, min_size=2 :: i want safe and available data, and i understand that the ceph defaults are there for a reason. basically: size=3, min_size=2 if you want to avoid corruptions. remove-wipe-reinstall disks that have developed corruptions/inconsistencies with the cluster kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HELP with some basics please
On 04.12.2017 19:18, tim taler wrote: In size=2 losing any 2 discs on different hosts would probably cause data to be unavailable / lost, as the pg copys are randomly distribbuted across the osds. Chances are, that you can find a pg which's acting group is the two failed osd (you lost all your replicas) okay I see, getting clearer at least ;-) you can also consider running size=2, min_size=2 while restructuring. it will block your problematic pg's if there is a failure, until the rebuild/rebalance is done. But it should be a bit more resistant to full cluster loss and/or corruption. basically it means if there is less then 2 copies do not accept writes. if you want to do this depends on your requirements, is it a bigger disaster to be unavailable a while, then there is to restore from backup. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Another OSD broken today. How can I recover it?
On 04. des. 2017 10:22, Gonzalo Aguilar Delgado wrote: Hello, Things are going worse every day. ceph -w cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771 health HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds 8 pgs inconsistent 1 pgs repair 1 pgs stale 1 pgs stuck stale recovery 20266198323167232/288980 objects degraded (7013010700798.405%) 37154696925806624 scrub errors no legacy OSD present but 'sortbitwise' flag is not set But I'm finally finding time to recover. The disk seems to be correct, no smart errors and everything looks fine just ceph not starting. Today I started to look for the ceph-objectstore-tool. That I don't really know much. It just works nice. No crash as expected like on the OSD. So I'm lost. Since both OSD and ceph objectstore tool use same backend how is this posible? Can someone help me on fixing this, please? this line seems quite insane: recovery 20266198323167232/288980 objects degraded (7013010700798.405%) there is obviously something wrong in your cluster. once the defect osd id down/out does the cluster eventually heal to HEALTH_OK ? you should start by reading and understanding this page. http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ also in order to get assistance you need to provide a lot more detail. how many nodes, how many osd's per node. what kinf of nodes cpu/ram. what kind of networking setup. show the output from ceph -s ceph osd tree ceph osd pool ls detail ceph health detail since you are systematically loosing osd's i would start by checking the timestamp in the defect osd for when it died. doublecheck your clock sync settingts that all servers are time syncronized and then check all logs for the time in question. especialy dmesg, did OOM killer do something ? was networking flaky ? mon logs ? did they complain about the osd in some fashion ? also since you fail to start the osd again there is probably some corruption going on. bump the log for that osd in the nodes ceph.conf, something like [osd.XX] debug osd = 20 rename the log for the osd so you have a fresh file. and try to start the osd once. put the log on some pastebin and send the url. read http://ceph.com/planet/how-to-increase-debug-levels-and-harvest-a-detailed-osd-log/ for details. generally: try to make it easy for people to help you without having to drag details out of you. If you can collect all of the above on a pastebin like http://paste.debian.net/ instead of piecing it together from 3-4 different email threads, you will find a lot more eyeballs willing to give it a look. good luck and kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph - SSD cluster
On 20. nov. 2017 23:06, Christian Balzer wrote: On Mon, 20 Nov 2017 15:53:31 +0100 Ansgar Jazdzewski wrote: Hi *, just on note because we hit it, take a look on your discard options make sure it not run on all OSD at the same time. Any SSD that actually _requires_ the use of TRIM/DISCARD to maintain either speed or endurance I'd consider unfit for Ceph to boot. hello is there some sort of hardware compatibillity list for this part ? perhaps community maintained on the wiki or similar. there are some older blog posts covering some devices, but hard to find ceph related for current devices. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Moving bluestore WAL and DB after bluestore creation
On 16.11.2017 09:45, Loris Cuoghi wrote: Le Wed, 15 Nov 2017 19:46:48 +, Shawn Edwards <lesser.e...@gmail.com> a écrit : On Wed, Nov 15, 2017, 11:07 David Turner <drakonst...@gmail.com> wrote: I'm not going to lie. This makes me dislike Bluestore quite a bit. Using multiple OSDs to an SSD journal allowed for you to monitor the write durability of the SSD and replace it without having to out and re-add all of the OSDs on the device. Having to now out and backfill back onto the HDDs is awful and would have made a time when I realized that 20 journal SSDs all ran low on writes at the same time nearly impossible to recover from. Flushing journals, replacing SSDs, and bringing it all back online was a slick process. Formatting the HDDs and backfilling back onto the same disks sounds like a big regression. A process to migrate the WAL and DB onto the HDD and then back off to a new device would be very helpful. On Wed, Nov 15, 2017 at 10:51 AM Mario Giammarco <mgiamma...@gmail.com> wrote: It seems it is not possible. I recreated the OSD 2017-11-12 17:44 GMT+01:00 Shawn Edwards <lesser.e...@gmail.com>: I've created some Bluestore OSD with all data (wal, db, and data) all on the same rotating disk. I would like to now move the wal and db onto an nvme disk. Is that possible without re-creating the OSD? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com This. Exactly this. Not being able to move the .db and .wal data on and off the main storage disk on Bluestore is a regression. Hello, What stops you from dd'ing the DB/WAL's partitions on another disk and updating the symlinks in the OSD's mount point under /var/lib/ceph/osd? this probably works when you deployed bluestore with partitions, but if you did not create partitions for block.db on orginal bluestore creation there is no block.db symlink, db and wal are mixed into the block partition and not easy to extract. also just dd the block device may not help if you want to change the size of the db partition. this needs more testing. probably tools can be created in the future for resizing db and wal partitions, and for extracting db data from block into a separate block.db partition. dd block.db would probably work when you need to replace a worn out ssd drive. but not so much if you want to deploy separate block.db from a bluestore made without block.db kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cluster network slower than public network
On 15.11.2017 13:50, Gandalf Corvotempesta wrote: As 10gb switches are expansive, what would happen by using a gigabit cluster network and a 10gb public network? Replication and rebalance should be slow, but what about public I/O ? When a client wants to write to a file, it does over the public network and the ceph automatically replicate it over the cluster network or the whole IO is made over the public? public io would be slow. each write goes from client to primary osd on public network, then is replicated 2 times to the secondary osd's over the cluster network, then the client is informed the block is written. since cluster network would see 2x write traffic compared to public network when things a OK. and many times the traffic of the public network when things are recovering or backfilling. i would prioritize the clusternetwork for the highest speed if one could not have 10Gbps on everything. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Undersized fix for small cluster, other than adding a 4th node?
On 09. nov. 2017 22:52, Marc Roos wrote: I added an erasure k=3,m=2 coded pool on a 3 node test cluster and am getting these errors. pg 48.0 is stuck undersized for 23867.00, current state active+undersized+degraded, last acting [9,13,2147483647,7,2147483647] pg 48.1 is stuck undersized for 27479.944212, current state active+undersized+degraded, last acting [12,1,2147483647,8,2147483647] pg 48.2 is stuck undersized for 27479.944514, current state active+undersized+degraded, last acting [12,1,2147483647,3,2147483647] pg 48.3 is stuck undersized for 27479.943845, current state active+undersized+degraded, last acting [11,0,2147483647,2147483647,5] pg 48.4 is stuck undersized for 27479.947473, current state active+undersized+degraded, last acting [8,4,2147483647,2147483647,5] pg 48.5 is stuck undersized for 27479.940289, current state active+undersized+degraded, last acting [6,5,11,2147483647,2147483647] pg 48.6 is stuck undersized for 27479.947125, current state active+undersized+degraded, last acting [5,8,2147483647,1,2147483647] pg 48.7 is stuck undersized for 23866.977708, current state active+undersized+degraded, last acting [13,11,2147483647,0,2147483647] Mentioned here http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009572.html is that the problem was resolved by adding an extra node, I already changed the min_size to 3. Or should I change to k=2,m=2 but do I still then have good saving on storage then? How can you calculate saving storage of erasure pool? minimum nodes for a cluster is k+m and with that you have no nodes for additional failure domain. IOW, if a node fail your cluster is degraded and can not heal itself. having ceph heal on failures is kind of one one of the best things about ceph. so when choosing how many nodes to have in your cluster, you need to think: k + m + how many node failures do i want to tolerate without stressing = minimum number of nodes basically with a 3 node cluster, you can either run 3x replication or k=2 + m=1 to look for space saving you can read http://ceph.com/geen-categorie/ceph-erasure-coding-overhead-in-a-nutshell/ kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How to enable jumbo frames on IPv6 only cluster?
On 27. okt. 2017 14:22, Félix Barbeira wrote: Hi, I'm trying to configure a ceph cluster using IPv6 only but I can't enable jumbo frames. I made the definition on the 'interfaces' file and it seems like the value is applied but when I test it looks like only works on IPv4, not IPv6. It works on IPv4: root@ceph-node01:~# ping -c 3 -M do -s 8972 ceph-node02 PING ceph-node02 (x.x.x.x) 8972(9000) bytes of data. 8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=1 ttl=64 time=0.474 ms 8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=2 ttl=64 time=0.254 ms 8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=3 ttl=64 time=0.288 ms --- ceph-node02 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2000ms rtt min/avg/max/mdev = 0.254/0.338/0.474/0.099 ms root@ceph-node01:~# But *not* in IPv6: root@ceph-node01:~# ping6 -c 3 -M do -s 8972 ceph-node02 PING ceph-node02(x:x:x:x:x:x:x:x) 8972 data bytes ping: local error: Message too long, mtu=1500 ping: local error: Message too long, mtu=1500 ping: local error: Message too long, mtu=1500 --- ceph-node02 ping statistics --- 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3024ms root@ceph-node01:~# root@ceph-node01:~# ifconfig eno1 Link encap:Ethernet HWaddr 24:6e:96:05:55:f8 inet6 addr: 2a02:x:x:x:x:x:x:x/64 Scope:Global inet6 addr: fe80::266e:96ff:fe05:55f8/64 Scope:Link UP BROADCAST RUNNING MULTICAST *MTU:9000* Metric:1 RX packets:633318 errors:0 dropped:0 overruns:0 frame:0 TX packets:649607 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:463355602 (463.3 MB) TX bytes:498891771 (498.8 MB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:127420 errors:0 dropped:0 overruns:0 frame:0 TX packets:127420 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1 RX bytes:179470326 (179.4 MB) TX bytes:179470326 (179.4 MB) root@ceph-node01:~# root@ceph-node01:~# cat /etc/network/interfaces # This file describes network interfaces avaiulable on your system # and how to activate them. For more information, see interfaces(5). source /etc/network/interfaces.d/* # The loopback network interface auto lo iface lo inet loopback # The primary network interface auto eno1 iface eno1 inet6 auto post-up ifconfig eno1 mtu 9000 root@ceph-node01:# Please help! hello have you changed on all nodes ? also the ipv6 icmpv6 protocol can advertise a link MTU value. the client will pick up this mtu value and store it in/proc/sys/net/ipv6/conf/eth0/mtu if /proc/sys/net/ipv6/conf/ens32/accept_ra_mtu is enabled. you can perhaps change what mtu is advertised on the link by altering your Router or device that advertise RA's kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS damaged
if you were following this page: http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-pg/ then there is normally hours of troubleshooting in the following paragraph, before finally admitting defeat and marking the object as lost: "It is possible that there are other locations where the object can exist that are not listed. For example, if a ceph-osd is stopped and taken out of the cluster, the cluster fully recovers, and due to some future set of failures ends up with an unfound object, it won’t consider the long-departed ceph-osd as a potential location to consider. (This scenario, however, is unlikely.)" Also this warning is important regarding the loosing of objects: "Use this with caution, as it may confuse applications that expected the object to exist." mds is definitiftly such an application. i think rgw would be the only application that loosing a object could be acceptable, depending on what used the object storage. rbd and cephfs will have issues of varying degree. One could argue that the mark-unfound-lost command should have a --yes-i-mean-it type of warning, especialy of the pool application is cephfs or rbd This is ofcourse a bit late now that the object is marked as lost. but for your future reference: since you had a inconsistent pg, most likely you had one corrupt object and 1 or more OK object on some osd. and using the methods written about in http://ceph.com/geen-categorie/ceph-manually-repair-object/ might have recovered that object for you. kind regards Ronny Aasen On 26. okt. 2017 04:38, dani...@igb.illinois.edu wrote: Hi Ronny, From the documentation, I thought this was the proper way to resolve the issue. Dan On 24. okt. 2017 19:14, Daniel Davidson wrote: Our ceph system is having a problem. A few days a go we had a pg that was marked as inconsistent, and today I fixed it with a: #ceph pg repair 1.37c then a file was stuck as missing so I did a: #ceph pg 1.37c mark_unfound_lost delete pg has 1 objects unfound and apparently lost marking sorry i can not assist on the corrupt mds part. i have no experience in that part. But I felt this escaleted a bit quick. since this is a "i accept lost object" type of command, the consequences are quite ugly, depending on what the missing object was for. Did you do much troubleshooting before jumping to this command so you were certain there was no other non dataloss options ? kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS damaged
On 24. okt. 2017 19:14, Daniel Davidson wrote: Our ceph system is having a problem. A few days a go we had a pg that was marked as inconsistent, and today I fixed it with a: #ceph pg repair 1.37c then a file was stuck as missing so I did a: #ceph pg 1.37c mark_unfound_lost delete pg has 1 objects unfound and apparently lost marking sorry i can not assist on the corrupt mds part. i have no experience in that part. But I felt this escaleted a bit quick. since this is a "i accept lost object" type of command, the consequences are quite ugly, depending on what the missing object was for. Did you do much troubleshooting before jumping to this command so you were certain there was no other non dataloss options ? kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure code profile
yes you can. but just like a raid5 array with a lost disk, it is not a comfortable way to run your cluster for any significant time. you also get performance degradations. having a warning active all the time makes it harder to detect new issues, and such. One becomes numb to the warning allways beeing on. strive to have your cluster in health ok all the time. and design so that you have the fault tolerance you want as overhead. having more nodes then strictly needed allow ceph to self heal quickly. and also gives better performance, by spreading load over more machines. 10+4 on 14 nodes means each and every nodes are hit on each write. kind regards Ronny Aasen On 23. okt. 2017 21:12, Jorge Pinilla López wrote: I have one question, what can or can't do a cluster working on degraded mode? With K=10 + M = 4 if one of my OSDs node fails it will start working on degraded mode, but can I still do writes and reads from that pool? El 23/10/2017 a las 21:01, Ronny Aasen escribió: On 23.10.2017 20:29, Karun Josy wrote: Hi, While creating a pool with erasure code profile k=10, m=4, I get PG status as "200 creating+incomplete" While creating pool with profile k=5, m=3 it works fine. Cluster has 8 OSDs with total 23 disks. Is there any requirements for setting the first profile ? you need K+M+X osd nodes. K and M comes from the profile, X is how many nodes you want to be able to tolerate failure of, without becoming degraded. (how many failed nodes ceph should be able to automatically heal) so with K=10 + M = 4 you need minimum 14 nodes and you have 0 fault tolerance (a single failure = a degreded cluster) so you have to scramble to replace the node to get HEALTH OK again. if you have 15 nodes you can loose 1 node and cehp will automatically rebalance to the 14 needed nodes, and you can replace the lost node at your leisure. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- *Jorge Pinilla López* jorp...@unizar.es Estudiante de ingenieria informática Becario del area de sistemas (SICUZ) Universidad de Zaragoza PGP-KeyID: A34331932EBC715A <http://pgp.rediris.es:11371/pks/lookup?op=get=0xA34331932EBC715A> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure code profile
On 23.10.2017 20:29, Karun Josy wrote: Hi, While creating a pool with erasure code profile k=10, m=4, I get PG status as "200 creating+incomplete" While creating pool with profile k=5, m=3 it works fine. Cluster has 8 OSDs with total 23 disks. Is there any requirements for setting the first profile ? you need K+M+X osd nodes. K and M comes from the profile, X is how many nodes you want to be able to tolerate failure of, without becoming degraded. (how many failed nodes ceph should be able to automatically heal) so with K=10 + M = 4 you need minimum 14 nodes and you have 0 fault tolerance (a single failure = a degreded cluster) so you have to scramble to replace the node to get HEALTH OK again. if you have 15 nodes you can loose 1 node and cehp will automatically rebalance to the 14 needed nodes, and you can replace the lost node at your leisure. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Brand new cluster -- pg is stuck inactive
strange that no osd is acting for your pg's can you show the output from ceph osd tree mvh Ronny Aasen On 13.10.2017 18:53, dE wrote: Hi, I'm running ceph 10.2.5 on Debian (official package). It cant seem to create any functional pools -- ceph health detail HEALTH_ERR 64 pgs are stuck inactive for more than 300 seconds; 64 pgs stuck inactive; too few PGs per OSD (21 < min 30) pg 0.39 is stuck inactive for 652.741684, current state creating, last acting [] pg 0.38 is stuck inactive for 652.741688, current state creating, last acting [] pg 0.37 is stuck inactive for 652.741690, current state creating, last acting [] pg 0.36 is stuck inactive for 652.741692, current state creating, last acting [] pg 0.35 is stuck inactive for 652.741694, current state creating, last acting [] pg 0.34 is stuck inactive for 652.741696, current state creating, last acting [] pg 0.33 is stuck inactive for 652.741698, current state creating, last acting [] pg 0.32 is stuck inactive for 652.741701, current state creating, last acting [] pg 0.3 is stuck inactive for 652.741762, current state creating, last acting [] pg 0.2e is stuck inactive for 652.741715, current state creating, last acting [] pg 0.2d is stuck inactive for 652.741719, current state creating, last acting [] pg 0.2c is stuck inactive for 652.741721, current state creating, last acting [] pg 0.2b is stuck inactive for 652.741723, current state creating, last acting [] pg 0.2a is stuck inactive for 652.741725, current state creating, last acting [] pg 0.29 is stuck inactive for 652.741727, current state creating, last acting [] pg 0.28 is stuck inactive for 652.741730, current state creating, last acting [] pg 0.27 is stuck inactive for 652.741732, current state creating, last acting [] pg 0.26 is stuck inactive for 652.741734, current state creating, last acting [] pg 0.3e is stuck inactive for 652.741707, current state creating, last acting [] pg 0.f is stuck inactive for 652.741761, current state creating, last acting [] pg 0.3f is stuck inactive for 652.741708, current state creating, last acting [] pg 0.10 is stuck inactive for 652.741763, current state creating, last acting [] pg 0.4 is stuck inactive for 652.741773, current state creating, last acting [] pg 0.5 is stuck inactive for 652.741774, current state creating, last acting [] pg 0.3a is stuck inactive for 652.741717, current state creating, last acting [] pg 0.b is stuck inactive for 652.741771, current state creating, last acting [] pg 0.c is stuck inactive for 652.741772, current state creating, last acting [] pg 0.3b is stuck inactive for 652.741721, current state creating, last acting [] pg 0.d is stuck inactive for 652.741774, current state creating, last acting [] pg 0.3c is stuck inactive for 652.741722, current state creating, last acting [] pg 0.e is stuck inactive for 652.741776, current state creating, last acting [] pg 0.3d is stuck inactive for 652.741724, current state creating, last acting [] pg 0.22 is stuck inactive for 652.741756, current state creating, last acting [] pg 0.21 is stuck inactive for 652.741758, current state creating, last acting [] pg 0.a is stuck inactive for 652.741783, current state creating, last acting [] pg 0.20 is stuck inactive for 652.741761, current state creating, last acting [] pg 0.9 is stuck inactive for 652.741787, current state creating, last acting [] pg 0.1f is stuck inactive for 652.741764, current state creating, last acting [] pg 0.8 is stuck inactive for 652.741790, current state creating, last acting [] pg 0.7 is stuck inactive for 652.741792, current state creating, last acting [] pg 0.6 is stuck inactive for 652.741794, current state creating, last acting [] pg 0.1e is stuck inactive for 652.741770, current state creating, last acting [] pg 0.1d is stuck inactive for 652.741772, current state creating, last acting [] pg 0.1c is stuck inactive for 652.741774, current state creating, last acting [] pg 0.1b is stuck inactive for 652.741777, current state creating, last acting [] pg 0.1a is stuck inactive for 652.741784, current state creating, last acting [] pg 0.2 is stuck inactive for 652.741812, current state creating, last acting [] pg 0.31 is stuck inactive for 652.741762, current state creating, last acting [] pg 0.19 is stuck inactive for 652.741789, current state creating, last acting [] pg 0.11 is stuck inactive for 652.741797, current state creating, last acting [] pg 0.18 is stuck inactive for 652.741793, current state creating, last acting [] pg 0.1 is stuck inactive for 652.741820, current state creating, last acting [] pg 0.30 is stuck inactive for 652.741769, current state creating, last acting [] pg 0.17 is stuck inactive for 652.741797, current state creating, last acting [] pg 0.0 is stuck inactive for 652.741829, current state creating, last acting [] pg 0.2f is stuck inactive for 652.741774, current state creating, last acting [] pg 0.16 is stuck inact
[ceph-users] windows server 2016 refs3.1 veeam syntetic backup with fast block clone
greetings when using windows storagespaces and refs 3.1 one can in veeam backups use something called block clone to build syntetic backups. and to reduce the time taken to backup vm's. i have used windows servers 2016 with refs3.1 on ceph. my question is if it is possible to get fast block clone and fast syntetic full backups when using refs on rbd on ceph. i ofcourse have other backup solutions, but this is spesific for vmware backups. possible? kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph luminous repo not working on Ubuntu xenial
"apt-cache policy" shows you the different versions that are possible to install, and the prioritized order they have. the highest version will normally be installed unless priorities are changed. example: apt-cache policy ceph ceph: Installed: 12.2.1-1~bpo90+1 Candidate: 12.2.1-1~bpo90+1 Version table: *** 12.2.1-1~bpo90+1 500 500 http://download.ceph.com/debian-luminous stretch/main amd64 Packages 100 /var/lib/dpkg/status 10.2.5-7.2 500 500 http://deb.debian.org/debian stretch/main amd64 Packages apt-get install ceph=$version will install that spesific version. example in my case: apt install ceph=10.2.5-7.2 will downgrade to the previous version. kind regards Ronny Aasen On 29.09.2017 15:40, Kashif Mumtaz wrote: Dear Stefan, Thanks for your help. You are right. I was missing apt update" after adding repo. After doing apt update I am able to install luminous cadmin@admin:~/my-cluster$ ceph --version ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable) I am not much in practice with Ubuntu. I use Centos/RHEL only . This time a specific requirement to install it on Ubuntu. I want to ask one thing. Now ceph two version availbe in repository. 1- Jewel in Ubuntu update repository 2 - Manually added ceph Repository If one package available in multiple repository with different version How can I install specific version ? . On Friday, September 29, 2017 9:57 AM, Stefan Kooman <ste...@bit.nl> wrote: Quoting Kashif Mumtaz (kashif.mum...@yahoo.com <mailto:kashif.mum...@yahoo.com>): > > Dear User, > I am striving had to install Ceph luminous version on Ubuntu 16.04.3 ( xenial ). > Its repo is available at https://download.ceph.com/debian-luminous/ <https://download.ceph.com/debian-luminous/%C2%A0> > I added it like sudo apt-add-repository 'deb https://download.ceph.com/debian-luminous/ xenial main' > # more sources.list > deb https://download.ceph.com/debian-luminous/ xenial main ^^ That looks good. > It say no package available. Did anybody able to install Luminous on Xenial by using repo? Just checkin': you did a "apt update" after adding the repo? The repo works fine for me. Is the Ceph gpg key installed? apt-key list |grep Ceph uid Ceph.com (release key) <secur...@ceph.com <mailto:secur...@ceph.com>> Make sure you have "apt-transport-https" installed (as the repos uses TLS). Gr. Stefan -- | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl <mailto:i...@bit.nl> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power outages!!! help!
On 28. sep. 2017 18:53, hjcho616 wrote: Yay! Finally after about exactly one month I finally am able to mount the drive! Now is time to see how my data is doing. =P Doesn't look too bad though. Got to love the open source. =) I downloaded ceph source code. Built them. Then tried to run ceph-objectstore-export on that osd.4. Then started debugging it. Obviously don't have any idea of what everything do... > but was able to trace to the error message. The corruption appears to be at the mount region. When it tries to decode a buffer, most buffers had very periodic (looking at the printfs I put in) access to data but then > few of them had huge number. Oh that "1" that didn't make sense was from the corruption happened, and that struct_v portion of the data changed to ASCII value of 1, which happily printed 1. =P Since it was a mount portion... and hoping it doesn't impact the data much... went ahead and allowed those corrupted values. I was able to export osd.4 with journal! congratulations and well done :) just imagine tring to do this on $vendors's propitary blackbox... Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] PG in active+clean+inconsistent, but list-inconsistent-obj doesn't show it
On 28. sep. 2017 09:27, Olivier Migeot wrote: Greetings, we're in the process of recovering a cluster after an electrical disaster. Didn't work bad so far, we managed to clear most of errors. All that prevents return to HEALTH_OK now is a bunch (6) of scrub errors, apparently from a PG that's marked as active+clean+inconsistent. Thing is, rados list-inconsistent-obj doesn't return anything but an empty list (plus, in the most recent attempts : error 2: (2) No such file or directory) We're on Jewel (waiting for this to be fixed before planning upgrade), and the pool our PG belongs to has a replica of 2. No success with ceph pg repair, and I already tried to remove and import the most recent version of said PG in both its acting OSDs : it doesn't change a thing. Is there anything else I could try? Thanks, size=2 is ofcourse horrible, and I assume you know that... But even more important: I hope you have min_size=2 so you avoid generating more problems in the future, or while troubleshooting. ! first of all, read this link a few times: http://ceph.com/geen-categorie/ceph-manually-repair-object/ you need to locate the bad objects to fix them. since rados list-inconsistent-obj does not work you need to manualy check the logs of the osd's that are participating in the pg in question. grep for ERR, once you find the name of the object with problem, you need to locate the object using find /path/of/pg -name 'objectname' once you have the objectpath you need to compare the 2 objects and find out what object is the bad one, this is where 3 replication would have helped, since when one is bad, how do you know the bad from the good... the error message in the log may give hints to the error. read and understand what the error message is, since it is critical to understanding what is wrong with the object. the object type also helps when determining the wrong one. is it a rados object, a rbd block or a cephfs metadata og data object. knowing what it should be helps determining the wrong one. things to try: ls -lh $path ; compare metadata are there obvious problems? refer to the error in the log. - one have size 0 and there should have been a size? - one have size greater then 0 and it should have been size 0? - one is significantly larger then the other, perhaps one is truncated? perhaps one have garbage added. md5sum $path - perhaps a block have read error, it would show on this command. and be a dead giveaway to the problem object. - compare checksum. do you know what the object should have as sum? actualy look at the object. use strings or hexdump to try to determine the contents, vs what the object should contain. if you can locate the bad object. then stop the osd. flush it's journal. move away the bad object, (i just mv it to somewhere else). restart the osd. run repair on the pg, tail the logs and wait for the repair and scrub to finish. -- if you are unable to determine the good object from the bad. You can try to determine what file it refers to in cephfs, or what block it refers to in rbd. and by overwriting that file or block in cephfs or rbd you can indirectly overwrite both objects with new data. if this is a rbd you should run a filesystem check on the fs on that rbd after all the ceph problems are repaired. good luck Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Re install ceph
On 27. sep. 2017 10:09, Pierre Palussiere wrote: Hi, Is anyone know if it’s possible to re install ceph on a host and keep osd without wipe data on them ? Hope you can help me, it depends... if you have journal on same drive as osd, you should be able to eject the drive from a server, connect it to another and udev should mount and active osd (data will ofcourse move) i can not see why reinstall of a host would be much different from moving the disk. if you have journal on a separate device then you need to move osd and journal device together. you can also have configuration that makes this process less automatic. !BUT! i would not in any way risk reinstalling a host with live osd's.! I would set all osd's out and let the data remap to other osd's so you have a temporary replica on other osd's. while reinstalling. the backfill should be fast since the data is still on disk. or I would set crush weight to 0 and drain all osd's off the node before reinstalling. here the backfill will take longer, since you actualy have to refill disks. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power outages!!! help!
i would only tar the pg you have missing objects from, trying to inject older objects when the pg is correct can not be good. scrub errors is kind of the issue with only 2 replicas. when you have 2 different objects. how to know witch one is correct and witch one is bad.. and as you have read on http://ceph.com/geen-categorie/ceph-manually-repair-object/ and on http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ you need to - find the pg :: rados list-inconsistent-pg [pool] - find the problem :: rados list-inconsistent-obj 0.6 --format=json-pretty ; give you the object name look for hints to what is the bad object - find the object :: manually check the objects, check the object metadata, run md5sum on them all and compare. check objects on the nonrunning osd's and compare there as well. anything to try to determine what object is ok and what is bad. - fix the problem :: assuming you find the bad object, stop the affected osd with the bad object, remove the object manually, restart osd. issue repair command. if the rados commands does not give you the info you need to do it all manually as on http://ceph.com/geen-categorie/ceph-manually-repair-object/ good luck Ronny Aasen On 20.09.2017 22:17, hjcho616 wrote: Thanks Ronny. I decided to try to tar everything under current directory. Is this correct command for it? Is there any directory we do not want in the new drive? commit_op_seq, meta, nosnap, omap? tar --xattrs --preserve-permissions -zcvf osd.4.tar.gz . As far as inconsistent PGs... I am running in to these errors. I tried moving one copy of pg to other location, but it just says moved shard is missing. Tried setting 'noout ' and turn one of them down, seems to work on something but then back to same error. Currently trying to move to different osd... making sure the drive is not faulty, got few of them.. but still persisting.. I've been kicking off ceph pg repair PG#, hoping it would fix them. =P Any other suggestion? 2017-09-20 09:39:48.481400 7f163c5fa700 0 log_channel(cluster) log [INF] : 0.29 repair starts 2017-09-20 09:47:37.384921 7f163c5fa700 -1 log_channel(cluster) log [ERR] : 0.29 shard 6: soid 0:97126ead:::200014ce4c3.028f:head data_digest 0x8f679a50 != data_digest 0x979f2ed4 from auth oi 0:97126ead:::200014ce4c3.028f:head(19366'539375 client.535319.1:2361163 dirty|data_digest|omap_digest s 4194304 uv 539375 dd 979f2ed4 od alloc_hint [0 0]) 2017-09-20 09:47:37.384931 7f163c5fa700 -1 log_channel(cluster) log [ERR] : 0.29 shard 7: soid 0:97126ead:::200014ce4c3.028f:head data_digest 0x8f679a50 != data_digest 0x979f2ed4 from auth oi 0:97126ead:::200014ce4c3.028f:head(19366'539375 client.535319.1:2361163 dirty|data_digest|omap_digest s 4194304 uv 539375 dd 979f2ed4 od alloc_hint [0 0]) 2017-09-20 09:47:37.384936 7f163c5fa700 -1 log_channel(cluster) log [ERR] : 0.29 soid 0:97126ead:::200014ce4c3.028f:head: failed to pick suitable auth object 2017-09-20 09:48:11.138566 7f1639df5700 -1 log_channel(cluster) log [ERR] : 0.29 shard 6: soid 0:97d5c15a:::10101b4.6892:head data_digest 0xd65b4014 != data_digest 0xf41cfab8 from auth oi 0:97d5c15a:::10101b4.6892:head(12962'65557 osd.4.0:42234 dirty|data_digest|omap_digest s 4194304 uv 776 dd f41cfab8 od alloc_hint [0 0]) 2017-09-20 09:48:11.138575 7f1639df5700 -1 log_channel(cluster) log [ERR] : 0.29 shard 7: soid 0:97d5c15a:::10101b4.6892:head data_digest 0xd65b4014 != data_digest 0xf41cfab8 from auth oi 0:97d5c15a:::10101b4.6892:head(12962'65557 osd.4.0:42234 dirty|data_digest|omap_digest s 4194304 uv 776 dd f41cfab8 od alloc_hint [0 0]) 2017-09-20 09:48:11.138581 7f1639df5700 -1 log_channel(cluster) log [ERR] : 0.29 soid 0:97d5c15a:::10101b4.6892:head: failed to pick suitable auth object 2017-09-20 09:48:55.584022 7f1639df5700 -1 log_channel(cluster) log [ERR] : 0.29 repair 4 errors, 0 fixed Latest health... HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs down; 1 pgs incomplete; 9 pgs inconsistent; 1 pgs repair; 1 pgs stuck inactive; 1 pgs stuck unclean; 68 scrub errors; mds rank 0 has failed; mds cluster is degraded; no legacy OSD present but 'sortbitwise' flag is not set Regards, Hong On Wednesday, September 20, 2017 11:53 AM, Ronny Aasen <ronny+ceph-us...@aasen.cx> wrote: On 20.09.2017 16:49, hjcho616 wrote: Anyone? Can this page be saved? If not what are my options? Regards, Hong On Saturday, September 16, 2017 1:55 AM, hjcho616 <hjcho...@yahoo.com> <mailto:hjcho...@yahoo.com> wrote: Looking better... working on scrubbing.. HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs incomplete; 12 pgs inconsistent; 2 pgs repair; 1 pgs stuck inactive; 1 pgs stuck unclean; 109 scrub errors; too few PGs per OSD (29 < min 30); mds rank 0 has failed; mds cluster is degrade
Re: [ceph-users] Power outages!!! help!
On 20.09.2017 16:49, hjcho616 wrote: Anyone? Can this page be saved? If not what are my options? Regards, Hong On Saturday, September 16, 2017 1:55 AM, hjcho616 <hjcho...@yahoo.com> wrote: Looking better... working on scrubbing.. HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs incomplete; 12 pgs inconsistent; 2 pgs repair; 1 pgs stuck inactive; 1 pgs stuck unclean; 109 scrub errors; too few PGs per OSD (29 < min 30); mds rank 0 has failed; mds cluster is degraded; noout flag(s) set; no legacy OSD present but 'sortbitwise' flag is not set Now PG1.28.. looking at all old osds dead or alive. Only one with DIR_* directory is in osd.4. This appears to be metadata pool! 21M of metadata can be quite a bit of stuff.. so I would like to rescue this! But I am not able to start this OSD. exporting through ceph-objectstore-tool appears to crash. Even with --skip-journal-replay and --skip-mount-omap (different failure). As I mentioned in earlier email, that exception thrown message is bogus... # ceph-objectstore-tool --op export --pgid 1.28 --data-path /var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal --file ~/1.28.export terminate called after throwing an instance of 'std::domain_error' [SNIP] What can I do to save that PG1.28? Please let me know if you need more information. So close!... =) Regards, Hong 12 inconsistent and 109 scrub errors is something you should fix first of all. also you can consider using the paid-services of many ceph support companies. that specialize in these kind of situations. -- that beeing said, here are some suggestions... when it comes to lost object recovery you have come about as far as i have ever experienced. so everything after here is just assumptions and wild guesswork to what you can try. I hope others shouts out if i tell you wildly wrong things. if you have found date pg1.28 from the broken osd and have checked all other working and nonworking drives, for that pg. then you need to try and extract the pg from the broken drive. As always in recovery cases, take a dd clone of the drive and work from the cloned image. to avoid more damage to the drive, and to allow you to try multiple times. you should add a temporary injection drive large enough for that pg, and set its crush weight to 0 so it always drains. make sure it is up and registered properly in ceph. the idea is to copy the pg manually from broken-osd to the injection drive, since the export/import fails.. making sure you get all xattrs included. one can either copy the whole pg, or just the "missing" objects. if there are few objects i would go for that, if there are many i would take the whole pg. you wont get data from leveldb. so i am not at all sure this would work. but worth a shot. - stop your injection osd, verify it is down and the proccess not running. - from the mountpoint of your broken-osd go into the current directory. and tar up the pg1.28 make sure you use -p and --xattrs when you create the archive. - if tar errors out on unreadable files, just rm those (since you are working on a copy of your rescue image, you can allways try again) - copy the tar file to the injection drive and extract while sitting in the current directory (remember --xattrs) - set debug options on the injection drive in ceph.conf - start the injection drive, and follow along in the log file. hopefully it should scan, locate the pg, and replicate the pg1.28 objects off to the current primary drive for pg1.28. and since it have crush weight 0 it should drain out. - if that works, verify the injection drive is drained, stop it and remove it from ceph. zap the drive. this is all as i said guesstimates so your mileage may vary good luck Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power outages!!! help!
you write you had all pg's exported except one. so i assume you have injected those pg's into the cluster again using the method linked a few times in this thread. How did that go, were you successfull in recovering those pg's ? kind regards. Ronny Aasen On 15. sep. 2017 07:52, hjcho616 wrote: I just did this and backfilling started. Let's see where this takes me. ceph osd lost 0 --yes-i-really-mean-it Regards, Hong On Friday, September 15, 2017 12:44 AM, hjcho616 <hjcho...@yahoo.com> wrote: Ronny, Working with all of the pgs shown in the "ceph health detail", I ran below for each PG to export. ceph-objectstore-tool --op export --pgid 0.1c --data-path /var/lib/ceph/osd/ceph-0 --journal-path /var/lib/ceph/osd/ceph-0/journal --skip-journal-replay --file 0.1c.export I have all PGs exported, except 1... PG 1.28. It is on ceph-4. This error doesn't make much sense to me. Looking at the source code from https://github.com/ceph/ceph/blob/master/src/osd/osd_types.cc, that message is telling me struct_v is 1... but not sure how it ended up in the default in the case statement when 1 case is defined... I tried with --skip-journal-replay, fails with same error message. ceph-objectstore-tool --op export --pgid 1.28 --data-path /var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal --file 1.28.export terminate called after throwing an instance of 'std::domain_error' what(): coll_t::decode(): don't know how to decode version 1 *** Caught signal (Aborted) ** in thread 7fabc5ecc940 thread_name:ceph-objectstor ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0) 1: (()+0x996a57) [0x55b2d3323a57] 2: (()+0x110c0) [0x7fabc46d50c0] 3: (gsignal()+0xcf) [0x7fabc2b08fcf] 4: (abort()+0x16a) [0x7fabc2b0a3fa] 5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fabc33efb3d] 6: (()+0x5ebb6) [0x7fabc33edbb6] 7: (()+0x5ec01) [0x7fabc33edc01] 8: (()+0x5ee19) [0x7fabc33ede19] 9: (coll_t::decode(ceph::buffer::list::iterator&)+0x21e) [0x55b2d2ff401e] 10: (DBObjectMap::_Header::decode(ceph::buffer::list::iterator&)+0x125) [0x55b2d31315f5] 11: (DBObjectMap::check(std::ostream&, bool)+0x279) [0x55b2d3126bb9] 12: (DBObjectMap::init(bool)+0x288) [0x55b2d3125eb8] 13: (FileStore::mount()+0x2525) [0x55b2d305ceb5] 14: (main()+0x28c0) [0x55b2d2c8d400] 15: (__libc_start_main()+0xf1) [0x7fabc2af62b1] 16: (()+0x34f747) [0x55b2d2cdc747] Aborted Then wrote a simple script to run import process... just created an OSD per PG. Basically ran below for each PG. mkdir /var/lib/ceph/osd/ceph-5/tmposd_0.1c/ ceph-disk prepare /var/lib/ceph/osd/ceph-5/tmposd_0.1c/ chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/ ceph-disk activate /var/lib/ceph/osd/ceph-5/tmposd_0.1c/ ceph osd crush reweight osd.$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) 0 systemctl stop ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) ceph-objectstore-tool --op import --pgid 0.1c --data-path /var/lib/ceph/osd/ceph-$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) --journal-path /var/lib/ceph/osd/ceph-$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)/journal --file ./export/0.1c.export chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/ systemctl start ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) Sometimes import didn't work.. but stopping OSD and rerunning ceph-objectstore-tool again seems to help or when some PG didn't really want to import . Unfound messages are gone! But I still have down+peering, or down+remapped+peering. # ceph health detail HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs down; 1 pgs inconsistent; 22 pgs peering; 22 pgs stuck inactive; 22 pgs stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow requests; 2 scrub errors; mds cluster is degraded; noout flag(s) set; no legacy OSD present but 'sortbitwise' flag is not set pg 1.d is stuck inactive since forever, current state down+peering, last acting [11,2] pg 0.a is stuck inactive since forever, current state down+remapped+peering, last acting [11,7] pg 2.8 is stuck inactive since forever, current state down+remapped+peering, last acting [11,7] pg 2.b is stuck inactive since forever, current state down+remapped+peering, last acting [7,11] pg 1.9 is stuck inactive since forever, current state down+remapped+peering, last acting [11,7] pg 0.e is stuck inactive since forever, current state down+peering, last acting [11,2] pg 1.3d is stuck inactive since forever, current state down+remapped+peering, last acting [10,6] pg 0.2c is stuck inactive since forever, current state down+peering, last acting [1,11] pg 0.0 is stuck inactive since forever, current state down+remapped+peering, last acting [10,7] pg 1.2b is stuck inactive since forever, current state down+peering, last acting [1,11] pg 0.29 is stuck inactive since forever, current state down+peering, last acting [11,6]
Re: [ceph-users] OSD_OUT_OF_ORDER_FULL even when the ratios are in order.
On 14. sep. 2017 11:58, dE . wrote: Hi, I got a ceph cluster where I'm getting a OSD_OUT_OF_ORDER_FULL health error, even though it appears that it is in order -- full_ratio 0.99 backfillfull_ratio 0.97 nearfull_ratio 0.98 These don't seem like a mistake to me but ceph is complaining -- OSD_OUT_OF_ORDER_FULL full ratio(s) out of order backfillfull_ratio (0.97) < nearfull_ratio (0.98), increased osd_failsafe_full_ratio (0.97) < full_ratio (0.99), increased ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com post output from ceph osd df ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] access ceph filesystem at storage level and not via ethernet
On 14. sep. 2017 00:34, James Okken wrote: Thanks Ronny! Exactly the info I need. And kinda of what I thought the answer would be as I was typing and thinking clearer about what I was asking. I just was hoping CEPH would work like this since the openstack fuel tools deploy CEPH storage nodes easily. I agree I would not be using CEPH for its strengths. I am interested further in what you've said in this paragraph though: "if you want to have FC SAN attached storage on servers, shareable between servers in a usable fashion I would rather mount the same SAN lun on multiple servers and use a cluster filesystem like ocfs or gfs that is made for this kind of solution." Please allow me to ask you a few questions regarding that even though it isn't CEPH specific. Do you mean gfs/gfs2 global file system? Does ocfs and/or gfs require some sort of management/clustering server to maintain and manage? (akin to a CEPH OSD) I'd love to find a distributed/cluster filesystem where I can just partition and format. And then be able to mount and use that same SAN datastore from multiple servers without a management server. If ocfs or gfs do need a server of this sort does it needed to be involved in the I/O? or will I be able to mount the datastore, similar to any other disk and the IO goes across the fiberchannel? i only have experience with ocfs. but i think gfs works similarish. There are quite a few cluster filesystems to choose from. https://en.wikipedia.org/wiki/Clustered_file_system servers that are mounting ocfs shared filesystems must have ocfs2-tools installed. have access to the common shared FC lun via FC. they need to be aware of the other ocfs servers of the same lun, that you define in a /etc/ocfs/cluster.conf configfile and the ocfs daemon must be running. then it is just a matter of making the ocfs (on one server) and adding it to fstab (of all servers) and mount. One final question, if you don't mind, do you think I could use ext4or xfs and "mount the same SAN lun on multiple servers" if I can guarantee each server will only right to its own specific directory and never anywhere the other servers will be writing? (I even have the SAN mapped to each server using different lun's) mounting the same (non cluster) filesystem on multiple servers is guaranteed to destroy the filesystem, you will have multiple servers writing in the same metadata area, the same journal area and generaly shitting over each other. luckily i think most modern filesystems would detect that the FS is mounted somewhere else and prevent you from mounting it again without big fat warnings. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] access ceph filesystem at storage level and not via ethernet
On 13.09.2017 19:03, James Okken wrote: Hi, Novice question here: The way I understand CEPH is that it distributes data in OSDs in a cluster. The reads and writes come across the ethernet as RBD requests and the actual data IO then also goes across the ethernet. I have a CEPH environment being setup on a fiber channel disk array (via an openstack fuel deploy). The servers using the CEPH storage also have access to the same fiber channel disk array. From what I understand those servers would need to make the RDB requests and do the IO across ethernet, is that correct? Even though with this infrastructure setup there is a “shorter” and faster path to those disks, via the fiber channel. Is there a way to access storage on a CEPH cluster when one has this “better” access to the disks in the cluster? (how about if it were to be only a single OSD with replication set to 1) Sorry if this question is crazy… thanks a bit cracy :) if the disks are directly attached on a OSD node, or attachable on Fiberchannel does not make a difference. you can not shortcut the ceph cluster and talk to the osd disks directly without eventually destroying the ceph cluster. Even if you did, ceph is an object storage on disk, so you would not find filesystem or RBD diskimages there, only objects on your FC attached osd node disks with filestore, and with bluestore not even readable objects. that beeing said I think a FC SAN attached ceph osd node sounds a bit strange. ceph's strength is the distributed scaleable solution. and having the osd nodes collected on a SAN array would nuter ceph's strengths, and amplify ceph's weakness of high latency. i would only consider such a solution for testing, learning or playing around without having actual hardware for a distributed system. and in that case use 1 lun for each osd disk, give 8-10 vm's some luns/osd's each, just to learn how to work with ceph. if you want to have FC SAN attached storage on servers, shareable between servers in a usable fashion I would rather mount the same SAN lun on multiple servers and use a cluster filesystem like ocfs or gfs that is made for this kind of solution. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power outages!!! help!
On 13. sep. 2017 07:04, hjcho616 wrote: Ronny, Did bunch of ceph pg repair pg# and got the scrub errors down to 10... well was 9, trying to fix one became 10.. waiting for it to fix (I did that noout trick as I only have two copies). 8 of those scrub errors looks like it would need data from osd.0. HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs degraded; 6 pgs down; 3 pgs inconsistent; 6 pgs peering; 6 pgs recovering; 16 pgs stale; 22 pgs stuck degraded; 6 pgs stuck inactive; 16 pgs stuck stale; 28 pgs stuck unclean; 16 pgs stuck undersized; 16 pgs undersized; 1 requests are blocked > 32 sec; recovery 221990/4503980 objects degraded (4.929%); recovery 147/2251990 unfound (0.007%); 10 scrub errors; mds cluster is degraded; no legacy OSD present but 'sortbitwise' flag is not set From what I saw from ceph health detail, running osd.0 would solve majority of the problems. But that was the disk with the smart error earlier. I did move to new drive using ddrescue. When trying to start osd.0, I get this. Is there anyway I can get around this? running a rescued disk is not something you should try. this is when you should try to export using the objectstoretool this was the drive that failed to export pg's becouse of missing superblock ? you could also try the export directly on the failed drive. just to try if that works. you many have to run the tool as ceph user if that is the user owning all the files you could try running the export of one of the pg's on osd.0 again and post all commands and output. good luck Ronny ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power outages!!! help!
you can start by posting more details. atleast "ceph osd tree" "cat ceph.conf" and "ceph osd df" so we can see what settings you are running, and how your cluster is balanced at the moment. generally: inconsistent pg's are pg's that have scrub errors. use rados list-inconsistent-pg [pool] and rados-list-inconsistent-obj [pg] to locate the objects with problems. compare and fix the objects using info from http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent also read http://ceph.com/geen-categorie/ceph-manually-repair-object/ since you have so many scrub errors i would assume there are more bad disks, check all disk's smart values and look for read errors in logs. if you find any you should drain those disks by setting crush weight to 0. and when they are empty remove them from the cluster. personally i use smartmontools it sends me emails about bad disks, and check disks manually withsmartctl -a /dev/sda || echo bad-disk: $? pg's that are down+peering need to have one of the acting osd's started again. or to have the objects recovered using the methods we have discussed previously. ref: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#placement-group-down-peering-failure nb: do not mark any osd's as lost since that = dataloss. I would - check smart stats of all disks. drain disks that are going bad. make sure you have enough space on good disks to drain them properly. - check scrub errors and objects. fix those that are fixable. some may require an object from a down osd. - try to get down osd's running again if possible. if you manage to get one running, let it recover and stabilize. - recover and inject objects from osd's that do not run. stasrt by doing one and one pg. and once you get the hang of the method you can do multiple pg's at the same time. good luck Ronny Aasen On 11. sep. 2017 06:51, hjcho616 wrote: It took a while. It appears to have cleaned up quite a bit... but still has issues. I've been seeing below message for more than a day and cpu utilization and io utilization is low... looks like something is stuck... I rebooted OSDs several times when it looked like it was stuck earlier and it would work on something else, but now it is not changing much. What can I try now? Regards, Hong # ceph health detail HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs degraded; 6 pgs down; 11 pgs inconsistent; 6 pgs peering; 6 pgs recovering; 16 pgs stale; 22 pgs stuck degraded; 6 pgs stuck inactive; 16 pgs stuck stale; 28 pgs stuck unclean; 16 pgs stuck undersized; 16 pgs undersized; 1 requests are blocked > 32 sec; 1 osds have slow requests; recovery 221990/4503980 objects degraded (4.929%); recovery 147/2251990 unfound (0.007%); 95 scrub errors; mds cluster is degraded; no legacy OSD present but 'sortbitwise' flag is not set pg 0.e is stuck inactive since forever, current state down+peering, last acting [11,2] pg 1.d is stuck inactive since forever, current state down+peering, last acting [11,2] pg 1.28 is stuck inactive since forever, current state down+peering, last acting [11,6] pg 0.29 is stuck inactive since forever, current state down+peering, last acting [11,6] pg 1.2b is stuck inactive since forever, current state down+peering, last acting [1,11] pg 0.2c is stuck inactive since forever, current state down+peering, last acting [1,11] pg 0.e is stuck unclean since forever, current state down+peering, last acting [11,2] pg 0.a is stuck unclean for 1233182.248198, current state stale+active+undersized+degraded+inconsistent, last acting [0] pg 2.8 is stuck unclean for 1238044.714421, current state stale+active+undersized+degraded, last acting [0] pg 2.1a is stuck unclean for 1238933.203920, current state active+recovering+degraded, last acting [2,11] pg 2.3 is stuck unclean for 1238882.443876, current state stale+active+undersized+degraded, last acting [0] pg 2.27 is stuck unclean for 1295260.765981, current state active+recovering+degraded, last acting [11,6] pg 0.d is stuck unclean for 1230831.504001, current state stale+active+undersized+degraded, last acting [0] pg 1.c is stuck unclean for 1238044.715698, current state stale+active+undersized+degraded, last acting [0] pg 1.3d is stuck unclean for 1232066.572856, current state stale+active+undersized+degraded, last acting [0] pg 1.28 is stuck unclean since forever, current state down+peering, last acting [11,6] pg 0.29 is stuck unclean since forever, current state down+peering, last acting [11,6] pg 1.2b is stuck unclean since forever, current state down+peering, last acting [1,11] pg 2.2f is stuck unclean for 1238127.474088, current state active+recovering+degraded+remapped, last acting [9,10] pg 0.0 is stuck unclean for 1233182.247776, current state stale+active+undersized+degraded, last acting [0] pg 0.2c is stuck unclean since forever, current
Re: [ceph-users] Power outages!!! help!
I would not even attempt to connect a recovered drive to ceph, especially not one that have had xfs errors and corruption. your pg's that are undersized lead me to belive you still need to either expand, with more disks, or nodes. or that you need to set |osd crush chooseleaf type = 0 | to let ceph pick 2 disks on the same node as a valid object placement. (temporary until you get 2 balanced nodes) generally let ceph selfheal as much as possible (no misplaced or degraded objects) this require that ceph have space for the recovery. i would run with size=2 min_size=2 you should also look at the 7 shrub errors. they indicate that there can be other drives with issues, you want to locate where those inconsistent objects are, and fix them. read this page about fixing scrub errors. http://ceph.com/geen-categorie/ceph-manually-repair-object/ then you would sit with the 103 unfound objects, and those you should try to recover from the recovered drive. by using the /ceph/-/objectstore/-/tool /export/import to try and export pg's missing objects to a dedicated temporary added import drive. the import drive does not need to be very large. since you can do one and one pg at the time. and you should only recover pg's that contain unfound objects. there is realy only 103 unfound objects that you need to recover. once the recovery is compleate you can wipe the functioning recovery drive, and install it as a new osd to the cluster. kind regards Ronny Aasen On 03.09.2017 06:20, hjcho616 wrote: I checked with ceph-2, 3, 4, 5 so I figured it was safe to assume that superblock file is the same. I copied it over and started OSD. It still fails with the same error message. Looks like when I updated to 10.2.9, some osd needs to be updated and that process is not finding the data it needs? What can I do about this situation? 2017-09-01 22:27:35.590041 7f68837e5800 1 filestore(/var/lib/ceph/osd/ceph-0) upgrade 2017-09-01 22:27:35.590149 7f68837e5800 -1 filestore(/var/lib/ceph/osd/ceph-0) could not find #-1:7b3f43c4:::osd_superblock:0# in index: (2) No such file or directory Regards, Hong On Friday, September 1, 2017 11:10 PM, hjcho616 <hjcho...@yahoo.com> wrote: Just realized there is a file called superblock in the ceph directory. ceph-1 and ceph-2's superblock file is identical, ceph-6 and ceph-7 are identical, but not between the two groups. When I originally created the OSDs, I created ceph-0 through 5. Can superblock file be copied over from ceph-1 to ceph-0? Hmm.. it appears to be doing something in the background even though osd.0 is down. ceph health output is changing! # ceph health HEALTH_ERR 40 pgs are stuck inactive for more than 300 seconds; 14 pgs backfill_wait; 21 pgs degraded; 10 pgs down; 2 pgs inconsistent; 10 pgs peering; 3 pgs recovering; 2 pgs recovery_wait; 30 pgs stale; 21 pgs stuck degraded; 10 pgs stuck inactive; 30 pgs stuck stale; 45 pgs stuck unclean; 16 pgs stuck undersized; 16 pgs undersized; 2 requests are blocked > 32 sec; recovery 221826/2473662 objects degraded (8.968%); recovery 254711/2473662 objects misplaced (10.297%); recovery 103/2251966 unfound (0.005%); 7 scrub errors; mds cluster is degraded; no legacy OSD present but 'sortbitwise' flag is not set Regards, Hong On Friday, September 1, 2017 10:37 PM, hjcho616 <hjcho...@yahoo.com> wrote: Tried connecting recovered osd. Looks like some of the files in the lost+found are super blocks. Below is the log. What can I do about this? 2017-09-01 22:27:27.634228 7f68837e5800 0 set uid:gid to 1001:1001 (ceph:ceph) 2017-09-01 22:27:27.634245 7f68837e5800 0 ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0), process ceph-osd, pid 5432 2017-09-01 22:27:27.635456 7f68837e5800 0 pidfile_write: ignore empty --pid-file 2017-09-01 22:27:27.646849 7f68837e5800 0 filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342) 2017-09-01 22:27:27.647077 7f68837e5800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2017-09-01 22:27:27.647080 7f68837e5800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option 2017-09-01 22:27:27.647091 7f68837e5800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: splice is supported 2017-09-01 22:27:27.678937 7f68837e5800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2017-09-01 22:27:27.679044 7f68837e5800 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: extsize is disabled by conf 2017-09-01 22:27:27.680718 7f68837e5800 1 leveldb: Recovering log #28054 2017-09-01 22:27:27.804501 7f68837e5800 1 leveldb: Delete type=0 #28054 2017-09-01 22:27:27.804579 7f68837e5800 1 leveldb: Delete type=3 #28053 2
Re: [ceph-users] Power outages!!! help!
On 30.08.2017 15:32, Steve Taylor wrote: I'm not familiar with dd_rescue, but I've just been reading about it. I'm not seeing any features that would be beneficial in this scenario that aren't also available in dd. What specific features give it "really a far better chance of restoring a copy of your disk" than dd? I'm always interested in learning about new recovery tools. i see i wrote dd_rescue from old habit, but the package one should use on debian is gddrescue or also called gnu ddrecue. this page have some details on the differences on dd vs the ddrescue variants. http://www.toad.com/gnu/sysadmin/index.html#ddrescue kind regards Ronny Aasen *Steve Taylor* | Senior Software Engineer |***StorageCraft Technology Corporation* <https://storagecraft.com> 380 Data Drive Suite 300 | Draper | Utah | 84020 *Office:* 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Tue, 2017-08-29 at 21:49 +0200, Willem Jan Withagen wrote: On 29-8-2017 19:12, Steve Taylor wrote: Hong, Probably your best chance at recovering any data without special, expensive, forensic procedures is to perform a dd from /dev/sdb to somewhere else large enough to hold a full disk image and attempt to repair that. You'll want to use 'conv=noerror' with your dd command since your disk is failing. Then you could either re-attach the OSD from the new source or attempt to retrieve objects from the filestore on it. Like somebody else already pointed out In problem "cases like disk, use dd_rescue. It has really a far better chance of restoring a copy of your disk --WjW I have actually done this before by creating an RBD that matches the disk size, performing the dd, running xfs_repair, and eventually adding it back to the cluster as an OSD. RBDs as OSDs is certainly a temporary arrangement for repair only, but I'm happy to report that it worked flawlessly in my case. I was able to weight the OSD to 0, offload all of its data, then remove it for a full recovery, at which point I just deleted the RBD. The possibilities afforded by Ceph inception are endless. ☺ Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited. On Mon, 2017-08-28 at 23:17 +0100, Tomasz Kusmierz wrote: Rule of thumb with batteries is: - more “proper temperature” you run them at the more life you get out of them - more battery is overpowered for your application the longer it will survive. Get your self a LSI 94** controller and use it as HBA and you will be fine. but get MORE DRIVES ! … On 28 Aug 2017, at 23:10, hjcho616 <hjcho...@yahoo.com <mailto:hjcho...@yahoo.com>> wrote: Thank you Tomasz and Ronny. I'll have to order some hdd soon and try these out. Car battery idea is nice! I may try that.. =) Do they last longer? Ones that fit the UPS original battery spec didn't last very long... part of the reason why I gave up on them.. =P My wife probably won't like the idea of car battery hanging out though ha! The OSD1 (one with mostly ok OSDs, except that smart failure) motherboard doesn't have any additional SATA connectors available. Would it be safe to add another OSD host? Regards, Hong On Monday, August 28, 2017 4:43 PM, Tomasz Kusmierz <tom.kusmierz@g mail.com> wrote: Sorry for being brutal … anyway 1. get the battery for UPS ( a car battery will do as well, I’ve moded on ups in the past with truck battery and it was working like a charm :D ) 2. get spare drives and put those in because your cluster CAN NOT get out of error due to lack of space 3. Follow advice of Ronny Aasen on hot to recover data from hard drives 4 get cooling to drives or you will loose more ! On 28 Aug 2017, at 22:39, hjcho616 <hjcho...@yahoo.com <mailto:hjcho...@yahoo.com>> wrote: Tomasz, Those machines are behind a surge protector. Doesn't appear to be a good one! I do have a UPS... but it is my fault... no battery. Power was pretty reliable for a while... and UPS was just beeping every chance it had, disrupting some sleep.. =P So running on surge protector only. I am running this in home environment. So far, HDD failures have been very rare for this environment. =) It just doesn't get loaded as much! I am not s
Re: [ceph-users] Power outages!!! help!
[snip] I'm not sure if I am liking what I see on fdisk... it doesn't show sdb1. I hope it shows up when I run dd_rescue to other drive... =P # fdisk /dev/sdb Welcome to fdisk (util-linux 2.25.2). Changes will remain in memory only, until you decide to write them. Be careful before using the write command. /dev/sdb: device contains a valid 'xfs' signature, it's strongly recommended to wipe the device by command wipefs(8) if this setup is unexpected to avoid possible collisions. Device does not contain a recognized partition table. Created a new DOS disklabel with disk identifier 0xe684adb6. Command (m for help): p Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0xe684adb6 Command (m for help): Do not use fdisk for osd drives. they are using the GPT partition structure. and depend on the GPT uuid to be correct. So use either parted or gdisk/cgdisk/sgdisk if you want to look at it. writing a mbr partition table to the osd will break it naturally. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power outages!!! help!
> [SNIP - bad drives] Generally when a disk is displaying bad blocks to the OS, the drive have been remapping blocks for ages in the background. and the disk is really on it's last legs. a bit unlikely that you get so many disks dying at the same time tho. but the problem can have been silently worsening and was not realy noticed until the osd had to restart due to the powerloss. if this is _very_ important data i would recomend you start by taking the bad drives out of operation, and cloning the bad drive block by block onto a good one. by using dd_rescue. also a good idea to store a image of the disk so you can try the different rescue methods several times. in the very worst case send the disk to a professional data recovery company. once that is done, you have 2 options: try to make the osd run again, by. xfs_fsck, + manually finding corrupt objects. (find + md5sum (look for read errors)) and deleting them have helped me in the past. if you manage to get the osd to run, drain it, by setting crush weight to 0. and eventualy remove the disk from the cluster. alternativly if you can not get the osd running again: use ceph objectstoretool to extract objects and inject them using a clean node and osd like described in http://ceph.com/geen-categorie/incomplete-pgs-oh-my/ read the man page and help for the tool i think the arguments have changed slightly since that blogpost. you may also run into read errors on corrupt objects, stopping your export. in that case rm the offending object and rerun the export. repeat for all bad drives. when doing the inject it is important that your cluster is operational and able to accept objects from the draining drive, so either set minimal replication type to OSD, or even better. add more osd nodes to make a operational cluster (with missing objects) also i see in your log you have os-prober testing all partitions. i tend to remove os-prober on machines that does not dualboot with another os. rules of thumb for future ceph clusters: min_size =2 for a reason it should never be 1 unless dataloss is wanted. size=3 f you need the cluster to be operating with a drive or node in a error state. size=2 gives you more space but the cluster will block on errors until the recovery is done. better to be blocking then loosing data. if you have size=3 and 3 nodes and you loose a node, then your cluster can not self heal. you should have more nodes then you have set size to. have free space on drives, this is where data is replicated to in case of a down node. if you have 4 nodes and you want to be able to loose one, and still operate. you need leftover room on your 3 remaining nodes to cover for the lost one. the more nodes you have the less the impact of a node failure is. and the less spare room is needed for a 4 node cluster you should not fill more then 66% if you want to be able to self-heal + operate. good luck Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power outages!!! help!
comments inline On 28.08.2017 18:31, hjcho616 wrote: I'll see what I can do on that... Looks like I may have to add another OSD host as I utilized all of the SATA ports on those boards. =P Ronny, I am running with size=2 min_size=1. I created everything with ceph-deploy and didn't touch much of that pool settings... I hope not, but sounds like I may have lost some files! I do want some of those OSDs to come back online somehow... to get that confidence level up. =P This is a bad idea as you have found out. once your cluster is healthy you should look at improving this. The dead osd.3 message is probably me trying to stop and start the osd. There were some cases where stop didn't kill the ceph-osd process. I just started or restarted osd to try and see if that worked.. After that, there were some reboots and I am not seeing those messages after it... when providing logs. try to move away the old one. do a single startup. and post that. it makes it easier to read when you have a single run in the file. This is something I am running at home. I am the only user. In a way it is production environment but just driven by me. =) Do you have any suggestions to get any of those osd.3, osd.4, osd.5, and osd.8 come back up without removing them? I have a feeling I can get some data back with some of them intact. just incase you are not able to make them run again, does not automatically mean the data is lost. i have successfully recovered lost object using these instructions http://ceph.com/geen-categorie/incomplete-pgs-oh-my/ I would start by renaming the osd's log file, do a single try at starting the osd. and posting that log. have you done anything to the osd's that could make them not run ? kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power outages!!! help!
h-3' is currently in use. (Is ceph-osd already running?) 7faf16e23800 -1 ** ERROR: osd pre_init failed: (16) Device or resource busy This can indicate that you have a dead osd3 process keeping the resources open, and preventing a new osd from starting. check with ps aux if you can see any ceph processes. If you do find somthging relating to your down osds's you should try stopping it normally, and if that fails. killing it manually. before trying to restart the osd. also check dmesg if you have messages relating to faulty hardware or OOM killer there. i have had experiences with the OOM killer where the osd node became unreliable until i rebooted the machine. kind regards, and good luck Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Monitoring a rbd map rbd connection
write to a subdirectory on the RBD. so if it is not mounted, the directory will be missing, and you get a no such file error. Ronny Aasen On 25.08.2017 18:04, David Turner wrote: Additionally, solely testing if you can write to the path could give a false sense of security if the path is writable when the RBD is not mounted. It would write a file to the system drive and you would see it as successful. On Fri, Aug 25, 2017 at 2:27 AM Adrian Saul <adrian.s...@tpgtelecom.com.au <mailto:adrian.s...@tpgtelecom.com.au>> wrote: If you are monitoring to ensure that it is mounted and active, a simple check_disk on the mountpoint should work. If the mount is not present, or the filesystem is non-responsive then this should pick it up. A second check to perhaps test you can actually write files to the file system would not go astray either. Other than that I don't think there is much point checking anything else like rbd mapped output. > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com <mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of > Hauke Homburg > Sent: Friday, 25 August 2017 1:35 PM > To: ceph-users <ceph-us...@ceph.com <mailto:ceph-us...@ceph.com>> > Subject: [ceph-users] Monitoring a rbd map rbd connection > > Hallo, > > Ich want to monitor the mapped Connection between a rbd map rbdimage > an a /dev/rbd device. > > This i want to do with icinga. > > Has anyone a Idea how i can do this? > > My first Idea is to touch and remove a File in the mount point. I am not sure > that this is the the only thing i have to do > > > Thanks for Help > > Hauke > > -- > www.w3-creative.de <http://www.w3-creative.de> > > www.westchat.de <http://www.westchat.de> > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, privilege or copyright is not waived or lost because this email has been sent to you by mistake. ___ ceph-users mailing list ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] luminous/bluetsore osd memory requirements
On 10.08.2017 17:30, Gregory Farnum wrote: This has been discussed a lot in the performance meetings so I've added Mark to discuss. My naive recollection is that the per-terabyte recommendation will be more realistic than it was in the past (an effective increase in memory needs), but also that it will be under much better control than previously. Is there any way to tune or reduce the memory footprint? perhaps by sacrificing performace ? our jewel cluster osd servers is maxed out on memory. And with the added memory requirements I fear we may not be able to upgrade to luminous/bluestore.. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph packages on stretch from eu.ceph.com
Thanks for the suggestions. i did do a trial with the proxmox ones, on a single node machine tho. But i hope now that debian 9 is released and stable, that the ceph repos will incluclude stretch soon.. Hint Hint :) I am itching to try to upgrade my testing cluster. :) kind regards Ronny Aasen On 26. april 2017 19:46, Alexandre DERUMIER wrote: you can try the proxmox stretch repository if you want http://download.proxmox.com/debian/ceph-luminous/dists/stretch/ - Mail original - De: "Wido den Hollander" <w...@42on.com> À: "ceph-users" <ceph-users@lists.ceph.com>, "Ronny Aasen" <ronny+ceph-us...@aasen.cx> Envoyé: Mercredi 26 Avril 2017 16:58:04 Objet: Re: [ceph-users] ceph packages on stretch from eu.ceph.com Op 25 april 2017 om 20:07 schreef Ronny Aasen <ronny+ceph-us...@aasen.cx>: Hello i am trying to install ceph on debian stretch from http://eu.ceph.com/debian-jewel/dists/ but there is no stretch repo there. now with stretch being frozen, it is a good time to be testing ceph on stretch. is it possible to get packages for stretch on jewel, kraken, and lumious ? Afaik packages are only build for stable releases. As Stretch isn't out there are no packages. You can try if the Ubuntu 16.04 (Xenial) packages work. Wido kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph packages on stretch from eu.ceph.com
Hello i am trying to install ceph on debian stretch from http://eu.ceph.com/debian-jewel/dists/ but there is no stretch repo there. now with stretch being frozen, it is a good time to be testing ceph on stretch. is it possible to get packages for stretch on jewel, kraken, and lumious ? kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] best practices in connecting clients to cephfs public network
hello i want to connect 3 servers to cephfs. The servers are normally not in the public network. is it best practice to connect 2 interfaces on the servers to have the servers directly connected to the public network ? or to route between the networks, via their common default gateway. the machines are vm's so it's easy to add interfaces, and the servers lan and the clusters public networks is on the same router so it's also easy to route between them. there is a separate firewall in front of the routed networks so the security aspect is quite similar one way or the other. what is the recommended way to connect clients to the public network ? kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Directly addressing files on individual OSD
On 16.03.2017 08:26, Youssef Eldakar wrote: Thanks for the reply, Anthony, and I am sorry my question did not give sufficient background. This is the cluster behind archive.bibalex.org. Storage nodes keep archived webpages as multi-member GZIP files on the disks, which are formatted using XFS as standalone file systems. The access system consults an index that says where a URL is stored, which is then fetched over HTTP from the individual storage node that has the URL somewhere on one of the disks. So far, we have pretty much been managing the storage using homegrown scripts to have each GZIP file stored on 2 separate nodes. This obviously has been requiring a good deal of manual work and as such has not been very effective. Given that description, do you feel Ceph could be an appropriate choice? if you adapt your scripts to something like... "Storage nodes archives webpages as gzip files, hashes the url to use as an object name and saves the gzipfiles as an object in ceph via the S3 interface. The access system gets a request for an url, it hashes an url into a object name and fetch the gzip (object) using regular S3 get syntax" ceph would deal with replication, you would only put objects in, and fetch them out. you could if you need it store the list of urls and hashes. except as a list of what you have stored. this is just an example tho. you could also use cephfs, mounted on nodes and serve files as today. ceph is just a storage tool it could work very nicely for your needs. but accessing the files on osd's directly will only bring pain. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph osd crash on startup / crashed first during snap removal
greetings when i removed a single large rbd snap today, from a 20 TB rbd my osd's had very high load for a while. during this periode of high load where multiple osd's was marked down, and marked itself up again, 2 of my osd's crashed, and these do not want to start again. the log does not show anything obvious to me to why the osd should crash so quickly like that on startup. logs does not show anything wrong with hardware either. i have shared a complete log file using debug osd/filestore/journal= 20 where i try to start the osd. https://owncloud.fjordane-it.no/index.php/s/gYEmYOcuil8ANG2 i still have the osd available so i can try starting it again with other debug values if that is valuable. i hope someone can shed some light on why this osd crashes. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] pg stuck with unfound objects on non exsisting osd's
thanks for the suggestion. is a rolling reboot sufficient? or must all osd's be down at the same time ? one is no problem. the other takes some scheduling.. Ronny Aasen On 01.11.2016 21:52, c...@elchaka.de wrote: Hello Ronny, if it is possible for you, try to Reboot all OSD Nodes. I had this issue on my test Cluster and it become healthy after rebooting. Hth - Mehmet Am 1. November 2016 19:55:07 MEZ, schrieb Ronny Aasen <ronny+ceph-us...@aasen.cx>: Hello. I have a cluster stuck with 2 pg's stuck undersized degraded, with 25 unfound objects. # ceph health detail HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs stuck undersized; 2 pgs undersized; recovery 294599/149522370 objects degraded (0.197%); recovery 640073/149522370 objects misplaced (0.428%); recovery 25/46579241 unfound (0.000%); noout flag(s) set pg 6.d4 is stuck unclean for 8893374.380079, current state active+recovering+undersized+degraded+remapped, last acting [62] pg 6.ab is stuck unclean for 8896787.249470, current state active+recovering+undersized+degraded+remapped, last acting [18,12] pg 6.d4 is stuck undersized for 438122.427341, current state active+recovering+undersized+degraded+remapped, last acting [62] pg 6.ab is stuck undersized for 416947.461950, current state active+recovering+undersized+degraded+remapped, last acting [18,12]*pg 6.d4 is stuck degraded for 438122.427402, current state active+recovering+undersized+degraded+remapped, last acting [62] pg 6.ab is stuck degraded for 416947.462010, current state active+recovering+undersized+degraded+remapped, last acting [18,12] pg 6.d4 is active+recovering+undersized+degraded+remapped, acting [62], 25 unfound pg 6.ab is active+recovering+undersized+degraded+remapped, acting [18,12] recovery 294599/149522370 objects degraded (0.197%) recovery 640073/149522370 objects misplaced (0.428%) recovery 25/46579241 unfound (0.000%) noout flag(s) set have been following the troubleshooting guide at http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/ but gets stuck without a resolution. luckily it is not critical data. so i wanted to mark the pg lost so it could become health-ok< br /> # ceph pg 6.d4 mark_unfound_lost delete Error EINVAL: pg has 25 unfound objects but we haven't probed all sources, not marking lost querying the pg i see that it would want osd.80 and osd 36 { "osd": "80", "status": "osd is down" }, trying to mark the osd's lost does not work either. since the osd's was removed from the cluster a long time ago. # ceph osd lost 80 --yes-i-really-mean-it osd.80 is not down or doesn't exist # ceph osd lost 36 --yes-i-really-mean-it osd.36 is not down or doesn't exist and this is where i am stuck. have tried stopping and starting the 3 osd's but that did not have any effect. Anyone have any advice how to proceed ? full output at: http://paste.debian.net/hidden/be03a185/ this is hammer 0.94.9 on debian 8. kind regards Ronny Aasen ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com * ** ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded
if you have the default crushmap and osd pool default size = 3, then ceph creates 3 copies of each object. and store it on 3 separate nodes. so the best way to solve your space problems is to try to even out the space between your hosts. either by adding disks to ceph1 ceph2 ceph3, or by adding more nodes. kind regards Ronny Aasen On 01.11.2016 20:14, Marcus Müller wrote: > Hi all, > > i have a big problem and i really hope someone can help me! > > We are running a ceph cluster since a year now. Version is: 0.94.7 (Hammer) > Here is some info: > > Our osd map is: > > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 26.67998 root default > -2 3.64000 host ceph1 > 0 3.64000 osd.0 up 1.0 1.0 > -3 3.5 host ceph2 > 1 3.5 osd.1 up 1.0 1.0 > -4 3.64000 host ceph3 > 2 3.64000 osd.2 up 1.0 1.0 > -5 15.89998 host ceph4 > 3 4.0 osd.3 up 1.0 1.0 > 4 3.5 osd.4 up 1.0 1.0 > 5 3.2 osd.5 up 1.0 1.0 > 6 5.0 osd.6 up 1.0 1.0 > > ceph df: > > GLOBAL: > SIZE AVAIL RAW USED %RAW USED > 40972G 26821G 14151G 34.54 > POOLS: > NAMEID USED %USED MAX AVAIL OBJECTS > blocks 7 4490G 10.96 1237G 7037004 > commits 8 473M 0 1237G 802353 > fs 9 9666M 0.02 1237G 7863422 > > ceph osd df: > > ID WEIGHT REWEIGHT SIZE USEAVAIL %USE VAR > 0 3.64000 1.0 3724G 3128G 595G 84.01 2.43 > 1 3.5 1.0 3724G 3237G 487G 86.92 2.52 > 2 3.64000 1.0 3724G 3180G 543G 85.41 2.47 > 3 4.0 1.0 7450G 1616G 5833G 21.70 0.63 > 4 3.5 1.0 7450G 1246G 6203G 16.74 0.48 > 5 3.2 1.0 7450G 1181G 6268G 15.86 0.46 > 6 5.0 1.0 7450G 560G 6889G 7.52 0.22 > TOTAL 40972G 14151G 26820G 34.54 > MIN/MAX VAR: 0.22/2.52 STDDEV: 36.53 > > > Our current cluster state is: > > health HEALTH_WARN > 63 pgs backfill > 8 pgs backfill_toofull > 9 pgs backfilling > 11 pgs degraded > 1 pgs recovering > 10 pgs recovery_wait > 11 pgs stuck degraded > 89 pgs stuck unclean > recovery 8237/52179437 objects degraded (0.016%) > recovery 9620295/52179437 objects misplaced (18.437%) > 2 near full osd(s) > noout,noscrub,nodeep-scrub flag(s) set > monmap e8: 4 mons at {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0} > election epoch 400, quorum 0,1,2,3 ceph1,ceph2,ceph3,ceph4 > osdmap e1774: 7 osds: 7 up, 7 in; 84 remapped pgs > flags noout,noscrub,nodeep-scrub > pgmap v7316159: 320 pgs, 3 pools, 4501 GB data, 15336 kobjects > 14152 GB used, 26820 GB / 40972 GB avail > 8237/52179437 objects degraded (0.016%) > 9620295/52179437 objects misplaced (18.437%) > 231 active+clean > 61 active+remapped+wait_backfill >9 active+remapped+backfilling >6 active+recovery_wait+degraded+remapped >6 active+remapped+backfill_toofull >4 active+recovery_wait+degraded >2 active+remapped+wait_backfill+backfill_toofull >1 active+recovering+degraded > recovery io 11754 kB/s, 35 objects/s > client io 1748 kB/s rd, 249 kB/s wr, 44 op/s > > > My main problems are: > > - As you can see from the osd tree, we have three separate hosts with only one osd each. Another one has four osds. Ceph allows me not to get data back from these three nodes with only one HDD, which are all near full. I tried to set the weight of the osds in the bigger node higher but this just does not work. So i added a new osd yesterday which made things not better, as you can see now. What do i have to do to just become these three nodes empty again and put more data on the other node with the four HDDs. > > - I added the „ceph4“ node later, this resulted in a strange ip change as you can see in the mon list. The public network and the cluster network were swapped or not assigned right. See ceph.conf > > [global] > fsid = xxx > mon_initial_members = ceph1 > mon_host = 192.168.10.3, 192.168.10.4, 192.168.10.5, 192.168.10.11 > auth_cluster_required = ce
[ceph-users] pg stuck with unfound objects on non exsisting osd's
Hello. I have a cluster stuck with 2 pg's stuck undersized degraded, with 25 unfound objects. # ceph health detail HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck unclean; 2 pgs stuck undersized; 2 pgs undersized; recovery 294599/149522370 objects degraded (0.197%); recovery 640073/149522370 objects misplaced (0.428%); recovery 25/46579241 unfound (0.000%); noout flag(s) set pg 6.d4 is stuck unclean for 8893374.380079, current state active+recovering+undersized+degraded+remapped, last acting [62] pg 6.ab is stuck unclean for 8896787.249470, current state active+recovering+undersized+degraded+remapped, last acting [18,12] pg 6.d4 is stuck undersized for 438122.427341, current state active+recovering+undersized+degraded+remapped, last acting [62] pg 6.ab is stuck undersized for 416947.461950, current state active+recovering+undersized+degraded+remapped, last acting [18,12] pg 6.d4 is stuck degraded for 438122.427402, current state active+recovering+undersized+degraded+remapped, last acting [62] pg 6.ab is stuck degraded for 416947.462010, current state active+recovering+undersized+degraded+remapped, last acting [18,12] pg 6.d4 is active+recovering+undersized+degraded+remapped, acting [62], 25 unfound pg 6.ab is active+recovering+undersized+degraded+remapped, acting [18,12] recovery 294599/149522370 objects degraded (0.197%) recovery 640073/149522370 objects misplaced (0.428%) recovery 25/46579241 unfound (0.000%) noout flag(s) set have been following the troubleshooting guide at http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/ but gets stuck without a resolution. luckily it is not critical data. so i wanted to mark the pg lost so it could become health-ok # ceph pg 6.d4 mark_unfound_lost delete Error EINVAL: pg has 25 unfound objects but we haven't probed all sources, not marking lost querying the pg i see that it would want osd.80 and osd 36 { "osd": "80", "status": "osd is down" }, trying to mark the osd's lost does not work either. since the osd's was removed from the cluster a long time ago. # ceph osd lost 80 --yes-i-really-mean-it osd.80 is not down or doesn't exist # ceph osd lost 36 --yes-i-really-mean-it osd.36 is not down or doesn't exist and this is where i am stuck. have tried stopping and starting the 3 osd's but that did not have any effect. Anyone have any advice how to proceed ? full output at: http://paste.debian.net/hidden/be03a185/ this is hammer 0.94.9 on debian 8. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] offending shards are crashing osd's
On 19. okt. 2016 13:00, Ronny Aasen wrote: On 06. okt. 2016 13:41, Ronny Aasen wrote: hello I have a few osd's in my cluster that are regularly crashing. [snip] ofcourse having 3 osd's dying regularly is not good for my health. so i have set noout, to avoid heavy recoveries. googeling this error messages gives exactly 1 hit: https://github.com/ceph/ceph/pull/6946 where it saies: "the shard must be removed so it can be reconstructed" but with my 3 osd's failing, i am not certain witch of them contain the broken shard. (or perhaps all 3 of them?) a bit reluctant to delete on all 3. I have 4+2 erasure coding. ( erasure size 6 min_size 4 ) so finding out witch one is bad would be nice. hope someone have an idea how to progress. kind regards Ronny Aasen i again have this problem with crashing osd's. a more detailed log is on the tail of this mail. Does anyone have any suggestions on how i can identify what shard that needs to be removed to allow the EC to recover. ? and more importantly how i can stop the osd's from crashing? kind regards Ronny Aasen Answering my own question for googleabillity. using this one liner. for dir in $(find /var/lib/ceph/osd/ceph-* -maxdepth 2 -type d -name '5.26*' | sort | uniq) ; do find $dir -name '*3a3938238e1f29.002d80ca*' -type f -ls ;done i got a list of all shards of the problematic object. One of the object had size 0 but was otherways readable without any io errors. I guess this explains the inconsistent size, but it does not explain why ceph decides it's better to crash 3 osd's, rather then move a 0 byte file into a "LOST+FOUND" style directory structure. Or just delete it, since it will not have any useful data anyway. Deleting this file (mv to /tmp). allowed the 3 broken osd's to start, and have been running for >24h now. while usualy they crash within 10 minutes. Yay! Generally you need to check _all_ shards on the given pg. Not just the 3 crashing. This was what confused me since i only focused on the crashing osd's I used the oneliner that checked osd's for the pg since due to backfilling the pg was spread all over the place. And i could run it from ansible to reduce tedious work. Also it would be convinient to be able to mark a broken/inconsistent pg manually "inactive". Instead of crashing 3 osd's and taking lots of other pg's with them down. One could set the pg inactive while troubleshooting, and unset pg-inactive when done. without having osd's crash and all the following high load rebalancing. Also i ran a find for 0 size files on that pg and there are multiple other files. are a 0 byte rbd_data file on a pg a normal occurence, or can i have more similar problems in the future due to the other 0 size files ? kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] offending shards are crashing osd's
On 06. okt. 2016 13:41, Ronny Aasen wrote: hello I have a few osd's in my cluster that are regularly crashing. [snip] ofcourse having 3 osd's dying regularly is not good for my health. so i have set noout, to avoid heavy recoveries. googeling this error messages gives exactly 1 hit: https://github.com/ceph/ceph/pull/6946 where it saies: "the shard must be removed so it can be reconstructed" but with my 3 osd's failing, i am not certain witch of them contain the broken shard. (or perhaps all 3 of them?) a bit reluctant to delete on all 3. I have 4+2 erasure coding. ( erasure size 6 min_size 4 ) so finding out witch one is bad would be nice. hope someone have an idea how to progress. kind regards Ronny Aasen i again have this problem with crashing osd's. a more detailed log is on the tail of this mail. Does anyone have any suggestions on how i can identify what shard that needs to be removed to allow the EC to recover. ? and more importantly how i can stop the osd's from crashing? kind regards Ronny Aasen -- query of pg in question -- # ceph pg 5.26 query { "state": "active+undersized+degraded+remapped+wait_backfill", "snap_trimq": "[]", "epoch": 138744, "up": [ 27, 109, 2147483647, 2147483647, 62, 75 ], "acting": [ 2147483647, 2147483647, 32, 107, 62, 38 ], "backfill_targets": [ "27(0)", "75(5)", "109(1)" ], "actingbackfill": [ "27(0)", "32(2)", "38(5)", "62(4)", "75(5)", "107(3)", "109(1)" ], "info": { "pgid": "5.26s2", "last_update": "84093'35622", "last_complete": "84093'35622", "log_tail": "82361'32622", "last_user_version": 0, "last_backfill": "MAX", "purged_snaps": "[1~7]", "history": { "epoch_created": 61149, "last_epoch_started": 138692, "last_epoch_clean": 136567, "last_epoch_split": 0, "same_up_since": 138691, "same_interval_since": 138691, "same_primary_since": 138691, "last_scrub": "84093'35622", "last_scrub_stamp": "2016-10-18 06:18:28.253508", "last_deep_scrub": "84093'35622", "last_deep_scrub_stamp": "2016-10-14 05:33:56.701167", "last_clean_scrub_stamp": "2016-10-14 05:33:56.701167" }, "stats": { "version": "84093'35622", "reported_seq": "210475", "reported_epoch": "138730", "state": "active+undersized+degraded+remapped+wait_backfill", "last_fresh": "2016-10-19 12:40:32.982617", "last_change": "2016-10-19 12:03:29.377914", "last_active": "2016-10-19 12:40:32.982617", "last_peered": "2016-10-19 12:40:32.982617", "last_clean": "2016-07-19 12:03:54.814292", "last_became_active": "0.00", "last_became_peered": "0.00", "last_unstale": "2016-10-19 12:40:32.982617", "last_undegraded": "2016-10-19 12:02:03.030755", "last_fullsized": "2016-10-19 12:02:03.030755", "mapping_epoch": 138627, "log_start": "82361'32622", "ondisk_log_start": "82361'32622", "created": 61149, "last_epoch_clean": 136567, "parent": "0.0", "parent_split_bits": 0, "last_scrub": "84093'35622", "last_scrub_stamp": "2016-10-18 06:18:28.253508", "last_deep_scrub": "84093'35622", "last_deep_scrub_stamp": "2016-10-14 05:33:56.701167", "last_clean_scrub_stamp": "2016-10-14 05:33:56.701167", "log_size": 3000, "ondisk_log_size
Re: [ceph-users] Recovery/Backfill Speedup
how did you set the parameter ? editing ceph.conf only works when you restart the osd nodes. but running something like ceph tell osd.* injectargs '--osd-max-backfills 6' would set all osd's max backfill dynamically without restarting the osd. and you should fairly quickly afterwards see more backfills in ceph -s I have also noticed that if i run ceph -n osd.0 --show-config on one of my mon nodes, it shows the deafult settings. it does not actualy talk to osd.0 and get the current settings. but if i run it from any osd node it works. But i am on hammer and not on jewel so this might have changed and actualy work for you. Kind regards Ronny Aasen On 05. okt. 2016 21:52, Dan Jakubiec wrote: Thank Ronny, I am working with Reed on this problem. Yes something is very strange. Docs say osd_max_backfills default to 10, but when we examined the run-time configuration using "ceph --show-config" it was showing osd_max_backfills set to 1 (we are running latest Jewel release). We have explicitly set this parameter to 10 now. Sadly, about 2 hours in backfills continue to be anemic. Any other ideas? $ ceph -s cluster edeb727e-c6d3-4347-bfbb-b9ce7f60514b health HEALTH_WARN 246 pgs backfill_wait 3 pgs backfilling 329 pgs degraded 83 pgs recovery_wait 332 pgs stuck unclean 257 pgs undersized recovery 154681996/676556815 objects degraded (22.863%) recovery 278768286/676556815 objects misplaced (41.204%) noscrub,nodeep-scrub,sortbitwise flag(s) set monmap e1: 3 mons at {core=10.0.1.249:6789/0,db=10.0.1.251:6789/0,dev=10.0.1.250:6789/0} election epoch 210, quorum 0,1,2 core,dev,db osdmap e4274: 16 osds: 16 up, 16 in; 279 remapped pgs flags noscrub,nodeep-scrub,sortbitwise pgmap v1657039: 576 pgs, 2 pools, 6427 GB data, 292 Mobjects 15308 GB used, 101 TB / 116 TB avail 154681996/676556815 objects degraded (22.863%) 278768286/676556815 objects misplaced (41.204%) 244 active+clean 242 active+undersized+degraded+remapped+wait_backfill 53 active+recovery_wait+degraded 17 active+recovery_wait+degraded+remapped 13 active+recovery_wait+undersized+degraded+remapped 3 active+remapped+wait_backfill 2 active+undersized+degraded+remapped+backfilling 1 active+degraded+remapped+wait_backfill 1 active+degraded+remapped+backfilling recovery io 1568 kB/s, 109 objects/s client io 5629 kB/s rd, 411 op/s rd, 0 op/s wr Here is what our current configuration looks like: $ ceph -n osd.0 --show-config | grep osd | egrep "recovery|backfill" | sort osd_allow_recovery_below_min_size = true osd_backfill_full_ratio = 0.85 osd_backfill_retry_interval = 10 osd_backfill_scan_max = 512 osd_backfill_scan_min = 64 osd_debug_reject_backfill_probability = 0 osd_debug_skip_full_check_in_backfill_reservation = false osd_kill_backfill_at = 0 osd_max_backfills = 10 osd_min_recovery_priority = 0 osd_recovery_delay_start = 0 osd_recovery_forget_lost_objects = false osd_recovery_max_active = 15 osd_recovery_max_chunk = 8388608 osd_recovery_max_single_start = 1 osd_recovery_op_priority = 63 osd_recovery_op_warn_multiple = 16 osd_recovery_sleep = 0 osd_recovery_thread_suicide_timeout = 300 osd_recovery_thread_timeout = 30 osd_recovery_threads = 5 -- Dan Ronny Aasen wrote: On 04.10.2016 16:31, Reed Dier wrote: Attempting to expand our small ceph cluster currently. Have 8 nodes, 3 mons, and went from a single 8TB disk per node to 2x 8TB disks per node, and the rebalancing process is excruciatingly slow. Originally at 576 PGs before expansion, and wanted to allow rebalance to finish before expanding the PG count for the single pool, and the replication size. I have stopped scrubs for the time being, as well as set client and recovery io to equal parts so that client io is not burying the recovery io. Also have increased the number of recovery threads per osd. [osd] osd_recovery_threads = 5 filestore_max_sync_interval = 30 osd_client_op_priority = 32 osd_recovery_op_priority = 32 Also, this is 10G networking we are working with and recovery io typically hovers between 0-35 MB’s but typically very bursty. Disks are 8TB 7.2k SAS disks behind an LSI 3108 controller, configured as individual RAID0 VD’s, with pdcache disabled, but BBU backed write back caching enabled at the controller level. Have thought about increasing the ‘osd_max_backfills’ as well as ‘osd_recovery_max_active’, and possibly ‘osd_recovery_max_chunk’ to attempt to speed it up, but will hopefully get some insight from the community here. ceph -s about 4 days in: health HEALTH_WARN 255 pgs backfill_wait 4 pgs backfilling 385 pgs degraded 1
[ceph-users] offending shards are crashing osd's
hello I have a few osd's in my cluster that are regularly crashing. in the log of them i can see osd.7 -1> 2016-10-06 08:09:18.869687 7ffaa037f700 -1 osd.7 pg_epoch: 128840 pg[5.3as0( v 84797'30080 (67219'27080,84797'30080] local-les=128834 n=13146 ec=61149 les/c 128834/127358 128829/128829/128829) [7,109,4,0,62,32]/[7,109,32,0,62,39] r=0 lpr=128829 pi=127357-128828/12 rops=5 bft=4(2),32(5) crt=0'0 lcod 0'0 mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: inconsistent shard sizes 5/abc6d43a/rbd_data.33640a238e1f29.0003b165/head the offending shard must be manually removed after verifying there are enough shards to recover (0, 8388608, [32(2),0, 39(5),0]) osd.32 -411> 2016-10-06 13:21:15.166968 7fe45b6cb700 -1 osd.32 pg_epoch: 129181 pg[5.3as2( v 84797'30080 (67219'27080,84797'30080] local-les=129171 n=13146 ec=61149 les/c 129171/127358 129170/129170/129170) [2147483647,2147483647,4,0,62,32]/[2147483647,2147483647,32,0,62,39] r=2 lpr=129170 pi=121260-129169/43 rops=5 bft=4(2),32(5) crt=0'0 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] handle_recovery_read_complete: inconsistent shard sizes 5/abc6d43a/rbd_data.33640a238e1f29.0003b165/head the offending shard must be manually removed after verifying there are enough shards to recover (0, 8388608, [32(2),0, 39(5),0]) osd.109 -1> 2016-10-06 13:17:36.748340 7fa53d36c700 -1 osd.109 pg_epoch: 129167 pg[5.3as1( v 84797'30080 (66310'24592,84797'30080] local-les=129163 n=13146 ec=61149 les/c 129163/127358 129162/129162/129162) [2147483647,109,4,0,62,32]/[2147483647,109,32,0,62,39] r=1 lpr=129162 pi=112552-129161/59 rops=5 bft=4(2),32(5) crt=84797'30076 lcod 0'0 mlcod 0'0 active+undersized+degraded+remapped+backfilling] handle_recovery_read_complete: inconsistent shard sizes 5/abc6d43a/rbd_data.33640a238e1f29.0003b165/head the offending shard must be manually removed after verifying there are enough shards to recover (0, 8388608, [32(2),0, 39(5),0]) ofcourse having 3 osd's dying regularly is not good for my health. so i have set noout, to avoid heavy recoveries. googeling this error messages gives exactly 1 hit: https://github.com/ceph/ceph/pull/6946 where it saies: "the shard must be removed so it can be reconstructed" but with my 3 osd's failing, i am not certain witch of them contain the broken shard. (or perhaps all 3 of them?) a bit reluctant to delete on all 3. I have 4+2 erasure coding. ( erasure size 6 min_size 4 ) so finding out witch one is bad would be nice. hope someone have an idea how to progress. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Give up on backfill, remove slow OSD
On 22. sep. 2016 09:16, Iain Buclaw wrote: Hi, I currently have an OSD that has been backfilling data off it for a little over two days now, and it's gone from approximately 68 PGs to 63. As data is still being read from, and written to it by clients whilst I'm trying to get it out of the cluster, this is not helping it at all. I figured that it's probably best just to cut my losses and just force it out entirely so that all new writes and reads to those PGs get redirected elsewhere to a functional disk, and the rest of the recovery can proceed without being blocked heavily by this one disk. Granted that objects and files have a 1:1 relationship, I can just rsync the data to a new server and write it back into ceph afterwards. Now, I know that as soon as I bring down this OSD, the entire cluster will stop operating. So what's the most swift method of telling the cluster to forget about this disk and everything that may be stored on it. Thanks It should normally not get new writes to it if you want to remove it from the cluster. I assume you did something wrong here. How did you define the osd out of the cluster ? generally my procedure for a working osd is something like 1. ceph osd crush reweight osd.X 0 2. ceph osd tree check that the osd in question actualy have 0 weight (first number after ID) and that the host weight have been reduced accordingly. 3. ls /var/lib/ceph/osd/cph-X/current ; periodically wait for the osd to drain, there should be no PG directories n.xxx_head or n.xxx_TEMP this will take a while depending on the size of the osd. in reality i just wait until the disk usage graph settle, then doublecheck with ls. 4: once empty I mark the osd out, stop the process, and removes the osd from the cluster as written in the documentation - ceph auth del osd.x - ceph osd crush remove osd.x - ceph osd rm osd.x PS: if your cluster stops to operate when a osd goes down, you have something else fundamentally wrong. you should look into this as well as a separate case. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Is it possible to recover the data of which all replicas are lost?
On 27. sep. 2016 13:29, xxhdx1985126 wrote: Hi, everyone. I've got a problem, here. Due to some miss operations, I deleted all three replicas of my data, is there any way to recover it? This is a very urgent problem. Please help me, Thanks. you do not give any details on how you deleted the data, so i am assuming a lot. But if you pulled 3 disks at the same time, and the disks are working. you can connect and mount the disks, and use the ceph-objectstore-tool to export a pg to a datafile, and then run thetool again to import it to a fresh emppty osd. this older writeup gives an overview of the process. keep in mind the tool have changed name as is part of the default install http://ceph.com/community/incomplete-pgs-oh-my/ if you actually deleted the pg's off the disks, or the disks are dead. Then you need to stop writing to those osd's and use some kind of file recovery tool or service. and then as step 2 use the tool above to get the objects back onto the cluster. i would start by marking the 3 osds's out, so no more writes take place. and then and stop them as soon as possible, (you do not want to make the problem worse) then stop the osd's and try some file recovery tools, or send the disks to someone like ibas https://www.krollontrack.com/ if they are dead. keep in mind you need the xattr information as well in order to get the functioning objects back. once you have the file structure in place you use the ceph-objectstore-tool to export /import to a working osd. good luck Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] problem starting osd ; PGLog.cc: 984: FAILED assert hammer 0.94.9
added debug journal = 20 and got some new lines in the log. that i added to the end of this email. any of you can make something out of them ? kind regards Ronny Aasen On 18.09.2016 18:59, Kostis Fardelas wrote: If you are aware of the problematic PGs and they are exportable, then ceph-objectstore-tool is a viable solution. If not, then running gdb and/or higher debug osd level logs may prove useful (to understand more about the problem or collect info to ask for more in ceph-devel). On 13 September 2016 at 17:26, Henrik Korkuc <li...@kirneh.eu> wrote: On 16-09-13 11:13, Ronny Aasen wrote: I suspect this must be a difficult question since there have been no replies on irc or mailinglist. assuming it's impossible to get these osd's running again. Is there a way to recover objects from the disks. ? they are mounted and data is readable. I have pg's down since they want to probe these osd's that do not want to start. pg query claim it can continue if i mark the osd as lost. but i would prefer to not loose data. especially since the data is ok and readable on the nonfunctioning osd. also let me know if there is other debug i can extract in order to troubleshoot the non starting osd's kind regards Ronny Aasen I cannot help you with this, but you can try using http://ceph.com/community/incomplete-pgs-oh-my/ and http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000238.html (found this mail thread googling for the objectool post). ymmv On 12. sep. 2016 13:16, Ronny Aasen wrote: after adding more osd's and having a big backfill running 2 of my osd's keep on stopping. We also recently upgraded from 0.94.7 to 0.94.9 but i do not know if that is related. the log say. [snip old error log. ] -17> 2016-09-18 22:52:06.405881 7f878791b880 10 filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head /1/578c53b6/rb.0.392c.238e1f29.000513d5/head '_' = 266 -16> 2016-09-18 22:52:06.405915 7f878791b880 15 filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head /1/578c53b6/rb.0.392c.238e1f29.000513d5/21 '_' -15> 2016-09-18 22:52:06.406049 7f878791b880 10 filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head /1/578c53b6/rb.0.392c.238e1f29.000513d5/21 '_' = 251 -14> 2016-09-18 22:52:06.406079 7f878791b880 15 filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head /1/4ecf13b6/rb.0.392c.238e1f29.0037c4cb/21 '_' -13> 2016-09-18 22:52:06.406166 7f878791b880 10 filestore(/var/lib/ceph/osd/ceph-106) error opening file /var/lib/ceph/osd/ceph-106/current/1.3b6_head/DIR_6/DIR_B/DIR_3/DIR_1/DIR_F/rb.0.392c.238e1f29.0037c4c b__21_4ECF13B6__1 with flags=2: (2) No such file or directory -12> 2016-09-18 22:52:06.406187 7f878791b880 10 filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head /1/4ecf13b6/rb.0.392c.238e1f29.0037c4cb/21 '_' = -2 -11> 2016-09-18 22:52:06.406190 7f878791b880 15 read_log missing 104661'46956,1/4ecf13b6/rb.0.392c.238e 1f29.0037c4cb/21 -10> 2016-09-18 22:52:06.406195 7f878791b880 15 filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head /1/e85f13b6/rb.0.392c.238e1f29.00b5bb3b/head '_' -9> 2016-09-18 22:52:06.406279 7f878791b880 10 filestore(/var/lib/ceph/osd/ceph-106) error opening file /var/lib/ceph/osd/ceph-106/current/1.3b6_head/DIR_6/DIR_B/DIR_3/DIR_1/DIR_F/rb.0.392c.238e1f29.00b5bb3 b__head_E85F13B6__1 with flags=2: (2) No such file or directory -8> 2016-09-18 22:52:06.406293 7f878791b880 10 filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head /1/e85f13b6/rb.0.392c.238e1f29.00b5bb3b/head '_' = -2 -7> 2016-09-18 22:52:06.406297 7f878791b880 15 read_log missing 104661'46955,1/e85f13b6/rb.0.392c.238e 1f29.00b5bb3b/head -6> 2016-09-18 22:52:06.406311 7f878791b880 15 filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head /1/e85f13b6/rb.0.392c.238e1f29.00b5bb3b/21 '_' -5> 2016-09-18 22:52:06.406363 7f878791b880 10 filestore(/var/lib/ceph/osd/ceph-106) error opening file /var/lib/ceph/osd/ceph-106/current/1.3b6_head/DIR_6/DIR_B/DIR_3/DIR_1/DIR_F/rb.0.392c.238e1f29.00b5bb3 b__21_E85F13B6__1 with flags=2: (2) No such file or directory -4> 2016-09-18 22:52:06.406369 7f878791b880 10 filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head /1/e85f13b6/rb.0.392c.238e1f29.00b5bb3b/21 '_' = -2 -3> 2016-09-18 22:52:06.406372 7f878791b880 15 read_log missing 91332'39092,1/e85f13b6/rb.0.392c.238e1 f29.00b5bb3b/21 -2> 2016-09-18 22:52:06.406375 7f878791b880 15 filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head /1/d9c303b6/rb.0.392c.238e1f29.4943/head '_' -1> 2016-09-18 22:52:06.426875 7f878791b880 10 filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head /1/d9c303b6/rb.0.392c.238e1f29.4943/head '_' = 266 0> 2016-09-18 22:52:06.455911 7f878791b880 -1 osd/PGLog.cc: In function 'static void PGLog::read_log(O