Re: [ceph-users] Looking for the best way to utilize 1TB NVMe added to the host with 8x3TB HDD OSDs
Ashley Merrick wrote: Correct, in a large cluster no problem. I was talking in Wladimir setup where they are running single node with a failure domain of OSD. Which would be a loss of all OSD's and all data. Sure I am aware that running with 1 NVMe is risky, so we have a plan to add a mirroring NVMe to it in some future. Hope this could be solved by using simple mdadm+lvm scheme Btw, are there any recommendations on cheapest Ceph node hardware ? Now I understand that 8x3TB HDDs in single host is quite a centralized setup. And I have a feeling that a good Ceph cluster should have more hosts than OSDs in each host. Like, with 8 OSDs per host, at least 8 hosts. Or at least 3 hosts with 3 OSDs in each. Right ? And then it would be reasonable to add single NVMe per host to allow any component of the host to fail within failure domain=host. I am still thinking within the cheapest concept of multiple HDDs + single NVMe per host. On Sun, 22 Sep 2019 03:42:52 +0800 *solarflow99 mailto:solarflo...@gmail.com>>* wrote now my understanding is that a NVMe drive is recommended to help speed up bluestore. If it were to fail then those OSDs would be lost but assuming there is 3x replication and enough OSDs I don't see the problem here. There are other scenarios where a whole server might le lost, it doesn't mean the total loss of the cluster. On Sat, Sep 21, 2019 at 5:27 AM Ashley Merrick mailto:singap...@amerrick.co.uk>> wrote: __ Placing it as a Journal / Bluestore DB/WAL will help with writes mostly, by the sounds of it you want to increase read performance?, how important is the data on this CEPH cluster? If you place it as a Journal DB/WAL any failure of it will cause total data loss so I would very much advise against this unless this is totally for testing and total data loss is not an issue. In that can is worth upgrading to blue store by rebuilding each OSD placing the DB/WAL on a SSD partition, you can do this one OSD at a time but there is no migration path so you would need to wait for data rebuilding after each OSD change before moving onto the next. If you need to make sure your data is safe then your really limited to using it as a read only cache, but I think even then most setups would cause all OSD's to go offline till you manually removed it from a read only cache if the disk failed. However bcache/dm-cache may support this automatically however is still a risk that I personally wouldn't want to take. Also it really depends on your use for CEPH and the I/O activity expected to what the best option may be. On Fri, 20 Sep 2019 14:56:12 +0800 *Wladimir Mutel mailto:m...@mwg.dp.ua>>* wrote Dear everyone, Last year I set up an experimental Ceph cluster (still single node, failure domain = osd, MB Asus P10S-M WS, CPU Xeon E3-1235L, RAM 64 GB, HDDs WD30EFRX, Ubuntu 18.04, now with kernel 5.3.0 from Ubuntu mainline PPA and Ceph 14.2.4 from download.ceph.com/debian-nautilus/dists/bionic <http://download.ceph.com/debian-nautilus/dists/bionic> ). I set up JErasure 2+1 pool, created some RBDs using that as data pool and exported them by iSCSI (using tcmu-runner, gwcli and associated packages). But with HDD-only setup their performance was less than stellar, not saturating even 1Gbit Ethernet on RBD reads. This year my experiment was funded with Gigabyte PCIe NVMe 1TB SSD (GP-ASACNE2100TTTDR). Now it is plugged in the MB and is visible as a storage device to lsblk. Also I can see its 4 interrupt queues in /proc/interrupts, and its transfer measured by hdparm -t is about 2.3GB/sec. And now I want to ask your advice on how to best include it into this already existing setup. Should I allocate it for OSD journals and databases ? Is there a way to reconfigure existing OSD in this way without destroying and recreating it ? Or are there plans to ease this kind of migration ? Can I add it as a write-adsorbing cache to individual RBD images ? To individual block devices at the level of bcache/dm-cache ? What about speeding up RBD reads ? I would appreciate to read your opinions and recommendations. (just want to warn you that in this situation I don't have financial option of going full-SSD)
[ceph-users] Looking for the best way to utilize 1TB NVMe added to the host with 8x3TB HDD OSDs
Dear everyone, Last year I set up an experimental Ceph cluster (still single node, failure domain = osd, MB Asus P10S-M WS, CPU Xeon E3-1235L, RAM 64 GB, HDDs WD30EFRX, Ubuntu 18.04, now with kernel 5.3.0 from Ubuntu mainline PPA and Ceph 14.2.4 from download.ceph.com/debian-nautilus/dists/bionic ). I set up JErasure 2+1 pool, created some RBDs using that as data pool and exported them by iSCSI (using tcmu-runner, gwcli and associated packages). But with HDD-only setup their performance was less than stellar, not saturating even 1Gbit Ethernet on RBD reads. This year my experiment was funded with Gigabyte PCIe NVMe 1TB SSD (GP-ASACNE2100TTTDR). Now it is plugged in the MB and is visible as a storage device to lsblk. Also I can see its 4 interrupt queues in /proc/interrupts, and its transfer measured by hdparm -t is about 2.3GB/sec. And now I want to ask your advice on how to best include it into this already existing setup. Should I allocate it for OSD journals and databases ? Is there a way to reconfigure existing OSD in this way without destroying and recreating it ? Or are there plans to ease this kind of migration ? Can I add it as a write-adsorbing cache to individual RBD images ? To individual block devices at the level of bcache/dm-cache ? What about speeding up RBD reads ? I would appreciate to read your opinions and recommendations. (just want to warn you that in this situation I don't have financial option of going full-SSD) Thank you all in advance for your response ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Mimic - EC and crush rules - clarification
David Turner wrote: Yes, when creating an EC profile, it automatically creates a CRUSH rule specific for that EC profile. You are also correct that 2+1 doesn't really have any resiliency built in. 2+2 would allow 1 node to go down while still having your data accessible. It will use 2x data to raw as Is not EC 2+2 the same as 2x replication (i.e. RAID1) ? Is not EC benefit and intention to allow equivalent replication factors be chosen between >1 and <2 ? That's why it is recommended to have m2 and k>m. Overall, your reliability in Ceph is measured as a cluster rebuild/performance degradation time in case of up-to m OSDs failure, provided that no more than m OSDs (or larger failure domains) have failed at once. Sure, EC is beneficial only when you have enough failure domains (i.e. hosts). My criterion is that you should have more hosts than you have individual OSDs within a single host. I.e. at least 8 (and better >8) hosts when you have 8 OSDs per host. opposed to the 1.5x of 2+1, but it gives you resiliency. The example in your command of 3+2 is not possible with your setup. May I ask why you want EC on such a small OSD count? I'm guessing to not use as much storage on your SSDs, but I would just suggest going with replica with such a small cluster. If you have a larger node/OSD count, then you can start seeing if EC is right for your use case, but if this is production data... I wouldn't risk it. When setting the crush rule, it wants the name of it, ssdrule, not 2. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Upgrade Ceph 13.2.0 -> 13.2.1 and Windows iSCSI clients breakup
Dear all, I want to share some experience of upgrading my experimental 1-host Ceph cluster from v13.2.0 to v13.2.1. First, I fetched new packages and installed them using 'apt dist-upgrade', which went smooth as usual. Then I noticed from 'lsof', that Ceph daemons were not restarted after the upgrade ('ceph osd versions' still showed 13.2.0). Using instructions on Luminous->Mimic upgrade, I decided to restart ceph-{mon,mgr,osd}.targets. And surely, on restarting ceph-osd.target, iSCSI sessions had been broken on tcmu-runner side ('Timing out cmd', 'Handler connection lost'), and Windows (2008 R2) clients lost their iSCSI devices. But that was only a beginning of surprises that followed. Looking into Windows Disk Management, I noticed that iSCSI disks were re-detected with size about 0.12 Gb larger, i.e. 2794.52 GB instead of 2794.40 GB, and of course the system lost their GPT labels from its sight. I quickly checked 'rbd info' on Ceph side and did not notice any increase in RBD images. They were still exactly 715398 4MB-objects as I intended initially. Restarting iSCSI initiator service on Windows did not help. Restarting the whole Windows did not help. Restarting tcmu-runner on Ceph side did not help. What resolved the problem, to my great surprise, was _removing/re-adding MPIO feature and re-adding iSCSI multipath support_. After that, Windows detected iSCSI disks with proper size again, and restored visibility of GPT partitions, dynamic disk metadata and all the rest. Ok, I avoided data loss at this time, but I have some remaining questions : 1. Can Ceph minor version upgrades be made less disruptive and traumatic? Like, some king of staged/rolling OSD daemons restart within single upgraded host, without losing librbd sessions ? 2. Is Windows (2008 R2) MPIO support really that screwed & crippled ? Were there any improvements in Win2012/2016 ? I have physical servers with Windows 2008 R2, and I would like to mirror their volumes to Ceph iSCSI targets, then convert them into QEMU/KVM virtual machines where the same data will be accessed with librbd. During my initial experiments, I found that reinstalling MPIO & re-enabling iSCSI multipath would fix most problems in Windows iSCSI access, but I would like to have a faster way of resetting iSCSI+MPIO state when something is going wrong on Windows side like in my case. 3. Anybody has an idea of where these 0.12 GB (probably 120 or 128 MB) were taken from ? Thank you in advance for your responses. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD image repurpose between iSCSI and QEMU VM, how to do properly ?
Jason Dillaman wrote: I am doing more experiments with Ceph iSCSI gateway and I am a bit confused on how to properly repurpose an RBD image from iSCSI target into QEMU virtual disk and back This isn't really a use case that we support nor intend to support. Your best bet would be to use an initiator in your linux host to connect to the same LUN as is being exported over iSCSI (just make sure the NTFS file system is quiesced / frozen. After some more tries I found a way which is convenient enough to me : for every step which requires changing RBD role from qemu/librbd to iscsi or back, I create an RBD snap and a clone of this snap under some new name and assign it to qemu/librbd or to gwcli/iscsi accordingly. Then I can easily drop original RBDs as they are unneeded. New Mimic functionality which frees me up from mandatory snap protection is of great help. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] chkdsk /b fails on Ceph iSCSI volume
Hi, I cloned a NTFS with bad blocks from USB HDD onto Ceph RBD volume (using ntfsclone, so the copy has sparse regions), and decided to clean bad blocks within the copy. I run chkdsk /b from WIndows and it fails on free space verification (step 5 of 5). In tcmu-runner.log I see that command 8f (SCSI Verify) is not supported. Does it mean that I should not try to run chkdsk /b on this volume at all ? (it seems that bad blocks were re-verified and cleared) Are there any plans to make user:rbd backstore support verify requests ? Thanks in advance for your replies. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD image repurpose between iSCSI and QEMU VM, how to do properly ?
Dear all, I am doing more experiments with Ceph iSCSI gateway and I am a bit confused on how to properly repurpose an RBD image from iSCSI target into QEMU virtual disk and back First, I create an RBD image and set it up as iSCSI backstore in gwcli, specifying its size exactly to avoid unwanted resizes Next, I connect Windows 2008 R2 to this image (enable MPIO before connect and select MPIO policy 'Failover only' for the accessed device) Then in Windows Disk Management I initialize the physical disk with GPT, convert it into Dynamic disk and create a simple NTFS volume in its free space Then in the same console I put the disk 'offline', and in iSCSI control panel I disconnect the session from Windows side Then I attach the same RBD image to QEMU/KVM virtual machine with Ubuntu 18.04 as virtio/librbd storage drive Then I boot Ubuntu 18.04 VM, find NTFS filesystem using 'ldmtool create all', and during ntfsclone from external disk I discover that RBD image is mapped read-only Ok, I stop Ubuntu VM, do 'rbd lock rm' for this image (lock is held by tcmu-runner, I suppose), restart Ubuntu, restart ntfsclone, and this time it is going well. Btw, ntfsclone onto device-mapper target created by ldmtool is going about 2x faster than directly onto Virtio Disk (vdN), so it transferred my 1600+GB in just 13+ hours instead of 27+ Ok, external NTFS is cloned seemingly well, I shutdown Ubuntu VM (it properly removed the RBD lock on shutdown) and try to access it from Windows by iSCSI again. And at this moment I stumble into trouble. First, I don't see added RBD image in 'Devices' on iSCSI initiator control panel. This I tried to resolve by restarting tcmu-runner. After reconnect from Windows side, RBD image became visible in devices (and RBD lock from tcmu side was reacquired), but its MPIO button was disabled, so I could not check or change MPIO policy (surely I enable MPIO in 'Connect' dialog). I tried also to restart rbd-target-gw but this also did not help. Restarting Windows server also did not improve the situation (MPIO button still disabled). What should I try to restart next, to avoid restarting the whole Ceph host ? May be unload/reload some kernel modules ? Thanks in advance for your help. Hope I could determine and resolve the problem myself, but this could take more time than getting help from you. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD gets resized when used as iSCSI target
Wladimir Mutel wrote: it back to gwcli/disks), I discover that its size is rounded up to 3 TiB, i.e. 3072 GiB or 786432*4M Ceph objects. As we know, GPT is stored 'targetcli ls /' (there, it is still 3.0T). Also, when I restart rbd-target-gw.service, it gets resized back up to 3.0T as shown by Well, I see this size in /etc/target/saveconfig.json And I see how the RBD is extended in /var/log/tcmu-runner.log And I remember that once I lazily added 2.7T RBD specifying its size as 3T in gwcli. Now trying to fix that wihout deleting/recreating the RBD... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD gets resized when used as iSCSI target
Dear all, I create an RBD to be used as iSCSI target, with size close to the most popular 3TB HDD size, 5860533168 512-byte sectors, or 715398*4M Ceph objects (2.7 TB or 2794.4 GB). Then I add it into gwcli/disks (having to specify the same size, 2861592M), and then, after some manipulations which I do not remember exactly (like, remove it from gwcli conf, use it for some time as RBD target in QEMU VM, then re-add it back to gwcli/disks), I discover that its size is rounded up to 3 TiB, i.e. 3072 GiB or 786432*4M Ceph objects. As we know, GPT is stored at the end of block device, so when we increase its size in this way, GPT becomes inaccessible and partition limits need to be guessed anew in some other way. I can shrink this gratuitously-increased RBD by 'rbd resize', and this is reflected in 'gwcli ls /' (3.0T becomes 2.7T). But not in 'targetcli ls /' (there, it is still 3.0T). Also, when I restart rbd-target-gw.service, it gets resized back up to 3.0T as shown by 'gwcli ls /' and to 786432 objects in 'rbd info'. I look into rbd/gateway.conf RADOS object, and don't see any explicit size specified there. Where does it take this 3.0T size from ? My last suspicion is RBD name which is 'tower-prime-e-3tb'. Can its '3tb' suffix be the culprit ? Thank you in advance for your replies. I am getting lost and slowly infuriated with this behavior. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] HDD-only performance, how far can it be sped up ?
Dear all, I set up a minimal 1-node Ceph cluster to evaluate its performance. We tried to save as much as possible on the hardware, so now the box has Asus P10S-M WS motherboard, Xeon E3-1235L v5 CPU, 64 GB DDR4 ECC RAM and 8x3TB HDDs (WD30EFRX) connected to on-board SATA ports. Also we are trying to save on storage redundancy, so for most of our RBD images we use erasure-coded data-pool (default profile, jerasure 2+1) instead of 3x replication. I started with Luminous/Xenial 12.2.5 setup which initialized my OSDs as Bluestore during deploy, then updated it to Mimic/Bionic 13.2.0. Base OS is Ubuntu 18.04 with kernel updated to 4.17.2 from Ubuntu mainline PPA. With this setup, I created a number of RBD images to test iSCSI, rbd-nbd and QEMU+librbd performance (running QEMU VMs on the same box). And that worked moderately well as far as data volume transferred within one session was limited. The fastest transfers I had with 'rbd import' which pulled an ISO image file at up to 25 MBytes/sec from the remote CIFS share over Gigabit Ethernet and stored it into EC data-pool. Windows 2008 R2 & 2016 setup, update installation, Win 2008 upgrade to 2012 and to 2016 within QEMU VM also went through tolerably well. I found that cache=writeback gives the best performance with librbd, unlike cache=unsafe which gave the best performance with VMs on plain local SATA drives. Also I have a subjective feeling (not confirmed by exact measurements) that providing a huge libRBD cache (like, cache size = 1GB, max dirty = 7/8GB, max dirty age = 60) improved Windows VM performance on bursty writes (like, during Windows update installations) as well as on reboots (due to cached reads). Now, what discouraged me, was my next attempt to clone an NTFS partition of ~2TB from a physical drive (via USB3-SATA3 convertor) to a partition on an RBD image. I tried to map RBD image with rbd-nbd either locally or remotely over Gigabit Ethernet, and the fastest speed I got with ntfsclone was about 8 MBytes/sec. Which means that it could spend up to 3 days copying these ~2TB of NTFS data. I thought about running ntfsclone /dev/sdX1 -o - | rbd import ... - , but ntfsclone needs to rewrite a part of existing RBD image starting from certain offset, so I decided this was not a solution in my situation. Now I am thinking about taking out one of OSDs and using it as a 'bcache' for this operation, but I am not sure how good is bcache performance with cache on rotating HDD. I know that keeping OSD logs and RocksDB on the same HDD creates a seeky workload which hurts overall transfer performance. Also I am thinking about a number of next-close possibilities, and I would like to hear your opinions on the benefits and drawbacks of each of them. 1. Would iSCSI access to that RBD image improve my performance (compared to rbd-nbd) ? I did not check that yet, but I noticed that Windows transferred about 2.5 MBytes/sec while formatting NTFS volume on this RBD attached to it by iSCSI. So, for seeky/sparse workloads like NTFS formatting the performance was not great. 2. Would it help to run ntfsclone in Linux VM, with RBD image accessed through QEMU+librbd ? (also going to measure that myself) 3. Is there any performance benefits in using Ceph cache-tier pools with my setup ? I hear now use of this technique is advised against, no? 4. We have an unused older box (Supermicro X8SIL-F mobo, Xeon X3430 CPU, 32 GB of DDR3 ECC RAM, 6 onboard SATA ports, used from 2010 to 2017, in perfectly working condition) which can be stuffed with up to 6 SATA HDDs and added to this Ceph cluster, so far with only Gigabit network interconnect. Like, move 4 OSDs out of first box into it, to have 2 boxes with 4 HDDs each. Is this going to improve Ceph performance with the setup described above ? 5. I hear that RAID controllers like Adaptec 5805, LSI 2108 provide better performance with SATA HDDs exported as JBODs than onboard SATA AHCI controllers due to more aggressive caching and reordering requests. Is this true ? 6. On the local market we can buy Kingston KC1000/960GB NVMe drive for moderately reasonable price. Its specification has rewrite limit of 1 PB and 0.58 DWPD (drive rewrite per day). Is there any counterindications against using it in production Ceph setup (i.e., too low rewrite limit, look for 8+PB) ? What is the difference between using it as a 'bcache' os as specifically-designed OSD log+rocksdb storage ? Can it be used as a single shared partition for all OSD daemons, or will it require spitting into 8 separate partitions ? Thank you in advance for your replies. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI to a Ceph node with 2 network adapters - how to ?
Jason Dillaman wrote: [1] http://docs.ceph.com/docs/master/rbd/iscsi-initiator-win/ I don't use either MPIO or MCS on Windows 2008 R2 or Windows 10 initiator (not Win2016 but hope there is no much difference). I try to make it work with a single session first. Also, right now I only have a single iSCSI gateway/portal (single host, single IP, single port). Or is MPIO mandatory to use with Ceph target ? It's mandatory even if you only have a single path since MPIO is responsible for activating the paths. Who would know ? I installed MPIO, enabled it for iSCSI (required Windows reboot), set MPIO policy to 'Failover only', and now my iSCSI target is readable ! Thanks a lot for your help ! Probably this should be written with bigger and redder letters in Ceph docs. Next question, would it be possible for iPXE loader to boot from such iSCSI volumes ? I am going to experiment with that but if the answer is known in advance, it would be good to know. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI to a Ceph node with 2 network adapters - how to ?
Jason Dillaman wrote: чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.726 54121 [DEBUG] dbus_name_acquired:461: name org.kernel.TCMUService1 acquired чер 13 08:38:30 p10s tcmu-runner[54121]: 2018-06-13 08:38:30.521 54121 [DEBUG_SCSI_CMD] tcmu_print_cdb_info:1205 rbd/libvirt.tower-prime-e-3tb: 1a 0 3f 0 c0 0 чер 13 08:38:30 p10s tcmu-runner[54121]: 2018-06-13 08:38:30.523 54121 [DEBUG_SCSI_CMD] tcmu_print_cdb_info:1205 rbd/libvirt.tower-prime-e-3tb: 1a 0 3f 0 c0 0 чер 13 08:38:30 p10s tcmu-runner[54121]: 2018-06-13 08:38:30.543 54121 [DEBUG_SCSI_CMD] tcmu_print_cdb_info:1205 rbd/libvirt.tower-prime-e-3tb: 9e 10 0 0 0 0 0 0 0 0 0 0 0 c 0 0 чер 13 08:38:30 p10s tcmu-runner[54121]: 2018-06-13 08:38:30.550 54121 [DEBUG_SCSI_CMD] tcmu_print_cdb_info:1205 rbd/libvirt.tower-prime-e-3tb: 9e 10 0 0 0 0 0 0 0 0 0 0 0 c 0 0 чер 13 08:38:47 p10s tcmu-runner[54121]: 2018-06-13 08:38:47.944 54121 [DEBUG_SCSI_CMD] tcmu_print_cdb_info:1205 rbd/libvirt.tower-prime-e-3tb: 9e 10 0 0 0 0 0 0 0 0 0 0 0 c 0 0 Wikipedia says that 1A is 'mode sense' and 9e is 'service action in'. These records are logged when I try to put the disk online or initialize it with GPT/MBR partition table in Windows Disk Management (and Windows report errors after that) What to check next ? Any importance of missing 'SSD' device class ? Did you configure MPIO within Windows [1]? Any errors recorded in the Windows Event Viewer? The "SSD" device class isn't important -- it's just a way to describe the LUN as being backed by non-rotational media (e.g. VMware will show a different icon). [1] http://docs.ceph.com/docs/master/rbd/iscsi-initiator-win/ I don't use either MPIO or MCS on Windows 2008 R2 or Windows 10 initiator (not Win2016 but hope there is no much difference). I try to make it work with a single session first. Also, right now I only have a single iSCSI gateway/portal (single host, single IP, single port). Or is MPIO mandatory to use with Ceph target ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI to a Ceph node with 2 network adapters - how to ?
On Tue, Jun 12, 2018 at 10:39:59AM -0400, Jason Dillaman wrote: > > So, my usual question is - where to look and what logs to enable > > to find out what is going wrong ? > If not overridden, tcmu-runner will default to 'client.admin' [1] so > you shouldn't need to add any additional caps. In the short-term to > debug your issue, you can perhaps increase the log level for > tcmu-runner to see if it's showing an error [2]. So, I put 'log_level = 5' into /etc/tcmu/tcmu.conf , restart tcmu0runner and see only this in its logs : чер 13 08:38:14 p10s systemd[1]: Starting LIO Userspace-passthrough daemon... чер 13 08:38:14 p10s tcmu-runner[54121]: Inotify is watching "/etc/tcmu/tcmu.conf", wd: 1, mask: IN_ALL_EVENTS чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.634 54121 [DEBUG] load_our_module:531: Module 'target_core_user' is already loaded чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.634 54121 [DEBUG] main:1087: handler path: /usr/lib/x86_64-linux-gnu/tcmu-runner чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.657 54121 [DEBUG] main:1093: 2 runner handlers found чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.658 54121 [DEBUG] tcmu_block_device:404 rbd/libvirt.tower-prime-e-3tb: blocking kernel device чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.658 54121 [DEBUG] tcmu_block_device:410 rbd/libvirt.tower-prime-e-3tb: block done чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.658 54121 [DEBUG] dev_added:769 rbd/libvirt.tower-prime-e-3tb: Got block_size 512, size in bytes 3000596692992 чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.658 54121 [DEBUG] tcmu_rbd_open:829 rbd/libvirt.tower-prime-e-3tb: tcmu_rbd_open config rbd/libvirt/tower-prime-e-3tb;osd_op_timeout=30 block size 512 num lbas 5860540416. чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.665 54121 [DEBUG] timer_check_and_set_def:383 rbd/libvirt.tower-prime-e-3tb: The cluster's default osd op timeout(0.00), osd heartbeat grace(20) interval(6) чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.672 54121 [DEBUG] tcmu_rbd_detect_device_class:300 rbd/libvirt.tower-prime-e-3tb: Pool libvirt using crush rule "replicated_rule" чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.672 54121 [DEBUG] tcmu_rbd_detect_device_class:316 rbd/libvirt.tower-prime-e-3tb: SSD not a registered device class. чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.715 7f30e08c9880 1 mgrc service_daemon_register tcmu-runner.p10s:libvirt/tower-prime-e-3tb metadata {arch=x86_64,ceph_release=mimic,ceph_version=ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable),ceph_version_short=13.2.0,cpu=Intel(R) Xeon(R) CPU E3-1235L v5 @ 2.00GHz,distro=ubuntu,distro_description=Ubuntu 18.04 LTS,distro_version=18.04,hostname=p10s,image_id=25c21238e1f29,image_name=tower-prime-e-3tb,kernel_description=#201806032231 SMP Sun Jun 3 22:33:34 UTC 2018,kernel_version=4.17.0-041700-generic,mem_swap_kb=15622140,mem_total_kb=65827836,os=Linux,pool_name=libvirt} чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.721 54121 [DEBUG] tcmu_unblock_device:422 rbd/libvirt.tower-prime-e-3tb: unblocking kernel device чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.721 54121 [DEBUG] tcmu_unblock_device:428 rbd/libvirt.tower-prime-e-3tb: unblock done чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.724 54121 [DEBUG] dbus_bus_acquired:445: bus org.kernel.TCMUService1 acquired чер 13 08:38:14 p10s systemd[1]: Started LIO Userspace-passthrough daemon. чер 13 08:38:14 p10s tcmu-runner[54121]: 2018-06-13 08:38:14.726 54121 [DEBUG] dbus_name_acquired:461: name org.kernel.TCMUService1 acquired чер 13 08:38:30 p10s tcmu-runner[54121]: 2018-06-13 08:38:30.521 54121 [DEBUG_SCSI_CMD] tcmu_print_cdb_info:1205 rbd/libvirt.tower-prime-e-3tb: 1a 0 3f 0 c0 0 чер 13 08:38:30 p10s tcmu-runner[54121]: 2018-06-13 08:38:30.523 54121 [DEBUG_SCSI_CMD] tcmu_print_cdb_info:1205 rbd/libvirt.tower-prime-e-3tb: 1a 0 3f 0 c0 0 чер 13 08:38:30 p10s tcmu-runner[54121]: 2018-06-13 08:38:30.543 54121 [DEBUG_SCSI_CMD] tcmu_print_cdb_info:1205 rbd/libvirt.tower-prime-e-3tb: 9e 10 0 0 0 0 0 0 0 0 0 0 0 c 0 0 чер 13 08:38:30 p10s tcmu-runner[54121]: 2018-06-13 08:38:30.550 54121 [DEBUG_SCSI_CMD] tcmu_print_cdb_info:1205 rbd/libvirt.tower-prime-e-3tb: 9e 10 0 0 0 0 0 0 0 0 0 0 0 c 0 0 чер 13 08:38:47 p10s tcmu-runner[54121]: 2018-06-13 08:38:47.944 54121 [DEBUG_SCSI_CMD] tcmu_print_cdb_info:1205 rbd/libvirt.tower-prime-e-3tb: 9e 10 0 0 0 0 0 0 0 0 0 0 0 c 0 0 Wikipedia says that 1A is 'mode sense' and 9e is 'service action in'. These records are logged when I try to put the disk online or initialize it with GPT/MBR partition table in Windows Disk Management (and Windows report errors after that) What to check next ? Any importance of missing 'SSD' device class
Re: [ceph-users] iSCSI to a Ceph node with 2 network adapters - how to ?
Hi everyone again, I continue set up of my testing Ceph cluster (1-node so far). I changed 'chooseleaf' from 'host' to 'osd' in CRUSH map to make it run healthy on 1 node. For the same purpose, I also set 'minimum_gateways = 1' for Ceph iSCSI gateway. Also I upgraded Ubuntu 18.04 kernel to mainline v4.17 to get up-to-date iSCSI attributes support required by gwcli (qfull_time_out and probably something else). I was able to add client host IQNs and configure their CHAP authentication. I was able to add iSCSI LUNs referring to RBD images, and to assign LUNs to clients. 'gwcli ls /' and 'targetcli ls /' show nice diagrams without signs of errors. iSCSI initiators from Windows 10 and 2008 R2 can log in to the portal with CHAP auth and list their assigned LUNs. And authenticated sessions are also shown in '*cli ls' printout But: in Windows disk management, mapped LUN is shown in 'offline' state. When I try to bring it online or to initalize the disk with MBR or GPT partition table, I get messages like 'device not ready' on Win10 or 'driver detected controller error on \device\harddisk\dr5' or the like. So, my usual question is - where to look and what logs to enable to find out what is going wrong ? My setup specifics are that I create my RBDs in non-default pool ('libvirt' instead of 'rbd'). Also I create them with erasure data-pool (called it 'jerasure21' as was configured in default erasure profile). Should I add explicit access to these pools to some Ceph client I don't know ? I know that 'gwcli' logs into Ceph as 'client.admin' but I am not sure about tcmu-runner and/or user:rbd backstore provider. Thank you in advance for your useful directions out of my problem. Wladimir Mutel wrote: Failed : disk create/update failed on p10s. LUN allocation failure Well, this was fixed by updating kernel to v4.17 from Ubuntu kernel/mainline PPA Going on... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] QEMU maps RBD but can't read them
Jason Dillaman wrote: One more question, how should I set profile 'rbd-read-only' properly ? I tried to set is for 'client.iso' on both 'iso' and 'jerasure21' pools, and this did not work. Set profile on both pools to 'rbd', it worked. But I don't want my iso imaged to be accidentally modified by virtual guests. Can this be solved with Ceph auth, or in some other way ? (in fact, I look for Ceph equivalent of 'chattr +i') QEMU doesn't currently handle the case for opening RBD images in read-only mode, so if you attempt to use 'profile rbd-read-only', I suspect attempting to open the image will fail. You could perhaps take a middle ground and just apply 'profile rbd-read-only pool=jerasure21' to protect the contents of the image. For QEMU I found that profile 'rbd-read-only' currently does not work. So, I use 'profile rbd' for both replicated and erasure pools, and hope that 'readonly' configuration in QEMU disk would help. In my past experience I found that running 'kvm ... -cdrom something.iso' sometimes would modify that .iso-file, so I had to set immutable attribute on the FS level. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] QEMU maps RBD but can't read them
Jason Dillaman wrote: The caps for those users looks correct for Luminous and later clusters. Any chance you are using data pools with the images? It's just odd that you have enough permissions to open the RBD image but cannot read its data objects. Yes, I use erasure-pool as data-pool for these images (to save on replication overhead). Should I add it to the [osd] profile list ? Indeed, that's the problem since the libvirt and/or iso user doesn't have access to the data-pool. This really helped, thanks ! client.iso key: AQBp...gA== caps: [mon] profile rbd caps: [osd] profile rbd pool=iso, profile rbd pool=jerasure21 client.libvirt key: AQBt...IA== caps: [mon] profile rbd caps: [osd] profile rbd pool=libvirt, profile rbd pool=jerasure21 Now I can boot the VM from the .iso image and install Windows. One more question, how should I set profile 'rbd-read-only' properly ? I tried to set is for 'client.iso' on both 'iso' and 'jerasure21' pools, and this did not work. Set profile on both pools to 'rbd', it worked. But I don't want my iso imaged to be accidentally modified by virtual guests. Can this be solved with Ceph auth, or in some other way ? (in fact, I look for Ceph equivalent of 'chattr +i') ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] QEMU maps RBD but can't read them
Jason Dillaman wrote: The caps for those users looks correct for Luminous and later clusters. Any chance you are using data pools with the images? It's just odd that you have enough permissions to open the RBD image but cannot read its data objects. Yes, I use erasure-pool as data-pool for these images (to save on replication overhead). Should I add it to the [osd] profile list ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] QEMU maps RBD but can't read them
Jason Dillaman wrote: Can you run "rbd --id libvirt --pool libvirt win206-test-3tb " w/o error? It sounds like your CephX caps for client.libvirt are not permitting read access to the image data objects. I tried to run 'rbd export' with these params, but it said it was unable to find a keyring. Is keyring file mandatory for every client ? 'ceph auth ls' shows these accounts with seemingly-proper permissions : client.iso key: AQBp...gA== caps: [mon] profile rbd caps: [osd] profile rbd pool=iso client.libvirt key: AQBt...IA== caps: [mon] profile rbd caps: [osd] profile rbd pool=libvirt And these same keys are listed in /etc/libvirt/secrets : /etc/libvirt/secrets# ls | while read a ; do echo $a : $(cat $a) ; done ac1d8d7b-d243-4474-841d-91c26fd93a14.base64 : AQBt...IA== ac1d8d7b-d243-4474-841d-91c26fd93a14.xml : private='yes'> ac1d8d7b-d243-4474-841d-91c26fd93a14 CEPH passphrase example ceph_example cf00c7e4-740a-4935-9d7c-223d3c81871f.base64 : AQBp...gA== cf00c7e4-740a-4935-9d7c-223d3c81871f.xml : private='yes'> cf00c7e4-740a-4935-9d7c-223d3c81871f CEPH ISO pool ceph_iso I just thought this should be enough. no ? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] QEMU maps RBD but can't read them
Dear all, I installed QEMU, libvirtd and its RBD plugins and now trying to make QEMU use my Ceph storage. I created 'iso' pool and imported Windows installation image there (rbd import). Also I created 'libvirt' pool and there, created 2.7-TB image for Windows installation. I created client.iso and client.libvirt accounts for Ceph authentication, and configured their secrets for pool access in virsh (as told in http://docs.ceph.com/docs/master/rbd/libvirt/ ). Then I started pools and checked that I can list their contents from virsh. Then I created a VM with dummy HDD and optical drive, and edited them using 'virsh edit' : name='iso/SW_DVD9_Win_Server_STD_CORE_2016_64Bit_Russian_-4_DC_STD_MLF_X21-70539.ISO'> Now I see this in the systemd journalctl : чер 06 16:24:12 p10s qemu-system-x86_64[4907]: 2018-06-06 16:24:12.147 7f40f37fe700 -1 librbd::io::ObjectRequest: 0x7f40d4010500 handle_read_object: failed to read from object: (1) Operation not permitted What should I check and where ? I can map the same RBD using rbd-nbd and read sectors from the mapped device. If I map using kernel RBD driver (I know this is not recommended to do on the same host), I get : чер 06 16:27:54 p10s kernel: rbd: image SW_DVD9_Win_Server_STD_CORE_2016_64Bit_Russian_-4_DC_STD_MLF_X21-70539.ISO: image uses unsupported features: 0x38 and RBD image feature set mismatch. You can disable features unsupported by the kernel with "rbd feature disable iso/SW_DVD9_Win_Server_STD_CORE_2016_64Bit_Russian_-4_DC_STD_MLF_X21-70539.ISO object-map fast-diff deep-flatten". Probably I need to change some attributes for the RBD to be usable with QEMU. Please give some hints. Thank you in advance. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI to a Ceph node with 2 network adapters - how to ?
On Mon, Jun 04, 2018 at 11:12:58AM +0300, Wladimir Mutel wrote: > /disks> create pool=rbd image=win2016-3tb-1 size=2861589M > CMD: /disks/ create pool=rbd image=win2016-3tb-1 size=2861589M count=1 > max_data_area_mb=None > pool 'rbd' is ok to use > Creating/mapping disk rbd/win2016-3tb-1 > Issuing disk create request > Failed : disk create/update failed on p10s. LUN allocation failure > Surely I could investigate what is happening by studying gwcli sources, > but if anyone already knows how to fix that, I would appreciate your > response. Well, this was fixed by updating kernel to v4.17 from Ubuntu kernel/mainline PPA Going on... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI to a Ceph node with 2 network adapters - how to ?
On Fri, Jun 01, 2018 at 08:20:12PM +0300, Wladimir Mutel wrote: > > And still, when I do '/disks create ...' in gwcli, it says > that it wants 2 existing gateways. Probably this is related > to the created 2-TPG structure and I should look for more ways > to 'improve' that json config so that rbd-target-gw loads it > as I need on single host. Well, I decided to bond my network interfaces and assign a single IP on them (as mchristi@ suggested) Also I put 'minimum_gateways = 1' into /etc/ceph/iscsi-gateway.cfg and got rid of 'At least 2 gateways required' in gwcli But now I have one more stumble : gwcli -d Adding ceph cluster 'ceph' to the UI Fetching ceph osd information Querying ceph for state information Refreshing disk information from the config object - Scanning will use 8 scan threads - rbd image scan complete: 0s Refreshing gateway & client information - checking iSCSI/API ports on p10s Querying ceph for state information Gathering pool stats for cluster 'ceph' /disks> create pool=rbd image=win2016-3tb-1 size=2861589M CMD: /disks/ create pool=rbd image=win2016-3tb-1 size=2861589M count=1 max_data_area_mb=None pool 'rbd' is ok to use Creating/mapping disk rbd/win2016-3tb-1 Issuing disk create request Failed : disk create/update failed on p10s. LUN allocation failure Surely I could investigate what is happening by studying gwcli sources, but if anyone already knows how to fix that, I would appreciate your response. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI to a Ceph node with 2 network adapters - how to ?
Ok, I looked into Python sources of ceph-iscsi-{cli,config} and found that per-host configuration sections use short host name (returned by this_host() function) as their primary key. So I can't trick gwcli with alternative host name like p10s2 which I put into /etc/hosts to denote my second IP, as this_host() calls gethostname() and further code disregards alternative host names at all. I added 192.168.201.231 into trusted_ip_list, but after 'create p10s2 192.168.201.231 skipchecks=true' I got KeyError 'p10s2' in gwcli/gateway.py line 571 Fortunately, I found a way to edit Ceph iSCSI configuration as a text file (rados --pool rbd get gateway.conf gateway.conf) I added needed IP to the appropriate json lists (."gateways"."ip_list" and."gateways"."p10s"."gateway_ip_list"), put the file back into RADOS and restarted rbd-target-gw in the hope everything will go well Unfortunately, I found (by running 'targetcli ls') that now it creates 2 TPGs with single IP portal in each of them Also, it disables 1st TPG but enables 2nd one, like this : o- iscsi [Targets: 1] | o- iqn.2018-06.domain.p10s:p10s [TPGs: 2] | o- tpg1 [disabled] | | o- portals [Portals: 1] | | o- 192.168.200.230:3260 [OK] | o- tpg2 [no-gen-acls, no-auth] | o- portals [Portals: 1] | o- 192.168.201.231:3260 [OK] And still, when I do '/disks create ...' in gwcli, it says that it wants 2 existing gateways. Probably this is related to the created 2-TPG structure and I should look for more ways to 'improve' that json config so that rbd-target-gw loads it as I need on single host. Wladimir Mutel wrote: Well, ok, I moved second address into different subnet (192.168.201.231/24) and also reflected that in 'hosts' file But that did not help much : /iscsi-target...test/gateways> create p10s2 192.168.201.231 skipchecks=true OS version/package checks have been bypassed Adding gateway, sync'ing 0 disk(s) and 0 client(s) Failed : Gateway creation failed, gateway(s) unavailable:192.168.201.231(UNKNOWN state) /disks> create pool=replicated image=win2016-3gb size=2861589M Failed : at least 2 gateways must exist before disk operations are permitted I see this mentioned in Ceph-iSCSI-CLI GitHub issues https://github.com/ceph/ceph-iscsi-cli/issues/54 and https://github.com/ceph/ceph-iscsi-cli/issues/59 but apparently without a solution So, would anybody propose an idea on how can I start using iSCSI over Ceph acheap? With the single P10S host I have in my hands right now? Additional host and 10GBE hardware would require additional funding, which would possible only in some future. Thanks in advance for your responses Wladimir Mutel wrote: I have both its Ethernets connected to the same LAN, with different IPs in the same subnet (like, 192.168.200.230/24 and 192.168.200.231/24) 192.168.200.230 p10s 192.168.200.231 p10s2 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] iSCSI to a Ceph node with 2 network adapters - how to ?
Well, ok, I moved second address into different subnet (192.168.201.231/24) and also reflected that in 'hosts' file But that did not help much : /iscsi-target...test/gateways> create p10s2 192.168.201.231 skipchecks=true OS version/package checks have been bypassed Adding gateway, sync'ing 0 disk(s) and 0 client(s) Failed : Gateway creation failed, gateway(s) unavailable:192.168.201.231(UNKNOWN state) /disks> create pool=replicated image=win2016-3gb size=2861589M Failed : at least 2 gateways must exist before disk operations are permitted I see this mentioned in Ceph-iSCSI-CLI GitHub issues https://github.com/ceph/ceph-iscsi-cli/issues/54 and https://github.com/ceph/ceph-iscsi-cli/issues/59 but apparently without a solution So, would anybody propose an idea on how can I start using iSCSI over Ceph acheap? With the single P10S host I have in my hands right now? Additional host and 10GBE hardware would require additional funding, which would possible only in some future. Thanks in advance for your responses Wladimir Mutel wrote: I have both its Ethernets connected to the same LAN, with different IPs in the same subnet (like, 192.168.200.230/24 and 192.168.200.231/24) 192.168.200.230 p10s 192.168.200.231 p10s2 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] iSCSI to a Ceph node with 2 network adapters - how to ?
Dear all, I am experimenting with Ceph setup. I set up a single node (Asus P10S-M WS, Xeon E3-1235 v5, 64 GB RAM, 8x3TB SATA HDDs, Ubuntu 18.04 Bionic, Ceph packages from http://download.ceph.com/debian-luminous/dists/xenial/ and iscsi parts built manually per http://docs.ceph.com/docs/master/rbd/iscsi-target-cli-manual-install/) Also i changed 'chooseleaf ... host' into 'chooseleaf ... osd' in the CRUSH map to run with single host. I have both its Ethernets connected to the same LAN, with different IPs in the same subnet (like, 192.168.200.230/24 and 192.168.200.231/24) mon_host in ceph.conf is set to 192.168.200.230, and ceph daemons (mgr, mon, osd) are listening to this IP. What I would like to finally achieve, is to provide multipath iSCSI access through both these Ethernets to Ceph RBDs, and apparently, gwcli does not allow me to add a second gateway to the same target. It is going like this : /iscsi-target> create iqn.2018-06.host.test:test ok /iscsi-target> cd iqn.2018-06.host.test:test/gateways /iscsi-target...test/gateways> create p10s 192.168.200.230 skipchecks=true OS version/package checks have been bypassed Adding gateway, sync'ing 0 disk(s) and 0 client(s) ok /iscsi-target...test/gateways> create p10s2 192.168.200.231 skipchecks=true OS version/package checks have been bypassed Adding gateway, sync'ing 0 disk(s) and 0 client(s) Failed : Gateway creation failed, gateway(s) unavailable:192.168.200.231(UNKNOWN state) host names are defined in /etc/hosts as follows : 192.168.200.230 p10s 192.168.200.231 p10s2 so I suppose that something does not listen on 192.168.200.231, but I don't have an idea what is that thing and how to make it listen there. Or how to achieve this goal (utilization of both Ethernets for iSCSI) in different way. Shoud I aggregate Ethernets into a 'bond' interface with single IP ? Should I build and use 'lrbd' tool instead of 'gwcli' ? Is it acceptable that I run kernel 4.15, not 4.16+ ? What other directions could you give me on this task ? Thanks in advance for your replies. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com