Re: cmake
On Wed, Dec 16, 2015 at 5:33 PM, Sage Weilwrote: > The work to transition to cmake has stalled somewhat. I've tried to use > it a few times but keep running into issues that make it unusable for me. > Not having make check is a big one, but I think the hackery required to > get that going points to the underlying problem(s). > > I seems like the main problem is that automake puts all build targets in > src/ and cmake spreads them all over build/*. This makes that you can't > just add ./ to anything that would normally be in your path (or, > PATH=.:$PATH, and then run, say, ../qa/workunits/cephtool/test.sh). > There's a bunch of kludges in vstart.sh to make it work that I think > mostly point to this issue (and the .libs things). Is there simply an > option we can give cmake to make it put built binaries directly in build/? > > Stepping back a bit, it seems like the goals should be > > 1. Be able to completely replace autotools. I don't fancy maintaining > both in parallel. Yes! > 2. Be able to run vstart etc from the build dir. I'm currently doing this (i.e. being in the build dir and running ../src/vstart.sh), along with the vstart_runner.py for cephfs tests. I did indeed have to make sure that vstart_runner was aware of the differing binary paths though. Though I'm obviously using just MDS+OSD, so I might be overstating the extent to which it currently works. > 3. Be able to run ./ceph[-anything] from the build dir, or put the build > dir in the path. (I suppose we could rely in a make install step, but > that seems like more hassle... hopefully it's not neceesary?) Shall we just put all our libs and binaries in one place? This works for me: set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib) set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib) set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin) (to get a bin/ and a lib/ with absolutely everything in) That way folks can either get used to typing bin/foo instead of ./foo, or add bin/ to their path. > 4. make check has to work > > 5. Use make-dist.sh to generate a release tarball (not make dist) > > 6. gitbuilders use make-dist.sh and cmake to build packages > > 7. release process uses make-dist.sh and cmake to build a relelase > > I'm probably missing something? > > Should we set a target of doing the 10.0.2 or .3 with cmake? > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with Ceph File System and LIO
On Tue, Dec 15, 2015 at 9:26 AM, Mike Christiewrote: > On 12/15/2015 12:08 AM, Eric Eastman wrote: >> I am testing Linux Target SCSI, LIO, with a Ceph File System backstore >> and I am seeing this error on my LIO gateway. I am using Ceph v9.2.0 >> on a 4.4rc4 Kernel, on Trusty, using a kernel mounted Ceph File >> System. A file on the Ceph File System is exported via iSCSI to a >> VMware ESXi 5.0 server, and I am seeing this error when doing a lot of >> I/O on the ESXi server. Is this a LIO or a Ceph issue? >> >> [Tue Dec 15 00:46:55 2015] [ cut here ] >> [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at >> /home/kernel/COD/linux/fs/ceph/addr.c:125 >> ceph_set_page_dirty+0x230/0x240 [ceph]() >> [Tue Dec 15 00:46:55 2015] Modules linked in: iptable_filter ip_tables >> x_tables xfs rbd iscsi_target_mod vhost_scsi tcm_qla2xxx ib_srpt >> tcm_fc tcm_usb_gadget tcm_loop target_core_file target_core_iblock >> target_core_pscsi target_core_user target_core_mod ipmi_devintf vhost >> qla2xxx ib_cm ib_sa ib_mad ib_core ib_addr libfc scsi_transport_fc >> libcomposite udc_core uio configfs ipmi_ssif ttm drm_kms_helper >> gpio_ich drm i2c_algo_bit fb_sys_fops coretemp syscopyarea ipmi_si >> sysfillrect ipmi_msghandler sysimgblt kvm acpi_power_meter 8250_fintek >> irqbypass hpilo shpchp input_leds serio_raw lpc_ich i7core_edac >> edac_core mac_hid ceph libceph libcrc32c fscache bonding lp parport >> mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core hid_generic >> usbhid hid hpsa mlx4_core psmouse bnx2 scsi_transport_sas fjes [last >> unloaded: target_core_mod] >> [Tue Dec 15 00:46:55 2015] CPU: 0 PID: 1123421 Comm: iscsi_trx >> Tainted: GW I 4.4.0-040400rc4-generic #201512061930 >> [Tue Dec 15 00:46:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS >> P64 01/22/2015 >> [Tue Dec 15 00:46:55 2015] fdc0ce43 >> 880bf38c38c0 813c8ab4 >> [Tue Dec 15 00:46:55 2015] 880bf38c38f8 >> 8107d772 ea00127a8680 >> [Tue Dec 15 00:46:55 2015] 8804e52c1448 8804e52c15b0 >> 8804e52c10f0 0200 >> [Tue Dec 15 00:46:55 2015] Call Trace: >> [Tue Dec 15 00:46:55 2015] [] dump_stack+0x44/0x60 >> [Tue Dec 15 00:46:55 2015] [] >> warn_slowpath_common+0x82/0xc0 >> [Tue Dec 15 00:46:55 2015] [] warn_slowpath_null+0x1a/0x20 >> [Tue Dec 15 00:46:55 2015] [] >> ceph_set_page_dirty+0x230/0x240 [ceph] >> [Tue Dec 15 00:46:55 2015] [] ? >> pagecache_get_page+0x150/0x1c0 >> [Tue Dec 15 00:46:55 2015] [] ? >> ceph_pool_perm_check+0x48/0x700 [ceph] >> [Tue Dec 15 00:46:55 2015] [] set_page_dirty+0x3d/0x70 >> [Tue Dec 15 00:46:55 2015] [] >> ceph_write_end+0x5e/0x180 [ceph] >> [Tue Dec 15 00:46:55 2015] [] ? >> iov_iter_copy_from_user_atomic+0x156/0x220 >> [Tue Dec 15 00:46:55 2015] [] >> generic_perform_write+0x114/0x1c0 >> [Tue Dec 15 00:46:55 2015] [] >> ceph_write_iter+0xf8a/0x1050 [ceph] >> [Tue Dec 15 00:46:55 2015] [] ? >> ceph_put_cap_refs+0x143/0x320 [ceph] >> [Tue Dec 15 00:46:55 2015] [] ? >> check_preempt_wakeup+0xfa/0x220 >> [Tue Dec 15 00:46:55 2015] [] ? zone_statistics+0x7c/0xa0 >> [Tue Dec 15 00:46:55 2015] [] ? >> copy_page_to_iter+0x5e/0xa0 >> [Tue Dec 15 00:46:55 2015] [] ? >> skb_copy_datagram_iter+0x122/0x250 >> [Tue Dec 15 00:46:55 2015] [] vfs_iter_write+0x76/0xc0 >> [Tue Dec 15 00:46:55 2015] [] >> fd_do_rw.isra.5+0xd8/0x1e0 [target_core_file] >> [Tue Dec 15 00:46:55 2015] [] >> fd_execute_rw+0xc5/0x2a0 [target_core_file] >> [Tue Dec 15 00:46:55 2015] [] >> sbc_execute_rw+0x22/0x30 [target_core_mod] >> [Tue Dec 15 00:46:55 2015] [] >> __target_execute_cmd+0x1f/0x70 [target_core_mod] >> [Tue Dec 15 00:46:55 2015] [] >> target_execute_cmd+0x195/0x2a0 [target_core_mod] >> [Tue Dec 15 00:46:55 2015] [] >> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod] >> [Tue Dec 15 00:46:55 2015] [] >> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod] >> [Tue Dec 15 00:46:55 2015] [] >> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod] >> [Tue Dec 15 00:46:55 2015] [] ? __switch_to+0x1dc/0x5a0 >> [Tue Dec 15 00:46:55 2015] [] ? >> iscsi_target_tx_thread+0x1e0/0x1e0 [iscsi_target_mod] >> [Tue Dec 15 00:46:55 2015] [] kthread+0xd8/0xf0 >> [Tue Dec 15 00:46:55 2015] [] ? >> kthread_create_on_node+0x1a0/0x1a0 >> [Tue Dec 15 00:46:55 2015] [] ret_from_fork+0x3f/0x70 >> [Tue Dec 15 00:46:55 2015] [] ? >> kthread_create_on_node+0x1a0/0x1a0 >> [Tue Dec 15 00:46:55 2015] ---[ end trace 4079437668c77cbb ]--- >> [Tue Dec 15 00:47:45 2015] ABORT_TASK: Found referenced iSCSI task_tag: >> 95784927 >> [Tue Dec 15 00:47:45 2015] ABORT_TASK: ref_tag: 95784927 already >> complete, skipping Looks likely to be a kclient bug, as it's in the newish pool_perm_check path. Perhaps we don't usually see this because we'd usually hit the permissions checks earlier (or during a read). CCing zyan, who will have a better idea than me. Eric: you should probably go ahead and
Re: [ceph-users] 答复: 答复: how to see file object-mappings for cephfuse client
On Mon, Dec 7, 2015 at 9:13 AM, Wuxiangweiwrote: > > it looks simple if everything stays as its default value. However, we do want > to change the stripe unit size and stripe count to achieve possible higher > performance. If so, it would be too troublesome to manually do the > calculation every time when we want to locate a given offset(and maybe a > certain interval). The 'cephfs map' and 'cephfs show_location' can provide > good information as we want, but sadly not for ceph-fuse. That's why I ask > for a similar tool. If you are interested in writing this, you could look at Client::get_file_stripe_address and Striper::file_to_extents. Currently in libcephfs we expose methods for getting the OSDs relevant to a particular file, in case something like hadoop wants to exploit locality, but we don't expose the intermediate knowledge of the object names. I am curious about why you need this? John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: proposal to run Ceph tests on pull requests
On Sat, Dec 5, 2015 at 11:49 AM, Loic Dacharywrote: > Hi Ceph, > > TL;DR: a ceph-qa-suite bot running on pull requests is sustainable and is an > incentive for contributors to use teuthology-openstack independently A bot for scheduling a named suite on a named PR, and posting the results back the PR is definitely a good thing. Thinking further about using commit messages to toggle the testing, I think that this could get awkward when it's coupled to the human side of code review. When someone pushes a "how about this?" modification they don't necessarily want to re-run the test suite until the reviewer has okayed it, but then that means that they have to push again, and the final thing that's tested would be a different SHA1 (hopefully the same code) than what the human last reviewed. We'll also have e.g. rebases, where there tends to be some discretion about whether a rebase requires a re-test. When you were talking about having the suite selected in the qa: tag, there was the motivation to put it in the commit message so that it would be preserved in backports. However, if the "Needs-qa:" flag is just a boolean, then I think it makes more sense to control it with a github label or by posting a command in a PR comment. I'm not sure how this really helps with the resource issues; for example with the fs suite we would probably not be able to make a finer-grained choice about what tests to run based on the diff. The part about randomly dropping a subset of tests when resources are low doesn't make sense to me -- I think the bot should either give up or enqueue itself. Cheers, John > When a pull request is submitted, it is compiled, some tests are run[1] and > the result is added to the pull request to confirm that it does not introduce > a trivial problem. Such tests are however limited because they must: > > * run within a few minutes at most > * not require multiple machines > * not require root privileges > > More extensive tests (primarily integration tests) are needed before a > contribution can be merged into Ceph [2], to verify it does not introduce a > subtle regression. It would be ideal to run these integration tests on each > pull request but there are two obstacles: > > * each test takes ~ 1.5 hour > * each test cost ~ 0.30 euros > > On the current master, running all tests would require ~1000 jobs [3]. That > would cost ~ 300 euros on each pull request and take ~10 hours assuming 100 > jobs can run in parallel. We could resolve that problem by: > > * maintaining a ceph-qa-suite map to be used as a white list mapping a diff > to a set of tests. For instance, if the diff modifies the src/ceph-disk file, > it outputs the ceph-disk suite[4]. This would effectively trim the tests that > are unrelated to the contribution and reduce the number of tests to a maximum > of ~100 [4] and most likely a dozen. > * tests are run if one of the commits of the pull request has the *Needs-qa: > true* flag in the commit message[5] > * limiting the number of tests to fit in the allocated budget. If there was > enough funding for 10,000 jobs during the previous period and there was a > total of 1,000 test run required (a test run is a set of tests as produced by > the ceph-qa-suite map), each run is trimmed to a maximum of ten tests, > regardless. > > Here is an example: > > Joe submits a pull request to fix a bug in the librados API > The make check bot compiles and fails make check because it introduces a bug > Joe uses run-make-check.sh locally to repeat the failure, fixes it and repush > The make check bot compiles and passes make check > Joe amends the commit message to add *Needs-qa: true* and repushes > The ceph-qa-suite map script finds a change on the librados API and outputs > smoke/basic/tasks/rados_api_tests.yaml > The ceph-qa-suite bot runs the test smoke/basic/tasks/rados_api_tests.yaml > which fails > Joe examines the logs found at http://teuthology-logs.public.ceph.com/ and > decides to debug by running the test himself > Joe runs teuthology-openstack --suite smoke/basic/tasks/rados_api_tests.yaml > against his own OpenStack tenant [6] > Joe repush with a fix > The ceph-qa-suite bot runs the test smoke/basic/tasks/rados_api_tests.yaml > which succeeds > Kefu reviews the pull request and has a link to the successful test runs in > the comments > > This approach scales with the size of the Ceph developer community [7] > because regular contributors benefit directly from funding the ceph-qa-suite > bot. New contributors can focus on learning how to interpret the > ceph-qa-suite error logs for their contribution and learn about how to debug > it via teuthology-openstack if needed, which is a better user experience than > trying to figure out which ceph-qa-suite job to run, learning about > teuthology, schedule the test and interpret the results. > > The maintenance workload of a ceph-qa-suite bot probably requires one work > day a week, to
Re: Suggestions on tracker 13578
On Wed, Dec 2, 2015 at 7:54 PM, Paul Von-Stamwitzwrote: >> -Original Message- >> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- >> ow...@vger.kernel.org] On Behalf Of Mark Nelson >> Sent: Wednesday, December 02, 2015 11:04 AM >> To: Gregory Farnum; Vimal >> Cc: ceph-devel >> Subject: Re: Suggestions on tracker 13578 >> >> >> On 12/02/2015 12:23 PM, Gregory Farnum wrote: >> > On Tue, Dec 1, 2015 at 5:23 AM, Vimal wrote: >> >> Hello, >> >> >> >> This mail is to discuss the feature request at >> >> http://tracker.ceph.com/issues/13578. >> >> >> >> If done, such a tool should help point out several mis-configurations >> >> that may cause problems in a cluster later. >> >> >> >> Some of the suggestions are: >> >> >> >> a) A check to understand if the MONs and OSD nodes are on the same >> machines. >> >> >> >> b) If /var is a separate partition or not, to prevent the root >> >> filesystem from being filled up. >> >> >> >> c) If monitors are deployed in different failure domains or not. >> >> >> >> d) If the OSDs are deployed in different failure domains. >> >> >> >> e) If a journal disk is used for more than six OSDs. Right now, the >> >> documentation suggests upto 6 OSD journals to exist on a single >> >> journal disk. >> >> >> >> f) Failure domains depending on the power source. >> >> >> >> There can be several more checks, and it can be a useful tool to test >> >> the problems an existing cluster or a new installation. >> >> >> >> But I'd like to know how the engineering community sees this, if its >> >> seems to be worth pursuing, and what suggestions do you have for >> >> improving/adding to this. >> > >> > This is a user experience and support tool; I don't think the >> > engineering community can really judge its value. ;) >> > >> > So sure, sounds good to me. It'll need to get into the hands of users >> > before we find out if it's a good plan or not. I was at the SDI Summit >> > yesterday and was hearing about how some of our choices (like >> > HEALTH_WARN on pg counts) are *really* scary for users who think >> > they're in danger of losing data. I suspect the difficulty of a tool >> > like this will be more in the communication of issues and severity, >> > more than in what exactly we choose to check. >> >> Frankly I've never been a big fan of how we report warnings like this through >> the health check. It's important to let users know if they've set up things >> sub-optimally, but I don't think ceph health is the way to do it. The >> difference between your doctor telling you you should exercise more and >> lose a few pounds vs you have Ebola and are going to suffer an incredibly >> gruesome and painful death in the next 48 hours. :) >> > > Since I was the one at the SDI Summit that took issue with some of these > warnings, I whole-heartedly agree with Greg's and Mark's comments. A warning > at health check should indicate to the user that some corrective action > should be taken, besides turning the warning off :-) I do not have an issue > reporting advisories, but they should be kept separate true warnings. If we > want to notify the user of variances from best practices, I suggest a > separate method, i.e. "ceph advise", rather than constantly repeating them on > health checks. Separating things into "advise" vs. "health" probably doesn't solve the problem, because one has to decide what goes in which section, and ends up with the same problem as INFO/WARN/ERR categorisation -- the idea of having different categories is fine, the hard part is assigning particular items to a category in a way that makes sense for different users. IMHO the core problems are attempting to collapse all these notifications into a global indicator, and attempting to do that in the same way for all systems. It needs to be finer grained than that. I never got around to doing anything with #7192 [1], but it outlines a way to change the health outlet into a form where it's easier to selectively ignore particular items. Once you break down the health output into a set of known status codes, a natural extension would be to have user-configurable masks, so that they could cancel particular warnings if they wanted to. Think of it like having the ability to press the warning lights in an aeroplane cockpit to turn off the alarm sound. John 1. http://tracker.ceph.com/issues/7192 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: teuthology field in commit messages
On Sun, Nov 29, 2015 at 8:25 PM, Loic Dachary <l...@dachary.org> wrote: > > > On 29/11/2015 21:08, John Spray wrote: >> On Sat, Nov 28, 2015 at 3:56 PM, Loic Dachary <l...@dachary.org> wrote: >>> Hi Ceph, >>> >>> An optional teuthology field could be added to a commit message like so: >>> >>> teuthology: --suite rbd >>> >>> to state that this commit should be tested with the rbd suite. It could be >>> parsed by bots and humans. >>> >>> It would make it easy and cost effective to run partial teuthology suites >>> automatically on pull requests. >>> >>> What do you think ? >> >> Hmm, we are usually testing things at the branch/PR level rather than >> on the per-commit level, so it feels a bit strange to have this in the >> commit message. > > Indeed. But what is a branch if not the HEAD commit ? It's the HEAD commit, and its ancestors. So in a typical PR (or branch) there are several commits since the base (i.e. since master), and perhaps only one of them has a test suite marked on it, or maybe they have different test suites marked on different commits in the branch. It's not necessarily a problem, just something that would need to have a defined behaviour (maybe when testing a PR collect the "teuthology:" tags from all commits in PR, and run all the suites mentioned?). John > >> However, if a system existed that would auto-test things when I put >> something magic in a commit message, I would probably use it! >> >> John >> >> >>> >>> -- >>> Loïc Dachary, Artisan Logiciel Libre >>> > > -- > Loïc Dachary, Artisan Logiciel Libre > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: teuthology field in commit messages
On Sat, Nov 28, 2015 at 3:56 PM, Loic Dacharywrote: > Hi Ceph, > > An optional teuthology field could be added to a commit message like so: > > teuthology: --suite rbd > > to state that this commit should be tested with the rbd suite. It could be > parsed by bots and humans. > > It would make it easy and cost effective to run partial teuthology suites > automatically on pull requests. > > What do you think ? Hmm, we are usually testing things at the branch/PR level rather than on the per-commit level, so it feels a bit strange to have this in the commit message. However, if a system existed that would auto-test things when I put something magic in a commit message, I would probably use it! John > > -- > Loïc Dachary, Artisan Logiciel Libre > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: teuthology field in commit messages
On Sun, Nov 29, 2015 at 9:25 PM, Loic Dachary <l...@dachary.org> wrote: > > > On 29/11/2015 21:47, John Spray wrote: >> On Sun, Nov 29, 2015 at 8:25 PM, Loic Dachary <l...@dachary.org> wrote: >>> >>> >>> On 29/11/2015 21:08, John Spray wrote: >>>> On Sat, Nov 28, 2015 at 3:56 PM, Loic Dachary <l...@dachary.org> wrote: >>>>> Hi Ceph, >>>>> >>>>> An optional teuthology field could be added to a commit message like so: >>>>> >>>>> teuthology: --suite rbd >>>>> >>>>> to state that this commit should be tested with the rbd suite. It could >>>>> be parsed by bots and humans. >>>>> >>>>> It would make it easy and cost effective to run partial teuthology suites >>>>> automatically on pull requests. >>>>> >>>>> What do you think ? >>>> >>>> Hmm, we are usually testing things at the branch/PR level rather than >>>> on the per-commit level, so it feels a bit strange to have this in the >>>> commit message. >>> >>> Indeed. But what is a branch if not the HEAD commit ? >> >> It's the HEAD commit, and its ancestors. So in a typical PR (or >> branch) there are several commits since the base (i.e. since master), >> and perhaps only one of them has a test suite marked on it, or maybe >> they have different test suites marked on different commits in the >> branch. >> >> It's not necessarily a problem, just something that would need to have >> a defined behaviour (maybe when testing a PR collect the "teuthology:" >> tags from all commits in PR, and run all the suites mentioned?). > > That's an interesting idea :-) My understanding is that we currently test a > PR by scheduling suites on its HEAD. But maybe you sometime schedule suites > using a commit that's in the middle of a PR ? I think I've made this too complicated... What I meant was that while one would schedule suites against the HEAD of the PR, that might not be the same commit that has the logical testing information in. For example, I might have main commit that has the "Fixes: " and "teuthology: " tags, and then a second commit (that would be HEAD) which e.g. tweaks a unit test. It would be weird if I had to put the teuthology: tag on the unit test commit rather than the functional test, so I guess it would make sense to look at the teuthology: tags from all the commits in a PR when scheduling it. John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nfsv41 over AF_VSOCK (nfs-ganesha)
On Mon, Oct 19, 2015 at 5:13 PM, John Spray <jsp...@redhat.com> wrote: >> If you try this, send feedback. >> OK, got this up and running. I've shared the kernel/qemu/nfsutils packages I built here: https://copr.fedoraproject.org/coprs/jspray/vsock-nfs/builds/ (at time of writing the kernel one is still building, and I'm running with ganesha out of a source tree) Observations: * Running VM as qemu user gives EPERM opening vsock device, even after changing permissions on the device node (for which I guess we'll want udev rules at some stage) -- is there a particular capability that we need to grant the qemu user? Was looking into this to make it convenient to run inside libvirt. * NFS writes from the guest are lagging for like a minute before completing, my hunch is that this is something in the NFS client recovery stuff (in ganesha) that's not coping with vsock, the operations seem to complete at the point where the server declares itself "NOT IN GRACE". * For those (like myself) unaccustomed to running ganesha, do not run it straight out of a source tree and expect everything to work, by default even VFS exports won't work that way (mounts work but clients see an empty tree) because it can't find the built FSAL .so. You can write a config file that works, but it's easier just to make install it. * (Anecdotal, seen while messing with other stuff) client mount seems to hang if I kill ganesha and then start it again, not sure if this is a ganesha issue or a general vsock issue. Cheers, John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
On Thu, Oct 22, 2015 at 1:43 PM, Milosz Tanski <mil...@adfin.com> wrote: > On Wed, Oct 21, 2015 at 5:33 PM, John Spray <jsp...@redhat.com> wrote: >> On Wed, Oct 21, 2015 at 10:33 PM, John Spray <jsp...@redhat.com> wrote: >>>> John, I know you've got >>>> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's >>>> supposed to be for this, but I'm not sure if you spotted any issues >>>> with it or if we need to do some more diagnosing? >>> >>> That test path is just verifying that we do handle dirs without dying >>> in at least one case -- it passes with the existing ceph code, so it's >>> not reproducing this issue. >> >> Clicked send to soon, I was about to add... >> >> Milosz mentioned that they don't have the data from the system in the >> broken state, so I don't have any bright ideas about learning more >> about what went wrong here unfortunately. >> > > Sorry about that, wasn't thinking at the time and just wanted to get > this up and going as quickly as possible :( > > If this happens next time I'll be more careful to keep more evidence. > I think multi-fs in the same rados namespace support would actually > helped here, since it makes it easier to create a newfs and leave the > other one around (for investigation) Yep, good point. I am a known enthusiast for multi-filesystem support :-) > But makes me wonder that the broken dir scenario can probably be > replicated by hand using rados calls. There's a pretty generic ticket > there for don't die on dir errors, but I imagine the code can be > audited and steps to cause a synthetic error can be produced. Yes, that part I have done (and will build into the automated tests in due course) -- the bit that is still a mystery is how the damage occurred to begin with. John > > -- > Milosz Tanski > CTO > 16 East 34th Street, 15th floor > New York, NY 10016 > > p: 646-253-9055 > e: mil...@adfin.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
> John, I know you've got > https://github.com/ceph/ceph-qa-suite/pull/647. I think that's > supposed to be for this, but I'm not sure if you spotted any issues > with it or if we need to do some more diagnosing? That test path is just verifying that we do handle dirs without dying in at least one case -- it passes with the existing ceph code, so it's not reproducing this issue. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
On Wed, Oct 21, 2015 at 10:33 PM, John Spray <jsp...@redhat.com> wrote: >> John, I know you've got >> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's >> supposed to be for this, but I'm not sure if you spotted any issues >> with it or if we need to do some more diagnosing? > > That test path is just verifying that we do handle dirs without dying > in at least one case -- it passes with the existing ceph code, so it's > not reproducing this issue. Clicked send to soon, I was about to add... Milosz mentioned that they don't have the data from the system in the broken state, so I don't have any bright ideas about learning more about what went wrong here unfortunately. John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nfsv41 over AF_VSOCK (nfs-ganesha)
On Fri, Oct 16, 2015 at 10:08 PM, Matt Benjaminwrote: > Hi devs (CC Bruce--here is a use case for vmci sockets transport) > > One of Sage's possible plans for Manilla integration would use nfs over the > new Linux vmci sockets transport integration in qemu (below) to access > Cephfs via an nfs-ganesha server running in the host vm. > > This now experimentally works. Very cool! Thank you for the detailed instructions, I look forward to trying this out soon. John > some notes on running nfs-ganesha over AF_VSOCK: > > 1. need stefan hajnoczi's patches for > * linux kernel (and build w/vhost-vsock support > * qemu (and build w/vhost-vsock support) > * nfs-utils (in vm guest) > > all linked from https://github.com/stefanha?tab=repositories > > 2. host and vm guest kernels must include vhost-vsock > * host kernel should load vhost-vsock.ko > > 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci device, > e.g > > /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 > --enable-kvm -drive > file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive > file=/opt/isos/f22.iso,media=cdrom -net > nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 > -parallel none -serial mon:stdio -device > vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4 -boot c > > 4. nfs-gansha (in host) > * need nfs-ganesha and its ntirpc rpc provider with vsock support > https://github.com/linuxbox2/nfs-ganesha (vsock branch) > https://github.com/linuxbox2/ntirpc (vsock branch) > > * configure ganesha w/vsock support > cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK > -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src > > in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block > > 5. mount in guest w/nfs41: > (e.g., in fstab) > 2:// /vsock41 nfs > noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576 > 0 0 > > If you try this, send feedback. > > Thanks! > > Matt > > -- > Matt Benjamin > Red Hat, Inc. > 315 West Huron Street, Suite 140A > Ann Arbor, Michigan 48103 > > http://www.redhat.com/en/technologies/storage > > tel. 734-707-0660 > fax. 734-769-8938 > cel. 734-216-5309 > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: newstore direction
On Mon, Oct 19, 2015 at 8:49 PM, Sage Weilwrote: > - We have to size the kv backend storage (probably still an XFS > partition) vs the block storage. Maybe we do this anyway (put metadata on > SSD!) so it won't matter. But what happens when we are storing gobs of > rgw index data or cephfs metadata? Suddenly we are pulling storage out of > a different pool and those aren't currently fungible. This is the concerning bit for me -- the other parts one "just" has to get the code right, but this problem could linger and be something we have to keep explaining to users indefinitely. It reminds me of cases in other systems where users had to make an educated guess about inode size up front, depending on whether you're expecting to efficiently store a lot of xattrs. In practice it's rare for users to make these kinds of decisions well up-front: it really needs to be adjustable later, ideally automatically. That could be pretty straightforward if the KV part was stored directly on block storage, instead of having XFS in the mix. I'm not quite up with the state of the art in this area: are there any reasonable alternatives for the KV part that would consume some defined range of a block device from userspace, instead of sitting on top of a filesystem? John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS stuck in a crash loop
On Mon, Oct 12, 2015 at 3:36 AM, Milosz Tanskiwrote: > On Sun, Oct 11, 2015 at 6:44 PM, Milosz Tanski wrote: >> On Sun, Oct 11, 2015 at 6:01 PM, Milosz Tanski wrote: >>> On Sun, Oct 11, 2015 at 5:33 PM, Milosz Tanski wrote: On Sun, Oct 11, 2015 at 5:24 PM, Milosz Tanski wrote: > On Sun, Oct 11, 2015 at 1:16 PM, Gregory Farnum > wrote: >> On Sun, Oct 11, 2015 at 10:09 AM, Milosz Tanski wrote: >>> About an hour ago my MDSs (primary and follower) started ping-pong >>> crashing with this message. I've spent about 30 minutes looking into >>> it but nothing yet. >>> >>> This is from a 0.94.3 MDS >>> >> >>> 0> 2015-10-11 17:01:23.596008 7fd4f52ad700 -1 mds/SessionMap.cc: >>> In function 'virtual void C_IO_SM_Save::finish(int)' thread >>> 7fd4f52ad700 time 2015-10-11 17:01:23.594089 >>> mds/SessionMap.cc: 120: FAILED assert(r == 0) >> >> These "r == 0" asserts pretty much always mean that the MDS did did a >> read or write to RADOS (the OSDs) and got an error of some kind back. >> (Or in the case of the OSDs, access to the local filesystem returned >> an error, etc.) I don't think these writes include any safety checks >> which would let the MDS break it which means that probably the OSD is >> actually returning an error — odd, but not impossible. >> >> Notice that the assert happened in thread 7fd4f52ad700, and look for >> the stuff in that thread. You should be able to find an OSD op reply >> (on the SessionMap object) coming in and reporting an error code. >> -Greg > > I only two error ops in that whole MDS session. Neither one happened > on the same thread (7f5ab6000700 in this file). But it looks like the > only session map is the -90 "Message too long" one. > > mtanski@tiny:~$ cat single_crash.log | grep 'osd_op_reply' | grep -v > 'ondisk = 0' > -3946> 2015-10-11 20:51:11.013965 7f5ab20f2700 1 -- > 10.0.5.31:6802/27121 <== osd.25 10.0.5.57:6804/32341 6163 > osd_op_reply(46349 mds0_sessionmap [writefull 0~95168363] v0'0 uv0 > ondisk = -90 ((90) Message too long)) v6 182+0+0 (2955408122 0 0) > 0x3a55d340 con 0x3d5a3c0 > -705> 2015-10-11 20:51:11.374132 7f5ab22f4700 1 -- > 10.0.5.31:6802/27121 <== osd.28 10.0.5.50:6801/1787 5297 > osd_op_reply(48004 300.e274 [delete] v0'0 uv1349638 ondisk = -2 > ((2) No such file or directory)) v6 179+0+0 (1182549251 0 0) > 0x66c5c80 con 0x3d5a7e0 > > Any idea what this could be Greg? To follow this up I found this ticket from 9 months ago: http://tracker.ceph.com/issues/10449 In there Yan says: "it's a kernel bug. hang request prevents mds from trimming completed_requests in sessionmap. there is nothing to do with mds. (maybe we should add some code to MDS to show warning when this bug happens)" When I was debugging this I saw an OSD (not cephfs client) operation stuck for a long time along with the MDS error: HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow requests; mds cluster is degraded; mds0: Behind on trimming (709/30) 1 ops are blocked > 16777.2 sec 1 ops are blocked > 16777.2 sec on osd.28 I did eventually bounce the OSD in question and it hasn't become stuck since, but the MDS is still eating it every time with the "Message too long" error on the session map. I'm not quite sure where to go from here. >>> >>> First time I had a chance to use the new recover tools. I was able to >>> reply the journal, reset it and then reset the sessionmap. MDS >>> returned back to life and so far everything looks good. Yay. >>> >>> Triggering this a bug/issue is a pretty interesting set of steps. >> >> Spoke too soon, a missing dir is now causing MDS to restart it self. >> >> -6> 2015-10-11 22:40:47.300169 7f580c7b9700 5 -- op tracker -- >> seq: 4, time: 2015-10-11 22:40:47.300168, event: finishing request, >> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58 >> 2015-10-11 21:34:49.224905 RETRY=36) >> -5> 2015-10-11 22:40:47.300208 7f580c7b9700 5 -- op tracker -- >> seq: 4, time: 2015-10-11 22:40:47.300208, event: cleaned up request, >> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58 >> 2015-10-11 21:34:49.224905 RETRY=36) >> -4> 2015-10-11 22:40:47.300231 7f580c7b9700 5 -- op tracker -- >> seq: 4, time: 2015-10-11 22:40:47.300231, event: done, op: >> client_request(client.3597476:21480382 rmdir #100015e0be2/58 >> 2015-10-11 21:34:49.224905 RETRY=36) >> -3> 2015-10-11 22:40:47.300284 7f580e0bd700 0 >> mds.0.cache.dir(100048df076) _fetched missing object for [dir >> 100048df076 /petabucket/beta/6d/f6/ [2,head] auth v=0 cv=0/0 ap=1+0+0 >> state=1073741952 f() n()
Re: advice on indexing sequential data?
On Thu, Oct 1, 2015 at 11:44 AM, Tom Nakamurawrote: > Hello ceph-devel, > > My lab is concerned with developing data mining application for > detecting and 'deanonymizing' spamming botnets from high-volume spam > feeds. > > Currently, we just store everything in large mbox files in directories > separated by day (think: /data/feed1/2015/09/13.mbox) on a single NFS > server. We have ad-hoc scripts to extract features from these mboxes and > pass them to our analysis pipelines (written in a mixture of > R/matlab/python/etc). This system is reaching its limit point. > > We already have a small Ceph installation with which we've had good luck > for storing other data,and would like to see how we can use it to solve > our mail problem. Our basic requirements are that: > > - We need to be able to access each message by its extracted features. > These features include simple information found in its header (for > example From: and To:) as well as more complex information like > signatures from attachments and network information (for example, > presence in blacklists). > - We will frequently add/remove features. > - Faster access to recent data is more important than to older data. > - Maintaining strict ordering of incoming messages is not necessary. In > other words, if we received two spam messages on our feeds, it doesn't > matter too much if they are stored in that order, as long as we can have > coarse-grained temporal accuracy (say, 5 minutes). So we don't need > anything as sophisticated as Zlog. > - We need to be able to remove messages older than some specific age, > due to storage constraints. > > Any advice on how to use Ceph and librados to accomplish this? Here are > my initial thoughts: > > - Each message is an object with some unique ID. Use omap to store all > its features in the same object. > - For each time period (which will have to be pre-specified to, say, an > hour), we have an object which contains a list of ID's, as a bytestring > of contatenated ID's. This should make expiring old messages trivial. > - For each feature, we have a timestamped index (like > 20150930-from-...@bar.com or > 20150813-has-attachment-with-hash-123abddeadbeef) the which contains a > list of ID's. > - Hopefully use Rados classes to index/feature-extract on the OSD's. > > How does this sound? One glaring omission is that I do not know how to > create indices which would support querying by inequality/ranges ('find > all messages between 1000 and 2000 bytes'). I would suggest some sort of hybrid approach, where you store your messages and your time index in ceph (so that you can insert data and expire data all within ceph), then use an external database for the queries your application layer is interested in. That way the external database becomes somewhat disposable (you can always rebuild efficiently for any given time period by consulting your time index in ceph), but you don't have to implement any multi-axis querying inside ceph. With that kind of approach, you don't have to worry about implementing indices (let the existing database do it), but you do still have to worry about recovery from failure, i.e. keeping the ceph store and the database index up to date. You might need a "regenerate data for this time period" call that re-inserted all the last 5 minutes emails to the database after a failure of whatever is injecting the data. John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: libcephfs invalidate upcalls
On Sat, Sep 26, 2015 at 8:03 PM, Matt Benjaminwrote: > Hi John, > > I prototyped an invalidate upcall for libcephfs and the Gasesha Ceph fsal, > building on the Client invalidation callback registrations. > > As you suggested, NFS (or AFS, or DCE) minimally expect a more generic > "cached vnode may have changed" trigger than the current inode and dentry > invalidates, so I extended the model slightly to hook cap revocation, > feedback appreciated. In cap_release, we probably need to be a bit more discriminating about when to drop, e.g. if we've only lost our exclusive write caps, the rest of our metadata might all still be fine to cache. Is ganesha in general doing any data caching? I think I had implicitly assumed that we were only worrying about metadata here but now I realise I never checked that. The awkward part is Client::trim_caps. In the Client::trim_caps case, the lru_is_expirable part won't be true until something has already been invalidated, so there needs to be an explicit hook there -- rather than invalidating in response to cap release, we need to invalidate in order to get ganesha to drop its handle, which will render something expirable, and finally when we expire it, the cap gets released. In that case maybe we need a hook in ganesha to say "invalidate everything you can" so that we don't have to make a very large number of function calls to invalidate things. In the fuse/kernel case we can only sometimes invalidate a piece of metadata (e.g. we can't if its flocked or whatever), so we ask it to invalidate everything. But perhaps in the NFS case we can always expect our invalidate calls to be respected, so we could just invalidate a smaller number of things (the difference between actual cache size and desired)? John > > g...@github.com:linuxbox2/ceph.git , branch invalidate > g...@github.com:linuxbox2/nfs-ganesha.git , branch ceph-invalidates > > thanks, > > Matt > > -- > Matt Benjamin > Red Hat, Inc. > 315 West Huron Street, Suite 140A > Ann Arbor, Michigan 48103 > > http://www.redhat.com/en/technologies/storage > > tel. 734-761-4689 > fax. 734-769-8938 > cel. 734-216-5309 > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Manila] CephFS native driver
Hi all, I've recently started work on a CephFS driver for Manila. The (early) code is here: https://github.com/openstack/manila/compare/master...jcsp:ceph It requires a special branch of ceph which is here: https://github.com/ceph/ceph/compare/master...jcsp:wip-manila This isn't done yet (hence this email rather than a gerrit review), but I wanted to give everyone a heads up that this work is going on, and a brief status update. This is the 'native' driver in the sense that clients use the CephFS client to access the share, rather than re-exporting it over NFS. The idea is that this driver will be useful for anyone who has such clients, as well as acting as the basis for a later NFS-enabled driver. The export location returned by the driver gives the client the Ceph mon IP addresses, the share path, and an authentication token. This authentication token is what permits the clients access (Ceph does not do access control based on IP addresses). It's just capable of the minimal functionality of creating and deleting shares so far, but I will shortly be looking into hooking up snapshots/consistency groups, albeit for read-only snapshots only (cephfs does not have writeable shapshots). Currently deletion is just a move into a 'trash' directory, the idea is to add something later that cleans this up in the background: the downside to the "shares are just directories" approach is that clearing them up has a "rm -rf" cost! A note on the implementation: cephfs recently got the ability (not yet in master) to restrict client metadata access based on path, so this driver is simply creating shares by creating directories within a cluster-wide filesystem, and issuing credentials to clients that restrict them to their own directory. They then mount that subpath, so that from the client's point of view it's like having their own filesystem. We also have a quota mechanism that I'll hook in later to enforce the share size. Currently the security here requires clients (i.e. the ceph-fuse code on client hosts, not the userspace applications) to be trusted, as quotas are enforced on the client side. The OSD access control operates on a per-pool basis, and creating a separate pool for each share is inefficient. In the future it is expected that CephFS will be extended to support file layouts that use RADOS namespaces, which are cheap, such that we can issue a new namespace to each share and enforce the separation between shares on the OSD side. However, for many people the ultimate access control solution will be to use a NFS gateway in front of their CephFS filesystem: it is expected that an NFS-enabled cephfs driver will follow this native driver in the not-too-distant future. This will be my first openstack contribution, so please bear with me while I come up to speed with the submission process. I'll also be in Tokyo for the summit next month, so I hope to meet other interested parties there. All the best, John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: full cluster/pool handling
On Thu, Sep 24, 2015 at 7:26 PM, Gregory Farnumwrote: > On Thu, Sep 24, 2015 at 8:04 AM, Sage Weil wrote: >> On Thu, 24 Sep 2015, Robert LeBlanc wrote: >>> -BEGIN PGP SIGNED MESSAGE- >>> Hash: SHA256 >>> >>> >>> On Thu, Sep 24, 2015 at 6:30 AM, Sage Weil wrote: >>> > Xuan Liu recently pointed out that there is a problem with our handling >>> > for full clusters/pools: we don't allow any writes when full, >>> > including delete operations. >>> > >>> > While fixing a separate full issue I ended up making several fixes and >>> > cleanups in the full handling code in >>> > >>> > https://github.com/ceph/ceph/pull/6052 >>> > >>> > The interesting part of that is that we will allow a write as long as it >>> > doesn't increase the overall utilizate of bytes or objects (according to >>> > the pg stats we're maintaining). That will include remove ops, of cours, >>> > but will also allow overwrites while full, which seems fair. >>> >>> What about overwrites on a COW FS, won't that still increase used >>> space? Maybe if it is a COW FS, don't allow overwrites? >> >> Yeah, we could strengthen (optionally?) the check so that only operations >> that result in a net decrease are allowed.. > > It's not just COW filesystems, anything that modifies > leveldb/rocksdb/whatever in any way will also increase the space used > — including regular object deletes which additionally get added to the > PG log, although *hopefully* that's not a problem since we have our > extra buffers to handle this sort of thing. While right now we might > have some hope of being able to tag ops as "net deletes" or "net > adds", I don't see that happening once we have widespread third-party > object classes or that Lua work gets in or something... > > So, I'd be really leery of trying to do anything more advanced than > letting clients execute delete operations, and letting privileged > clients keep doing real work. (Or maybe restricting it entirely to the > second half there.) > > That latter switch already exists, by the way, although I don't think > it's actually enforced via cephx caps (it should be) — the Objecter > has an honor_full_flag setting which the MDS sets to false. I don't > think the library interfaces are there to specify it per-op but IIRC > it is part of the data sent to OSDs so it wouldn't require a wire > protocol change. I don't think it's sent with the ops, the OSDs simply check if the requester's entity type is MDS. John > -Greg > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Review Request
This is neat, I had been hankering to lambda-ize various things, but hadn't worked through what the allocation would looked like in practice. Do you know if there's a reason the standard is defined so as to not let us override the reserved size of a std::function? Are we taking any kind of hit by doing this? John P.S. I have just gone on Amazon to order a C++11 book because I need to start using more than just lambda, auto, and pretty for loops ;-) On Wed, Sep 16, 2015 at 8:06 PM, Adam C. Emersonwrote: > On 09/16/2015 02:31 PM, Gregory Farnum wrote: >> Can you provide a little background (links or text) for those of us >> who aren't up on C++11x/14x? I looked at them briefly but having only >> the vaguest idea about some of them quickly got lost. :) > > Surely, sir. > > So, in summary, Context* is the means we have been using to handle > callbacks. It has two big problems: > > (1) Every Context is allocated at the point of use and freed when it's > called. So by their nature they make heavy use of the heap. (If you > overload complete there's a few exceptions, but by and large Context is > very heapy.) > > (2) Context accepts an int. That's it. You can get around this by > storing other things in it and passing pointers to it and so forth. But > it would be nice to have more variety of type in our callbacks. > > C++11 (and following) have a whole pile of things that can be used as > functions. There are function objects (objects with an operator() > defined on them), lambdas with variable capture, function member > pointers, reference wrappers pointing to function objects and the like. > This gives you a good bit of flexibility, allows things like variable > capture, let's you allocate one object once and just pass references to > it, and so forth. > > Now, these objects all have different sizes and different types. So you > can't just shove them in an object naively. Because a class one of whose > members is a function pointer is going to be a different type than a > class holding a function object. (Which itself is different to a type > holding a reference to a function object.) > > std::function exists as a solution to this problem. It provides a > uniform type that can hold any object satisfying the requirement of > being callable with given argument types and a given return value. So, > for example, if you have some 'stat'-like call, you could specify it as > taking a function taking an error code, a size, and a date. Anything > that can be called with such would be accepted, anything that can't > (wrong argument types, say) would be rejected at compile time. And it > could be uniformly stored in a list of Operations. > > The downside is that if the thing you provide is too big, std::function > will allocate. 'too big' depends on your library vendor and there's no > way to find it out what. Thus the purpose of the ceph::function class. > If we have a pretty good idea how large most of the callback function > objects we expect to hold are, we can tell it to statically allocate > that much space. This gives us a tool to get allocations out of our fast > path. (For example, if we preallocate a bunch of classes with a > ceph::function that preallocates enough space to hold likely callbacks, > we can just pick them off and have no allocations, in the usual case.) > > If we know /everything/ we'll ever get, we can disable allocations > entirely. This is mostly so we can catch situations where we're trying > to shove something unexpectedly huge somewhere it ought not go. An > internal sanity check. > > Now, ceph::function, like std::function, is still an abstraction with a > virtual call in it and because it copies things around it reduces > opportunities for inlining. Thus, if you aren't storing a callback on an > object that's supposed to be in a queue, you shouldn't use it. You > should do something like: > > template > void do_stuff(Fun&& f) { > ...; > f(some, values); > }; > > This allows inlining, and (based on my experiments trying multiple > implementations and looking at the generated assembly) if f is called in > a loop, the code is just as good as if you'd open-coded the loop and > written the body of 'f' in it. Functions like this can have the > potential to use closures (lambda expression) effectively for free. > > So, the summary is: > > Context* is very heapy. We have less heapy alternatives. This implements > a foundation for one of them. We also get a bit more flexibility. > > Thank you. > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Backporting from Infernalis and c++11
On Tue, Sep 15, 2015 at 8:21 AM, Loic Dacharywrote: > With Infernalis Ceph move to c++11 (and CMake), we will see more conflicts > when backporting bug fixes to Hammer. Any ideas you may have to better deal > with this would be most welcome. Since these conflicts will be mostly > cosmetic, they should not be too difficult to resolve. The trick will be for > someone not familiar with the codebase to separate what is cosmetic and what > is not. > > This does not happen yet, no immediate concern :-) Maybe if we think about > that well in advance we'll be in a better position to deal with it later on ? I think this came up in conversation but wasn't necessarily made official policy yet -- my understanding is that we are (already) endeavouring to avoid c++11isms in bug fixes, along with the usual principle of fixing bugs in the smallest/neatest patch we can. Perhaps in cases where those of us working on master mistakenly put something un-backportable in a bug fix, it would be reasonable for the backporter to point it out and poke us for a clean version of the patch. John -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Brewer's theorem also known as CAP theorem
On Tue, Sep 15, 2015 at 1:38 PM, Owen Syngewrote: > On Mon, 14 Sep 2015 13:57:26 -0700 > Gregory Farnum wrote: > >> The OSD is supposed to stay down if any of the networks are missing. >> Ceph is a CP system in CAP parlance; there's no such thing as a CA >> system. ;) > > I know I am being fussy, but within my team your email was sited that > you cannot consider ceph as a CA system. Hence I make my argument in > public so I can be humbled in public. > > Just to clarify your opinion I site > > http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed > > suggests: > > > The CAP theorem states that any networked shared-data system can have > at most two of three desirable properties. > > * consistency (C) equivalent to having a single up-to-date copy of > the data; > * high availability (A) of that data (for updates) > * tolerance to network partitions (P). > > > So I dispute that a CA system cannot exist. > > I think you are too absolute even in interpretation of this vague > theory. A further quote from the author of said theorem from the same > article: > > > The "2 of 3" formulation was always misleading because it tended to > oversimplify the tensions among properties. > > > As I understand it: > > Ceph as a cluster always provides "Consistency". (or else you found a > bug) > > If a ceph cluster is operating it will always provide acknowledgment > (it may block) to the client if the operation has succeeded > or failed hence provides "availability". > > if a ceph cluster is partitioned, only one partition will continue > operation, hence you cannot consider the system "partition" tolerant > as multiple parts of the system cannot operate when partitioned. The technical meaning of partition tolerance is that the system continues to provide a service to clients in the face of a partition, not that it splits into multiple separately operating units. Given the general definition of partition (any network failure, any host failure), any real physical network has the 'P' part, so only CP and AP systems are physically meaningful. In order to imagine a CA system you have to imagine a perfect network. There has been plenty of confusion in the past as CAP terminology went mainstream, so there are also plenty of blogs providing longer explanations: http://codahale.com/you-cant-sacrifice-partition-tolerance/ http://www.quora.com/Whats-the-difference-between-CA-and-CP-systems-in-the-context-of-CAP-Consistency-Availability-and-Partition-Tolerance > Hence as a cluster ceph is CA. > > Alternatively if you look at it from an OSD rather than cluster > perspective, you can get the perspective you take, OSD's are CP system > in CAP parlance. > > I would argue it is all a matter of perspective Not so much a matter of perspective, more that the words involved are used in quite specific technical ways. If you try to understand it in plain english it seems ambiguous, but the way the terms are used within this field it's quite clear cut. John , and believe that to > call Brewer's theorem anything other than guidance without strong > discussion of your understanding of consistency and its interaction with > partitioning is to confuse and over simplify. > > Best regards > > Owen > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Should pgls filters be part of object classes?
So, I've got this very-cephfs-specific piece of pgls filtering code in ReplicatedPG: https://github.com/ceph/ceph/commit/907a3c5a2ba8e3edda18d7edf89ccae7b9d91dc5 I'm not sure I'm sufficiently motivated to create some whole new plugin framework for these, but what about piggy-backing on object classes? I guess it would be an additional cls_register_filter(myfilter, my_callback_constructor) fn. tl;dr; How do people feel about extending object classes to include providing PGLS filters as well? John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Transitioning Ceph from Autotools to CMake
OK, here are vstart+ceph.in changes that work well enough in my out of tree build: https://github.com/ceph/ceph/pull/5457 John On Mon, Aug 3, 2015 at 11:09 AM, John Spray jsp...@redhat.com wrote: On Sat, Aug 1, 2015 at 8:24 PM, Orit Wasserman owass...@redhat.com wrote: 3. no vstart.sh , starting working on this too but have less progress here. At the moment in order to use vstart I copy the exe and libs to src dir. I just started playing with CMake on Friday, adding some missing cephfs bits. I was going to fix (3) as well, but I don't want to duplicate work -- do you have an existing branch at all? Presumably this will mostly be a case of adding appropriate prefixes to commands. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Transitioning Ceph from Autotools to CMake
On Sat, Aug 1, 2015 at 8:24 PM, Orit Wasserman owass...@redhat.com wrote: 3. no vstart.sh , starting working on this too but have less progress here. At the moment in order to use vstart I copy the exe and libs to src dir. I just started playing with CMake on Friday, adding some missing cephfs bits. I was going to fix (3) as well, but I don't want to duplicate work -- do you have an existing branch at all? Presumably this will mostly be a case of adding appropriate prefixes to commands. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Installing Ceph without root privilege
On 31/07/15 07:10, Piyawath Boukom wrote: Dear Ceph-dev team, My name is Piyawath. I’m trying to set up Ceph on university’s machine. Unfortunately, I am able to install/set up it in user mode (not root), including creating Ceph user and other processes. Is it possible to do set up without root privilege ? I’m quite new in this field, please forgive me if I ignorantly asked you. Hi, I guess if you're installing as non-root then you probably just want something to experiment with a bit? The easiest thing may be to compile from source: http://ceph.com/docs/next/install/build-ceph/ ...and then create a temporary vstart cluster: http://ceph.com/docs/next/dev/quick_guide/ This is not at all suitable for putting any real data in, but will let you play around a bit with the ceph command line etc. If you need a more realistic system I would suggest asking your administrator for a way to create virtual machines, so that you can have root in the virtual machines and install packages there. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vstart runner for cephfs tests
On 23/07/15 12:56, Mark Nelson wrote: I had similar thoughts on the benchmarking side, which is why I started writing cbt a couple years ago. I needed the ability to quickly spin up clusters and run benchmarks on arbitrary sets of hardware. The outcome isn't perfect, but it's been extremely useful for running benchmarks and sort of exists as a half-way point between vstart and teuthology. The basic idea is that you give it a yaml file that looks a little bit like a teuthology yaml file and cbt will (optionally) build a cluster across a number of user defined nodes with pdsh, start various monitoring tools (this is ugly right now, I'm working on making it modular), and then sweep through user defined benchmarks and sets of parameter spaces. I have a separate tool that will sweep through ceph parameters, create ceph.conf files for each space, and run cbt with each one, but the eventual goal is to integrate that into cbt itself. Though I never really intended it to run functional tests, I just added something like looks very similar to the rados suite so I can benchmark ceph_test_rados for the new community lab hardware. I already had a mechanism to inject OSD down/out up/in events, so with a bit of squinting it can give you a very rough approximation of a workload using the osd thrasher. If you are interested, I'd be game to see if we could integrate your cephfs tests as well (I eventually wanted to add cephfs benchmark capabilities anyway). Cool - my focus is very much on tightening the code-build-test loop for developers, but I can see us needing to extend that into a code-build-test-bench loop as we do performance work on cephfs in the future. Does cbt rely on having ceph packages built, or does it blast the binaries directly from src/ onto the test nodes? John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vstart runner for cephfs tests
On 23/07/15 12:23, Loic Dachary wrote: You may be interested by https://github.com/ceph/ceph/blob/master/src/test/ceph-disk-root.sh which is conditionally included https://github.com/ceph/ceph/blob/master/src/test/Makefile.am#L86 by --enable-root-make-check https://github.com/ceph/ceph/blob/master/configure.ac#L414 If you're reckless and trust the tests not to break (a crazy proposition by definition IMHO ;-), you can make TESTS=test/ceph-disk-root.sh check If you want protection, you do the same in a docker container with test/docker-test.sh --os-type centos --os-version 7 --dev make TESTS=test/ceph-disk-root.sh check I tried various strategies to make tests requiring root access more accessible and less scary and that's the best compromise I found. test/docker-test.sh is what the make check bot uses. Interesting, I didn't realise we already had root-ish tests in there. At some stage the need for root may go away in ceph-fuse, as in principle fuse mount/unmounts shouldn't require root. If not then putting an outer docker wrapper around this could make sense, if we publish the built binaries into the docker container via a volume or somesuch. I am behind on familiarizing myself with the dockerised tests. When a test can be used both from sources and from teuthology, I found it more convenient to have it in the qa/workunits directory which is available in both environments. Who knows, maybe you will want a vstart based cephfs test to run as part of make check, in the same way https://github.com/ceph/ceph/blob/master/src/test/cephtool-test-mds.sh does. Yes, this crossed my mind. At the moment, even many of the quick tests/cephfs tests take tens of seconds, so they are probably a bit too big to go in a default make check, but for some of the really simple things that are currently done in cephtool/test.sh, I would be temped to move them into the python world to make them a bit less fiddly. The test location is a bit challenging, because we essentially have two not-completely-stable interfaces here, vstart and teuthology. Because teuthology is the more complicated, for the moment it makes sense for the tests to live in that git repo. Long term it would be nice if fine-grained functional tests lived in the same git repo as the code they're testing, but I don't really have a plan for that right now outside of the probably-too-radical step of merging ceph-qa-suite into the ceph repo. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
vstart runner for cephfs tests
Audience: anyone working on cephfs, general testing interest. The tests in ceph-qa-suite/tasks/cephfs are growing in number, but kind of inconvenient to run because they require teuthology (and therefore require built packages, locked nodes, etc). Most of them don't actually require anything beyond what you already have in a vstart cluster, so I've adapted them to optionally run that way. The idea is that we can iterate a lot faster when writing new tests (one less excuse not to write them) and get better use out of the tests when debugging things and testing fixes. teuthology is fine for mass-running the nightlies etc, but it's overkill for testing individual bits of MDS/client functionality. The code is currently on the wip-vstart-runner ceph-qa-suite branch, and the two magic commands are: 1. Start a vstart cluster with a couple of MDSs, as your normal user: $ make -j4 rados ceph-fuse ceph-mds ceph-mon ceph-osd cephfs-data-scan cephfs-journal-tool cephfs-table-tool ./stop.sh ; rm -rf out dev ; MDS=2 OSD=3 MON=1 ./vstart.sh -d -n 2. Invoke the test runner, as root (replace paths, test name as appropriate. Leave of test name to run everything): # PYTHONPATH=/home/jspray/git/teuthology/:/home/jspray/git/ceph-qa-suite/ python /home/jspray/git/ceph-qa-suite/tasks/cephfs/vstart_runner.py tasks.cephfs.test_strays.TestStrays.test_migration_on_shutdown test_migration_on_shutdown (tasks.cephfs.test_strays.TestStrays) ... ok -- Ran 1 test in 121.982s OK ^^^ see! two minutes, and no waiting for gitbuilders! The main caveat here is that it needs to run as root in order to mount/unmount things, which is a little scary. My plan is to split it out into a little root service for doing mount operations, and then let the main test part run as a normal user and call out to the mounter service when needed. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
CephFS fsck tickets
Lots of new tickets in tracker related to forward scrub, this is a handly list (mainly for Greg and myself) that maps to the notes from our design discussion. They're all under the 'fsck' category in tracker. John Tagging prerequisite: - forward scrub (traverse tree, at least) #12255 - create scrub header or scrub map in MDLog, like subtreemap, make scrub startup wait until this is written before going ahead #12257 - add recovery of scrub header during MDS replay, and re-start any scrubs that were ongoing #12258 - add tagged-scrub command taking path tag #12273 - actually apply the tag during the RADOS op for reading something #12274 - start the process from all subtree roots on all MDSs, and skip non-auth regions #12275 - block migration during tagging scrub OR - when migrating, look up to parent subtree and send along info about the scrub if we haven't been scrubbed in this tag yet - handle MDS rank shutdown vs. ongoing scrub Backtrace handling #12277 - during backward scrub, tag our parent with the most recent backtrace seen, even if we already created it with a less recent backtrace (set_if_greater on a backtrace xattr) as a hint to a subsequent forward scrub step (ready for this now) #12278 - during forward scrub, look at these hints, and move folders around if we have more recent linkage information from the backtrace (maybe wait to do this later, it's a corner) Backward scrub online #12279 - Add hooks to MDS to enable backward scrub to lookup_ino and work out if an orphaned ino is actually orphaned or just more recently created than the last scrub. - Add hooks to MDS to enable injection to be done online (i.e. inject linkage RPC) #12280 - Use hooks from backward scrub in a new mode that is readonly on the RADOS pools, and mediates all writes through running MDS. Handling purged strays: #12281 * StrayManager needs to respect PIN_SCRUBQUEUE (remove it from the scrub stack before purging it) and the Locker locks that scrubstack uses (don't purge it until the ongoing scrub RADOS ops are done) (added a note to #11950) * if/when we add purgequeue, #11950, backward scrub will also need to interrogate that to determine non-orphan-ness of an inode, or enforce that it gets flushed before doing anything. #11859 - DamageTable Status scheduling #12282 - List/progress/abort/pause commands for ongoing scrub #12283 - Time of day scheduling for scrub #12284 - Deprioritise/pause scrub on highly loaded systems (loaded MDS, loaded RADOS) -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: python facade pattern implementation in ceph and ceph-deploy is bad practice?
Owen, Please can you say what your overall goal is with recent ceph-deploy patches? Whether a given structure is right or wrong is really a function of the overall direction. If you are aiming to make ceph-deploy into an extensible framework that can do things in parallel, then you need to say so, and that's a bigger conversation about whether that's a reasonable goal for a tool which has so far made a virtue of maintaining modest ambitions. Along these lines, I noticed that you recently submitted a pull request to ceph-deploy that added a dependency on SQLAlchemy, and several hundred lines of extra boilerplate code -- this kind of thing is unlikely to get a warm reception without a stronger justification for the extra weight. While I don't know how related that is to the points you're making in this post, but it certainly inspires some curiosity about where you're going with this. I had not seen your wip_services_abstraction branch before, I've just taken a quick look now. More comments would probably have made it easier to read, as would following PEP8. I don't think there's anything problematic about having a class that knows how to start and stop a service, but I don't know what comments you've received elsewhere (there aren't any on the PR). John On 09/07/15 11:08, Owen Synge wrote: Dear all, The facade pattern (or façade pattern) is a software design pattern commonly used with object-oriented programming. The name is by analogy to an architectural facade. (wikipedia) I am frustrated with the desire to standardise on the one bad practice implementations of the facade pattern in python that is used in ceph-deploy even in ceph. The current standard of selectively calling modules with functions has a series of complexities to it. Ceph example: https://github.com/ceph/ceph/tree/master/src/ceph-detect-init/ceph_detect_init ceph-deploy example: https://github.com/ceph/ceph-deploy/tree/master/ceph_deploy/hosts SO I guess _some_ of you dont immediately see why this is VERY bad practice, and wonder why this makes me feel like the orange in this story: http://www.dailymail.co.uk/news/article-2540670/The-perfect-half-time-oranges-five-football-matches-Farmers-create-pentagon-shaped-fruit.html when I try to code with ceph-deploy in particular. The ceph/ceph-deploy way of implementing a facade in python causes this list the problems: 1) facade cannot be instantiated twice. 2) facade requires code layout inflexibility. 3) facade implementation can have side effects when implementation is changed. And probably others I have not thought of. In consequence from points above. From (1) (1A) no concurrency, so you cant configure more than one ceph-node at a time. (1A) You have to close one facade to start anouther, eg in ceph-deploy you have to close each connection before connecting to the next server so making it slow to use as all state has to be gathered. From (2) (2A) No nesting of facades without code duplication. EG reuse of systemd support between distributions. (2B) inflexible / complex include paths for shared code between facade implementations. From (3) (3A) Since all variables in a facade implementation are global, but isolated to an implementation, we cannot have variables in the implementation, any cross function variables that are not passed as parameters or return values will return will lead to side effects that are difficult to test. So this set of general points will have complex side effects that make you feel like an the pictured orange when developing that are related to wear the facade is implemented. About this point you will say well its open source so fix it ? My answer to this is that when I try to do this, as in this patch: https://github.com/osynge/ceph-deploy/commit/b82f89f35b27814ed4aba1082efd134c24ecf21f More than once, seem to suggest I should use the much more complex multi file implementation of a façade. The only advantages of implementing façades in the standard ceph / ceph-deploy way that I can see is: (I) how ceph-deploy has always done this way. (II) It allows you to continue using nearly no custom objects in python. (III) We like our oranges in funny shapes. It seems to me the current implementation could have been created due to the misunderstanding that modules are like objects, when intact they are like classes. Issues (1) and (3) can be solved simply by importing methods as objects rather than classes, but this does nothing to solve the bigger issue (2) which is more serious, but its a simple step forward that might very simple to patch, but their is little point in a POC unless people agree that issues (1) (2) and (3) are serious. Please can we not spread this boat anchor* implementation of a facade, further around the code base, and ideally migrate away from this bad practice, and help us all feel like happy round oranges. -- To unsubscribe from this list: send the line unsubscribe
Re: Preferred location for utility execution
On 26/06/2015 21:14, Handzik, Joe wrote: Hey ceph-devel, Gregory Meno (of Calamari fame) and I are working on what is now officially a blueprint for Jewel ( http://tracker.ceph.com/projects/ceph/wiki/Calamariapihardwarestorage ), and we'd like some feedback. Some of this has been addressed via separate conversations about the feature that some of this work started out as (identifying drives in a cluster by toggling their LED states), but we wanted to ask a more direct question: What is the preferred location/mechanism to execute operations on storage hardware? We see two clear options: 1. Make Calamari responsible for executing commands using various linux utilities (and /sys, when applicable). 2. Build a command set into RADOS to execute commands using various linux utilities. These commands could then be executed by Calamari using the rest api. The big win for #1 is the ability to rapidly iterate on the capabilities of the Calamari toolset (it is almost certainly going to be faster to create a set of scripts similar to Gregory's initial commit for SMART polling than to add that functionality inside RADOS. See: https://github.com/ceph/calamari/pull/267 ). For #2, we'd pick up the ability to run those same commands via the cli, which would give users a lot more flexibility in how they troubleshoot their cluster (Calamari wouldn't be required, it would just make life easier). Hi Joe, I'd reiterate my earlier comments[1] in favour of option 2. I would be cautious about implementing any of this in Calamari until there are at least upstream packages available for folks to use, and broader uptake. In the current situation, it's hard to ask people to try something out in Calamari, and much more straightforward to distribute something as part of Ceph. Hardware is pretty varied, I would expect you'll need help from others in the community to ensure any hardware handling works as expected in diverse environments, which will be much simpler with ceph than calamari. The part where some central python (calamari or otherwise) would really come into its own is in the fusion of information from multiple hosts, and exposing it to a user interface. On that aspect, I left some comments last time this came up: http://lists.ceph.com/pipermail/ceph-calamari-ceph.com/2015-May/73.html Ceph itself is getting a bit smarter with some of this stuff, e.g. the new node ls stuff gives you metadata about hosts and services without the need for calamari. Hanging device info off these new structures would be a pretty reasonable thing to do, and if someone later has a GUI that they want to pipe that into, they can grab it via the mon along with everything else. Cheers, John 1. https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23186.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cephfs obsolescence and object location
Since you're only looking up the ID of the first object, it's really simple. It's just the hex printed inode number followed by .. That's not guaranteed to always be the case in the future, but it's likely to be true longer than the deprecated ioctls exist. If I was you, I would hard-code the object naming convention rather than writing in a dependency on the ioctl. As Greg says, you can also query all the layout stuff (via supported interfaces) and do the full calculation of object names for arbitrary offsets into the file if you need to. John On 22/06/2015 22:18, Bill Sharer wrote: I'm currently running giant on gentoo and was wondering about the stability of the api for mapping MDS files to rados objects. The cephfs binary complains that it is obsolete for getting layout information, but it also provides object location info. AFAICT this is the only way to map files in a cephfs filesystem to object locations if I want to take advantage of the UFO nature of ceph's stores in order to access via both cephfs and rados methods. I have a content store that scans files, calculates their sha1hash and then stores them on a cephfs filesystem tree with their filenames set to their sha1hash name. I can then build views of this content using an external local filesystem and symlinks pointing into the cephfs store. At the same time, I want to be able to use this store via rados either through the gateway or my own software that is rados aware. The store is being treated as a write-once, read-many style system. Towards this end, I started writing a QT4 based library that includes this little Location routine (which currently works) to grab the rados object location from a hash object in this store. I'm just wondering whether this is all going to break horribly in the future when ongoing MDS development decides to break the code I borrowed from cephfs :-) QString Shastore::Location(const QString hash) { QString result = ; QString cache_path = this-dbcache + / + hash.left(2) + / + hash.mid(2,2) + / + hash; QFile cache_file(cache_path); if (cache_file.exists()) { if (cache_file.open(QIODevice::ReadOnly)) { /* * Ripped from cephfs code, grab the handle and use the ceph version of ioctl to * rummage through the file's xattrs for rados location. cephfs whines about being * obsolete to get layout this way, but this appears to be only way to get location. * This may all break horribly in a future release since MDS is undergoing heavy development * * cephfs lets user pass file_offset in argv but it defaults to 0. Presumably this is the first * extent of the pile of extents (4mb each?) and shards for the file. If user wants to jump * elsewhere with a non-zero offset, the resulting rados object location may be different */ int fd = cache_file.handle(); struct ceph_ioctl_dataloc location; location.file_offset = 0; int err = ioctl(fd, CEPH_IOC_GET_DATALOC, (unsigned long)location); if (err) { qDebug() Location: Error getting rados location for cache_path; } else { result = QString(location.object_name); } cache_file.close(); } else { qDebug() Location: unable to open cache_path readonly; } } else { qDebug() Location: cache file cache_path does not exist; } return result; } Since you're only looking at -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd top
On 17/06/2015 18:06, Robert LeBlanc wrote: Well, I think this has gone well past my ability to implement. Should this be turned into a BP and see if someone is able to work on it? Sorry, didn't meant to hijack your thread :-) It might still be useful to discuss the simpler case of tracking top clients/top objects (i.e. just native RADOS concepts) with an LRU table of stats (like Sage described) as a simpler alternative to my custom-querying proposal. I'm going to write a blueprint for the the custom query thing anyway though, as I'm kind of hot on the idea, though don't who/when will have time to take it on as it's a bit heavyweight. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd top
On 15/06/2015 14:52, Sage Weil wrote: I seem to remember having a short conversation about something like this a few CDS's back... although I think it was 'rados top'. IIRC the basic idea we had was for each OSD to track it's top clients (using some approximate LRU type algorithm) and then either feed this relatively small amount of info (say, top 10-100 clients) back to the mon for summation, or dump via the admin socket for calamari to aggregate. This doesn't give you the rbd image name, but I bet we could infer that without too much trouble (e.g., include a recent object or two with the client). Or, just assume that client id is enough (it'll include an IP and PID... enough info to find the /var/run/ceph admin socket or the VM process. If we were going to do top clients, I think it'd make sense to also have a top objects list as well, so you can see what the hottest objects in the cluster are. The following is a bit of a tangent... A few weeks ago I was thinking about general solutions to this problem (for the filesystem). I played with (very briefly on wip-live-query) the idea of publishing a list of queries to the MDSs/OSDs, that would allow runtime configuration of what kind of thing we're interested in and how we want it broken down. If we think of it as an SQL-like syntax, then for the RBD case we would have something like: SELECT read_bytes, write_bytes WHERE pool=rbd GROUP BY rbd_image (You'd need a protocol-specific module of some kind to define what rbd_image meant here, which would do a simple mapping from object attributes to an identifier (similar would exist for e.g. cephfs inode)) Each time an OSD does an operation, it consults the list of active performance queries and updates counters according to the value of the GROUP BY parameter for the query (so the above example each OSD would be keeping a result row for each rbd image touchd). The LRU part could be implemented as LIMIT BY + SORT parameters, such that the result rows would be periodically sorted and the least-touched results would drop off the list. That would probably be used in conjunction with a decay operator on the sorted-by field, like: SELECT read_bytes, write_bytes,ops WHERE pool=rbd GROUP BY rbd_image SORT BY movingAverage(derivative(ops)) LIMIT 100 Combining WHERE clauses would let the user drill down (apologies for buzzword) by doing things like identifying the most busy clients, and then for each of those clients identify which images/files/objects the client is most active on, or vice versa identify busy objects and then see which clients are hitting them. Usually keeping around enough stats to enable this is prohibitive at scale, but it's fine when you're actively creating custom queries for the results you're really interested in, instead of keeping N_clients*N_objects stats, and when you have the LIMIT part to ensure results never get oversized. The GROUP BY options would also include metadata sent from clients, e.g. the obvious cases like VM instance names, or rack IDs, or HPC job IDs. Maybe also some less obvious ones like decorating cephfs IOs with the inode of the directory containing the file, so that OSDs could accumulate per-directory bandwidth numbers, and user could ask which directory is bandwidth-hottest? as well as which file is bandwidth-hottest?. Then, after implementing all that craziness, you get some kind of wild multicolored GUI that shows you where the action is in your system at a cephfs/rgw/rbd level. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd top
On 15/06/2015 17:10, Robert LeBlanc wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 John, let me see if I understand what you are saying... When a person runs `rbd top`, each OSD would receive a message saying please capture all the performance, grouped by RBD and limit it to 'X'. That way the OSD doesn't have to constantly update performance for each object, but when it is requested it starts tracking it? Right, initially the OSD isn't collecting anything, it starts as soon as it sees a query get loaded up (published via OSDMap or some other mechanism). That said, in practice I can see people having some set of queries that they always have loaded and feeding into graphite in the background. If so, that is an interesting idea. I wonder if that would be simpler than tracking the performance of each/MRU objects in some format like /proc/diskstats where it is in memory and not necessarily consistent. The benefit is that you could have lifelong stats that show up like iostat and it would be a simple operation. Hmm, not sure we're on the same page about this part, what I'm talking about is all in memory and would be lost across daemon restarts. Some other component would be responsible for gathering the stats across all the daemons in one place (that central part could persist stats if desired). Each object should be able to reference back to RBD/CephFS upon request and the client could even be responsible for that load. Client performance data would need stats in addition to the object stats. You could extend the mechanism to clients. However, as much as possible it's a good thing to keep it server side, as servers are generally fewer (still have to reduce these stats across N servers to present to user), and we have multiple client implementations (kernel/userspace). What kind of thing do you want to get from clients? My concern is that adding additional SQL like logic to each op is going to get very expensive. I guess if we could push that to another thread early in the op, then it might not be too bad. I'm enjoying the discussion and new ideas. Hopefully in most cases the query can be applied very cheaply, for operations like comparing pool ID or grouping by client ID. However, I would also envisage an optional sampling number, such that e.g. only 1 in every 100 ops would go through the query processing. Useful for systems where keeping highest throughput is paramount, and the numbers will still be useful if clients are doing many thousands of ops per second. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Thoughts about metadata exposure in Calamari
On 05/06/2015 20:33, Handzik, Joe wrote: I err in the direction of calling 'osd metadata' too, but it does mean that Calamari will need to add that call in (I'll leave it to Gregory to say if that is particularly undesirable). Do you think it would be worthwhile to better define the metadata bundle into a structure, or is it ok to leave it as a set of string pairs? Versioning of the metadata is something to consider. The osd metadata stuff is outside the osdmap epochs, so anything that is consuming updates to it is stuck with doing some kind of full polling as it stands. It might be that some better interface with versions+deltas is needed for a management layer to efficiently consume it. A version concept where the version is incremented when an OSD starts or updates its metadata could make synchronization with a management layer much more efficient. Efficiency matters here when we're calling on the mons to serialize data for potentially 1s of OSDs into JSON whenever the management layer wants an update. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sharded collection list
Talked about this elsewhere but for the benefit of the list: * The API suggested here looks nicer to me too * This depends on the new PGLS ordering OSD side, so that has to land before this * In the meantime I've rebased the #9964 (rados import/export) branch to not depend on sharded pgls Cheers, John On 02/06/2015 23:54, Sage Weil wrote: Hey John- So the shared pgls stuff has collided a bit with the looming hobject sorting changes. Sam and I just talked about it a bit and came up with what librados API would be most appealing: - the listing API would have start/end markers - it would be driven by a new opaque type rados_list_cursor_t, which is just data, no state, and internally is just an hobject_t. - it would be totally stateless.. kill the [N]ListContext stuff in Objecter (and reimplement a simple wrapper in librados.cc or even .h). Note that the important bits of state there now are epoch (needed for detecting split; this will go away with a better cursor) result buffer (we can drop this) nspace (part of the ioctx, it just tags each request) cookie (this basically becomes the cursor .. it's just an hobject_t typedef) - the list could take a start cursor, optional end cursor, and output the next cursor to continue from. - we'd lose the buffering that ListContext currently does, which means that the request that goes over the wire will return the same number of entries that the C caller asks for. The C++ interface is an iterator so it'll have to do its own buffering, but that should be pretty trivial... - we should kill these calls, which were never used: CEPH_RADOS_API uint32_t rados_nobjects_list_get_pg_hash_position(rados_list_ctx_t ctx); CEPH_RADOS_API uint32_t rados_nobjects_list_seek(rados_list_ctx_t ctx, uint32_t pos); - we'd add a new call that is something like int rados_construct_iterator(ioctx, int n, int m, cursor *out); so that you can get a position partway through the pg. What do you think? Unfortunately it is quite a departure from what you implemented already but I think it'll be a net simplification *and* let you do all the things we want, like - get a set of ranges to list form - change our mind partway through to break things into smaller shards without losing previous work - start listing from a random position in the pool You could even list a single hash value by constructing a cursor with n=hash and n=hash=1 and m=2^32. What do you think? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD journal draft design
On 02/06/2015 16:11, Jason Dillaman wrote: I am posting to get wider review/feedback on this draft design. In support of the RBD mirroring feature [1], a new client-side journaling class will be developed for use by librbd. The implementation is designed to carry opaque journal entry payloads so it will be possible to be re-used in other applications as well in the future. It will also use the librados API for all operations. At a high level, a single journal will be composed of a journal header to store metadata and multiple journal objects to contain the individual journal entries. ... A new journal object class method will be used to submit journal entry append requests. This will act as a gatekeeper for the concurrent client case. A successful append will indicate whether or not the journal is now full (larger than the max object size), indicating to the client that a new journal object should be used. If the journal is too large, an error code responce would alert the client that it needs to write to the current active journal object. In practice, the only time the journaler should expect to see such a response would be in the case where multiple clients are using the same journal and the active object update notification has yet to be received. Can you clarify the procedure when a client write gets a I'm full return code from a journal object? The key part I'm not clear on is whether the client will first update the header to add an object to the active set (and then write it) or whether it goes ahead and writes objects and then lazily updates the header. * If it's object first, header later, what bounds how far ahead of the active set we have to scan when doing recovery? * If it's header first, object later, thats an uncomfortable bit of latency whenever we cross and object bound Nothing intractable about mitigating either case, just wondering what the idea is in this design. In contrast to the current journal code used by CephFS, the new journal code will use sequence numbers to identify journal entries, instead of offsets within the journal. Additionally, a given journal entry will not be striped across multiple journal objects. Journal entries will be mapped to journal objects using the sequence number: sequence number mod splay count == object number mod splay count for active journal objects. The rationale for this difference is to facilitate parallelism for appends as journal entries will be splayed across a configurable number of journal objects. The journal API for appending a new journal entry will return a future which can be used to retrieve the assigned sequence number for the submitted journal entry payload once committed to disk. The use of a future allows for asynchronous journal entry submissions by default and can be used to simplify integration with the client-side cache writeback handler (and as a potential future enhacement to delay appends to the journal in order to satisfy EC-pool alignment requirements). When two clients are both doing splayed writes, and they both send writes in parallel, it seems like the per-object fullness check via the object class could result in the writes getting staggered across different objects. E.g. if we have two objects that both only have one slot left, then A could end up taking the slot in one (call it 1) and B could end up taking the slot in the other (call it 2). Then when B's write lands at to object 1, it gets a I'm full response and has to send the entry... where? I guess to some arbitrarily-higher-numbered journal object depending on how many other writes landed in the meantime. This potentially leads to the stripes (splays?) of a given journal entry being separated arbitrarily far across different journal objects, which would be fine as long as everything was well formed, but will make detecting issues during replay harder (would have to remember partially-read entries when looking for their remaining stripes through rest of journal). You could apply the object class behaviour only to the object containing the 0th splay, but then you'd have to wait for the write there to complete before writing to the rest of the splays, so the latency benefit would go away. Or its equally possible that there's a trick in the design that has gone over my head :-) Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD mirroring design draft
On 28/05/2015 06:37, Gregory Farnum wrote: On Tue, May 12, 2015 at 5:42 PM, Josh Durgin jdur...@redhat.com wrote: It will need some metadata regarding positions in the journal. These could be stored as omap values in a 'journal header' object in a replicated pool, for rbd perhaps the same pool as the image for simplicity. The header would contain at least: * pool_id - where journal data is stored * journal_object_prefix - unique prefix for journal data objects * positions - (zone, purpose, object num, offset) tuples indexed by zone * object_size - approximate size of each data object * object_num_begin - current earliest object in the log * object_num_end - max potential object in the log Similar to rbd images, journal data would be stored in objects named after the journal_object_prefix and their object number. To avoid issues of padding or splitting journal entries, and to make it simpler to keep append-only, it's easier to allow the objects to be near object_size before moving to the next object number instead of sticking with an exact object size. Ideally this underlying structure could be used for both rbd and cephfs. Variable sized objects are different from the existing cephfs journal, which uses fixed-size objects for striping. The default is still 4MB chunks though. How important is striping the journal to cephfs? For rbd it seems unlikely to help much, since updates need to be batched up by the client cache anyway. I think the journaling v2 stuff that John did actually made objects variably-sized as you've described here. We've never done any sort of striping on the MDS journal, although I think it was possible.previously. The objects are still fixed size: we talked about changing it so that journal events would never span an object boundary, but didn't do it -- it still uses Filer. Parallelism ^^^ Mirroring many images is embarrassingly parallel. A simple unit of work is an image (more specifically a journal, if e.g. a group of images shared a journal as part of a consistency group in the future). Spreading this work across threads within a single process is relatively simple. For HA, and to avoid a single NIC becoming a bottleneck, we'll want to spread out the work across multiple processes (and probably multiple hosts). rbd-mirror should have no local state, so we just need a mechanism to coordinate the division of work across multiple processes. One way to do this would be layering on top of watch/notify. Each rbd-mirror process in a zone could watch the same object, and shard the set of images to mirror based on a hash of image ids onto the current set of rbd-mirror processes sorted by client gid. The set of rbd-mirror processes could be determined by listing watchers. You're going to have some tricky cases here when reassigning authority as watchers come and go, but I think it should be doable. I've been fantasizing about something similar to this for CephFS backward scrub/recovery. My current code supports parallelism, but relies on the user to script their population of workers across client nodes. I had been thinking of more of a master/slaves model, where one guy would get to be the master by e.g. taking the lock on an object, and he would then hand out work to everyone else that was a watch/notify subscriber to the magic object. It seems like that could be simpler than having workers have to work out independently what their workload should be, and have the added bonus of providing a command-like mechanism in addition to continuous operation. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: progress bars
On 28/05/2015 17:41, Robert LeBlanc wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Let me see if I understand this... Your idea is to have a progress bar that show (active+clean + active+scrub + active+deep-scrub) / pgs and then estimate time remaining? Not quite: it's not about doing a calculation on the global PG state counts. The code identifies specific PGs affected by specific operations, and then watches the status of those PGs. So if PGs are split the numbers change and the progress bar go backwards, is that a big deal? I don't see a case where the progress bars go backwards with the code I have so far? In the case of operations on PGs that split, it'll just ignore the new PGs, but you'll get a separate event tracking the creation of the new ones. In general, progress bars going backwards isn't something we should allow to happen (happy to hear counter examples though, I'm mainly speaking from intuition on that point!) If this was extended to track operations across PG splits (it's unclear to me that that complexity is worthwhile), then the bar still wouldn't need to go backwards, as whatever stat was being tracked would remain the same when summed across the newly split PGs. I don't think so, it might take a little time to recalculate how long it will take, but no big deal. I do like the idea of the progress bar even if it is fuzzy. I keep running ceph status or ceph -w to watch things and have to imagine it in my mind. Right, the idea is to save the admin from having to interpret PG counts mentally. It might be nice to have some other stats like client I/O and rebuild I/O so that I can see if recovery is impacting production I/O. We already have some of these stats globally, but it would be nice to be able to reason about what proportion of I/O is associated with specific operations, e.g. I have some total recovery IO number, what proportion of that is due to a particular drive failure?. Without going and looking at current pg stat structures I don't know if there is enough data in the mon right now to guess those numbers. This would *definitely* be heuristic rather than exact, in any case. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow file creating and deleting using bonnie ++ on Hammer
On 26/05/2015 15:50, Barclay Jameson wrote: Thank you for the great explanation Zheng! That definitely shows what I was seeing with the bonnie++ test. What bad things would happen if I modified the config option mds_tick_interval to flush the journal to a second or less? The MDS does various pieces of housekeeping according to that interval, so setting it extremely low will cause some CPU cycles to be wasted, and flushing the log more often will cause a larger number of smaller IOs to get generated. I would be very surprised if decreasing it to approx 1s was harmful though. On a busy real world system, other metadata operations will often drive log writes through faster than waiting for a tick. Does this also mean any custom code written should avoid use of fsync() if writing a large number of files? You should call it only when your application requires it for consistency, and always expect it to be a high latency operation. Add up the latency from your client to your server and from the server to the disk, and the length of the IO queue on the disk, and then the return leg -- that is the *minimum* time you should expect to wait for an fsync. For example, a real world workload creating N files in a directory would hopefully call fsync on the directory once at the end, rather than in between every file, unless you really do need to be sure that the dentry for the preceding file will be persistent before you start writing the next file. Sometimes it's easier to reason about it in terms of concurrency: if you have a bunch of IOs that you could safely run in parallel in a thread each, then you shouldn't be fsyncing between them, just at the point you join them. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow file creating and deleting using bonnie ++ on Hammer
On 26/05/2015 07:55, Yan, Zheng wrote: the reason for slow file creations is that bonnie++ call fsync(2) after each creat(2). fsync() wait for safe replies of the create requests. MDS sends safe reply when log event for the request gets journaled safely. MDS flush the journal every 5 seconds (mds_tick_interval). So the speed of file creation for bonnie++ is one file every file seconds. Ah, I hadn't noticed that the benchmark called... I wonder if I'm seeing the fuse client return quickly because it simply doesn't implement the fsyncdir call. We should fix that! It looks like we used to have an OP_FSYNC in the client-server protocol (perhaps for flushing the log immediately on fsyncs), anyone have any background on why that went away? Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS auth caps for cephfs
On 21/05/2015 01:14, Sage Weil wrote: Looking at the MDSAuthCaps again, I think there are a few things we might need to clean up first. The way it is currently structured, the idea is that you have an array of grants (MDSCapGrant). For any operation, you'd look at each grant until one that says what you're trying to do is okay. If non match, you fail. (i.e., they're additive only.) Each MDSCapGrant has a 'spec' and a 'match'. The 'match' is a check to see if the current grant applies to a given operation, and the 'spec' says what you're allowed to do. Currently MDSCapMatch is just int uid; // Require UID to be equal to this, if !=MDS_AUTH_UID_ANY std::string path; // Require path to be child of this (may be / for any) I think path is clearly right. UID I'm not sure makes sense here... I'm inclined to ignore it (instead of removing it) until we decide how to restrict a mount to be a single user. The spec is bool read; bool write; bool any; I'm not quite sure what 'any' means, but read/write are pretty clear. Ah, I added that when implementing 'tell' -- 'any' is checked when handling incoming MCommand in MDS, so it's effectively the admin permission. The root_squash option clearly belongs in spec, and Nistha's first patch adds it there. What about the other NFS options.. should be mirror those too? root_squash Map requests from uid/gid 0 to the anonymous uid/gid. Note that this does not apply to any other uids or gids that might be equally sensitive, such as user bin or group staff. no_root_squash Turn off root squashing. This option is mainly useful for diskless clients. all_squash Map all uids and gids to the anonymous user. Useful for NFS-exported public FTP directories, news spool directories, etc. The opposite option is no_all_squash, which is the default setting. anonuid and anongid These options explicitly set the uid and gid of the anonymous account. This option is primarily useful for PC/NFS clients, where you might want all requests appear to be from one user. As an example, consider the export entry for /home/joe in the example section below, which maps all requests to uid 150 (which is supposedly that of user joe). Yes, I think we should. Part of me wants to say that people who want NFS-like behaviour should be using NFS gateways. However, these are all probably straightforward enough to implement that it's worth maintaining them in cephfs too. We probably need to mirror these in our mount options too, so that e.g. someone with an admin key can still enable root_squash at will, rather than having to craft an authentication token with the desired behaviour. We could also do an all_squash bool at the same time (or a flags field for more efficient encoding), and anonuid/gid so that we don't hard-code 65534. In order to add these to the grammer, I suspect we should go back to root_squash (not squash_root), and add an 'optoins' tag. e.g., allow path /foo rw options no_root_squash anonuid=123 anongid=123 (having them live next to rw was breaking the spirit parser, bah). Looks good to me. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow file creating and deleting using bonnie ++ on Hammer
On 22/05/2015 16:25, Barclay Jameson wrote: The Bonnie++ job _FINALLY_ finished. If I am reading this correctly it took days to create, stat, and delete 16 files?? [root@blarg cephfs]# ~/bonnie++-1.03e/bonnie++ -u root:root -s 256g -r 131072 -d /cephfs/ -m CephBench -f -b Using uid:0, gid:0. Writing intelligently...done Rewriting...done Reading intelligently...done start 'em...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.03e --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP CephBench 256G 1006417 76 90114 13 137110 8 329.8 7 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 0 0 + +++ 0 0 0 0 5267 19 0 0 CephBench,256G,,,1006417,76,90114,13,,,137110,8,329.8,7,16,0,0,+,+++,0,0,0,0,5267,19,0,0 Any thoughts? It's 16000 files by default (not 16), but this usually takes only a few minutes. FWIW I tried running a quick bonnie++ (with -s 0 to skip the IO phase) on a development (vstart.sh) cluster with a fuse client, and it readily handles several hundred client requests per second (checked with ceph daemonperf mds.id) Nothing immediately leapt out at me from a quick look at the log you posted, but with issues like these it is always worth trying to narrow it down by trying the fuse client instead of the kernel client, and/or different kernel versions. You may also want to check that your underlying RADOS cluster is performing reasonably by doing a rados bench too. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: negative pool stats (was: Bug?)
On 10/04/2015 19:27, Barclay Jameson wrote: Watching osd pool stats (watch --interval=.5 -d 'ceph osd pool stats') while restarting 1/3 of my OSDs give some odd numbers: pool cephfs_data id 1 -768/9 objects degraded (-8533.333%) recovery io 18846 B/s, 1 objects/s client io 15356 B/s wr, 2 op/s pool cephfs_metadata id 2 -1/0 objects degraded (-inf%) Negative stats are: http://tracker.ceph.com/issues/7737 The fix appears to have just missed 0.94, but should be in the next stable releases. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: progress bars
Hi all, [this is a re-send of a mail from yesterday that didn't make it, probably due to an attachment] It has always annoyed me that we don't provide a simple progress bar indicator for things like the migration of data from an OSD when it's marked out, the rebalance that happens when we add a new OSD, or scrubbing the PGs on an OSD. I've experimented a bit with adding user-visible progress bars for some of the simple cases (screenshot at http://imgur.com/OaifxMf). The code is here: https://github.com/ceph/ceph/blob/wip-progress-events/src/mon/ProgressEvent.cc This is based on a series of ProgressEvent classes that are instantiated when certain things happen, like marking and OSD in or out. They provide an init() hook that captures whatever state is needed at the start of the operation (generally noting which PGs are affected) and a tick() hook that checks whether the affected PGs have reached their final state. Clearly, while this is simple for the simple cases, there are lots of instances where things will overlap: a PG can get moved again while it's being backfilled following a particular OSD going out. These progress indicators don't have to capture that complexity, but the goal would be to make sure they did complete eventually rather than getting stuck/confused in those cases. This is just a rough cut to play with the idea, there's no persistence of the ProgressEvents, and the init/tick() methods are peppered with correctness issues. Still, it gives a flavour of how we could add something friendlier like this to expose simplified progress indicators. Ideas for further work: * Add in an MDS handler to capture the progress of an MDS rank as it goes through replay/reconnect/clientreplay * A handler for overall cluster restart, that noticed when the mon quorum was established and all the map timestamps were some time in the past, and then generated progress based on OSDs coming up and PGs peering. * Simple: a handler for PG creation after pool creation * Generate estimated completion times from the rate of progress so far * Friendlier PGMap output, by hiding all PG states that are explained by an ongoing ProgressEvent, to only indicate low level PG status for things that the ProgressEvents don't understand. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Advice for implementation of LED behavior in Ceph ecosystem
On 01/04/2015 22:17, John Spray wrote: Once you have found the block device and reported it in the OSD metadata, you can use that information to go poke its LEDs using enclosure services hooks as you suggest, and wrap that in an OSD 'tell' command (OSD::do_command). In a similar vein to finding the block device, it would be a good thing to have a config option here so that admins can optionally specify a custom command for flashing a particular OSD's LED. Admins might not bother setting that, but it would mean a system integrator could optionally configure ceph to work with whatever exotic custom stuff they have. One more thought occurs to me -- one of the main cases where you'd want to flash an LED would be to identify the drive of an OSD that is down/out due to a dead drive. In that instance, the ceph-osd process wouldn't actually be running, so you wouldn't be able to send it the 'tell' to flash the LED. I guess in this interesting case you could either: * Allow other OSDs on the same host to handle the 'tell blink' command for the dead OSD's drive * Leave this to calamari/whoever to read the dead OSD's block device path from ceph osd metadata, and go blink the LEDs themselves. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Advice for implementation of LED behavior in Ceph ecosystem
On 01/04/2015 22:57, Mark Nelson wrote: It seems to me that the OSD potentially would flash the LED on it's way down if it thinks it's drive is dead/dying? That's a good idea for the case where ceph-osd is proactively identifying a failing drive. I'm also thinking about the case where we come back from a reboot and a drive is sufficiently unreadable that ceph-disk doesn't see the OSD partitions and ceph-osd never gets started, or the OSD's local filesystem is unmountable. Because the keyring lives on that local filesystem, OSDs couldn't phone home in that case, even to report a failure. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Advice for implementation of LED behavior in Ceph ecosystem
On 01/04/2015 23:04, John Spray wrote: On 01/04/2015 22:57, Mark Nelson wrote: It seems to me that the OSD potentially would flash the LED on it's way down if it thinks it's drive is dead/dying? That's a good idea for the case where ceph-osd is proactively identifying a failing drive. I'm also thinking about the case where we come back from a reboot and a drive is sufficiently unreadable that ceph-disk doesn't see the OSD partitions and ceph-osd never gets started, or the OSD's local filesystem is unmountable. Because the keyring lives on that local filesystem, OSDs couldn't phone home in that case, even to report a failure. Sorry, mental lapse: we're not talking about phoning home, we're talking about flashing the LED. So perhaps ceph-disk itself could be modified to flash an LED on a drive if it has a GPT partition ID for a ceph osd but we can't mount it or start an OSD service. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Advice for implementation of LED behavior in Ceph ecosystem
On 01/04/2015 19:56, Handzik, Joe wrote: 1. Stick everything in Calamari via Salt calls similar to what Gregory is showing. I have concerns about this, I think I'd still need extra information from the OSDs themselves. I might need to implement the first half of option #2 anyway. 2. Scatter it across the codebases (would probably require changes in Ceph, Calamari, and Calamari-clients). Expose the storage target data via the OSDs, and move that information upward via the RESTful API. Then, expose another RESTful API behavior that allows a user to change the LED state. Implementing as much as possible in the Ceph codebase itself has an added benefit (as far as I see it, at least) if someone ever decides that the fault LED should be toggled on based on the state of the OSD or backing storage device. It should be easier for Ceph to hook into that kind of functionality if Calamari doesn't need to be involved. Dan mentioned something I thought about too...not EVERY OSD's backing storage is going to be able to use this (Kinetic drives, NVDIMMs, M.2, etc etc), I'd need to implement some way to filter devices and communicate via the Calamari GUI that the device doesn't have an LED to toggle or doesn't understand SCSI Enclosure Services (I'm targeting industry standard HBAs first, and I'll deal with RAID controllers like Smart Array later). I'm trying to get this out there early so anyone with particularly strong implementation opinions can give feedback. Any advice would be appreciated! I'm still new to the Ceph source base, and probably understand Calamari and Calamari-clients better than Ceph proper at the moment. Similar to Mark's comment, I would lean towards option 2 -- it would be great to have a CLI-driven ability to flash the LEDs for an OSD, and work on integrating that with a GUI afterwards. Currently the OSD metadata on drives is pretty limited, it'll just tell you the /var/lib/ceph/osd/ceph-X path for the data and journal -- the task of resolving that to a physical device is left as an exercise to the reader, so to speak. I would suggest extending osd metadata to also report the block device, but only for the simple case where an OSD is a GPT partition on a raw /dev/sdX block device. Resolving block device to underlying disks in configurations like LVM/MDRAID/multipath is complex in the general case (I've done it, I don't recommend it), and most ceph clusters don't use those layers. You could add a fallback ability for users to specify their block device in ceph.conf, in case the simple GPT-assuming OSD probing code can't find it from the mount point. Once you have found the block device and reported it in the OSD metadata, you can use that information to go poke its LEDs using enclosure services hooks as you suggest, and wrap that in an OSD 'tell' command (OSD::do_command). In a similar vein to finding the block device, it would be a good thing to have a config option here so that admins can optionally specify a custom command for flashing a particular OSD's LED. Admins might not bother setting that, but it would mean a system integrator could optionally configure ceph to work with whatever exotic custom stuff they have. Hopefully that's some help, it sounds like you've already thought it through a fair bit anyway. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: adding {mds,mon} metadata asok command
On 24/03/2015 10:11, Joao Eduardo Luis wrote: I don't think people change hostnames for sport Sounds interesting, I might buy tickets to a game :-D John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: adding {mds,mon} metadata asok command
On 23/03/2015 10:04, Joao Eduardo Luis wrote: I agree. And I don't think we need a new service for this, and I also don't think we need to write stuff to the store. We can generate this information when the monitor hits 'bootstrap()' and share it with the rest of the quorum once an election finishes, and always keep it in memory (unless there's some information that needs to be persisted, but I was under the impression that was not the case). Just to clarify, you mean we don't need to write the mon metadata to the store, but we'd still want to persist the MDS/OSD metadata - right? John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: adding {mds,mon} metadata asok command
On 20/03/2015 05:39, kefu chai wrote: to pave the road to http://tracker.ceph.com/issues/10904, where we need to add a command to list the hostname of nodes in a ceph cluster, i would like to add the {mds,mon} metadata commands to print the system information including, but not limited to hostname, mem_{total,swap}_kb, and distro info, of specified mds and mon. the implementation follow the mechanism of osd metadata: on the mds side i would like to reuse the MDSMonitor service: 1. piggy back a map for the metadata in MMDSBeacon message, 2. put the metadata into the same DBStore transaction but with another prefix when storing the pending inc into local storage. 3. and expose it using the mds metadata and later on the service ls (not sure about the name ...) @greg and @zyan, are you good with this? not sure this will overburden the mds or not. i will use uname(2) and grep /proc/meminfo to get the metadata in the same way of OSD. It should be straightforward to include the metadata in MMDSBeacon only once per daemon lifetime, by checking if state is CEPH_MDS_STATE_BOOT -- that way we don't have to worry about any ongoing costs. I expect that change can live entirely in Beacon.cc without touching any other MDS code. As for the means of getting the information, I expect the generic kernel/mem/cpu/distro stuff from OSD::_collect_metadata can be moved up into common/ somewhere and reused as-is from mon+mds. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Persistence of completed_requests in sessionmap (do we need it?)
Zheng noticed on my new sessionmap code [1] that sessions weren't getting dirtied on trim_completed_requests. I had missed that, because I was only updating the places that we already incremented the sessionmap version while modifying something. I went and looked at how this worked in the existing code, and it appears that we don't actually bother persisting updates to the sessionmap if completed_requests is the only thing that changed. We would *tend* to persist it as a consequence to other session updates like prealloc_inos, but if one is simply issuing lots of metadata updates to existing files in a loop, the sessionmap never gets written back (even when expiring log segments). During replay, we rebuild completed_requests from EMetaBlob::replay, and we've made it this far without reliably persisting it in sessionmap, so I wonder if we ever needed to save this at all? Thoughts? Cheers, John 1. https://github.com/ceph/ceph/pull/3718 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph versions
On 26/02/2015 23:12, Sage Weil wrote: Hammer will most likely be v0.94[.x]. We're getting awfully close to 0.99, though, which makes many people think 1.0 or 1.00 (isntead of 0.100), and the current versioning is getting a bit silly. So let's talk about alternatives! I'm late to this thread, but... I find option B preferable, because it puts the most important information (which series, and is it stable) within the X.Y part that people will typically use in normal speech. In an ideal world option D, but I have found historically that it gets very confusing to have multiple releases with the same number, differentiated only by a trailing -dev/-rc. Folks are prone to drop the critical trailing qualifier when they say I'm running 1.2. However, option D would be my second choice to option B, as its the most explicit of the alternatives. As for the others... Option A (doubles and triples) has a similar abbreviation problem, that people will say I'm running 1.2 whether they were running a 1.2 dev or a 1.2.1 stable. Option C (semantic) is nice for APIs, but for software releases will confuse ordinary humans who associate big version jumps with big features etc. Option E differentiates between stable/dev with double/triple, which means it too has the abbreviation problem when spoken about colloquially. Option F is confusing because it requires the reader to differentiate between 8 (not 0!) and 9 (point 0) to see stability. Cheers, John Here are a few of options: -- Option A -- doubles and triples X.Y[.Z] - Increment X at the start of each major dev cycle (hammer, infernalis) - Increment Y for each iteration during that cycle - Eventually decide it's good and start adding .Z for the stable fixes. For example, 1.0 first infernalis dev release 1.1 dev release ... 1.8 infernalis rc 1.9 infernalis final 1.9.1 stable update 1.9.2 stable update ... 2.0 first j (jewel?) dev release 2.1 next dev release ... 2.8 final j 2.8.x stable j releases Q: How do I tell if it's a stable release? A: It is a triple instead of a double. Q: How do I tell if this is the final release in the series? A: Nobody knows that until we start doing stable updates; see above. -- Option B -- even/odd X.Y.Z - if Y is even, this is a stable series - if Y is odd, this is a dev release - increment X when something major happens 1.0 hammer final 1.0.1 stable/bugfix 1.0.2 stable 1.0.3 stable ... 1.1.0 infernalis dev release 1.1.1 infernalis dev release 1.1.2 infernalis dev release ... 1.2.0 infernalis final 1.2.1 stable branch ... 1.3.0 j-release dev 1.3.1 j-release dev 1.3.2 j-release dev ... 1.4.0 j-release final 1.4.1 stable 1.4.1 stable Q: How do I tell if it's a stable release? A: Second item is even. -- Option C -- semantic major.minor.patch - MAJOR version when you make incompatible API changes, - MINOR version when you add functionality in a backwards-compatible manner, and - PATCH version when you make backwards-compatible bug fixes. 1.0.0 hammer final 1.0.1 bugfix 1.0.1 bugfix 1.0.1 bugfix 1.1 infernalis dev release 1.2 infernalis dev release 2.0 infernalis dev release 2.1 infernalis dev release 2.2 infernalis dev release 2.3 infernalis final 2.3.0 bugfix 2.3.1 bugfix 2.3.2 bugfix 2.4 j dev release ... 2.14 j final 2.14.0 bugfix 2.14.1 bugfix ... 2.15 k dev .. 3.3 k final 3.3.1 bugfix ... Q: How do I tell what named release series this is? A: As with the others, you just have to know. Q: How do we distinguish between stable-series updates and dev updates? A: Stable series are triples. Q: How do I know if I can downgrade? A: The major hasn't changed. Q: Really? A: Well, maybe. We haven't dealt with downgrades yet so this assumes we get it right (and test it). We may not realize there is a backward-incompatible change right away and only discover it later during testing, at which point the versions are fixed; we'd probably bump the *next* release in response. -- Option D -- labeled X.Y-{dev,rc,release}Z - Increment Y on each major named release - Increment X if it's a major major named release (bigger change than usual) - Use dev, rc, or release prefix to clearly label what type of release this is - Increment Z for stable updates 1.0-dev1 first infernalis dev release 1.0-dev2 another dev release ... 1.0-rc1 first rc 1.0-rc2 next rc 1.0-release1 final release 1.0-release2 stable update 1.0-release3 stable update 1.1-dev1 first cut for j-release 1.1-dev2 ... ... 1.1-rc1 1.1-release1 stable 1.1-release2 stable 1.1-release3 stable Q: How do I tell what kind of release this is? A: Look at the string embedded in the version Q: Will these funny strings confuse things that sort by version? A: I don't think so. -- Option E -- ubuntu YY.MM[.Z] - YY is year, MM is month of release - Z for stable updates 15.03 hammer
Re: MDS crashes (80.8)
On 26/02/2015 15:26, Wyllys Ingersoll wrote: Trying to run ceph-mds on a freshly installed firefly cluster with no ceph FS created yet. It consistently crashes upon startup. Below is debug output showing the point of the crash. Something is obviously misconfigured or broken but Im at a loss as to where the issue would be. Any ideas? The MDS thinks that its on-disk tables should already exist (the MDSMap structure on the mons are telling it so), but when it goes to read them from RADOS it's finding that they aren't there. This is reminiscent of http://tracker.ceph.com/issues/7485, although I just checked and the v0.80.8 tag does include the fix for that. Please could you see if you have the MDS log from the very first time it ran? There is probably a clue there. Once you have recovered that and any other interesting diagnostic information, you can try to get out of this bad state by using ceph mds newfs to reset the MDSMap. Thanks, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS crashes (80.8)
On 26/02/2015 17:19, Wyllys Ingersoll wrote: OK, attached is the initial log, or at least the earliest log I can find. Ah, now that I look more closely at the backtrace, I realise that creation succeeded, but it is now failing on subsequent runs because it can't find the metadata pool. I guess you probably deleted it at some point between Monday and today. You will need to create yourself some filesystem pools (metadata and data) before using newfs to reset your filesystem. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS crashes (80.8)
On 26/02/2015 17:58, Wyllys Ingersoll wrote: Yeah, I noticed that too, so I recreated both of those pools and it still wont start. It crashes in a different place now, but still wont start, even after running 'newfs'. Attached is the debug log output when I start ceph-mds ... common/Thread.cc: In function 'int Thread::join(void**)' thread 7f0127612700 time 2015-02-26 07:55:09.734802 common/Thread.cc: 141: FAILED assert(status == 0) ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7) 1: (Thread::detach()+0) [0x8c30d0] 2: (MonClient::shutdown()+0x50f) [0x881d5f] 3: (MDS::suicide()+0xe5) [0x576285] 4: (MDLog::handle_journaler_write_error(int)+0x5f) [0x7aab2f] 5: (Context::complete(int)+0x9) [0x56d9a9] 6: (Journaler::handle_write_error(int)+0x5e) [0x7b949e] 7: (Journaler::_finish_write_head(int, Journaler::Header, Context*)+0x306) [0x7b9946] 8: (Context::complete(int)+0x9) [0x56d9a9] 9: (Objecter::check_op_pool_dne(Objecter::Op*)+0x214) [0x7ce6a4] 10: (Objecter::C_Op_Map_Latest::finish(int)+0x124) [0x7cea04] 11: (Context::complete(int)+0x9) [0x56d9a9] 12: (Finisher::finisher_thread_entry()+0x1b8) [0x9aced8] 13: (()+0x8182) [0x7f012db95182] 14: (clone()+0x6d) [0x7f012c50befd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. 2015-02-26 07:55:09.736256 7f0127612700 -1 common/Thread.cc: In function 'int Thread::join(void**)' thread 7f0127612700 time 2015-02-26 07:55:09.734802 common/Thread.cc: 141: FAILED assert(status == 0) That is still looking like a missing pool (the check_op_pool_dne frame in the trace). Can you post ceph osd dump and ceph mds dump? John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS crashes (80.8)
On 26/02/2015 18:07, Wyllys Ingersoll wrote: Here is 'ceph df', followed by mds dump and osd dump Your MDS map is trying to use pool 0 for both data and metadata. Firstly, these should be different pools. Secondly, you have no pool 0. You do have metadata and data pools with ids 6 and 7, so you should be typing ceph mds newfs 6 7. Note that this situation can't occur on more recent versions of Ceph, where the filesystem-pool relations are enforced strictly. If you are intent on using the filesystem, you should consider using more recent versions to benefit from the numerous other filesystem bugfixes. Cheers, John $ ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 1307T 1302T5039G 0.38 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS ks3backup 3 160M 0 0 9 test 4 0 0 0 0 TEST 5 2501G 0.19 0 320263 metadata 6 0 0 0 0 data 7 0 0 0 0 rbd 8 0 0 0 0 $ ceph mds dump dumped mdsmap epoch 220 epoch 220 flags 0 created 2015-02-26 07:43:23.410584 modified 2015-02-26 07:55:28.817383 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 last_failure 0 last_failure_osd_epoch 867 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap} max_mds 1 in 0 up {0=20039} failed stopped data_pools 0 metadata_pool 0 inline_data disabled 20039: 10.2.3.33:6800/5098 'mdc03' mds.0.91 up:creating seq 1 laggy since 2015-02-26 07:55:28.817341 $ ceph osd dump epoch 872 fsid e73bfab8-01f1-4534-9a66-1b425d5c3341 created 2015-02-23 07:53:05.626725 modified 2015-02-26 08:01:48.834331 flags pool 3 'ks3backup' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 251 flags hashpspool stripe_width 0 pool 4 'test' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 363 flags hashpspool stripe_width 0 pool 5 'TEST' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 473 flags hashpspool stripe_width 0 pool 6 'metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 808 flags hashpspool stripe_width 0 pool 7 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 810 flags hashpspool stripe_width 0 pool 8 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 512 pgp_num 512 last_change 833 flags hashpspool stripe_width 0 max_osd 242 osd.0 up in weight 1 up_from 774 up_thru 833 down_at 748 last_clean_interval [692,747) 10.2.2.109:6851/6280 10.3.2.109:6834/6280 10.3.2.109:6835/6280 10.2.2.109:6852/6280 exists,up 43999652-2966-4b1b-a8f0-c33ef5db0970 osd.1 up in weight 1 up_from 648 up_thru 833 down_at 630 last_clean_interval [598,629) 10.2.3.105:6857/6775 10.3.3.105:6838/6775 10.3.3.105:6839/6775 10.2.3.105:6858/6775 exists,up 1e4b6fa3-044c-40b8-9f5b-968bdfae5dd7 osd.2 up in weight 1 up_from 627 up_thru 833 down_at 619 last_clean_interval [455,618) 10.2.1.108:6848/5782 10.3.1.108:6832/5782 10.3.1.108:6833/5782 10.2.1.108:6849/5782 exists,up 95000872-72f2-46e8-8d4a-f9b1422032cb osd.3 up in weight 1 up_from 773 up_thru 833 down_at 752 last_clean_interval [693,751) 10.2.2.110:6833/4189 10.3.2.110:6822/4189 10.3.2.110:6823/4189 10.2.2.110:6834/4189 exists,up 980182ab-51bf-4cd9-a9a4-cfca05123592 osd.4 up in weight 1 up_from 732 up_thru 833 down_at 722 last_clean_interval [718,721) 10.2.1.101:6806/2789 10.3.1.101:6804/2789 10.3.1.101:6805/2789 10.2.1.101:6807/2789 exists,up 92a8fe48-082d-4ad1-93ba-5a4f9598d866 osd.5 up in weight 1 up_from 766 up_thru 810 down_at 739 last_clean_interval [663,738) 10.2.3.111:6800/4353 10.3.3.111:6800/4353 10.3.3.111:6801/4353 10.2.3.111:6801/4353 exists,up d6539347-2c2a-4fba-867d-74c84b2018ad osd.6 up in weight 1 up_from 781 up_thru 833 down_at 760 last_clean_interval [711,759) 10.2.2.104:6830/4317 10.3.2.104:6820/4317 10.3.2.104:6821/4317 10.2.2.104:6831/4317 exists,up 55b05a0f-2b17-44e2-84c5-78db81ea862b osd.7 up in weight 1 up_from 639 up_thru 833 down_at 633 last_clean_interval [608,632) 10.2.3.106:6842/4762 10.3.3.106:6828/4762 10.3.3.106:6829/4762 10.2.3.106:6843/4762 exists,up 6a62d4fc-487c-4b39-a64a-823d675969a9 osd.8 up in weight 1 up_from 769 up_thru 833 down_at 744 last_clean_interval [671,743) 10.2.3.112:6845/5168 10.3.3.112:6830/5168 10.3.3.112:6831/5168 10.2.3.112:6846/5168 exists,up 9c4202e6-7749-4436-adb1-9c8bf1dc7f04 osd.9 up in weight 1 up_from 734
Re: MDS crashes (80.8)
On 26/02/2015 18:27, Wyllys Ingersoll wrote: Excellent, thanks. It starts now without crashing, but I see lots of errors like this: 2015-02-26 08:24:19.134693 7f78091aa700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2015-02-26 08:24:19.134695 7f78091aa700 0 -- 10.2.3.33:6800/5236 10.2.2.104:6857/8883 pipe(0x26f5700 sd=33 :50680 s=1 pgs=0 cs=0 l=1 c=0x280b1e0).failed verifying authorize reply 2015-02-26 08:24:19.134717 7f7809bb4700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2015-02-26 08:24:19.134719 7f7809bb4700 0 -- 10.2.3.33:6800/5236 10.2.3.111:6848/4652 pipe(0x26f4300 sd=23 :58785 s=1 pgs=0 cs=0 l=1 c=0x2694c00).failed verifying authorize reply 2015-02-26 08:24:19.134761 7f78093ac700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption 2015-02-26 08:24:19.134764 7f78093ac700 0 -- 10.2.3.33:6800/5236 10.2.3.105:6842/5408 pipe(0x26f5c00 sd=30 :38240 s=1 pgs=0 cs=0 l=1 c=0x280a000).failed verifying authorize reply It would appear that something is wrong with your deployment, such as missing keys. You might want to take this issue to ceph-users to see if anyone can help with whatever system you were using to deploy these services. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Negative stats on lab cluster
The lab cluster hosting teuthology logs is currently exhibiting negative statistics (#5884, #7737). Could be a good time for someone with more low-level RADOS expertise than me to take a look at the status and see if they can work out why it's happening. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-calamari] disk failure prediction
On 18/02/2015 23:20, Sage Weil wrote: We wouldn't see quite the same results since our raid sets are effectively entire pools I think we could do better than pool-wide, e.g. if multiple drives in one chassis are at risk (where PG stores at most one copy per chassis), we can identify that as less severe than the general case where multiple at-risk drives might be in the same PG. Making it CRUSH-aware like this would be a good hook for users to take advantage of the ceph/calamari SMART monitoring rather than rolling their own. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph-fuse remount issues
Background: a while ago, we found (#10277) that existing cache expiration mechanism wasn't working with latest kernels. We used to invalidate the top level dentries, which caused fuse to invalidate everything, but an implementation detail in fuse caused it to start ignoring our repeated invalidate calls, so this doesn't work any more. To persuade fuse to dirty its entire metadata cache, Zheng added in a system() call to mount -o remount after we expire things from our client side cache. However, this was a bit of a hack and has created problems: * You can't call mount -o remount unless you're root, so we are less flexible than we used to be (#10542) * While the remount is happening, unmounts sporadically fail and the fuse process can become unresponsive to SIGKILL (#10916) The first issue was maybe an acceptable compromise, but the second issue is just painful, and it seems like we might not have seen the last of the knock on effects -- upstream maintainers certainly aren't expecting filesystems to remount themselves quite so frequently. We probably have an opportunity to get something upstream in fuse to support a direct call to trigger the invalidation we want, if we can work out what that should look like. Thoughts? John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Standardization of perf counters comments
On Wed, Feb 11, 2015 at 6:02 PM, Gregory Farnum g...@gregs42.com wrote: On Wed, Feb 11, 2015 at 9:33 AM, Alyona Kiseleva akisely...@mirantis.com wrote: Hi, I would like to propose something. There are a lot of perf counters in different places in code, but the most of them are undocumented. I found only one commented counter in OSD.cc code, but not for all metrics. Name of counter is not very clear as it's description, sometimes isn't at all. So, I have an idea, that it would be great, if perf schema would contain not only the counter type, but some description too. It can be added in PerfCountersBuilder methods - at first as optional parameter with empty string by default, later, may be, as required parameter. This short description could be saved in perf_counter_data_any_d struct together with other counter properties and appear in perf schema as the second counter property. There will be lots of counters that aren't usefully describable in a single sentence. That doesn't excuse them from being documented, but we should be careful to avoid generating useless tautological strings like num_strays_delayed - Number of strays currently that are currently delayed. For lots of things, an understandable definition will require some level of introduction of terms and concepts -- in my example, what's a stray? what does it mean for it to be delayed? While we shouldn't hold up the descriptions waiting for documentation that explains all the needed concepts, we should think about how that will fit together. Perhaps, rather than having a single string in the code, we should look to create a separate metadata file that allows richer RST docs for each setting, and verify that all settings are described during the docs build (i.e. a gitbuilder fail will tell us if someone added a setting without the associated documentation). That way, the short perfcounter descriptions could include hyperlinks to related concepts elsewhere in the docs. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph bindings for go docker
It would also be pretty interesting to do a CephFS driver here using subdirs as volumes. Just a thought, if anyone writes the RBD driver they could put rbd in the name so that we can disambiguate them in future? Cheers, John On Mon, Feb 9, 2015 at 6:23 PM, Noah Watkins nwatk...@redhat.com wrote: Hi Loic, This sounds great. The librados bindings have good test converage, but I merged a PR for RBD support a couple weeks ago and haven't had time to get it cleaned up and tests written. Do you need support for the AIO interface in librbd? -Noah - Original Message - From: Loic Dachary l...@dachary.org To: Noah Watkins noah.watk...@inktank.com Cc: Ceph Development ceph-devel@vger.kernel.org, Vincent Batts vba...@redhat.com, Johan Euphrosine pro...@google.com Sent: Monday, February 9, 2015 9:15:02 AM Subject: Ceph bindings for go docker Hi, I discovered https://github.com/noahdesu/go-ceph today :-) It would be useful in the context of a Ceph volume driver for docker ( see https://github.com/docker/docker/issues/10661 https://github.com/docker/docker/pull/8484 ). Are you a docker user by any chance ? -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: flock() on libcephfs ?
On Fri, Feb 6, 2015 at 11:39 AM, Xavier Roche roche+k...@exalead.com wrote: I'm not sure this is the right place to post, so do not hesitate to redirect me to a more appropriate list if necessary! You're in the right place! New to ceph, I naively attempted to add a flock (int ceph_flock(struct ceph_mount_info *cmount, int fd, int operation)) function in libcephfs, but I could not find the proper way to figure out what was the owner identifier used by Client::ll_flock() (an uint64_t integer) - this is not the pid, I presume ? You should let the caller pass in the owner ID. For example, if you were using libcephfs to implement a filesystem interface, it would be up to that interface to work out the unique ID from the calling layer above it. So you can pass it through libcephfs, no logic needed. Thanks in advance for any hints! By the way - would this new feature (flock()) make sense ? Definitely, the flock support in the userspace client is new, and the libcephfs bindings just didn't catch up yet: it was added in October: commit a1b2c8ff955b30807ac53ce6bdc97cf61a7262ca Author: Yan, Zheng z...@redhat.com Date: Thu Oct 2 19:07:41 2014 +0800 client: posix file lock support Signed-off-by: Yan, Zheng z...@redhat.com Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CDS process
+1 I've always found the sometimes-populated-sometimes-not blueprint docs confusing. John On Thu, Feb 5, 2015 at 2:50 PM, Sage Weil sw...@redhat.com wrote: I wonder if we should simplify the cds workflow a bit to go straight to an etherpad outline of the blueprint instead of the wiki blueprint doc. I find it a bit disorienting to be flipping between the two, and after the fact find it frustrating that there isn't a single reference to go back to for the outcome of the session (you have to look at both the pad and the bp). Perhaps just using the pad from the get-go will streamline things a bit and make it a little more lightweight? What does everyone think? sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: using ceph-deploy on build after make install
I suspect that your clue is Failed to execute command: /usr/sbin/service ceph -- as a rule, service scripts are part of the per-distro packaging rather than make install. Personally, if I'm installing systemwide I always start with some built packages, and if I need to substitute a home-built binary for debugging then I do so by directly overwriting it in /usr/bin/. John On Wed, Feb 4, 2015 at 12:58 AM, Deneau, Tom tom.den...@amd.com wrote: New to ceph building but here is my situation... I have been successfully able to build ceph starting from git checkout firefly (also successful from git checkout master). After building, I am able to run vstarth.sh from the source directory as ./vstart.sh -d -n -x (or with -X). I can then do rados commands such as rados bench. I should also add that when I have installed binaries from rpms (this is a fedora21 aarch64 system), I have been successfully able to deploy a cluster using various ceph-deploy commands. Now I would like to do make install to install my built version and then use the installed version with my ceph-deploy commands. In this case I installed ceph-deploy with pip install ceph-deploy which gives me 5.21. The first ceph-deploy command I use is: ceph-deploy new myhost which seems to work fine. The next command however is ceph-deploy mon create-initial which ends up failing with [INFO ] Running command: /usr/sbin/service ceph -c /etc/ceph/ceph.conf start mon.hostname [WARNIN] The service command supports only basic LSB actions (start, stop, restart, try-rest\ art, reload, force-reload, status). For other actions, please try to use systemctl. [ERROR ] RuntimeError: command returned non-zero exit status: 2 [ERROR ] Failed to execute command: /usr/sbin/service ceph -c /etc/ceph/ceph.conf start mon.seattle-tdeneau [ERROR ] GenericError: Failed to create 1 monitors and even the ceph status command fails # ceph -c ./ceph.conf status -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for authentication 0 librados: client.admin initialization error (2) No such file or directory Error connecting to cluster: ObjectNotFound Whereas this all worked fine when I used binaries from rpms. Is there some install step that I am missing? -- Tom Deneau, AMD -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] chattr +i not working with cephfs
On Wed, Jan 28, 2015 at 5:23 PM, Gregory Farnum g...@gregs42.com wrote: My concern is whether we as the FS are responsible for doing anything more than storing and returning that immutable flag — are we supposed to block writes to anything that has it set? That could be much trickier... The VFS layer is checking the flag for us, but some filesystems do have paths where they need to do their own too (e.g. XFS has various ioctls that do explicit checks too). It's also up to us to publish the S_IMMUTABLE bit to the i_flags attribute of the generic inode, based on wherever/however we store the flag ourselves. Fuse doesn't seem to have a path for us to update i_flags though, so it might be that we either have to extend that interface or do the checking ourselves in userspace in order to support it there. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] chattr +i not working with cephfs
We don't implement the GETFLAGS and SETFLAGS ioctls used for +i. Adding the ioctls is pretty easy, but then we need somewhere to put the flags. Currently we don't store a flags attribute on inodes, but maybe we could borrow the high bits of the mode attribute for this if we wanted to implement it? CCing ceph-devel to see if Sage/Greg can offer any more background. John On Wed, Jan 28, 2015 at 1:24 AM, Eric Eastman eric.east...@keepertech.com wrote: Should chattr +i work with cephfs? Using ceph v0.91 and a 3.18 kernel on the CephFS client, I tried this: # mount | grep ceph 172.16.30.10:/ on /cephfs/test01 type ceph (name=cephfs,key=client.cephfs) # echo 1 /cephfs/test01/test.1 # ls -l /cephfs/test01/test.1 -rw-r—r-- 1 root root 2 Jan 27 19:09 /cephfs/test01/test.1 # chattr +i /cephfs/test01/test.1 chattr: Inappropriate ioctl for device while reading flags on /cephfs/test01/test.1 I also tried it using the FUSE interface: # ceph-fuse -m 172.16.30.10 /cephfs/fuse01/ ceph-fuse[5326]: starting ceph client 2015-01-27 19:54:59.002563 7f6f8fbcb7c0 -1 init, newargv = 0x2ec2be0 newargc=11 ceph-fuse[5326]: starting fuse # mount | grep ceph ceph-fuse on /cephfs/fuse01 type fuse.ceph-fuse (rw,nosuid,nodev,allow_other,default_permissions) # echo 1 /cephfs/fuse01/test02.dat # chattr +i /cephfs/fuse01/test02.dat chattr: Invalid argument while reading flags on /cephfs/fuse01/test02.dat Eric ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS aborted after recovery and active, FAILED assert (r =0)
Hmm, upgrading should help here, as the problematic data structure (anchortable) no longer exists in the latest version. I haven't checked, but hopefully we don't try to write it during upgrades. The bug you're hitting is more or less the same as a similar one we have with the sessiontable in the latest ceph, but you won't hit it there unless you're very unlucky! John On Fri, Jan 16, 2015 at 7:37 AM, Mohd Bazli Ab Karim bazli.abka...@mimos.my wrote: Dear Ceph-Users, Ceph-Devel, Apologize me if you get double post of this email. I am running a ceph cluster version 0.72.2 and one MDS (in fact, it's 3, 2 down and only 1 up) at the moment. Plus I have one CephFS client mounted to it. Now, the MDS always get aborted after recovery and active for 4 secs. Some parts of the log are as below: -3 2015-01-15 14:10:28.464706 7fbcc8226700 1 -- 10.4.118.21:6800/5390 == osd.19 10.4.118.32:6821/243161 73 osd_op_re ply(3742 1000240c57e. [create 0~0,setxattr (99)] v56640'1871414 uv1871414 ondisk = 0) v6 221+0+0 (261801329 0 0) 0x 7770bc80 con 0x69c7dc0 -2 2015-01-15 14:10:28.464730 7fbcc8226700 1 -- 10.4.118.21:6800/5390 == osd.18 10.4.118.32:6818/243072 67 osd_op_re ply(3645 107941c. [tmapup 0~0] v56640'1769567 uv1769567 ondisk = 0) v6 179+0+0 (3759887079 0 0) 0x7757ec80 con 0x1c6bb00 -1 2015-01-15 14:10:28.464754 7fbcc8226700 1 -- 10.4.118.21:6800/5390 == osd.47 10.4.118.35:6809/8290 79 osd_op_repl y(3419 mds_anchortable [writefull 0~94394932] v0'0 uv0 ondisk = -90 (Message too long)) v6 174+0+0 (3942056372 0 0) 0x69f94 a00 con 0x1c6b9a0 0 2015-01-15 14:10:28.471684 7fbcc8226700 -1 mds/MDSTable.cc: In function 'void MDSTable::save_2(int, version_t)' thread 7 fbcc8226700 time 2015-01-15 14:10:28.46 mds/MDSTable.cc: 83: FAILED assert(r = 0) ceph version () 1: (MDSTable::save_2(int, unsigned long)+0x325) [0x769e25] 2: (Context::complete(int)+0x9) [0x568d29] 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1097) [0x7c15d7] 4: (MDS::handle_core_message(Message*)+0x5a0) [0x588900] 5: (MDS::_dispatch(Message*)+0x2f) [0x58908f] 6: (MDS::ms_dispatch(Message*)+0x1e3) [0x58ab93] 7: (DispatchQueue::entry()+0x549) [0x975739] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x8902dd] 9: (()+0x7e9a) [0x7fbcccb0de9a] 10: (clone()+0x6d) [0x7fbccb4ba3fd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Is there any workaround/patch to fix this issue? Let me know if need to see the log with debug-mds of certain level as well. Any helps would be very much appreciated. Thanks. Bazli DISCLAIMER: This e-mail (including any attachments) is for the addressee(s) only and may be confidential, especially as regards personal data. If you are not the intended recipient, please note that any dealing, review, distribution, printing, copying or use of this e-mail is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete the original message (including any attachments). MIMOS Berhad is a research and development institution under the purview of the Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions and other information in this e-mail that do not relate to the official business of MIMOS Berhad and/or its subsidiaries shall be understood as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts responsibility for the same. All liability arising from or in connection with computer viruses and/or corrupted e-mails is excluded to the fullest extent permitted by law. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS aborted after recovery and active, FAILED assert (r =0)
It has just been pointed out to me that you can also workaround this issue on your existing system by increasing the osd_max_write_size setting on your OSDs (default 90MB) to something higher, but still smaller than your osd journal size. That might get you on a path to having an accessible filesystem before you consider an upgrade. John On Fri, Jan 16, 2015 at 10:57 AM, John Spray john.sp...@redhat.com wrote: Hmm, upgrading should help here, as the problematic data structure (anchortable) no longer exists in the latest version. I haven't checked, but hopefully we don't try to write it during upgrades. The bug you're hitting is more or less the same as a similar one we have with the sessiontable in the latest ceph, but you won't hit it there unless you're very unlucky! John On Fri, Jan 16, 2015 at 7:37 AM, Mohd Bazli Ab Karim bazli.abka...@mimos.my wrote: Dear Ceph-Users, Ceph-Devel, Apologize me if you get double post of this email. I am running a ceph cluster version 0.72.2 and one MDS (in fact, it's 3, 2 down and only 1 up) at the moment. Plus I have one CephFS client mounted to it. Now, the MDS always get aborted after recovery and active for 4 secs. Some parts of the log are as below: -3 2015-01-15 14:10:28.464706 7fbcc8226700 1 -- 10.4.118.21:6800/5390 == osd.19 10.4.118.32:6821/243161 73 osd_op_re ply(3742 1000240c57e. [create 0~0,setxattr (99)] v56640'1871414 uv1871414 ondisk = 0) v6 221+0+0 (261801329 0 0) 0x 7770bc80 con 0x69c7dc0 -2 2015-01-15 14:10:28.464730 7fbcc8226700 1 -- 10.4.118.21:6800/5390 == osd.18 10.4.118.32:6818/243072 67 osd_op_re ply(3645 107941c. [tmapup 0~0] v56640'1769567 uv1769567 ondisk = 0) v6 179+0+0 (3759887079 0 0) 0x7757ec80 con 0x1c6bb00 -1 2015-01-15 14:10:28.464754 7fbcc8226700 1 -- 10.4.118.21:6800/5390 == osd.47 10.4.118.35:6809/8290 79 osd_op_repl y(3419 mds_anchortable [writefull 0~94394932] v0'0 uv0 ondisk = -90 (Message too long)) v6 174+0+0 (3942056372 0 0) 0x69f94 a00 con 0x1c6b9a0 0 2015-01-15 14:10:28.471684 7fbcc8226700 -1 mds/MDSTable.cc: In function 'void MDSTable::save_2(int, version_t)' thread 7 fbcc8226700 time 2015-01-15 14:10:28.46 mds/MDSTable.cc: 83: FAILED assert(r = 0) ceph version () 1: (MDSTable::save_2(int, unsigned long)+0x325) [0x769e25] 2: (Context::complete(int)+0x9) [0x568d29] 3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1097) [0x7c15d7] 4: (MDS::handle_core_message(Message*)+0x5a0) [0x588900] 5: (MDS::_dispatch(Message*)+0x2f) [0x58908f] 6: (MDS::ms_dispatch(Message*)+0x1e3) [0x58ab93] 7: (DispatchQueue::entry()+0x549) [0x975739] 8: (DispatchQueue::DispatchThread::entry()+0xd) [0x8902dd] 9: (()+0x7e9a) [0x7fbcccb0de9a] 10: (clone()+0x6d) [0x7fbccb4ba3fd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. Is there any workaround/patch to fix this issue? Let me know if need to see the log with debug-mds of certain level as well. Any helps would be very much appreciated. Thanks. Bazli DISCLAIMER: This e-mail (including any attachments) is for the addressee(s) only and may be confidential, especially as regards personal data. If you are not the intended recipient, please note that any dealing, review, distribution, printing, copying or use of this e-mail is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete the original message (including any attachments). MIMOS Berhad is a research and development institution under the purview of the Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions and other information in this e-mail that do not relate to the official business of MIMOS Berhad and/or its subsidiaries shall be understood as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts responsibility for the same. All liability arising from or in connection with computer viruses and/or corrupted e-mails is excluded to the fullest extent permitted by law. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New Defects reported by Coverity Scan for ceph
Hmm, maybe it's just because they're in a main() function? I notice that an exception handler was added to ceph_authtool.cc to handle the same coverity complaint there a few months ago. John On Fri, Jan 16, 2015 at 3:17 PM, Gregory Farnum g...@gregs42.com wrote: Sage, are these uncaught assertion errors something we normally ignore? I'm not familiar with any code that tries to catch errors in our standard init patterns, which is what looks to be the problem on these new coverity issues in cephfs-table-tool. -Greg On Fri, Jan 16, 2015 at 6:39 AM, scan-ad...@coverity.com wrote: Hi, Please find the latest report on new defect(s) introduced to ceph found with Coverity Scan. 4 new defect(s) introduced to ceph found with Coverity Scan. 19 defect(s), reported by Coverity Scan earlier, were marked fixed in the recent build analyzed by Coverity Scan. New defect(s) Reported-by: Coverity Scan Showing 4 of 4 defect(s) ** CID 1264457: Uncaught exception (UNCAUGHT_EXCEPT) /tools/cephfs/cephfs-table-tool.cc: 11 in main() /tools/cephfs/cephfs-table-tool.cc: 11 in main() /tools/cephfs/cephfs-table-tool.cc: 11 in main() /tools/cephfs/cephfs-table-tool.cc: 11 in main() /tools/cephfs/cephfs-table-tool.cc: 11 in main() ** CID 1264458: Uninitialized scalar field (UNINIT_CTOR) /test/librbd/test_ImageWatcher.cc: 47 in TestImageWatcher::WatchCtx::WatchCtx(TestImageWatcher)() ** CID 1264459: Uninitialized scalar field (UNINIT_CTOR) /test/librbd/test_fixture.cc: 44 in TestFixture::TestFixture()() ** CID 1264460: Structurally dead code (UNREACHABLE) /common/sync_filesystem.h: 51 in sync_filesystem(int)() *** CID 1264457: Uncaught exception (UNCAUGHT_EXCEPT) /tools/cephfs/cephfs-table-tool.cc: 11 in main() 5 #include common/errno.h 6 #include global/global_init.h 7 8 #include TableTool.h 9 10 CID 1264457: Uncaught exception (UNCAUGHT_EXCEPT) In function main(int, char const **) an exception of type ceph::FailedAssertion is thrown and never caught. 11 int main(int argc, const char **argv) 12 { 13 vectorconst char* args; 14 argv_to_vec(argc, argv, args); 15 env_to_vec(args); 16 /tools/cephfs/cephfs-table-tool.cc: 11 in main() 5 #include common/errno.h 6 #include global/global_init.h 7 8 #include TableTool.h 9 10 CID 1264457: Uncaught exception (UNCAUGHT_EXCEPT) In function main(int, char const **) an exception of type ceph::FailedAssertion is thrown and never caught. 11 int main(int argc, const char **argv) 12 { 13 vectorconst char* args; 14 argv_to_vec(argc, argv, args); 15 env_to_vec(args); 16 /tools/cephfs/cephfs-table-tool.cc: 11 in main() 5 #include common/errno.h 6 #include global/global_init.h 7 8 #include TableTool.h 9 10 CID 1264457: Uncaught exception (UNCAUGHT_EXCEPT) In function main(int, char const **) an exception of type ceph::FailedAssertion is thrown and never caught. 11 int main(int argc, const char **argv) 12 { 13 vectorconst char* args; 14 argv_to_vec(argc, argv, args); 15 env_to_vec(args); 16 /tools/cephfs/cephfs-table-tool.cc: 11 in main() 5 #include common/errno.h 6 #include global/global_init.h 7 8 #include TableTool.h 9 10 CID 1264457: Uncaught exception (UNCAUGHT_EXCEPT) In function main(int, char const **) an exception of type ceph::FailedAssertion is thrown and never caught. 11 int main(int argc, const char **argv) 12 { 13 vectorconst char* args; 14 argv_to_vec(argc, argv, args); 15 env_to_vec(args); 16 /tools/cephfs/cephfs-table-tool.cc: 11 in main() 5 #include common/errno.h 6 #include global/global_init.h 7 8 #include TableTool.h 9 10 CID 1264457: Uncaught exception (UNCAUGHT_EXCEPT) In function main(int, char const **) an exception of type ceph::FailedAssertion is thrown and never caught. 11 int main(int argc, const char **argv) 12 { 13 vectorconst char* args; 14 argv_to_vec(argc, argv, args); 15 env_to_vec(args); 16 *** CID 1264458: Uninitialized scalar field (UNINIT_CTOR) /test/librbd/test_ImageWatcher.cc: 47 in TestImageWatcher::WatchCtx::WatchCtx(TestImageWatcher)() 41 NOTIFY_OP_REQUEST_LOCK = 2, 42 NOTIFY_OP_HEADER_UPDATE = 3 43 }; 44 45 class WatchCtx : public librados::WatchCtx2 { 46 public: CID 1264458: Uninitialized scalar field (UNINIT_CTOR) Non-static class member m_handle is not initialized in this constructor nor in any functions that it calls. 47 WatchCtx(TestImageWatcher parent) : m_parent(parent) {} 48 49 int watch(const
Re: 'Immutable bit' on pools to prevent deletion
On Thu, Jan 15, 2015 at 6:07 PM, Sage Weil sw...@redhat.com wrote: What would that buy us? Preventing injectargs on it would require mon restarts, which is unfortunate ? and makes it sounds more like a security feature than a safety blanket. I meant 'ceph tell mon.* injectargs ...' as distinct from 'ceph daemon ... config set', which requires access to the host. But yeah, if we went to the effort to limit injectargs (maybe a blanket option that disables injectargs on mons?), it could double as a security feature. But whether it may also useful for security doesn't change whether it is a good safety blanket. I like it because it's simple, easy to implement, and easy to disable for testing... :) The trouble with this is admin socket part is that any tool that manages Ceph must use the admin socket interface as well as the normal over-the-network command interface, and by extension must be able to execute locally on a mon. We would no longer have a comprehensive remote management interface for the mon: management tools would have to run some code locally too. I think it's sufficient to require two API calls (set the flag or config option, then do the delete) within the remote API, rather than requiring that anyone driving the interface knows how to speak two network protocols (usual mon remote command + SSH-to-asok). John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 'Immutable bit' on pools to prevent deletion
On Thu, Jan 15, 2015 at 7:07 PM, Sage Weil sw...@redhat.com wrote: The trouble with this is admin socket part is that any tool that manages Ceph must use the admin socket interface as well as the normal over-the-network command interface, and by extension must be able to execute locally on a mon. We would no longer have a comprehensive remote management interface for the mon: management tools would have to run some code locally too. True.. if we make that option enabled by default. If we it's off by default them it's an opt-in layer of protection. Most clusters don't have ephemeral pools so I think lots of people would want this. +1, the problem goes away if it's opt-in, should be easy enough for API consumers to inspect the conf setting and give a nice informative error if the safety catch is engaged. I can imagine wanting to engage this ahead of plugging in a GUI or some config management recipes that you didn't quite trust yet. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph backports
Sounds sane -- is the new plan to always do backports via this process? i.e. if I see a backport PR which has not been through integration testing, should I refrain from merging it? John On Mon, Jan 5, 2015 at 11:53 AM, Loic Dachary l...@dachary.org wrote: Hi Ceph, I'm going to spend time to care for the Ceph backports (i.e. help reduce the time they stay in pull requests or redmine tickets). It should roughly go as follows: 0. Developer follows normal process to land PR to master. Once complete and ticket is marked Pending Backport this process initiates. 1. I periodically polls Redmine to look for tickets in Pending Backport state 2. I find commit associated with Redmine ticket and Cherry Picks it to backport integration branch off of desired maintenance branch (Dumping, Firefly, etc). (Note - patch may require backport to multiple branches) 3. I resolve any merge conflicts with the cherry-picked commit 4. Once satisfied with group of backported commits to integration branch, I notifies QE. 5. QE tests backport integration branch against appropriate suites 6a. If QE is satisfied with test results, they merge backport integration branch. 6b. If QE is NOT satisfied with the test results, they indicate backport integration branch is NOT ready to merge and return to me to work with original Developer to resolve issue and return to steps 2/3 7. Ticket is moved to Resolved once backport integration branch containing cherry-picked backport is merged to the desired mainteance branch(es) I'll first try to implement this semi manually and document / script when convenient. If anyone has ideas to improve this tentative process, now is the time :-) Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-calamari] 'pg stat' structured output
From a quick read of the code, calamari just uses pg dump pgs_brief, and calculates its own totals etc -- this change shouldn't affect it. John On Sun, Dec 21, 2014 at 3:18 PM, Sage Weil sw...@redhat.com wrote: I recently switched this around to not to a full 'pg dump' for the formatted 'pg stat' command. The problem is that the current code that does the num_pg_by_state uses the state name as the key. This includes 'active+clean' and other instances of +, which is not a valid character for an XML token. Is calamari relying on this code anywhere or can we switch this around to be state nameactive+clean/name num123/num /state (or equivalent JSON)? sage GET pg/stat: json 200 GET pg/stat: xml 200 FAILURE: url http://localhost:5000/api/v0.1/pg/stat Invalid XML returned: not well-formed (invalid token): line 4, column 40 Response content: response output pg_summarynum_pg_by_stateactive+clean24/active+clean/num_pg_by_stateversion55/versionnum_pgs24/num_pgsnum_bytes3892/num_bytesraw_bytes_used491817414656/raw_bytes_usedraw_bytes_avail663907856384/raw_bytes_availraw_bytes1217698979840/raw_bytes/pg_summary /output status OK /status /response ___ ceph-calamari mailing list ceph-calam...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-calamari-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Where do MDSHealthMetrics show up?
On Fri, Nov 21, 2014 at 1:54 AM, Michael Sevilla mikesevil...@gmail.com wrote: Hi. Where do the MDSHealthMetrics in MMDSBeacon (e.g., MDS_HEALTH_TRIM) show up in the monitors? When we run ceph -s? I suspect I don't see them because I'd have to run ceph -s at the exact moment when the MDS is trimming. Is there an easier way to see these warning or is there some debug flag I need to turn on? In the specific case of MDS_HEALTH_TRIM, this is aimed at detecting systems that are trimming at a pathologically bad rate (or perhaps stuck entirely due to a bug), so the in such an unhealthy system we would expect the state to stick around for a while -- it shouldn't just be a blink and you miss it status. However, you would have to look at the status sometime in the unhealthy period: there's currently nothing in the cluster log for that health check. For the new MDS health warnings, we have some overlapping coverage between health indications (i.e. things that show up in ceph -s) and cluster log messages (i.e. things that show up in ceph -w). There is a general problem here for the health stuff (not just for the MDS things) that it is only generated on-demand when someone looks at it -- e.g. things like clock skew also only show up if you happen to run ceph -s at the right moment. Internally this corresponds to the various get_health() functions in the mon subsystems. It would be good to have a generic way for health indicators (MDS and beyond) to emit clog messages when they appear and disappear, so that you don't have to look at the status at the right moment. That would be a little hard to implement at the moment because the health messages are just freeform strings, but I put some notes on cleaning up health reporting here a while back: http://tracker.ceph.com/issues/7192 Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Consul
Consul is an interesting thing. It had crossed my mind as a service monitoring/discovery thing for cases where: * We have services other than Ceph in the IO path (e.g. apache, samba, nfs) * The mons aren't happy (something to tell me which of my mons are up even if there is no quorum) * There might be multiple Ceph clusters and you want one health state reflecting both clusters Most other monitoring tools (e.g. calamari) take the shortcut of having a single central monitoring server -- something consul-esque that is lighter-weight and more resilient could be a step forward for cluster monitoring applications in more flexible and less enterprisey environments. The whole separate service but it's lightweight so that's okay approach is embodied by Consul. I think there is an alternative path available that I think of as we already have a consensus system, let's make a way to plug monitoring applications on top of it -- a way to plug extra smarts into the mons could be interesting too. John On Wed, Nov 5, 2014 at 12:38 AM, Loic Dachary l...@dachary.org wrote: Hi Ceph, While at the OpenStack summit Dan Bode spoke highly of Consul ( https://consul.io/intro/index.html ). Its scope is new to me. Each individual feature is familiar but I'm not entirely sure if combining them into a single software is necessary. And I wonder how it could relate to Ceph. It is entirely possible that it does not even make sense to ask theses questions ;-) Cheers -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] ceph: remove unused `map_waiters` from osdc client
This is initialized but never used. Signed-off-by: John Spray john.sp...@redhat.com --- include/linux/ceph/osd_client.h | 1 - net/ceph/osd_client.c | 1 - 2 files changed, 2 deletions(-) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 03aeb27..7cb5cea 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -197,7 +197,6 @@ struct ceph_osd_client { struct ceph_osdmap *osdmap; /* current map */ struct rw_semaphoremap_sem; - struct completion map_waiters; u64last_requested_map; struct mutex request_mutex; diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 5a75395..75ab07c 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -2537,7 +2537,6 @@ int ceph_osdc_init(struct ceph_osd_client *osdc, struct ceph_client *client) osdc-client = client; osdc-osdmap = NULL; init_rwsem(osdc-map_sem); - init_completion(osdc-map_waiters); osdc-last_requested_map = 0; mutex_init(osdc-request_mutex); osdc-last_tid = 0; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] ceph: update ceph_msg_header structure
2 bytes of what was reserved space is now used by userspace for the compat_version field. Signed-off-by: John Spray john.sp...@redhat.com --- include/linux/ceph/msgr.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/ceph/msgr.h b/include/linux/ceph/msgr.h index cac4b28..1c18872 100644 --- a/include/linux/ceph/msgr.h +++ b/include/linux/ceph/msgr.h @@ -152,7 +152,8 @@ struct ceph_msg_header { receiver: mask against ~PAGE_MASK */ struct ceph_entity_name src; - __le32 reserved; + __le16 compat_version; + __le16 reserved; __le32 crc; /* header crc32c */ } __attribute__ ((packed)); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] ceph: include osd epoch barrier in debugfs
This is useful in our automated testing, so that we can verify that the barrier is propagating correctly between servers and clients. Signed-off-by: John Spray john.sp...@redhat.com --- fs/ceph/debugfs.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c index 5d5a4c8..60db629 100644 --- a/fs/ceph/debugfs.c +++ b/fs/ceph/debugfs.c @@ -174,6 +174,9 @@ static int mds_sessions_show(struct seq_file *s, void *ptr) /* The -o name mount argument */ seq_printf(s, name \%s\\n, opt-name ? opt-name : ); + /* The latest OSD epoch barrier known to this client */ + seq_printf(s, osd_epoch_barrier \%d\\n, mdsc-cap_epoch_barrier); + /* The list of MDS session rank+state */ for (mds = 0; mds mdsc-max_sessions; mds++) { struct ceph_mds_session *session = -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] ceph: update CAPRELEASE message format
Version 2 includes the new osd epoch barrier field. This allows clients to inform servers that their released caps may not be used until a particular OSD map epoch. Signed-off-by: John Spray john.sp...@redhat.com --- fs/ceph/mds_client.c | 13 + fs/ceph/mds_client.h | 8 ++-- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index dce7977..3f5bc23 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -1508,12 +1508,25 @@ void ceph_send_cap_releases(struct ceph_mds_client *mdsc, struct ceph_mds_session *session) { struct ceph_msg *msg; + u32 *cap_barrier; dout(send_cap_releases mds%d\n, session-s_mds); spin_lock(session-s_cap_lock); while (!list_empty(session-s_cap_releases_done)) { msg = list_first_entry(session-s_cap_releases_done, struct ceph_msg, list_head); + + BUG_ON(msg-front.iov_len + sizeof(*cap_barrier) \ + PAGE_CACHE_SIZE); + + // Append cap_barrier field + cap_barrier = msg-front.iov_base + msg-front.iov_len; + *cap_barrier = cpu_to_le32(mdsc-cap_epoch_barrier); + msg-front.iov_len += sizeof(*cap_barrier); + + msg-hdr.version = cpu_to_le16(2); + msg-hdr.compat_version = cpu_to_le16(1); + list_del_init(msg-list_head); spin_unlock(session-s_cap_lock); msg-hdr.front_len = cpu_to_le32(msg-front.iov_len); diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 7b40568..b9412a8 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -92,10 +92,14 @@ struct ceph_mds_reply_info_parsed { /* * cap releases are batched and sent to the MDS en masse. + * + * Account for per-message overhead of mds_cap_release header + * and u32 for osd epoch barrier trailing field. */ #define CEPH_CAPS_PER_RELEASE ((PAGE_CACHE_SIZE - \ - sizeof(struct ceph_mds_cap_release)) / \ - sizeof(struct ceph_mds_cap_item)) + sizeof(struct ceph_mds_cap_release) - \ + sizeof(u32)) / \ + sizeof(struct ceph_mds_cap_item)) /* -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] ceph: add ceph_osdc_cancel_writes
To allow us to abort writes in ENOSPC conditions, instead of having them block indefinitely. Signed-off-by: John Spray john.sp...@redhat.com --- include/linux/ceph/osd_client.h | 8 + net/ceph/osd_client.c | 67 + 2 files changed, 75 insertions(+) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 7cb5cea..f82000c 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -21,6 +21,7 @@ struct ceph_authorizer; /* * completion callback for async writepages */ +typedef void (*ceph_osdc_full_callback_t)(struct ceph_osd_client *, void *); typedef void (*ceph_osdc_callback_t)(struct ceph_osd_request *, struct ceph_msg *); typedef void (*ceph_osdc_unsafe_callback_t)(struct ceph_osd_request *, bool); @@ -226,6 +227,9 @@ struct ceph_osd_client { u64 event_count; struct workqueue_struct *notify_wq; + +ceph_osdc_full_callback_t map_cb; +void *map_p; }; extern int ceph_osdc_setup(void); @@ -331,6 +335,7 @@ extern void ceph_osdc_put_request(struct ceph_osd_request *req); extern int ceph_osdc_start_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req, bool nofail); +extern u32 ceph_osdc_cancel_writes(struct ceph_osd_client *osdc, int r); extern void ceph_osdc_cancel_request(struct ceph_osd_request *req); extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc, struct ceph_osd_request *req); @@ -361,5 +366,8 @@ extern int ceph_osdc_create_event(struct ceph_osd_client *osdc, void *data, struct ceph_osd_event **pevent); extern void ceph_osdc_cancel_event(struct ceph_osd_event *event); extern void ceph_osdc_put_event(struct ceph_osd_event *event); + +extern void ceph_osdc_register_map_cb(struct ceph_osd_client *osdc, + ceph_osdc_full_callback_t cb, void *data); #endif diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 75ab07c..eb7e735 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -836,6 +836,59 @@ __lookup_request_ge(struct ceph_osd_client *osdc, return NULL; } +/* + * Drop all pending write/modify requests and complete + * them with the `r` as return code. + * + * Returns the highest OSD map epoch of a request that was + * cancelled, or 0 if none were cancelled. + */ +u32 ceph_osdc_cancel_writes( +struct ceph_osd_client *osdc, +int r) +{ +struct ceph_osd_request *req; +struct rb_node *n = osdc-requests.rb_node; +u32 latest_epoch = 0; + + dout(enter cancel_writes r=%d, r); + +mutex_lock(osdc-request_mutex); + +while (n) { +req = rb_entry(n, struct ceph_osd_request, r_node); +n = rb_next(n); + +ceph_osdc_get_request(req); +if (req-r_flags CEPH_OSD_FLAG_WRITE) { +req-r_result = r; +complete_all(req-r_completion); +complete_all(req-r_safe_completion); + +if (req-r_callback) { +// Requires callbacks used for write ops are +// amenable to being called with NULL msg +// (e.g. writepages_finish) +req-r_callback(req, NULL); +} + +__unregister_request(osdc, req); + +if (*req-r_request_osdmap_epoch latest_epoch) { +latest_epoch = *req-r_request_osdmap_epoch; +} +} +ceph_osdc_put_request(req); +} + +mutex_unlock(osdc-request_mutex); + + dout(complete cancel_writes latest_epoch=%ul, latest_epoch); + +return latest_epoch; +} +EXPORT_SYMBOL(ceph_osdc_cancel_writes); + static void __kick_linger_request(struct ceph_osd_request *req) { struct ceph_osd_client *osdc = req-r_osdc; @@ -2102,6 +2155,10 @@ done: downgrade_write(osdc-map_sem); ceph_monc_got_osdmap(osdc-client-monc, osdc-osdmap-epoch); + if (osdc-map_cb) { + osdc-map_cb(osdc, osdc-map_p); + } + /* * subscribe to subsequent osdmap updates if full to ensure * we find out when we are no longer full and stop returning @@ -2125,6 +2182,14 @@ bad: up_write(osdc-map_sem); } +void ceph_osdc_register_map_cb(struct ceph_osd_client *osdc, +ceph_osdc_full_callback_t cb, void *data) +{ +osdc-map_cb = cb; +osdc-map_p = data; +} +EXPORT_SYMBOL(ceph_osdc_register_map_cb); + /* * watch/notify callback event infrastructure * @@ -2553,6 +2618,8 @@ int ceph_osdc_init(struct ceph_osd_client *osdc, struct ceph_client *client) spin_lock_init(osdc-event_lock); osdc-event_tree = RB_ROOT; osdc-event_count = 0; + osdc-map_cb = NULL; + osdc-map_p = NULL; schedule_delayed_work(osdc-osds_timeout_work
[PATCH 4/7] ceph: handle full condition by cancelling ops
While cancelling, we store the OSD epoch at time of cancellation, this will be used later in CAPRELEASE messages. Signed-off-by: John Spray john.sp...@redhat.com --- fs/ceph/mds_client.c | 21 + fs/ceph/mds_client.h | 1 + 2 files changed, 22 insertions(+) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 9f00853..dce7977 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -3265,6 +3265,23 @@ static void delayed_work(struct work_struct *work) schedule_delayed(mdsc); } +/** + * Call this with map_sem held for read + */ +static void handle_osd_map(struct ceph_osd_client *osdc, void *p) +{ + struct ceph_mds_client *mdsc = (struct ceph_mds_client*)p; + u32 cancelled_epoch = 0; + + if (osdc-osdmap-flags CEPH_OSDMAP_FULL) { + cancelled_epoch = ceph_osdc_cancel_writes(osdc, -ENOSPC); + if (cancelled_epoch) { + mdsc-cap_epoch_barrier = max(cancelled_epoch + 1, + mdsc-cap_epoch_barrier); + } + } +} + int ceph_mdsc_init(struct ceph_fs_client *fsc) { @@ -3311,6 +3328,10 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc) ceph_caps_init(mdsc); ceph_adjust_min_caps(mdsc, fsc-min_caps); + mdsc-cap_epoch_barrier = 0; + + ceph_osdc_register_map_cb(fsc-client-osdc, + handle_osd_map, (void*)mdsc); return 0; } diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 230bda7..7b40568 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -298,6 +298,7 @@ struct ceph_mds_client { int num_cap_flushing; /* # caps we are flushing */ spinlock_tcap_dirty_lock; /* protects above items */ wait_queue_head_t cap_flushing_wq; + u32 cap_epoch_barrier; /* * Cap reservations -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] the state of cephfs in giant
On Thu, Oct 30, 2014 at 10:55 AM, Florian Haas flor...@hastexo.com wrote: * ganesha NFS integration: implemented, no test coverage. I understood from a conversation I had with John in London that flock() and fcntl() support had recently been added to ceph-fuse, can this be expected to Just Work™ in Ganesha as well? To clarify this comment: flock in ceph-fuse was recently implemented (by Yan Zheng) in *master* rather than giant, so it's in line for hammer. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Request for comments
We do have a lot of functions that bubble up standard error numbers -- in general callers of these functions have to accommodate the possibility of any error code (including errors they don't understand). I notice that ::mount is a bit unusual in additionally having some explicit 'magic' return codes to indicate which stage of init failed. A pull request to document the magic -1001, -1002 etc returns from that function would be welcome, it might not be useful to try and enumerate all possible error numbers for all API calls though -- it could be hard to prove that the list was comprehensive, and callers should always handle unexpected codes too. Cheers, John On Fri, Oct 10, 2014 at 5:22 PM, Barclay Jameson almightybe...@gmail.com wrote: When reading code for libcephfs.cc (giant branch) it wasn't apparent to me for the lines 95 and lines 101 what the return values was expected from the function calls init and mount other than 0. It was only after tracing a bit of code did I see that error code such as: -ENOENT and -EEXIST could be returned as well. It would be awesome if comments were added for these functions to show what the expected return values are going to be for these functions. Thanks, almightybeeij -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: New Defects reported by Coverity Scan for ceph (fwd)
Nice to see that coverity and lockdep agree :-) This should go away with the fix for #9562. John On Thu, Sep 25, 2014 at 4:02 PM, Sage Weil sw...@redhat.com wrote: -- Forwarded message -- From: scan-ad...@coverity.com To: undisclosed-recipients:; Cc: Date: Thu, 25 Sep 2014 06:18:46 -0700 Subject: New Defects reported by Coverity Scan for ceph Hi, Please find the latest report on new defect(s) introduced to ceph found with Coverity Scan. Defect(s) Reported-by: Coverity Scan Showing 1 of 1 defect(s) ** CID 1241497: Thread deadlock (ORDER_REVERSAL) *** CID 1241497: Thread deadlock (ORDER_REVERSAL) /osdc/Filer.cc: 314 in Filer::_do_purge_range(PurgeRange *, int)() 308 return; 309 } 310 311 int max = 10 - pr-uncommitted; 312 while (pr-num 0 max 0) { 313 object_t oid = file_object_t(pr-ino, pr-first); CID 1241497: Thread deadlock (ORDER_REVERSAL) Calling get_osdmap_read acquires lock RWLock.L while holding lock Mutex._m (count: 15 / 30). 314 const OSDMap *osdmap = objecter-get_osdmap_read(); 315 object_locator_t oloc = osdmap-file_to_object_locator(pr-layout); 316 objecter-put_osdmap_read(); 317 objecter-remove(oid, oloc, pr-snapc, pr-mtime, pr-flags, 318 NULL, new C_PurgeRange(this, pr)); 319 pr-uncommitted++; To view the defects in Coverity Scan visit, http://scan.coverity.com/projects/25?tab=overview To unsubscribe from the email notification for new defects, http://scan5.coverity.com/cgi-bin/unsubscribe.py -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Calamari Goes Open Source
On Wed, Jul 30, 2014 at 3:40 PM, Dan Ryder (daryder) dary...@cisco.com wrote: I had similar issues, I tried many different ways to use vagrant but couldn’t build packages successfully. I’m not sure how reliable this is, but if you are looking to get Calamari packages quickly, you can skip the Vagrant install steps and just use the Makefile. A note of warning here -- the vagrant stuff exists for a reason. The ease of building things directly depends very much on what distro you're on and what external packages are installed. Because of the way the virtualenv for /opt/calamari is built, it can be sensitive not just to what packages you have installed, but what packages you *don't* have installed -- if something is installed systemwide then that can prevent pip from realizing it needs to build it into the virtualenv. BTW, the configuration of the build virtual machines is separable from vagrant itself: within each folder in vagrant you'll see a salt/roots subdir. Those salt states can also be used on a virtual machine set up using your choice of provisioner, if you're running salt there. Finally, anyone working in this area should join the ceph-calamari mailing list at http://lists.ceph.com/listinfo.cgi/ceph-calamari-ceph.com -- it would be good to hear about any specific steps in build instructions that are failing there. Cheers, John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-objecter
On Mon, Jul 28, 2014 at 10:48 PM, Sage Weil sw...@redhat.com wrote: Anyway, I rebased and resolved conflicts and pushed a new wip-objecter-rebased which looks like it has the right diff between the old and new versions. Take a look? The main change is that the set data pool virtual xattrs do objecter-wait_for_latest_osdmap() and the flawed MDS::request_osdmap() helper is now gone. Yep, looks right. I was going to do about the same squashup once things were passing unit tests, but I had missed the leak of the contexts. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-objecter
On Mon, Jul 28, 2014 at 11:17 PM, Sage Weil sw...@redhat.com wrote: I wonder if a safer approach is to make a subclass of Context called RelockingContext, which takes an mds_lock * in the ctor, and then make all of the other completions subclasses of that. That way we can use type checking to ensure that all Context's passed to MDLog::submit_entry() (or other places that make sense) are of the right type. Hopefully there aren't any classes that are used in both and locking and non-locking context... Greg and I chatted about this sort of thing a bit last week, where the same sort of mechanism might be used to distinguish completions that need to go via a finisher vs. ones that are safe to call directly. I convinced myself that wasn't going to work last week (because we have to call completions from the journaler in the right order), but the general idea of declaratively distinguishing contexts like this is still appealing. The following are the interesting attributes of contexts: 1 Contexts that just don't care how they're called (e.g. C_OnFinisher can be called any which way) 2 Contexts that will take mds_lock (if you're holding it, send via finisher thread) 3 Contexts that require mds_lock is already held 4 Contexts that may call into Journaler (you may not call this while holding Journaler.lock, if you're holding it, send via finisher thread) 5 Contexts that may do Objecter I/O (you may not call this from an objecter callback, if you're in that situation you must go via a finisher thread) These are not all exclusive. Number 5 goes away when we switch to librados, because it's already going to be going via a finisher thread before calling back our I/O completions. For the moment we get the same effect by using C_OnFinisher with all the C_IO_* contexts. Number 4 goes away if Journaler sends all its completions via a finisher, as seems to be the simplest thing to do right now. Alternatively, we could make a LockingFinisher class that takes the specified lock (mds_lock) before calling each Context. That might be simpler? Hmm, if we did that then completions might sometimes hop up multiple finishers if they went e.g. first into a take the mds lock finisher then subsequently via a the mds lock must be held finisher (second one being strictly redundant but it sure would be nice to have the check in there somehow). The explicit Context subclasses has lots of appeal to me from the compile-time assurance we would have that we were using the right kind of thing in the right kind of place, e.g. when we do add_waiter calls on an inode's lock we can check at compile time that we're passing a mds lock already held subclass. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-objecter
I think that for the calls where we check the epoch and then conditionally call wait_for_map with the context, we have to change them to do the wait_for_map call first and check the boolean response. Otherwise, if the map we have is updated between the read of the epoch and the call to wait_for_map, it could already be ready and never call back our context. My version is here (on top a rebase of wip-objecter to master on branch wip-objecter-rebase): https://github.com/ceph/ceph/commit/3cd82464ed6f13ec5b44da303849061648b9e3a1 John On Sat, Jul 26, 2014 at 6:42 AM, Sage Weil sw...@redhat.com wrote: Hey John, I fixed up the mds osdmap wait stuff and squashed it into wip-objecter. That rebased out from underneath your branch; sorry. Will try to look at the other patches shortly! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: collectd / graphite / grafana .. calamari?
Hi Ricardo, Let me share a few notes on metrics in calamari: * We're bundling graphite, and using diamond to send home metrics. The diamond collector used in calamari has always been open source [1]. * The Calamari UI has its own graphs page that talks directly to the graphite API (the calamari REST API does not duplicate any of the graphing interface) * We also bundle the default graphite dashboard, so that folks can go to /graphite/dashboard/ on the calamari server to plot anything custom they want to. It could be quite interesting hook in Grafana there in the same way that we currently hook in the default graphite dashboard, as it grafana definitely nicer and would give us a roadmap to influxdb (a project I am quite excited about). Cheers, John 1. https://github.com/ceph/Diamond/commits/calamari On Fri, May 23, 2014 at 1:58 AM, Ricardo Rocha rocha.po...@gmail.com wrote: Hi. I saw the thread a couple days ago on ceph-users regarding collectd... and yes, i've been working on something similar for the last few days :) https://github.com/rochaporto/collectd-ceph It has a set of collectd plugins pushing metrics which mostly map what the ceph commands return. In the setup we have it pushes them to graphite and the displays rely on grafana (check for a screenshot in the link above). As it relies on common building blocks, it's easily extensible and we'll come up with new dashboards soon - things like plotting osd data against the metrics from the collectd disk plugin, which we also deploy. This email is mostly to share the work, but also to check on Calamari? I asked Patrick after the RedHat/Inktank news and have no idea what it provides, but i'm sure it comes with lots of extra sauce - he suggested to ask in the list. What's the timeline to have it open sourced? It would be great to have a look at it, and as there's work from different people in this area maybe start working together on some fancier monitoring tools. Regards, Ricardo -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Feedback: default FS pools and newfs behavior
In response to #8010[1], I'm looking at making it possible to explicitly disable CephFS, so that the (often unused) filesystem pools don't hang around if they're unwanted. The administrative behavior would change such that: * To enable the filesystem it is necessary to create two pools and use ceph newfs metadata data * There's a new ceph rmfs command to disable the filesystem and allow removing its pools * Initially, the filesystem is disabled and the 'data' and 'metadata' pools are not created by default There's an initial cut of this on a branch: https://github.com/ceph/ceph/commits/wip-nullfs Questions: * Are there strong opinions about whether the CephFS pools should exist by default? I think it makes life simpler if they don't, avoiding what the heck is the 'data' pool? type questions from newcomers. * Is it too unfriendly to require users to explicitly create pools before running newfs, or do we need to auto-create pools when they run newfs? Auto-creating some pools from newfs is a bit awkward internally because it requires modifying both OSD and MDS maps in one command. Cheers, John 1. http://tracker.ceph.com/issues/8010 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A plea for more tooling usage
Couple of notes for anyone following along on ubuntu precise or another older system: * make sure you're using clang =3.3 to get support for suppressing all the warning flags in Daniel's command line. * libcrypto++ 5.6.1 (the version in ubuntu) doesn't compile with clang, unless you hack I was looking through the warnings out of interest and made a couple notes. On Tue, May 6, 2014 at 2:48 PM, Daniel Hofmann dan...@trvx.org wrote: warning: anonymous types declared in an anonymous union are an extension [-Wnested-anon-types] $ grep Wnested-anon-types uniq.txt | wc -l 28 This is mostly ceph_osd_op in rados.h - code is fine but nonstandard. We can avoid this particular warning by explicitly typedeffing the struct types before using them, but would still have the warning that anonymous unions are themselves nonstandard. Anonymous unions are nice... warning: zero size arrays are an extension [-Wzero-length-array] $ grep Wzero-length-array uniq.txt | wc -l 8134 This is mostly dout_impl, I think -- it is using an array whose size is defined as 0 or -1 conditionally, apparently in order to detect out of bounds severity numbers. I wonder if there is a neater way to do this - I am not a preprocessor guru. warning: '%' may not be nested in a struct due to flexible array member [-Wflexible-array-extensions] $ grep Wflexible-array-extensions uniq.txt | wc -l 2 This is true (ceph_frag_tree in ceph_mds_reply_inode)... I don't see a nice way around it. Perhaps its a useful enough language extension that we should continue to use it. warning: cast between pointer-to-function and pointer-to-object is an extension [-Wpedantic] warning: use of non-standard escape character '\%' [-Wpedantic] $ grep Wpedantic uniq.txt | wc -l 15 The pointer cast thing is overly pedantic in the cases we're using dlsym (which returns a void*), object pointer cast to fn pointer is illegal in the language standard but guaranteed to work in POSIX. The \% thing is easily fixed. warning: use of GNU old-style field designator extension [-Wgnu-designator] $ grep Wgnu-designator uniq.txt | wc -l 43 fuse_ll.cc:fuse_ll_oper and config.cc:g_default_file_layout. Can easily just remove the field designators and initialize as {val1, val2} but that's much less readable :-/ I don't think nice struct initialization becomes a thing until C++11. warning: using namespace directive in global context in header [-Wheader-hygiene] $ grep Wheader-hygiene uniq.txt | wc -l 87 We should be able to solve these easily with a fairly mechanical procedure: * Remove the 'using's from the headers * Change type references in headers to use appropriate prefix (mostly std::) * Add in the 'using's to any .cc files that relied on them warning: struct '%' was previously declared as a class [-Wmismatched-tags] $ grep Wmismatched-tags uniq.txt | wc -l 93 At least some of these (especially Inode in libcephfs.h) appear to have the intention of exposing POD classes in C-compatible headers. We should probably change the C++ side to also use struct for the things that are going to be exposed to C land. warning: missing field '%' initializer [-Wmissing-field-initializers] $ grep Wmissing-field-initializer uniq.txt | wc -l 3 Hmm, I only saw one of these: test/osd/TestRados.cc:260:36: warning: missing field 'ec_pool_valid' initializer [-Wmissing-field-initializers] ...but I'm on an older clang (3.3) so perhaps that explains it. warning: private field '%' is not used [-Wunused-private-field] $ grep Wunused-private-field uniq.txt | wc -l 11 This is a handy warning indeed, it appears to be accurately pointing out unused fields. John -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RADOS translator for GlusterFS
In terms of making something work really quickly, one approach would be to base off the existing POSIX translator, use a local FS backed by an RBD volume for the metadata, and store the file content directly using librados. That would avoid the need to invent a way to map filesystem-style metadata to librados calls, while still getting reasonably efficient data operations through to rados. I would doubt this would be very slick, but it could be a fun hack! John On Mon, May 5, 2014 at 4:21 PM, Jeff Darcy jda...@redhat.com wrote: Now that we're all one big happy family, I've been mulling over different ways that the two technology stacks could work together. One idea would be to use some of the GlusterFS upper layers for their interface and integration possibilities, but then falling down to RADOS instead of GlusterFS's own distribution and replication. I must emphasize that I don't necessarily think this is The Right Way for anything real, but I think it's an important experiment just to see what the problems are and how well it performs. So here's what I'm thinking. For the Ceph folks, I'll describe just a tiny bit of how GlusterFS works. The core concept in GlusterFS is a translator which accepts file system requests and generates file system requests in exactly the same form. This allows them to be stacked in arbitrary orders, moved back and forth across the server/client divide, etc. There are several broad classes of translators: * Some, such as FUSE or GFAPI, inject new requests into the translator stack. * Some, such as posix, satisfy requests by calling a server-local FS. * The client and server translators together get requests from one machine to another. * Some translators *route* requests (one in to one of several out). * Some translators *fan out* requests (one in to all of several out). * Most are one in, one out, to add e.g. locks or caching etc. Of particular interest here are the DHT (routing/distribution) and AFR (fan-out/replication) translators, which mirror functionality in RADOS. My idea is to cut out everything from these on below, in favor of a translator based on librados instead. How this works is pretty obvious for file data - just read and write to RADOS objects instead of to files. It's a bit less obvious for metadata, especially directory entries. One really simple idea is to store metadata as data, in some format defined by the translator itself, and have it handle the read/modify/write for adding/deleting entries and such. That would be enough to get some basic performance tests done. A slightly more sophisticated idea might be to use OSD class methods to do the read/modify/write, but I don't know much about that mechanism so I'm not sure that's even feasible. This is not something I'm going to be working on as part of my main job, but I'd like to get the experiment started in some of my spare time. Is there anyone else interested in collaborating, or are there any other obvious ideas I'm missing? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Full OSD handling (CephFS and in general)
Having found that our full space handling in CephFS wasn't working right now[1], there was some discussion on the CephFS standup about how to improve the free space handling in a more general way. Currently (once #7780 is fixed), we just give the MDS a pass on all the fullness checks, so that it can journal file deletions to free up space. This is a halfway solution because there are still ways for the MDS to fill up the remaining space with operations other than deletions, especially with the advent of inlining for small files. It's also hacky, because there is code inside the OSD that special cases writes from the MDS. Changes discussed === In the CephFS layer: * we probably need to do some work to blacklist client requests other than housekeeping and deletions, when we are in a low-space situation. In the RADOS layer: * Per-pool full flag: For situations where metadata+data pools are on separate OSDs, a per-pool full flag (instead of the current global one), so that we can distinguish between situations where we need to be conservative with metadata operations (low space on metadata pool) vs situations where only client data IO is blocked (low space on data pool). This seems fairly uncontroversial, as the current global full flag doesn't reflect that different pools can be on entirely separate storage. * Per-pool full ratio: For situations where metadata+data pools are on the same OSDs, separate full ratios per pool, so that once the data pool's threshold is reached, the remaining reserved space is given over to the metadata pool (assuming metadata pool has a higher full ratio, possibly just set to 100%). Throwing it out to the list for thoughts. Cheers, John 1. http://tracker.ceph.com/issues/7780 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with MDS respawning (up:replay)
Hi Luke, I've responded to your colleague on ceph-users (I'm assuming this is the same issue). John On Mon, Mar 17, 2014 at 4:00 PM, Luke Jing Yuan jyl...@mimos.my wrote: Dear all, We had been running our cluster for at least 1.5 months without any issues but something really bad happened yesterday with the MDS and we really look forward for some guidance/pointer on how this may be resolved urgently. Anyway, we started to noticed the following messages repeating in MDS log: # cat /var/log/ceph/ceph-mds.mon01.log 2014-03-16 18:40:41.894404 7f0f2875c700 0 mds.0.server handle_client_file_setlock: start: 0, length: 0, client: 324186, pid: 30684, pid_ns: 18446612141968944256, type: 4 2014-03-16 18:49:09.993985 7f0f24645700 0 -- x.x.x.x:6801/3739 y.y.y.y:0/1662262473 pipe(0x728d2780 sd=26 :6801 s=0 pgs=0 cs=0 l=0 c=0x100adc6e0).accept peer addr is really y.y.y.y:0/1662262473 (socket is y.y.y.y:33592/0) 2014-03-16 18:49:10.000197 7f0f24645700 0 -- x.x.x.x:6801/3739 y.y.y.y:0/1662262473 pipe(0x728d2780 sd=26 :6801 s=0 pgs=0 cs=0 l=0 c=0x100adc6e0).accept connect_seq 0 vs existing 1 state standby 2014-03-16 18:49:10.000239 7f0f24645700 0 -- x.x.x.x:6801/3739 y.y.y.y:0/1662262473 pipe(0x728d2780 sd=26 :6801 s=0 pgs=0 cs=0 l=0 c=0x100adc6e0).accept peer reset, then tried to connect to us, replacing 2014-03-16 18:49:10.550726 7f4c34671780 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 13282 2014-03-16 18:49:10.826713 7f4c2f6f8700 1 mds.-1.0 handle_mds_map standby 2014-03-16 18:49:10.984992 7f4c2f6f8700 1 mds.0.14 handle_mds_map i am now mds.0.14 2014-03-16 18:49:10.985010 7f4c2f6f8700 1 mds.0.14 handle_mds_map state change up:standby -- up:replay 2014-03-16 18:49:10.985017 7f4c2f6f8700 1 mds.0.14 replay_start 2014-03-16 18:49:10.985024 7f4c2f6f8700 1 mds.0.14 recovery set is 2014-03-16 18:49:10.985027 7f4c2f6f8700 1 mds.0.14 need osdmap epoch 3446, have 3445 2014-03-16 18:49:10.985030 7f4c2f6f8700 1 mds.0.14 waiting for osdmap 3446 (which blacklists prior instance) 2014-03-16 18:49:16.945500 7f4c2f6f8700 0 mds.0.cache creating system inode with ino:100 2014-03-16 18:49:16.945747 7f4c2f6f8700 0 mds.0.cache creating system inode with ino:1 2014-03-16 18:49:17.358681 7f4c2b5e1700 -1 mds/journal.cc: In function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 7f4c2b5e1700 time 2014-03-16 18:49:17.356336 mds/journal.cc: 1316: FAILED assert(i == used_preallocated_ino) ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) 1: (EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)+0x7587) [0x5af5e7] 2: (EUpdate::replay(MDS*)+0x3a) [0x5b67ea] 3: (MDLog::_replay_thread()+0x678) [0x79dbb8] 4: (MDLog::ReplayThread::entry()+0xd) [0x58bded] 5: (()+0x7e9a) [0x7f4c33a96e9a] 6: (clone()+0x6d) [0x7f4c3298b3fd] From ceph -s, we didn't notice any stuck PGs or what not but the following: # ceph -s cluster x health HEALTH_WARN mds cluster is degraded monmap e1: 3 mons at {mon01=x.x.x.x:6789/0,mon02=x.x.x.y:6789/0,mon03=x.x.x.z:6789/0}, election epoch 1210, quorum 0,1,2 mon01,mon02,mon03 mdsmap e17020: 1/1/1 up {0=mon01=up:replay}, 2 up:standby osdmap e20195: 24 osds: 24 up, 24 in pgmap v1424671: 3300 pgs, 6 pools, 793 GB data, 3284 kobjects 1611 GB used, 87636 GB / 89248 GB avail 3300 active+clean client io 2750 kB/s rd, 0 op/s We also noticed in our syslog (dmesg actually) that the MDS services had been flapping: [5165030.941804] init: ceph-mds (ceph/mon01) main process (2264) killed by ABRT signal [5165030.941919] init: ceph-mds (ceph/mon01) main process ended, respawning [5165040.907291] init: ceph-mds (ceph/mon01) main process (2302) killed by ABRT signal [5165040.907363] init: ceph-mds (ceph/mon01) main process ended, respawning [5165050.860593] init: ceph-mds (ceph/mon01) main process (2346) killed by ABRT signal [5165050.860670] init: ceph-mds (ceph/mon01) main process ended, respawning More info from ceph df: GLOBAL: SIZE AVAIL RAW USED %RAW USED 89248G 87636G 1611G1.81 POOLS: NAMEID USED %USED OBJECTS Data0 9387M 0.012350 metadata1941M 0 547003 rbd 2 0 0 0 backuppc4 783G0.882813040 mysqlfs 5 114M0 1278 mysqlrbd6 0 0 0 Appreciate if someone would able to enlighten us on a possible solution to this. Thanks in advance. Regards, Luke DISCLAIMER: This e-mail (including any attachments) is for the addressee(s) only and may be confidential, especially as regards personal data. If you are not the intended recipient, please note that any
Re: Erasure code properties in OSDMap
I am sure all of that will work, but it doesn't explain why these properties must be stored and named separately to crush rulesets. To flesh this out one also needs get and list operations for the sets of properties, which feels like overkill if there is an existing place we could be storing these things. The reason I'm taking such an interest in what may seem something minor is that once this has been added, we will be stuck with it for some time once external tools start depending on the interface. The ruleset-based approach doesn't have to be more complicated for CLI users, we would essentially replace any myproperties above with a ruleset name instead. osd pool create mypool pgnum pgpnum ruleset osd set ruleset-properties ruleset key=val [key=val...] The simple default cases of pool create mypool pgnum pgpnum erasure could be handled by making sure there exist default rulesets called erasure and replicated rather than having these be magic words to the commands that cause ruleset creation. Rulesets currently just have numbers instead of names, but it would be simpler to add names to rulesets than to introduce a whole new type of object to the interface. John On Tue, Mar 11, 2014 at 2:03 PM, Loic Dachary loic.dach...@cloudwatt.com wrote: On 11/03/2014 13:21, John Spray wrote: From a high level view, what is the logical difference between the crush ruleset and the properties object? I'm thinking about how this is exposed to users and tools, and it seems like both would be defined as the settings about data placement and encoding. I certainly understand the separation internally, I am just concerned about making the interface we expose upwards more complicated by adding a new type of object. Is there really a need for a new type of properties object, instead of storing these properties under the existing ruleset ID? These properties are used to configure the new feature that was introduced in Firefly : erasure coded pools. From a user point of view the simplest would be to ceph osd pool create mypool erasure and rely on the fact that a default ruleset will be created using the default erasure code plugin with the default parameters. If the sysadmin wants to tweak the K+M factors (s)he could: ceph osd set properties myproperties k=10 m=4 and then ceph osd pool create mypool erasure myproperties which would implicitly ask the default erasure code plugin to create a ruleset named mypool-ruleset after configuring it with myproperties. If the sysadmin wants to share rulesets between pools instead of relying on their implicit creation, (s)he could ceph osd create-serasure myruleset myproperties and then ceph osd set crush_ruleset as per usual. And if (s)he really wants fine tuning, manually adding the ruleset is also possible. I feel confortable explaining this but I'm probably much too familiar with the subject to be a good judge of what makes sense to someone new or not ;-) Cheers John On Sun, Mar 9, 2014 at 12:13 PM, Loic Dachary loic.dach...@cloudwatt.com wrote: Hi Sage Sam, I quickly sketched the replacement of the pg_pool_t::properties map with a OSDMap::properties list of maps at https://github.com/dachary/ceph/commit/fe3819a62eb139fc3f0fa4282b4d22aecd8cd398 and explained how I see it at http://tracker.ceph.com/issues/7662#note-2 It indeed makes things simpler, more consistent and easier to explain. I can provide an implementation this week if this seems reasonable to you. Cheers -- Loďc Dachary, Senior Developer -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Loïc Dachary, Senior Developer -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html