from:"John Spray"

Re: cmake

2015-12-16 Thread John Spray

On Wed, Dec 16, 2015 at 5:33 PM, Sage Weil  wrote:
> The work to transition to cmake has stalled somewhat.  I've tried to use
> it a few times but keep running into issues that make it unusable for me.
> Not having make check is a big one, but I think the hackery required to
> get that going points to the underlying problem(s).
>
> I seems like the main problem is that automake puts all build targets in
> src/ and cmake spreads them all over build/*.  This makes that you can't
> just add ./ to anything that would normally be in your path (or,
> PATH=.:$PATH, and then run, say, ../qa/workunits/cephtool/test.sh).
> There's a bunch of kludges in vstart.sh to make it work that I think
> mostly point to this issue (and the .libs things).  Is there simply an
> option we can give cmake to make it put built binaries directly in build/?
>
> Stepping back a bit, it seems like the goals should be
>
> 1. Be able to completely replace autotools.  I don't fancy maintaining
> both in parallel.

Yes!

> 2. Be able to run vstart etc from the build dir.

I'm currently doing this (i.e. being in the build dir and running
../src/vstart.sh), along with the vstart_runner.py for cephfs tests.
I did indeed have to make sure that vstart_runner was aware of the
differing binary paths though.

Though I'm obviously using just MDS+OSD, so I might be overstating the
extent to which it currently works.

> 3. Be able to run ./ceph[-anything] from the build dir, or put the build
> dir in the path.  (I suppose we could rely in a make install step, but
> that seems like more hassle... hopefully it's not neceesary?)

Shall we just put all our libs and binaries in one place?  This works for me:
set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)

(to get a bin/ and a lib/ with absolutely everything in)

That way folks can either get used to typing bin/foo instead of ./foo,
or add bin/ to their path.

> 4. make check has to work
>
> 5. Use make-dist.sh to generate a release tarball (not make dist)
>
> 6. gitbuilders use make-dist.sh and cmake to build packages
>
> 7. release process uses make-dist.sh and cmake to build a relelase
>
> I'm probably missing something?
>
> Should we set a target of doing the 10.0.2 or .3 with cmake?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Issue with Ceph File System and LIO

2015-12-15 Thread John Spray

On Tue, Dec 15, 2015 at 9:26 AM, Mike Christie  wrote:
> On 12/15/2015 12:08 AM, Eric Eastman wrote:
>> I am testing Linux Target SCSI, LIO, with a Ceph File System backstore
>> and I am seeing this error on my LIO gateway.  I am using Ceph v9.2.0
>> on a 4.4rc4 Kernel, on Trusty, using a kernel mounted Ceph File
>> System.  A file on the Ceph File System is exported via iSCSI to a
>> VMware ESXi 5.0 server, and I am seeing this error when doing a lot of
>> I/O on the ESXi server.   Is this a LIO or a Ceph issue?
>>
>> [Tue Dec 15 00:46:55 2015] [ cut here ]
>> [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at
>> /home/kernel/COD/linux/fs/ceph/addr.c:125
>> ceph_set_page_dirty+0x230/0x240 [ceph]()
>> [Tue Dec 15 00:46:55 2015] Modules linked in: iptable_filter ip_tables
>> x_tables xfs rbd iscsi_target_mod vhost_scsi tcm_qla2xxx ib_srpt
>> tcm_fc tcm_usb_gadget tcm_loop target_core_file target_core_iblock
>> target_core_pscsi target_core_user target_core_mod ipmi_devintf vhost
>> qla2xxx ib_cm ib_sa ib_mad ib_core ib_addr libfc scsi_transport_fc
>> libcomposite udc_core uio configfs ipmi_ssif ttm drm_kms_helper
>> gpio_ich drm i2c_algo_bit fb_sys_fops coretemp syscopyarea ipmi_si
>> sysfillrect ipmi_msghandler sysimgblt kvm acpi_power_meter 8250_fintek
>> irqbypass hpilo shpchp input_leds serio_raw lpc_ich i7core_edac
>> edac_core mac_hid ceph libceph libcrc32c fscache bonding lp parport
>> mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core hid_generic
>> usbhid hid hpsa mlx4_core psmouse bnx2 scsi_transport_sas fjes [last
>> unloaded: target_core_mod]
>> [Tue Dec 15 00:46:55 2015] CPU: 0 PID: 1123421 Comm: iscsi_trx
>> Tainted: GW I 4.4.0-040400rc4-generic #201512061930
>> [Tue Dec 15 00:46:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS
>> P64 01/22/2015
>> [Tue Dec 15 00:46:55 2015]   fdc0ce43
>> 880bf38c38c0 813c8ab4
>> [Tue Dec 15 00:46:55 2015]   880bf38c38f8
>> 8107d772 ea00127a8680
>> [Tue Dec 15 00:46:55 2015]  8804e52c1448 8804e52c15b0
>> 8804e52c10f0 0200
>> [Tue Dec 15 00:46:55 2015] Call Trace:
>> [Tue Dec 15 00:46:55 2015]  [] dump_stack+0x44/0x60
>> [Tue Dec 15 00:46:55 2015]  [] 
>> warn_slowpath_common+0x82/0xc0
>> [Tue Dec 15 00:46:55 2015]  [] warn_slowpath_null+0x1a/0x20
>> [Tue Dec 15 00:46:55 2015]  []
>> ceph_set_page_dirty+0x230/0x240 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> pagecache_get_page+0x150/0x1c0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> ceph_pool_perm_check+0x48/0x700 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] set_page_dirty+0x3d/0x70
>> [Tue Dec 15 00:46:55 2015]  []
>> ceph_write_end+0x5e/0x180 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> iov_iter_copy_from_user_atomic+0x156/0x220
>> [Tue Dec 15 00:46:55 2015]  []
>> generic_perform_write+0x114/0x1c0
>> [Tue Dec 15 00:46:55 2015]  []
>> ceph_write_iter+0xf8a/0x1050 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> ceph_put_cap_refs+0x143/0x320 [ceph]
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> check_preempt_wakeup+0xfa/0x220
>> [Tue Dec 15 00:46:55 2015]  [] ? zone_statistics+0x7c/0xa0
>> [Tue Dec 15 00:46:55 2015]  [] ? 
>> copy_page_to_iter+0x5e/0xa0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> skb_copy_datagram_iter+0x122/0x250
>> [Tue Dec 15 00:46:55 2015]  [] vfs_iter_write+0x76/0xc0
>> [Tue Dec 15 00:46:55 2015]  []
>> fd_do_rw.isra.5+0xd8/0x1e0 [target_core_file]
>> [Tue Dec 15 00:46:55 2015]  []
>> fd_execute_rw+0xc5/0x2a0 [target_core_file]
>> [Tue Dec 15 00:46:55 2015]  []
>> sbc_execute_rw+0x22/0x30 [target_core_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> __target_execute_cmd+0x1f/0x70 [target_core_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> target_execute_cmd+0x195/0x2a0 [target_core_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  []
>> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  [] ? __switch_to+0x1dc/0x5a0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> iscsi_target_tx_thread+0x1e0/0x1e0 [iscsi_target_mod]
>> [Tue Dec 15 00:46:55 2015]  [] kthread+0xd8/0xf0
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> kthread_create_on_node+0x1a0/0x1a0
>> [Tue Dec 15 00:46:55 2015]  [] ret_from_fork+0x3f/0x70
>> [Tue Dec 15 00:46:55 2015]  [] ?
>> kthread_create_on_node+0x1a0/0x1a0
>> [Tue Dec 15 00:46:55 2015] ---[ end trace 4079437668c77cbb ]---
>> [Tue Dec 15 00:47:45 2015] ABORT_TASK: Found referenced iSCSI task_tag: 
>> 95784927
>> [Tue Dec 15 00:47:45 2015] ABORT_TASK: ref_tag: 95784927 already
>> complete, skipping

Looks likely to be a kclient bug, as it's in the newish
pool_perm_check path.  Perhaps we don't usually see this because we'd
usually hit the permissions checks earlier (or during a read).

CCing zyan, who will have a better idea than me.

Eric: you should probably go ahead and

Re: [ceph-users] 答复: 答复: how to see file object-mappings for cephfuse client

2015-12-07 Thread John Spray

On Mon, Dec 7, 2015 at 9:13 AM, Wuxiangwei  wrote:
>
> it looks simple if everything stays as its default value. However, we do want 
> to change the stripe unit size and stripe count to achieve possible higher 
> performance. If so, it would be too troublesome to manually do the 
> calculation every time when we want to locate a given offset(and maybe a 
> certain interval). The 'cephfs map' and 'cephfs show_location' can provide 
> good information as we want, but sadly not for ceph-fuse. That's why I ask 
> for a similar tool.

If you are interested in writing this, you could look at
Client::get_file_stripe_address and Striper::file_to_extents.
Currently in libcephfs we expose methods for getting the OSDs relevant
to a particular file, in case something like hadoop wants to exploit
locality, but we don't expose the intermediate knowledge of the object
names.

I am curious about why you need this?

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: proposal to run Ceph tests on pull requests

2015-12-07 Thread John Spray

On Sat, Dec 5, 2015 at 11:49 AM, Loic Dachary  wrote:
> Hi Ceph,
>
> TL;DR: a ceph-qa-suite bot running on pull requests is sustainable and is an 
> incentive for contributors to use teuthology-openstack independently

A bot for scheduling a named suite on a named PR, and posting the
results back the PR is definitely a good thing.

Thinking further about using commit messages to toggle the testing, I
think that this could get awkward when it's coupled to the human side
of code review.  When someone pushes a "how about this?" modification
they don't necessarily want to re-run the test suite until the
reviewer has okayed it, but then that means that they have to push
again, and the final thing that's tested would be a different SHA1
(hopefully the same code) than what the human last reviewed.  We'll
also have e.g. rebases, where there tends to be some discretion about
whether a rebase requires a re-test.

When you were talking about having the suite selected in the qa: tag,
there was the motivation to put it in the commit message so that it
would be preserved in backports.  However, if the "Needs-qa:" flag is
just a boolean, then I think it makes more sense to control it with a
github label or by posting a command in a PR comment.

I'm not sure how this really helps with the resource issues; for
example with the fs suite we would probably not be able to make a
finer-grained choice about what tests to run based on the diff.  The
part about randomly dropping a subset of tests when resources are low
doesn't make sense to me -- I think the bot should either give up or
enqueue itself.

Cheers,
John

> When a pull request is submitted, it is compiled, some tests are run[1] and 
> the result is added to the pull request to confirm that it does not introduce 
> a trivial problem. Such tests are however limited because they must:
>
> * run within a few minutes at most
> * not require multiple machines
> * not require root privileges
>
> More extensive tests (primarily integration tests) are needed before a 
> contribution can be merged into Ceph [2], to verify it does not introduce a 
> subtle regression. It would be ideal to run these integration tests on each 
> pull request but there are two obstacles:
>
> * each test takes ~ 1.5 hour
> * each test cost ~ 0.30 euros
>
> On the current master, running all tests would require ~1000 jobs [3]. That 
> would cost ~ 300 euros on each pull request and take ~10 hours assuming 100 
> jobs can run in parallel. We could resolve that problem by:
>
> * maintaining a ceph-qa-suite map to be used as a white list mapping a diff 
> to a set of tests. For instance, if the diff modifies the src/ceph-disk file, 
> it outputs the ceph-disk suite[4]. This would effectively trim the tests that 
> are unrelated to the contribution and reduce the number of tests to a maximum 
> of ~100 [4] and most likely a dozen.
> * tests are run if one of the commits of the pull request has the *Needs-qa: 
> true* flag in the commit message[5]
> * limiting the number of tests to fit in the allocated budget. If there was 
> enough funding for 10,000 jobs during the previous period and there was a 
> total of 1,000 test run required (a test run is a set of tests as produced by 
> the ceph-qa-suite map), each run is trimmed to a maximum of ten tests, 
> regardless.
>
> Here is an example:
>
> Joe submits a pull request to fix a bug in the librados API
> The make check bot compiles and fails make check because it introduces a bug
> Joe uses run-make-check.sh locally to repeat the failure, fixes it and repush
> The make check bot compiles and passes make check
> Joe amends the commit message to add *Needs-qa: true* and repushes
> The ceph-qa-suite map script finds a change on the librados API and outputs 
> smoke/basic/tasks/rados_api_tests.yaml
> The ceph-qa-suite bot runs the test smoke/basic/tasks/rados_api_tests.yaml 
> which fails
> Joe examines the logs found at http://teuthology-logs.public.ceph.com/ and 
> decides to debug by running the test himself
> Joe runs teuthology-openstack --suite smoke/basic/tasks/rados_api_tests.yaml 
> against his own OpenStack tenant [6]
> Joe repush with a fix
> The ceph-qa-suite bot runs the test smoke/basic/tasks/rados_api_tests.yaml 
> which succeeds
> Kefu reviews the pull request and has a link to the successful test runs in 
> the comments
>
> This approach scales with the size of the Ceph developer community [7] 
> because regular contributors benefit directly from funding the ceph-qa-suite 
> bot. New contributors can focus on learning how to interpret the 
> ceph-qa-suite error logs for their contribution and learn about how to debug 
> it via teuthology-openstack if needed, which is a better user experience than 
> trying to figure out which ceph-qa-suite job to run, learning about 
> teuthology, schedule the test and interpret the results.
>
> The maintenance workload of a ceph-qa-suite bot probably requires one work 
> day a week, to

Re: Suggestions on tracker 13578

2015-12-02 Thread John Spray

On Wed, Dec 2, 2015 at 7:54 PM, Paul Von-Stamwitz
 wrote:
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>> ow...@vger.kernel.org] On Behalf Of Mark Nelson
>> Sent: Wednesday, December 02, 2015 11:04 AM
>> To: Gregory Farnum; Vimal
>> Cc: ceph-devel
>> Subject: Re: Suggestions on tracker 13578
>>
>>
>> On 12/02/2015 12:23 PM, Gregory Farnum wrote:
>> > On Tue, Dec 1, 2015 at 5:23 AM, Vimal  wrote:
>> >> Hello,
>> >>
>> >> This mail is to discuss the feature request at
>> >> http://tracker.ceph.com/issues/13578.
>> >>
>> >> If done, such a tool should help point out several mis-configurations
>> >> that may cause problems in a cluster later.
>> >>
>> >> Some of the suggestions are:
>> >>
>> >> a) A check to understand if the MONs and OSD nodes are on the same
>> machines.
>> >>
>> >> b) If /var is a separate partition or not, to prevent the root
>> >> filesystem from being filled up.
>> >>
>> >> c) If monitors are deployed in different failure domains or not.
>> >>
>> >> d) If the OSDs are deployed in different failure domains.
>> >>
>> >> e) If a journal disk is used for more than six OSDs. Right now, the
>> >> documentation suggests upto 6 OSD journals to exist on a single
>> >> journal disk.
>> >>
>> >> f) Failure domains depending on the power source.
>> >>
>> >> There can be several more checks, and it can be a useful tool to test
>> >> the problems an existing cluster or a new installation.
>> >>
>> >> But I'd like to know how the engineering community sees this, if its
>> >> seems to be worth pursuing, and what suggestions do you have for
>> >> improving/adding to this.
>> >
>> > This is a user experience and support tool; I don't think the
>> > engineering community can really judge its value. ;)
>> >
>> > So sure, sounds good to me. It'll need to get into the hands of users
>> > before we find out if it's a good plan or not. I was at the SDI Summit
>> > yesterday and was hearing about how some of our choices (like
>> > HEALTH_WARN on pg counts) are *really* scary for users who think
>> > they're in danger of losing data. I suspect the difficulty of a tool
>> > like this will be more in the communication of issues and severity,
>> > more than in what exactly we choose to check.
>>
>> Frankly I've never been a big fan of how we report warnings like this through
>> the health check.  It's important to let users know if they've set up things
>> sub-optimally, but I don't think ceph health is the way to do it.  The
>> difference between your doctor telling you you should exercise more and
>> lose a few pounds vs you have Ebola and are going to suffer an incredibly
>> gruesome and painful death in the next 48 hours. :)
>>
>
> Since I was the one at the SDI Summit that took issue with some of these 
> warnings, I whole-heartedly agree with Greg's and Mark's comments. A warning 
> at health check should indicate to the user that some corrective action 
> should be taken, besides turning the warning off :-) I do not have an issue 
> reporting advisories, but they should be kept separate true warnings. If we 
> want to notify the user of variances from best practices, I suggest a 
> separate method, i.e. "ceph advise", rather than constantly repeating them on 
> health checks.

Separating things into "advise" vs. "health" probably doesn't solve
the problem, because one has to decide what goes in which section, and
ends up with the same problem as INFO/WARN/ERR categorisation -- the
idea of having different categories is fine, the hard part is
assigning particular items to a category in a way that makes sense for
different users.

IMHO the core problems are attempting to collapse all these
notifications into a global indicator, and attempting to do that in
the same way for all systems.  It needs to be finer grained than that.
I never got around to doing anything with #7192 [1], but it outlines a
way to change the health outlet into a form where it's easier to
selectively ignore particular items.

Once you break down the health output into a set of known status
codes, a natural extension would be to have user-configurable masks,
so that they could cancel particular warnings if they wanted to.
Think of it like having the ability to press the warning lights in an
aeroplane cockpit to turn off the alarm sound.

John

1. http://tracker.ceph.com/issues/7192
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: teuthology field in commit messages

2015-11-29 Thread John Spray

On Sun, Nov 29, 2015 at 8:25 PM, Loic Dachary <l...@dachary.org> wrote:
>
>
> On 29/11/2015 21:08, John Spray wrote:
>> On Sat, Nov 28, 2015 at 3:56 PM, Loic Dachary <l...@dachary.org> wrote:
>>> Hi Ceph,
>>>
>>> An optional teuthology field could be added to a commit message like so:
>>>
>>> teuthology: --suite rbd
>>>
>>> to state that this commit should be tested with the rbd suite. It could be 
>>> parsed by bots and humans.
>>>
>>> It would make it easy and cost effective to run partial teuthology suites 
>>> automatically on pull requests.
>>>
>>> What do you think ?
>>
>> Hmm, we are usually testing things at the branch/PR level rather than
>> on the per-commit level, so it feels a bit strange to have this in the
>> commit message.
>
> Indeed. But what is a branch if not the HEAD commit ?

It's the HEAD commit, and its ancestors.  So in a typical PR (or
branch) there are several commits since the base (i.e. since master),
and perhaps only one of them has a test suite marked on it, or maybe
they have different test suites marked on different commits in the
branch.

It's not necessarily a problem, just something that would need to have
a defined behaviour (maybe when testing a PR collect the "teuthology:"
tags from all commits in PR, and run all the suites mentioned?).

John


>
>> However, if a system existed that would auto-test things when I put
>> something magic in a commit message, I would probably use it!
>>
>> John
>>
>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: teuthology field in commit messages

2015-11-29 Thread John Spray

On Sat, Nov 28, 2015 at 3:56 PM, Loic Dachary  wrote:
> Hi Ceph,
>
> An optional teuthology field could be added to a commit message like so:
>
> teuthology: --suite rbd
>
> to state that this commit should be tested with the rbd suite. It could be 
> parsed by bots and humans.
>
> It would make it easy and cost effective to run partial teuthology suites 
> automatically on pull requests.
>
> What do you think ?

Hmm, we are usually testing things at the branch/PR level rather than
on the per-commit level, so it feels a bit strange to have this in the
commit message.

However, if a system existed that would auto-test things when I put
something magic in a commit message, I would probably use it!

John


>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: teuthology field in commit messages

2015-11-29 Thread John Spray

On Sun, Nov 29, 2015 at 9:25 PM, Loic Dachary <l...@dachary.org> wrote:
>
>
> On 29/11/2015 21:47, John Spray wrote:
>> On Sun, Nov 29, 2015 at 8:25 PM, Loic Dachary <l...@dachary.org> wrote:
>>>
>>>
>>> On 29/11/2015 21:08, John Spray wrote:
>>>> On Sat, Nov 28, 2015 at 3:56 PM, Loic Dachary <l...@dachary.org> wrote:
>>>>> Hi Ceph,
>>>>>
>>>>> An optional teuthology field could be added to a commit message like so:
>>>>>
>>>>> teuthology: --suite rbd
>>>>>
>>>>> to state that this commit should be tested with the rbd suite. It could 
>>>>> be parsed by bots and humans.
>>>>>
>>>>> It would make it easy and cost effective to run partial teuthology suites 
>>>>> automatically on pull requests.
>>>>>
>>>>> What do you think ?
>>>>
>>>> Hmm, we are usually testing things at the branch/PR level rather than
>>>> on the per-commit level, so it feels a bit strange to have this in the
>>>> commit message.
>>>
>>> Indeed. But what is a branch if not the HEAD commit ?
>>
>> It's the HEAD commit, and its ancestors.  So in a typical PR (or
>> branch) there are several commits since the base (i.e. since master),
>> and perhaps only one of them has a test suite marked on it, or maybe
>> they have different test suites marked on different commits in the
>> branch.
>>
>> It's not necessarily a problem, just something that would need to have
>> a defined behaviour (maybe when testing a PR collect the "teuthology:"
>> tags from all commits in PR, and run all the suites mentioned?).
>
> That's an interesting idea :-) My understanding is that we currently test a 
> PR by scheduling suites on its HEAD. But maybe you sometime schedule suites 
> using a commit that's in the middle of a PR ?

I think I've made this too complicated...

What I meant was that while one would schedule suites against the HEAD
of the PR, that might not be the same commit that has the logical
testing information in.  For example, I might have main commit that
has the "Fixes: " and "teuthology: " tags, and then a second commit
(that would be HEAD) which e.g. tweaks a unit test.  It would be weird
if I had to put the teuthology: tag on the unit test commit rather
than the functional test, so I guess it would make sense to look at
the teuthology: tags from all the commits in a PR when scheduling it.

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-23 Thread John Spray

On Mon, Oct 19, 2015 at 5:13 PM, John Spray <jsp...@redhat.com> wrote:
>> If you try this, send feedback.
>>

OK, got this up and running.

I've shared the kernel/qemu/nfsutils packages I built here:
https://copr.fedoraproject.org/coprs/jspray/vsock-nfs/builds/

(at time of writing the kernel one is still building, and I'm running
with ganesha out of a source tree)

Observations:
 * Running VM as qemu user gives EPERM opening vsock device, even
after changing permissions on the device node (for which I guess we'll
want udev rules at some stage) -- is there a particular capability
that we need to grant the qemu user?  Was looking into this to make it
convenient to run inside libvirt.
 * NFS writes from the guest are lagging for like a minute before
completing, my hunch is that this is something in the NFS client
recovery stuff (in ganesha) that's not coping with vsock, the
operations seem to complete at the point where the server declares
itself "NOT IN GRACE".
 * For those (like myself) unaccustomed to running ganesha, do not run
it straight out of a source tree and expect everything to work, by
default even VFS exports won't work that way (mounts work but clients
see an empty tree) because it can't find the built FSAL .so.  You can
write a config file that works, but it's easier just to make install
it.
 * (Anecdotal, seen while messing with other stuff) client mount seems
to hang if I kill ganesha and then start it again, not sure if this is
a ganesha issue or a general vsock issue.

Cheers,
John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MDS stuck in a crash loop

2015-10-22 Thread John Spray

On Thu, Oct 22, 2015 at 1:43 PM, Milosz Tanski <mil...@adfin.com> wrote:
> On Wed, Oct 21, 2015 at 5:33 PM, John Spray <jsp...@redhat.com> wrote:
>> On Wed, Oct 21, 2015 at 10:33 PM, John Spray <jsp...@redhat.com> wrote:
>>>> John, I know you've got
>>>> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's
>>>> supposed to be for this, but I'm not sure if you spotted any issues
>>>> with it or if we need to do some more diagnosing?
>>>
>>> That test path is just verifying that we do handle dirs without dying
>>> in at least one case -- it passes with the existing ceph code, so it's
>>> not reproducing this issue.
>>
>> Clicked send to soon, I was about to add...
>>
>> Milosz mentioned that they don't have the data from the system in the
>> broken state, so I don't have any bright ideas about learning more
>> about what went wrong here unfortunately.
>>
>
> Sorry about that, wasn't thinking at the time and just wanted to get
> this up and going as quickly as possible :(
>
> If this happens next time I'll be more careful to keep more evidence.
> I think multi-fs in the same rados namespace support would actually
> helped here, since it makes it easier to create a newfs and leave the
> other one around (for investigation)

Yep, good point.  I am a known enthusiast for multi-filesystem support :-)

> But makes me wonder that the broken dir scenario can probably be
> replicated by hand using rados calls. There's a pretty generic ticket
> there for don't die on dir errors, but I imagine the code can be
> audited and steps to cause a synthetic error can be produced.

Yes, that part I have done (and will build into the automated tests in
due course) -- the bit that is still a mystery is how the damage
occurred to begin with.

John

>
> --
> Milosz Tanski
> CTO
> 16 East 34th Street, 15th floor
> New York, NY 10016
>
> p: 646-253-9055
> e: mil...@adfin.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MDS stuck in a crash loop

2015-10-21 Thread John Spray

> John, I know you've got
> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's
> supposed to be for this, but I'm not sure if you spotted any issues
> with it or if we need to do some more diagnosing?

That test path is just verifying that we do handle dirs without dying
in at least one case -- it passes with the existing ceph code, so it's
not reproducing this issue.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MDS stuck in a crash loop

2015-10-21 Thread John Spray

On Wed, Oct 21, 2015 at 10:33 PM, John Spray <jsp...@redhat.com> wrote:
>> John, I know you've got
>> https://github.com/ceph/ceph-qa-suite/pull/647. I think that's
>> supposed to be for this, but I'm not sure if you spotted any issues
>> with it or if we need to do some more diagnosing?
>
> That test path is just verifying that we do handle dirs without dying
> in at least one case -- it passes with the existing ceph code, so it's
> not reproducing this issue.

Clicked send to soon, I was about to add...

Milosz mentioned that they don't have the data from the system in the
broken state, so I don't have any bright ideas about learning more
about what went wrong here unfortunately.

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: nfsv41 over AF_VSOCK (nfs-ganesha)

2015-10-19 Thread John Spray

On Fri, Oct 16, 2015 at 10:08 PM, Matt Benjamin  wrote:
> Hi devs (CC Bruce--here is a use case for vmci sockets transport)
>
> One of Sage's possible plans for Manilla integration would use nfs over the 
> new Linux  vmci sockets transport integration in qemu (below) to access 
> Cephfs via an nfs-ganesha server running in the host vm.
>
> This now experimentally works.

Very cool!  Thank you for the detailed instructions, I look forward to
trying this out soon.

John

> some notes on running nfs-ganesha over AF_VSOCK:
>
> 1. need stefan hajnoczi's patches for
> * linux kernel (and build w/vhost-vsock support
> * qemu (and build w/vhost-vsock support)
> * nfs-utils (in vm guest)
>
> all linked from https://github.com/stefanha?tab=repositories
>
> 2. host and vm guest kernels must include vhost-vsock
> * host kernel should load vhost-vsock.ko
>
> 3. start a qemu(-kvm) guest (w/patched kernel) with a vhost-vsock-pci device, 
> e.g
>
> /opt/qemu-vsock/bin/qemu-system-x86_64-m 2048 -usb -name vsock1 
> --enable-kvm -drive 
> file=/opt/images/vsock.qcow,if=virtio,index=0,format=qcow2 -drive 
> file=/opt/isos/f22.iso,media=cdrom -net 
> nic,model=virtio,macaddr=02:36:3e:41:1b:78 -net bridge,br=br0 
> -parallel none -serial mon:stdio -device 
> vhost-vsock-pci,id=vhost-vsock-pci0,addr=4.0,guest-cid=4  -boot c
>
> 4. nfs-gansha (in host)
> * need nfs-ganesha and its ntirpc rpc provider with vsock support
> https://github.com/linuxbox2/nfs-ganesha (vsock branch)
> https://github.com/linuxbox2/ntirpc (vsock branch)
>
> * configure ganesha w/vsock support
> cmake -DCMAKE_INSTALL_PREFIX=/cache/nfs-vsock -DUSE_FSAL_VFS=ON -DUSE_VSOCK 
> -DCMAKE_C_FLAGS="-O0 -g3 -gdwarf-4" ../src
>
> in ganesha.conf, add "nfsvsock" to Protocols list in EXPORT block
>
> 5. mount in guest w/nfs41:
> (e.g., in fstab)
> 2:// /vsock41 nfs 
> noauto,soft,nfsvers=4.1,sec=sys,proto=vsock,clientaddr=4,rsize=1048576,wsize=1048576
>  0 0
>
> If you try this, send feedback.
>
> Thanks!
>
> Matt
>
> --
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-707-0660
> fax.  734-769-8938
> cel.  734-216-5309
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: newstore direction

2015-10-19 Thread John Spray

On Mon, Oct 19, 2015 at 8:49 PM, Sage Weil  wrote:
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.

This is the concerning bit for me -- the other parts one "just" has to
get the code right, but this problem could linger and be something we
have to keep explaining to users indefinitely.  It reminds me of cases
in other systems where users had to make an educated guess about inode
size up front, depending on whether you're expecting to efficiently
store a lot of xattrs.

In practice it's rare for users to make these kinds of decisions well
up-front: it really needs to be adjustable later, ideally
automatically.  That could be pretty straightforward if the KV part
was stored directly on block storage, instead of having XFS in the
mix.  I'm not quite up with the state of the art in this area: are
there any reasonable alternatives for the KV part that would consume
some defined range of a block device from userspace, instead of
sitting on top of a filesystem?

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MDS stuck in a crash loop

2015-10-14 Thread John Spray

On Mon, Oct 12, 2015 at 3:36 AM, Milosz Tanski  wrote:
> On Sun, Oct 11, 2015 at 6:44 PM, Milosz Tanski  wrote:
>> On Sun, Oct 11, 2015 at 6:01 PM, Milosz Tanski  wrote:
>>> On Sun, Oct 11, 2015 at 5:33 PM, Milosz Tanski  wrote:
 On Sun, Oct 11, 2015 at 5:24 PM, Milosz Tanski  wrote:
> On Sun, Oct 11, 2015 at 1:16 PM, Gregory Farnum  
> wrote:
>> On Sun, Oct 11, 2015 at 10:09 AM, Milosz Tanski  wrote:
>>> About an hour ago my MDSs (primary and follower) started ping-pong
>>> crashing with this message. I've spent about 30 minutes looking into
>>> it but nothing yet.
>>>
>>> This is from a 0.94.3 MDS
>>>
>>
>>>  0> 2015-10-11 17:01:23.596008 7fd4f52ad700 -1 mds/SessionMap.cc:
>>> In function 'virtual void C_IO_SM_Save::finish(int)' thread
>>> 7fd4f52ad700 time 2015-10-11 17:01:23.594089
>>> mds/SessionMap.cc: 120: FAILED assert(r == 0)
>>
>> These "r == 0" asserts pretty much always mean that the MDS did did a
>> read or write to RADOS (the OSDs) and got an error of some kind back.
>> (Or in the case of the OSDs, access to the local filesystem returned
>> an error, etc.) I don't think these writes include any safety checks
>> which would let the MDS break it which means that probably the OSD is
>> actually returning an error — odd, but not impossible.
>>
>> Notice that the assert happened in thread 7fd4f52ad700, and look for
>> the stuff in that thread. You should be able to find an OSD op reply
>> (on the SessionMap object) coming in and reporting an error code.
>> -Greg
>
> I only two error ops in that whole MDS session. Neither one happened
> on the same thread (7f5ab6000700 in this file). But it looks like the
> only session map is the -90 "Message too long" one.
>
> mtanski@tiny:~$ cat single_crash.log | grep 'osd_op_reply' | grep -v
> 'ondisk = 0'
>  -3946> 2015-10-11 20:51:11.013965 7f5ab20f2700  1 --
> 10.0.5.31:6802/27121 <== osd.25 10.0.5.57:6804/32341 6163 
> osd_op_reply(46349 mds0_sessionmap [writefull 0~95168363] v0'0 uv0
> ondisk = -90 ((90) Message too long)) v6  182+0+0 (2955408122 0 0)
> 0x3a55d340 con 0x3d5a3c0
>   -705> 2015-10-11 20:51:11.374132 7f5ab22f4700  1 --
> 10.0.5.31:6802/27121 <== osd.28 10.0.5.50:6801/1787 5297 
> osd_op_reply(48004 300.e274 [delete] v0'0 uv1349638 ondisk = -2
> ((2) No such file or directory)) v6  179+0+0 (1182549251 0 0)
> 0x66c5c80 con 0x3d5a7e0
>
> Any idea what this could be Greg?

 To follow this up I found this ticket from 9 months ago:
 http://tracker.ceph.com/issues/10449 In there Yan says:

 "it's a kernel bug. hang request prevents mds from trimming
 completed_requests in sessionmap. there is nothing to do with mds.
 (maybe we should add some code to MDS to show warning when this bug
 happens)"

 When I was debugging this I saw an OSD (not cephfs client) operation
 stuck for a long time along with the MDS error:

 HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow
 requests; mds cluster is degraded; mds0: Behind on trimming (709/30)
 1 ops are blocked > 16777.2 sec
 1 ops are blocked > 16777.2 sec on osd.28

 I did eventually bounce the OSD in question and it hasn't become stuck
 since, but the MDS is still eating it every time with the "Message too
 long" error on the session map.

 I'm not quite sure where to go from here.
>>>
>>> First time I had a chance to use the new recover tools. I was able to
>>> reply the journal, reset it and then reset the sessionmap. MDS
>>> returned back to life and so far everything looks good. Yay.
>>>
>>> Triggering this a bug/issue is a pretty interesting set of steps.
>>
>> Spoke too soon, a missing dir is now causing MDS to restart it self.
>>
>> -6> 2015-10-11 22:40:47.300169 7f580c7b9700  5 -- op tracker --
>> seq: 4, time: 2015-10-11 22:40:47.300168, event: finishing request,
>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58
>> 2015-10-11 21:34:49.224905 RETRY=36)
>> -5> 2015-10-11 22:40:47.300208 7f580c7b9700  5 -- op tracker --
>> seq: 4, time: 2015-10-11 22:40:47.300208, event: cleaned up request,
>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58
>> 2015-10-11 21:34:49.224905 RETRY=36)
>> -4> 2015-10-11 22:40:47.300231 7f580c7b9700  5 -- op tracker --
>> seq: 4, time: 2015-10-11 22:40:47.300231, event: done, op:
>> client_request(client.3597476:21480382 rmdir #100015e0be2/58
>> 2015-10-11 21:34:49.224905 RETRY=36)
>> -3> 2015-10-11 22:40:47.300284 7f580e0bd700  0
>> mds.0.cache.dir(100048df076) _fetched missing object for [dir
>> 100048df076 /petabucket/beta/6d/f6/ [2,head] auth v=0 cv=0/0 ap=1+0+0
>> state=1073741952 f() n()

Re: advice on indexing sequential data?

2015-10-01 Thread John Spray

On Thu, Oct 1, 2015 at 11:44 AM, Tom Nakamura  wrote:
> Hello ceph-devel,
>
> My lab is concerned with developing data mining application for
> detecting and 'deanonymizing' spamming botnets from high-volume spam
> feeds.
>
> Currently, we just store everything in large mbox files in directories
> separated by day (think: /data/feed1/2015/09/13.mbox) on a single NFS
> server. We have ad-hoc scripts to extract features from these mboxes and
> pass them to our analysis pipelines (written in a mixture of
> R/matlab/python/etc). This system is reaching its limit point.
>
> We already have a small Ceph installation with which we've had good luck
> for storing other data,and would like to see how we can use it to solve
> our mail problem. Our basic requirements are that:
>
> - We need to be able to access each message by its extracted features.
> These features include simple information found in its header (for
> example From: and To:) as well as more complex information like
> signatures from attachments and network information (for example,
> presence in blacklists).
> - We will frequently add/remove features.
> - Faster access to recent data is more important than to older data.
> - Maintaining strict ordering of incoming messages is not necessary. In
> other words, if we received two spam messages on our feeds, it doesn't
> matter too much if they are stored in that order, as long as we can have
> coarse-grained temporal accuracy (say, 5 minutes). So we don't need
> anything as sophisticated as Zlog.
> - We need to be able to remove messages older than some specific age,
> due to storage constraints.
>
> Any advice on how to use Ceph and librados to accomplish this?  Here are
> my initial thoughts:
>
> - Each message is an object with some unique ID. Use omap to store all
> its features in the same object.
> - For each time period (which will have to be pre-specified to, say, an
> hour), we have an object which contains a list of ID's, as a bytestring
> of contatenated ID's. This should make expiring old messages trivial.
> - For each feature, we have a timestamped index (like
> 20150930-from-...@bar.com or
> 20150813-has-attachment-with-hash-123abddeadbeef) the which contains a
> list of ID's.
> - Hopefully use Rados classes to index/feature-extract on the OSD's.
>
> How does this sound? One glaring omission is that I do not know how to
> create indices which would support querying by inequality/ranges ('find
> all messages between 1000 and 2000 bytes').

I would suggest some sort of hybrid approach, where you store your
messages and your time index in ceph (so that you can insert data and
expire data all within ceph), then use an external database for the
queries your application layer is interested in.  That way the
external database becomes somewhat disposable (you can always rebuild
efficiently for any given time period by consulting your time index in
ceph), but you don't have to implement any multi-axis querying inside
ceph.

With that kind of approach, you don't have to worry about implementing
indices (let the existing database do it), but you do still have to
worry about recovery from failure, i.e. keeping the ceph store and the
database index up to date.  You might need a "regenerate data for this
time period" call that re-inserted all the last 5 minutes emails to
the database after a failure of whatever is injecting the data.

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: libcephfs invalidate upcalls

2015-09-28 Thread John Spray

On Sat, Sep 26, 2015 at 8:03 PM, Matt Benjamin  wrote:
> Hi John,
>
> I prototyped an invalidate upcall for libcephfs and the Gasesha Ceph fsal, 
> building on the Client invalidation callback registrations.
>
> As you suggested, NFS (or AFS, or DCE) minimally expect a more generic 
> "cached vnode may have changed" trigger than the current inode and dentry 
> invalidates, so I extended the model slightly to hook cap revocation, 
> feedback appreciated.

In cap_release, we probably need to be a bit more discriminating about
when to drop, e.g. if we've only lost our exclusive write caps, the
rest of our metadata might all still be fine to cache.  Is ganesha in
general doing any data caching?  I think I had implicitly assumed that
we were only worrying about metadata here but now I realise I never
checked that.

The awkward part is Client::trim_caps.  In the Client::trim_caps case,
the lru_is_expirable part won't be true until something has already
been invalidated, so there needs to be an explicit hook there --
rather than invalidating in response to cap release, we need to
invalidate in order to get ganesha to drop its handle, which will
render something expirable, and finally when we expire it, the cap
gets released.

In that case maybe we need a hook in ganesha to say "invalidate
everything you can" so that we don't have to make a very large number
of function calls to invalidate things.  In the fuse/kernel case we
can only sometimes invalidate a piece of metadata (e.g. we can't if
its flocked or whatever), so we ask it to invalidate everything.  But
perhaps in the NFS case we can always expect our invalidate calls to
be respected, so we could just invalidate a smaller number of things
(the difference between actual cache size and desired)?

John

>
> g...@github.com:linuxbox2/ceph.git , branch invalidate
> g...@github.com:linuxbox2/nfs-ganesha.git , branch ceph-invalidates
>
> thanks,
>
> Matt
>
> --
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-761-4689
> fax.  734-769-8938
> cel.  734-216-5309
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Manila] CephFS native driver

2015-09-24 Thread John Spray

Hi all,

I've recently started work on a CephFS driver for Manila.  The (early)
code is here:
https://github.com/openstack/manila/compare/master...jcsp:ceph

It requires a special branch of ceph which is here:
https://github.com/ceph/ceph/compare/master...jcsp:wip-manila

This isn't done yet (hence this email rather than a gerrit review),
but I wanted to give everyone a heads up that this work is going on,
and a brief status update.

This is the 'native' driver in the sense that clients use the CephFS
client to access the share, rather than re-exporting it over NFS.  The
idea is that this driver will be useful for anyone who has such
clients, as well as acting as the basis for a later NFS-enabled
driver.

The export location returned by the driver gives the client the Ceph
mon IP addresses, the share path, and an authentication token.  This
authentication token is what permits the clients access (Ceph does not
do access control based on IP addresses).

It's just capable of the minimal functionality of creating and
deleting shares so far, but I will shortly be looking into hooking up
snapshots/consistency groups, albeit for read-only snapshots only
(cephfs does not have writeable shapshots).  Currently deletion is
just a move into a 'trash' directory, the idea is to add something
later that cleans this up in the background: the downside to the
"shares are just directories" approach is that clearing them up has a
"rm -rf" cost!

A note on the implementation: cephfs recently got the ability (not yet
in master) to restrict client metadata access based on path, so this
driver is simply creating shares by creating directories within a
cluster-wide filesystem, and issuing credentials to clients that
restrict them to their own directory.  They then mount that subpath,
so that from the client's point of view it's like having their own
filesystem.  We also have a quota mechanism that I'll hook in later to
enforce the share size.

Currently the security here requires clients (i.e. the ceph-fuse code
on client hosts, not the userspace applications) to be trusted, as
quotas are enforced on the client side.  The OSD access control
operates on a per-pool basis, and creating a separate pool for each
share is inefficient.  In the future it is expected that CephFS will
be extended to support file layouts that use RADOS namespaces, which
are cheap, such that we can issue a new namespace to each share and
enforce the separation between shares on the OSD side.

However, for many people the ultimate access control solution will be
to use a NFS gateway in front of their CephFS filesystem: it is
expected that an NFS-enabled cephfs driver will follow this native
driver in the not-too-distant future.

This will be my first openstack contribution, so please bear with me
while I come up to speed with the submission process.  I'll also be in
Tokyo for the summit next month, so I hope to meet other interested
parties there.

All the best,
John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: full cluster/pool handling

2015-09-24 Thread John Spray

On Thu, Sep 24, 2015 at 7:26 PM, Gregory Farnum  wrote:
> On Thu, Sep 24, 2015 at 8:04 AM, Sage Weil  wrote:
>> On Thu, 24 Sep 2015, Robert LeBlanc wrote:
>>> -BEGIN PGP SIGNED MESSAGE-
>>> Hash: SHA256
>>>
>>>
>>> On Thu, Sep 24, 2015 at 6:30 AM, Sage Weil  wrote:
>>> > Xuan Liu recently pointed out that there is a problem with our handling
>>> > for full clusters/pools: we don't allow any writes when full,
>>> > including delete operations.
>>> >
>>> > While fixing a separate full issue I ended up making several fixes and
>>> > cleanups in the full handling code in
>>> >
>>> > https://github.com/ceph/ceph/pull/6052
>>> >
>>> > The interesting part of that is that we will allow a write as long as it
>>> > doesn't increase the overall utilizate of bytes or objects (according to
>>> > the pg stats we're maintaining).  That will include remove ops, of cours,
>>> > but will also allow overwrites while full, which seems fair.
>>>
>>> What about overwrites on a COW FS, won't that still increase used
>>> space? Maybe if it is a COW FS, don't allow overwrites?
>>
>> Yeah, we could strengthen (optionally?) the check so that only operations
>> that result in a net decrease are allowed..
>
> It's not just COW filesystems, anything that modifies
> leveldb/rocksdb/whatever in any way will also increase the space used
> — including regular object deletes which additionally get added to the
> PG log, although *hopefully* that's not a problem since we have our
> extra buffers to handle this sort of thing. While right now we might
> have some hope of being able to tag ops as "net deletes" or "net
> adds", I don't see that happening once we have widespread third-party
> object classes or that Lua work gets in or something...
>
> So, I'd be really leery of trying to do anything more advanced than
> letting clients execute delete operations, and letting privileged
> clients keep doing real work. (Or maybe restricting it entirely to the
> second half there.)
>
> That latter switch already exists, by the way, although I don't think
> it's actually enforced via cephx caps (it should be) — the Objecter
> has an honor_full_flag setting which the MDS sets to false. I don't
> think the library interfaces are there to specify it per-op but IIRC
> it is part of the data sent to OSDs so it wouldn't require a wire
> protocol change.

I don't think it's sent with the ops, the OSDs simply check if the
requester's entity type is MDS.

John

> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Review Request

2015-09-16 Thread John Spray

This is neat, I had been hankering to lambda-ize various things, but
hadn't worked through what the allocation would looked like in
practice.

Do you know if there's a reason the standard is defined so as to not
let us override the reserved size of a std::function?  Are we taking
any kind of hit by doing this?

John


P.S. I have just gone on Amazon to order a C++11 book because I need
to start using more than just lambda, auto, and pretty for loops ;-)

On Wed, Sep 16, 2015 at 8:06 PM, Adam C. Emerson  wrote:
> On 09/16/2015 02:31 PM, Gregory Farnum wrote:
>> Can you provide a little background (links or text) for those of us
>> who aren't up on C++11x/14x? I looked at them briefly but having only
>> the vaguest idea about some of them quickly got lost. :)
>
> Surely, sir.
>
> So, in summary, Context* is the means we have been using to handle
> callbacks. It has two big problems:
>
> (1) Every Context is allocated at the point of use and freed when it's
> called. So by their nature they make heavy use of the heap. (If you
> overload complete there's a few exceptions, but by and large Context is
> very heapy.)
>
> (2) Context accepts an int. That's it. You can get around this by
> storing other things in it and passing pointers to it and so forth. But
> it would be nice to have more variety of type in our callbacks.
>
> C++11 (and following) have a whole pile of things that can be used as
> functions. There are function objects (objects with an operator()
> defined on them), lambdas with variable capture, function member
> pointers, reference wrappers pointing to function objects and the like.
> This gives you a good bit of flexibility, allows things like variable
> capture, let's you allocate one object once and just pass references to
> it, and so forth.
>
> Now, these objects all have different sizes and different types. So you
> can't just shove them in an object naively. Because a class one of whose
> members is a function pointer is going to be a different type than a
> class holding a function object. (Which itself is different to a type
> holding a reference to a function object.)
>
> std::function exists as a solution to this problem. It provides a
> uniform type that can hold any object satisfying the requirement of
> being callable with given argument types and a given return value. So,
> for example, if you have some 'stat'-like call, you could specify it as
> taking a function taking an error code, a size, and a date. Anything
> that can be called with such would be accepted, anything that can't
> (wrong argument types, say) would be rejected at compile time. And it
> could be uniformly stored in a list of Operations.
>
> The downside is that if the thing you provide is too big, std::function
> will allocate. 'too big' depends on your library vendor and there's no
> way to find it out what. Thus the purpose of the ceph::function class.
> If we have a pretty good idea how large most of the callback function
> objects we expect to hold are, we can tell it to statically allocate
> that much space. This gives us a tool to get allocations out of our fast
> path. (For example, if we preallocate a bunch of classes with a
> ceph::function that preallocates enough space to hold likely callbacks,
> we can just pick them off and have no allocations, in the usual case.)
>
> If we know /everything/ we'll ever get, we can disable allocations
> entirely. This is mostly so we can catch situations where we're trying
> to shove something unexpectedly huge somewhere it ought not go. An
> internal sanity check.
>
> Now, ceph::function, like std::function, is still an abstraction with a
> virtual call in it and because it copies things around it reduces
> opportunities for inlining. Thus, if you aren't storing a callback on an
> object that's supposed to be in a queue, you shouldn't use it. You
> should do something like:
>
> template
> void do_stuff(Fun&& f) {
>   ...;
>   f(some, values);
> };
>
> This allows inlining, and (based on my experiments trying multiple
> implementations and looking at the generated assembly) if f is called in
> a loop, the code is just as good as if you'd open-coded the loop and
> written the body of 'f' in it. Functions like this can have the
> potential to use closures (lambda expression) effectively for free.
>
> So, the summary is:
>
> Context* is very heapy. We have less heapy alternatives. This implements
> a foundation for one of them. We also get a bit more flexibility.
>
> Thank you.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Backporting from Infernalis and c++11

2015-09-15 Thread John Spray

On Tue, Sep 15, 2015 at 8:21 AM, Loic Dachary  wrote:
> With Infernalis Ceph move to c++11 (and CMake), we will see more conflicts 
> when backporting bug fixes to Hammer. Any ideas you may have to better deal 
> with this would be most welcome. Since these conflicts will be mostly 
> cosmetic, they should not be too difficult to resolve. The trick will be for 
> someone not familiar with the codebase to separate what is cosmetic and what 
> is not.
>
> This does not happen yet, no immediate concern :-) Maybe if we think about 
> that well in advance we'll be in a better position to deal with it later on ?

I think this came up in conversation but wasn't necessarily made
official policy yet -- my understanding is that we are (already)
endeavouring to avoid c++11isms in bug fixes, along with the usual
principle of fixing bugs in the smallest/neatest patch we can.

Perhaps in cases where those of us working on master mistakenly put
something un-backportable in a bug fix, it would be reasonable for the
backporter to point it out and poke us for a clean version of the
patch.

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Brewer's theorem also known as CAP theorem

2015-09-15 Thread John Spray

On Tue, Sep 15, 2015 at 1:38 PM, Owen Synge  wrote:
> On Mon, 14 Sep 2015 13:57:26 -0700
> Gregory Farnum  wrote:
>
>> The OSD is supposed to stay down if any of the networks are missing.
>> Ceph is a CP system in CAP parlance; there's no such thing as a CA
>> system. ;)
>
> I know I am being fussy, but within my team your email was sited that
> you cannot consider ceph as a CA system. Hence I make my argument in
> public so I can be humbled in public.
>
> Just to clarify your opinion I site
>
> http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
>
> suggests:
>
> 
> The CAP theorem states that any networked shared-data system can have
> at most two of three desirable properties.
>
> * consistency (C) equivalent to having a single up-to-date copy of
> the data;
> * high availability (A) of that data (for updates)
> * tolerance to network partitions (P).
> 
>
> So I dispute that a CA system cannot exist.
>
> I think you are too absolute even in interpretation of this vague
> theory. A further quote from the author of said theorem from the same
> article:
>
> 
> The "2 of 3" formulation was always misleading because it tended to
> oversimplify the tensions among properties.
> 
>
> As I understand it:
>
> Ceph as a cluster always provides "Consistency". (or else you found a
> bug)
>
> If a ceph cluster is operating it will always provide acknowledgment
> (it may block) to the client if the operation has succeeded
> or failed hence provides "availability".
>
> if a ceph cluster is partitioned, only one partition will continue
> operation, hence you cannot consider the system "partition" tolerant
> as multiple parts of the system cannot operate when partitioned.

The technical meaning of partition tolerance is that the system
continues to provide a service to clients in the face of a partition,
not that it splits into multiple separately operating units.

Given the general definition of partition (any network failure, any
host failure), any real physical network has the 'P' part, so only CP
and AP systems are physically meaningful.  In order to imagine a CA
system you have to imagine a perfect network.

There has been plenty of confusion in the past as CAP terminology went
mainstream, so there are also plenty of blogs providing longer
explanations:
http://codahale.com/you-cant-sacrifice-partition-tolerance/
http://www.quora.com/Whats-the-difference-between-CA-and-CP-systems-in-the-context-of-CAP-Consistency-Availability-and-Partition-Tolerance


> Hence as a cluster ceph is CA.
>
> Alternatively if you look at it from an OSD rather than cluster
> perspective, you can get the perspective you take, OSD's are CP system
> in CAP parlance.
>
> I would argue it is all a matter of perspective

Not so much a matter of perspective, more that the words involved are
used in quite specific technical ways.  If you try to understand it in
plain english it seems ambiguous, but the way the terms are used
within this field it's quite clear cut.

John

, and believe that to
> call Brewer's theorem anything other than guidance without strong
> discussion of your understanding of consistency and its interaction with
> partitioning is to confuse and over simplify.
>
> Best regards
>
> Owen
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Should pgls filters be part of object classes?

2015-08-05 Thread John Spray

So, I've got this very-cephfs-specific piece of pgls filtering code in
ReplicatedPG:
https://github.com/ceph/ceph/commit/907a3c5a2ba8e3edda18d7edf89ccae7b9d91dc5

I'm not sure I'm sufficiently motivated to create some whole new
plugin framework for these, but what about piggy-backing on object
classes?  I guess it would be an additional
cls_register_filter(myfilter, my_callback_constructor) fn.

tl;dr; How do people feel about extending object classes to include
providing PGLS filters as well?

John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Transitioning Ceph from Autotools to CMake

2015-08-03 Thread John Spray

OK, here are vstart+ceph.in changes that work well enough in my out of
tree build:
https://github.com/ceph/ceph/pull/5457

John

On Mon, Aug 3, 2015 at 11:09 AM, John Spray jsp...@redhat.com wrote:
 On Sat, Aug 1, 2015 at 8:24 PM, Orit Wasserman owass...@redhat.com wrote:


 3. no vstart.sh , starting working on this too but have less progress
 here. At the moment in order to use vstart I copy the exe and libs to
 src dir.

 I just started playing with CMake on Friday, adding some missing cephfs
 bits.  I was going to fix (3) as well, but I don't want to duplicate work
 -- do you have an existing branch at all?  Presumably this will mostly be a
 case of adding appropriate prefixes to commands.

 Cheers,
 John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Transitioning Ceph from Autotools to CMake

2015-08-03 Thread John Spray

On Sat, Aug 1, 2015 at 8:24 PM, Orit Wasserman owass...@redhat.com wrote:


 3. no vstart.sh , starting working on this too but have less progress
 here. At the moment in order to use vstart I copy the exe and libs to
 src dir.

I just started playing with CMake on Friday, adding some missing cephfs
bits.  I was going to fix (3) as well, but I don't want to duplicate work
-- do you have an existing branch at all?  Presumably this will mostly be a
case of adding appropriate prefixes to commands.

Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Installing Ceph without root privilege

2015-07-31 Thread John Spray


On 31/07/15 07:10, Piyawath Boukom wrote:

Dear Ceph-dev team,

My name is Piyawath. I’m trying to set up Ceph on university’s machine.

Unfortunately, I am able to install/set up it in user mode (not root), 
including creating Ceph user and other processes. Is it possible to do set up 
without root privilege ?

I’m quite new in this field, please forgive me if I ignorantly asked you.


Hi,

I guess if you're installing as non-root then you probably just want 
something to experiment with a bit?


The easiest thing may be to compile from source:
http://ceph.com/docs/next/install/build-ceph/

...and then create a temporary vstart cluster:
http://ceph.com/docs/next/dev/quick_guide/

This is not at all suitable for putting any real data in, but will let 
you play around a bit with the ceph command line etc.  If you need a 
more realistic system I would suggest asking your administrator for a 
way to create virtual machines, so that you can have root in the virtual 
machines and install packages there.


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vstart runner for cephfs tests

2015-07-23 Thread John Spray




On 23/07/15 12:56, Mark Nelson wrote:
I had similar thoughts on the benchmarking side, which is why I 
started writing cbt a couple years ago.  I needed the ability to 
quickly spin up clusters and run benchmarks on arbitrary sets of 
hardware.  The outcome isn't perfect, but it's been extremely useful 
for running benchmarks and sort of exists as a half-way point between 
vstart and teuthology.


The basic idea is that you give it a yaml file that looks a little bit 
like a teuthology yaml file and cbt will (optionally) build a cluster 
across a number of user defined nodes with pdsh, start various 
monitoring tools (this is ugly right now, I'm working on making it 
modular), and then sweep through user defined benchmarks and sets of 
parameter spaces.  I have a separate tool that will sweep through ceph 
parameters, create ceph.conf files for each space, and run cbt with 
each one, but the eventual goal is to integrate that into cbt itself.


Though I never really intended it to run functional tests, I just 
added something like looks very similar to the rados suite so I can 
benchmark ceph_test_rados for the new community lab hardware. I 
already had a mechanism to inject OSD down/out up/in events, so with a 
bit of squinting it can give you a very rough approximation of a 
workload using the osd thrasher.  If you are interested, I'd be game 
to see if we could integrate your cephfs tests as well (I eventually 
wanted to add cephfs benchmark capabilities anyway).


Cool - my focus is very much on tightening the code-build-test loop for 
developers, but I can see us needing to extend that into a 
code-build-test-bench loop as we do performance work on cephfs in the 
future.  Does cbt rely on having ceph packages built, or does it blast 
the binaries directly from src/ onto the test nodes?


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vstart runner for cephfs tests

2015-07-23 Thread John Spray




On 23/07/15 12:23, Loic Dachary wrote:

You may be interested by

https://github.com/ceph/ceph/blob/master/src/test/ceph-disk-root.sh

which is conditionally included

https://github.com/ceph/ceph/blob/master/src/test/Makefile.am#L86

by --enable-root-make-check

https://github.com/ceph/ceph/blob/master/configure.ac#L414

If you're reckless and trust the tests not to break (a crazy proposition by 
definition IMHO ;-), you can

make TESTS=test/ceph-disk-root.sh check

If you want protection, you do the same in a docker container with

test/docker-test.sh --os-type centos --os-version 7 --dev make 
TESTS=test/ceph-disk-root.sh check

I tried various strategies to make tests requiring root access more accessible 
and less scary and that's the best compromise I found. test/docker-test.sh is 
what the make check bot uses.


Interesting, I didn't realise we already had root-ish tests in there.

At some stage the need for root may go away in ceph-fuse, as in 
principle fuse mount/unmounts shouldn't require root.  If not then 
putting an outer docker wrapper around this could make sense, if we 
publish the built binaries into the docker container via a volume or 
somesuch.  I am behind on familiarizing myself with the dockerised tests.



When a test can be used both from sources and from teuthology, I found it more 
convenient to have it in the qa/workunits directory which is available in both 
environments. Who knows, maybe you will want a vstart based cephfs test to run 
as part of make check, in the same way

https://github.com/ceph/ceph/blob/master/src/test/cephtool-test-mds.sh

does.


Yes, this crossed my mind.  At the moment, even many of the quick 
tests/cephfs tests take tens of seconds, so they are probably a bit too 
big to go in a default make check, but for some of the really simple 
things that are currently done in cephtool/test.sh, I would be temped to 
move them into the python world to make them a bit less fiddly.


The test location is a bit challenging, because we essentially have two 
not-completely-stable interfaces here, vstart and teuthology. Because 
teuthology is the more complicated, for the moment it makes sense for 
the tests to live in that git repo.  Long term it would be nice if 
fine-grained functional tests lived in the same git repo as the code 
they're testing, but I don't really have a plan for that right now 
outside of the probably-too-radical step of merging ceph-qa-suite into 
the ceph repo.


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

vstart runner for cephfs tests

2015-07-23 Thread John Spray



Audience: anyone working on cephfs, general testing interest.

The tests in ceph-qa-suite/tasks/cephfs are growing in number, but kind 
of inconvenient to run because they require teuthology (and therefore 
require built packages, locked nodes, etc).  Most of them don't actually 
require anything beyond what you already have in a vstart cluster, so 
I've adapted them to optionally run that way.


The idea is that we can iterate a lot faster when writing new tests (one 
less excuse not to write them) and get better use out of the tests when 
debugging things and testing fixes.  teuthology is fine for mass-running 
the nightlies etc, but it's overkill for testing individual bits of 
MDS/client functionality.


The code is currently on the wip-vstart-runner ceph-qa-suite branch, and 
the two magic commands are:


1. Start a vstart cluster with a couple of MDSs, as your normal user:
$ make -j4 rados ceph-fuse ceph-mds ceph-mon ceph-osd cephfs-data-scan 
cephfs-journal-tool cephfs-table-tool  ./stop.sh ; rm -rf out dev ; 
MDS=2 OSD=3 MON=1 ./vstart.sh -d -n


2. Invoke the test runner, as root (replace paths, test name as 
appropriate.  Leave of test name to run everything):
# 
PYTHONPATH=/home/jspray/git/teuthology/:/home/jspray/git/ceph-qa-suite/ 
python /home/jspray/git/ceph-qa-suite/tasks/cephfs/vstart_runner.py 
tasks.cephfs.test_strays.TestStrays.test_migration_on_shutdown


test_migration_on_shutdown (tasks.cephfs.test_strays.TestStrays) ... ok

--
Ran 1 test in 121.982s

OK


^^^ see!  two minutes, and no waiting for gitbuilders!

The main caveat here is that it needs to run as root in order to 
mount/unmount things, which is a little scary.  My plan is to split it 
out into a little root service for doing mount operations, and then let 
the main test part run as a normal user and call out to the mounter 
service when needed.


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

CephFS fsck tickets

2015-07-13 Thread John Spray



Lots of new tickets in tracker related to forward scrub, this is a 
handly list (mainly for Greg and myself) that maps to the notes from our 
design discussion.  They're all under the 'fsck' category in tracker.


John



Tagging
prerequisite:
 - forward scrub (traverse tree, at least)

 #12255
 - create scrub header or scrub map in MDLog, like subtreemap, make
   scrub startup wait until this is written before going ahead

#12257
 - add recovery of scrub header during MDS replay, and re-start
   any scrubs that were ongoing

#12258
- add tagged-scrub command taking path tag

#12273
 - actually apply the tag during the RADOS op for reading something


#12274
 - start the process from all subtree roots on all MDSs, and skip 
non-auth regions

#12275
 - block migration during tagging scrub
 OR
 - when migrating, look up to parent subtree and send along info about
   the scrub if we haven't been scrubbed in this tag yet

 - handle MDS rank shutdown vs. ongoing scrub

Backtrace handling
#12277
- during backward scrub, tag our parent with the most recent
   backtrace seen, even if we already created it with a less
   recent backtrace (set_if_greater on a backtrace xattr) as
   a hint to a subsequent forward scrub step
   (ready for this now)
#12278
 - during forward scrub, look at these hints, and move folders
   around if we have more recent linkage information from the backtrace
   (maybe wait to do this later, it's a corner)

Backward scrub online
#12279
 - Add hooks to MDS to enable backward scrub to lookup_ino and work out
   if an orphaned ino is actually orphaned or just more recently
   created than the last scrub.
 - Add hooks to MDS to enable injection to be done online (i.e. inject
   linkage RPC)
#12280
 - Use hooks from backward scrub in a new mode that is readonly on the
   RADOS pools, and mediates all writes through running MDS.

Handling purged strays:
#12281
 * StrayManager needs to respect PIN_SCRUBQUEUE (remove it from the scrub
 stack before purging it) and the Locker locks that scrubstack uses (don't
 purge it until the ongoing scrub RADOS ops are done)
(added a note to #11950)
 * if/when we add purgequeue, #11950, backward scrub will also need to
 interrogate that to determine non-orphan-ness of an inode, or enforce
 that it gets flushed before doing anything.

#11859 - DamageTable

Status  scheduling
#12282
 - List/progress/abort/pause commands for ongoing scrub
#12283
 - Time of day scheduling for scrub
#12284
 - Deprioritise/pause scrub on highly loaded systems (loaded MDS, 
loaded RADOS)

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: python facade pattern implementation in ceph and ceph-deploy is bad practice?

2015-07-09 Thread John Spray


Owen,

Please can you say what your overall goal is with recent ceph-deploy 
patches?  Whether a given structure is right or wrong is really a 
function of the overall direction.  If you are aiming to make 
ceph-deploy into an extensible framework that can do things in parallel, 
then you need to say so, and that's a bigger conversation about whether 
that's a reasonable goal for a tool which has so far made a virtue of 
maintaining modest ambitions.


Along these lines, I noticed that you recently submitted a pull request 
to ceph-deploy that added a dependency on SQLAlchemy, and several 
hundred lines of extra boilerplate code -- this kind of thing is 
unlikely to get a warm reception without a stronger justification for 
the extra weight.  While I don't know how related that is to the points 
you're making in this post, but it certainly inspires some curiosity 
about where you're going with this.


I had not seen your wip_services_abstraction branch before, I've just 
taken a quick look now.  More comments would probably have made it 
easier to read, as would following PEP8.  I don't think there's anything 
problematic about having a class that knows how to start and stop a 
service, but I don't know what comments you've received elsewhere (there 
aren't any on the PR).


John


On 09/07/15 11:08, Owen Synge wrote:

Dear all,

The facade pattern (or façade pattern) is a software design pattern
commonly used with object-oriented programming. The name is by analogy
to an architectural facade. (wikipedia)

I am frustrated with the desire to standardise on the one bad practice
implementations of the facade pattern in python that is used in
ceph-deploy even in ceph.

The current standard of selectively calling modules with functions has
a series of complexities to it.

Ceph example:


https://github.com/ceph/ceph/tree/master/src/ceph-detect-init/ceph_detect_init

ceph-deploy example:

https://github.com/ceph/ceph-deploy/tree/master/ceph_deploy/hosts

SO I guess _some_ of you dont immediately see why this is VERY bad
practice, and wonder why this makes me feel like the orange in this story:

http://www.dailymail.co.uk/news/article-2540670/The-perfect-half-time-oranges-five-football-matches-Farmers-create-pentagon-shaped-fruit.html

when I try to code with ceph-deploy in particular.

The ceph/ceph-deploy way of implementing a facade in python causes this
list the problems:

1) facade cannot be instantiated twice.

2) facade requires code layout inflexibility.

3) facade implementation can have side effects when implementation is
changed.

And probably others I have not thought of.

In consequence from points above.

 From (1)

(1A) no concurrency, so you cant configure more than one ceph-node at a
time.
(1A) You have to close one facade to start anouther, eg in ceph-deploy
you have to close each connection before connecting to the next server
so making it slow to use as all state has to be gathered.

 From (2)

(2A) No nesting of facades without code duplication. EG reuse of systemd
support between distributions.
(2B) inflexible / complex include paths for shared code between facade
implementations.

 From (3)

(3A) Since all variables in a facade implementation are global, but
isolated to an implementation, we cannot have variables in the
implementation, any cross function variables that are not passed as
parameters or return values will return will lead to side effects that
are difficult to test.

So this set of general points will have complex side effects that make
you feel like an the pictured orange when developing that are related to
wear the facade is implemented.

About this point you will say well its open source so fix it ?

My answer to this is that when I try to do this, as in this patch:

https://github.com/osynge/ceph-deploy/commit/b82f89f35b27814ed4aba1082efd134c24ecf21f

More than once, seem to suggest I should use the much more complex multi
file implementation of a façade.

The only advantages of implementing façades in the standard ceph /
ceph-deploy way that I can see is:

(I) how ceph-deploy has always done this way.
(II) It allows you to continue using nearly no custom objects in python.
(III) We like our oranges in funny shapes.

It seems to me the current implementation could have been created due to
the misunderstanding that modules are like objects, when intact they are
like classes. Issues (1) and (3) can be solved simply by importing
methods as objects rather than classes, but this does nothing to solve
the bigger issue (2) which is more serious, but its a simple step
forward that might very simple to patch, but their is little point in a
POC unless people agree that issues (1) (2) and (3) are serious.

Please can we not spread this boat anchor* implementation of a facade,
further around the code base, and ideally migrate away from this bad
practice, and help us all feel like happy round oranges.

--
To unsubscribe from this list: send the line unsubscribe

Re: Preferred location for utility execution

2015-06-26 Thread John Spray




On 26/06/2015 21:14, Handzik, Joe wrote:

Hey ceph-devel,

Gregory Meno (of Calamari fame) and I are working on what is now officially a 
blueprint for Jewel ( 
http://tracker.ceph.com/projects/ceph/wiki/Calamariapihardwarestorage ), and 
we'd like some feedback.

Some of this has been addressed via separate conversations about the feature 
that some of this work started out as (identifying drives in a cluster by 
toggling their LED states), but we wanted to ask a more direct question: What 
is the preferred location/mechanism to execute operations on storage hardware?

We see two clear options:

1. Make Calamari responsible for executing commands using various linux 
utilities (and /sys, when applicable).
2. Build a command set into RADOS to execute commands using various linux 
utilities. These commands could then be executed by Calamari using the rest api.

The big win for #1 is the ability to rapidly iterate on the capabilities of the 
Calamari toolset (it is almost certainly going to be faster to create a set of 
scripts similar to Gregory's initial commit for SMART polling than to add that 
functionality inside RADOS. See: https://github.com/ceph/calamari/pull/267 ). 
For #2, we'd pick up the ability to run those same commands via the cli, which 
would give users a lot more flexibility in how they troubleshoot their cluster 
(Calamari wouldn't be required, it would just make life easier).


Hi Joe,

I'd reiterate my earlier comments[1] in favour of option 2.

I would be cautious about implementing any of this in Calamari until 
there are at least upstream packages available for folks to use, and 
broader uptake.  In the current situation, it's hard to ask people to 
try something out in Calamari, and much more straightforward to 
distribute something as part of Ceph.  Hardware is pretty varied, I 
would expect you'll need help from others in the community to ensure any 
hardware handling works as expected in diverse environments, which will 
be much simpler with ceph than calamari.


The part where some central python (calamari or otherwise) would really 
come into its own is in the fusion of information from multiple hosts, 
and exposing it to a user interface.  On that aspect, I left some 
comments last time this came up: 
http://lists.ceph.com/pipermail/ceph-calamari-ceph.com/2015-May/73.html


Ceph itself is getting a bit smarter with some of this stuff, e.g. the 
new node ls stuff gives you metadata about hosts and services without 
the need for calamari.  Hanging device info off these new structures 
would be a pretty reasonable thing to do, and if someone later has a GUI 
that they want to pipe that into, they can grab it via the mon along 
with everything else.


Cheers,
John


1. https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23186.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: cephfs obsolescence and object location

2015-06-23 Thread John Spray



Since you're only looking up the ID of the first object, it's really 
simple.  It's just the hex printed inode number followed by 
..  That's not guaranteed to always be the case in the future, 
but it's likely to be true longer than the deprecated ioctls exist.  If 
I was you, I would hard-code the object naming convention rather than 
writing in a dependency on the ioctl.


As Greg says, you can also query all the layout stuff (via supported 
interfaces) and do the full calculation of object names for arbitrary 
offsets into the file if you need to.


John


On 22/06/2015 22:18, Bill Sharer wrote:
I'm currently running giant on gentoo and was wondering about the 
stability of the api for mapping MDS files to rados objects.  The 
cephfs binary complains that it is obsolete for getting layout 
information, but it also provides object location info.  AFAICT this 
is the only way to map files in a cephfs filesystem to object 
locations if I want to take advantage of the UFO nature of ceph's 
stores in order to access via both cephfs and rados methods.


I have a content store that scans files, calculates their sha1hash and 
then stores them on a cephfs filesystem tree with their filenames set 
to their sha1hash name.  I can then build views of this content using 
an external local filesystem and symlinks pointing into the cephfs 
store.  At the same time, I want to be able to use this store via 
rados either through the gateway or my own software that is rados 
aware.  The store is being treated as a write-once, read-many style 
system.


Towards this end, I started writing a QT4 based library that includes 
this little Location routine (which currently works) to grab the rados 
object location from a hash object in this store. I'm just wondering 
whether this is all going to break horribly in the future when ongoing 
MDS development decides to break the code I borrowed from cephfs :-)




QString Shastore::Location(const QString hash) {
QString result = ;
QString cache_path = this-dbcache + / + hash.left(2) + / + 
hash.mid(2,2) + / + hash;

QFile cache_file(cache_path);
if (cache_file.exists()) {
if (cache_file.open(QIODevice::ReadOnly)) {
/*
 * Ripped from cephfs code, grab the handle and use the 
ceph version of ioctl to
 * rummage through the file's xattrs for rados location.  
cephfs whines about being
 * obsolete to get layout this way, but this appears to be 
only way to get location.
 * This may all break horribly in a future release since 
MDS is undergoing heavy development

 *
 *  cephfs lets user pass file_offset in argv but it 
defaults to 0.  Presumably this is the first
 *  extent of the pile of extents (4mb each?) and shards 
for the file.  If user wants to jump
 *  elsewhere with a non-zero offset, the resulting rados 
object location may be different

 */
int fd = cache_file.handle();
struct ceph_ioctl_dataloc location;
location.file_offset = 0;
int err = ioctl(fd, CEPH_IOC_GET_DATALOC, (unsigned 
long)location);

if (err) {
qDebug()  Location: Error getting rados location 
for   cache_path;

} else {
result = QString(location.object_name);
}
cache_file.close();
} else {
qDebug()  Location: unable to open   cache_path   
readonly;

}
} else {
qDebug()  Location: cache file   cache_path   does 
not exist;

}
return result;
}


Since you're only looking at

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rbd top

2015-06-17 Thread John Spray




On 17/06/2015 18:06, Robert LeBlanc wrote:

Well, I think this has gone well past my ability to implement. Should
this be turned into a BP and see if someone is able to work on it?


Sorry, didn't meant to hijack your thread :-)

It might still be useful to discuss the simpler case of tracking top 
clients/top objects (i.e. just native RADOS concepts) with an LRU table 
of stats (like Sage described) as a simpler alternative to my 
custom-querying proposal.  I'm going to write a blueprint for the the 
custom query thing anyway though, as I'm kind of hot on the idea, though 
don't who/when will have time to take it on as it's a bit heavyweight.


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rbd top

2015-06-15 Thread John Spray




On 15/06/2015 14:52, Sage Weil wrote:


I seem to remember having a short conversation about something like this a
few CDS's back... although I think it was 'rados top'.  IIRC the basic
idea we had was for each OSD to track it's top clients (using some
approximate LRU type algorithm) and then either feed this relatively small
amount of info (say, top 10-100 clients) back to the mon for summation,
or dump via the admin socket for calamari to aggregate.

This doesn't give you the rbd image name, but I bet we could infer that
without too much trouble (e.g., include a recent object or two with the
client).  Or, just assume that client id is enough (it'll include an IP
and PID... enough info to find the /var/run/ceph admin socket or the VM
process.

If we were going to do top clients, I think it'd make sense to also have a
top objects list as well, so you can see what the hottest objects in the
cluster are.


The following is a bit of a tangent...

A few weeks ago I was thinking about general solutions to this problem 
(for the filesystem).  I played with (very briefly on wip-live-query) 
the idea of publishing a list of queries to the MDSs/OSDs, that would 
allow runtime configuration of what kind of thing we're interested in 
and how we want it broken down.


If we think of it as an SQL-like syntax, then for the RBD case we would 
have something like:

  SELECT read_bytes, write_bytes WHERE pool=rbd GROUP BY rbd_image

(You'd need a protocol-specific module of some kind to define what 
rbd_image meant here, which would do a simple mapping from object 
attributes to an identifier (similar would exist for e.g. cephfs inode))


Each time an OSD does an operation, it consults the list of active 
performance queries and updates counters according to the value of the 
GROUP BY parameter for the query (so the above example each OSD would be 
keeping a result row for each rbd image touchd).


The LRU part could be implemented as LIMIT BY + SORT parameters, such 
that the result rows would be periodically sorted and the least-touched 
results would drop off the list.  That would probably be used in 
conjunction with a decay operator on the sorted-by field, like:
  SELECT read_bytes, write_bytes,ops WHERE pool=rbd GROUP BY rbd_image 
SORT BY movingAverage(derivative(ops)) LIMIT 100


Combining WHERE clauses would let the user drill down (apologies for 
buzzword) by doing things like identifying the most busy clients, and 
then for each of those clients identify which images/files/objects the 
client is most active on, or vice versa identify busy objects and then 
see which clients are hitting them. Usually keeping around enough stats 
to enable this is prohibitive at scale, but it's fine when you're 
actively creating custom queries for the results you're really 
interested in, instead of keeping N_clients*N_objects stats, and when 
you have the LIMIT part to ensure results never get oversized.


The GROUP BY options would also include metadata sent from clients, e.g. 
the obvious cases like VM instance names, or rack IDs, or HPC job IDs.  
Maybe also some less obvious ones like decorating cephfs IOs with the 
inode of the directory containing the file, so that OSDs could 
accumulate per-directory bandwidth numbers, and user could ask which 
directory is bandwidth-hottest? as well as which file is 
bandwidth-hottest?.


Then, after implementing all that craziness, you get some kind of wild 
multicolored GUI that shows you where the action is in your system at a 
cephfs/rgw/rbd level.


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: rbd top

2015-06-15 Thread John Spray




On 15/06/2015 17:10, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

John, let me see if I understand what you are saying...

When a person runs `rbd top`, each OSD would receive a message saying
please capture all the performance, grouped by RBD and limit it to
'X'. That way the OSD doesn't have to constantly update performance
for each object, but when it is requested it starts tracking it?


Right, initially the OSD isn't collecting anything, it starts as soon as 
it sees a query get loaded up (published via OSDMap or some other 
mechanism).


That said, in practice I can see people having some set of queries that 
they always have loaded and feeding into graphite in the background.


If so, that is an interesting idea. I wonder if that would be simpler
than tracking the performance of each/MRU objects in some format like
/proc/diskstats where it is in memory and not necessarily consistent.
The benefit is that you could have lifelong stats that show up like
iostat and it would be a simple operation.


Hmm, not sure we're on the same page about this part, what I'm talking 
about is all in memory and would be lost across daemon restarts.  Some 
other component would be responsible for gathering the stats across all 
the daemons in one place (that central part could persist stats if desired).



Each object should be able
to reference back to RBD/CephFS upon request and the client could even
be responsible for that load. Client performance data would need stats
in addition to the object stats.


You could extend the mechanism to clients.  However, as much as possible 
it's a good thing to keep it server side, as servers are generally fewer 
(still have to reduce these stats across N servers to present to user), 
and we have multiple client implementations (kernel/userspace).  What 
kind of thing do you want to get from clients?

My concern is that adding additional SQL like logic to each op is
going to get very expensive. I guess if we could push that to another
thread early in the op, then it might not be too bad. I'm enjoying the
discussion and new ideas.


Hopefully in most cases the query can be applied very cheaply, for 
operations like comparing pool ID or grouping by client ID. However, I 
would also envisage an optional sampling number, such that e.g. only 1 
in every 100 ops would go through the query processing.  Useful for 
systems where keeping highest throughput is paramount, and the numbers 
will still be useful if clients are doing many thousands of ops per second.


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Thoughts about metadata exposure in Calamari

2015-06-05 Thread John Spray




On 05/06/2015 20:33, Handzik, Joe wrote:

I err in the direction of calling 'osd metadata' too, but it does mean that 
Calamari will need to add that call in (I'll leave it to Gregory to say if that 
is particularly undesirable). Do you think it would be worthwhile to better 
define the metadata bundle into a structure, or is it ok to leave it as a set 
of string pairs?


Versioning of the metadata is something to consider. The osd metadata 
stuff is outside the osdmap epochs, so anything that is consuming 
updates to it is stuck with doing some kind of full polling as it 
stands.  It might be that some better interface with versions+deltas is 
needed for a management layer to efficiently consume it.


A version concept where the version is incremented when an OSD starts or 
updates its metadata could make synchronization with a management layer 
much more efficient.  Efficiency matters here when we're calling on the 
mons to serialize data for potentially 1s of OSDs into JSON whenever 
the management layer wants an update.


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: sharded collection list

2015-06-04 Thread John Spray



Talked about this elsewhere but for the benefit of the list:
 * The API suggested here looks nicer to me too
 * This depends on the new PGLS ordering OSD side, so that has to land 
before this
 * In the meantime I've rebased the #9964 (rados import/export) branch 
to not depend on sharded pgls


Cheers,
John

On 02/06/2015 23:54, Sage Weil wrote:

Hey John-

So the shared pgls stuff has collided a bit with the looming hobject
sorting changes.  Sam and I just talked about it a bit and came up
with what librados API would be most appealing:

  - the listing API would have start/end markers

  - it would be driven by a new opaque type rados_list_cursor_t, which is
just data, no state, and internally is just an hobject_t.

  - it would be totally stateless.. kill the [N]ListContext stuff in
Objecter (and reimplement a simple wrapper in librados.cc or even .h).
Note that the important bits of state there now are

  epoch (needed for detecting split; this will go away with a better cursor)
  result buffer (we can drop this)
  nspace (part of the ioctx, it just tags each request)
  cookie (this basically becomes the cursor .. it's just an hobject_t typedef)

  - the list could take a start cursor, optional end cursor, and output the
next cursor to continue from.

  - we'd lose the buffering that ListContext currently does, which means
that the request that goes over the wire will return the same number
of entries that the C caller asks for.  The C++ interface is an iterator
so it'll have to do its own buffering, but that should be pretty
trivial...

  - we should kill these calls, which were never used:

  CEPH_RADOS_API uint32_t 
rados_nobjects_list_get_pg_hash_position(rados_list_ctx_t ctx);

  CEPH_RADOS_API uint32_t rados_nobjects_list_seek(rados_list_ctx_t ctx,
   uint32_t pos);

  - we'd add a new call that is something like

  int rados_construct_iterator(ioctx, int n, int m, cursor *out);

so that you can get a position partway through the pg.

What do you think?  Unfortunately it is quite a departure from what you
implemented already but I think it'll be a net simplification *and*
let you do all the things we want, like

  - get a set of ranges to list form
  - change our mind partway through to break things into smaller shards
without losing previous work
  - start listing from a random position in the pool

You could even list a single hash value by constructing a cursor with
n=hash and n=hash=1 and m=2^32.

What do you think?
sage

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RBD journal draft design

2015-06-03 Thread John Spray




On 02/06/2015 16:11, Jason Dillaman wrote:

I am posting to get wider review/feedback on this draft design.  In support of 
the RBD mirroring feature [1], a new client-side journaling class will be 
developed for use by librbd.  The implementation is designed to carry opaque 
journal entry payloads so it will be possible to be re-used in other 
applications as well in the future.  It will also use the librados API for all 
operations.  At a high level, a single journal will be composed of a journal 
header to store metadata and multiple journal objects to contain the individual 
journal entries.

...
A new journal object class method will be used to submit journal entry append 
requests.  This will act as a gatekeeper for the concurrent client case.  A 
successful append will indicate whether or not the journal is now full (larger 
than the max object size), indicating to the client that a new journal object 
should be used.  If the journal is too large, an error code responce would 
alert the client that it needs to write to the current active journal object.  
In practice, the only time the journaler should expect to see such a response 
would be in the case where multiple clients are using the same journal and the 
active object update notification has yet to be received.


Can you clarify the procedure when a client write gets a I'm full 
return code from a journal object?  The key part I'm not clear on is 
whether the client will first update the header to add an object to the 
active set (and then write it) or whether it goes ahead and writes 
objects and then lazily updates the header.
* If it's object first, header later, what bounds how far ahead of the 
active set we have to scan when doing recovery?
* If it's header first, object later, thats an uncomfortable bit of 
latency whenever we cross and object bound


Nothing intractable about mitigating either case, just wondering what 
the idea is in this design.




In contrast to the current journal code used by CephFS, the new journal code will use sequence numbers 
to identify journal entries, instead of offsets within the journal.  Additionally, a given journal 
entry will not be striped across multiple journal objects.  Journal entries will be mapped to journal 
objects using the sequence number: sequence number mod splay count == object 
number mod splay count for active journal objects.

The rationale for this difference is to facilitate parallelism for appends as 
journal entries will be splayed across a configurable number of journal 
objects.  The journal API for appending a new journal entry will return a 
future which can be used to retrieve the assigned sequence number for the 
submitted journal entry payload once committed to disk. The use of a future 
allows for asynchronous journal entry submissions by default and can be used to 
simplify integration with the client-side cache writeback handler (and as a 
potential future enhacement to delay appends to the journal in order to satisfy 
EC-pool alignment requirements).


When two clients are both doing splayed writes, and they both send writes in parallel, it 
seems like the per-object fullness check via the object class could result in the writes 
getting staggered across different objects.  E.g. if we have two objects that both only 
have one slot left, then A could end up taking the slot in one (call it 1) and B could 
end up taking the slot in the other (call it 2).  Then when B's write lands at to object 
1, it gets a I'm full response and has to send the entry... where?  I guess 
to some arbitrarily-higher-numbered journal object depending on how many other writes 
landed in the meantime.

This potentially leads to the stripes (splays?) of a given journal entry being 
separated arbitrarily far across different journal objects, which would be fine 
as long as everything was well formed, but will make detecting issues during 
replay harder (would have to remember partially-read entries when looking for 
their remaining stripes through rest of journal).

You could apply the object class behaviour only to the object containing the 
0th splay, but then you'd have to wait for the write there to complete before 
writing to the rest of the splays, so the latency benefit would go away.  Or 
its equally possible that there's a trick in the design that has gone over my 
head :-)

Cheers,
John

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RBD mirroring design draft

2015-05-28 Thread John Spray




On 28/05/2015 06:37, Gregory Farnum wrote:

On Tue, May 12, 2015 at 5:42 PM, Josh Durgin jdur...@redhat.com wrote:

It will need some metadata regarding positions in the journal. These
could be stored as omap values in a 'journal header' object in a
replicated pool, for rbd perhaps the same pool as the image for
simplicity. The header would contain at least:

* pool_id - where journal data is stored
* journal_object_prefix - unique prefix for journal data objects
* positions - (zone, purpose, object num, offset) tuples indexed by zone
* object_size - approximate size of each data object
* object_num_begin - current earliest object in the log
* object_num_end - max potential object in the log

Similar to rbd images, journal data would be stored in objects named
after the journal_object_prefix and their object number. To avoid
issues of padding or splitting journal entries, and to make it simpler
to keep append-only, it's easier to allow the objects to be near
object_size before moving to the next object number instead of
sticking with an exact object size.

Ideally this underlying structure could be used for both rbd and
cephfs. Variable sized objects are different from the existing cephfs
journal, which uses fixed-size objects for striping. The default is
still 4MB chunks though. How important is striping the journal to
cephfs? For rbd it seems unlikely to help much, since updates need to
be batched up by the client cache anyway.

I think the journaling v2 stuff that John did actually made objects
variably-sized as you've described here. We've never done any sort of
striping on the MDS journal, although I think it was
possible.previously.


The objects are still fixed size: we talked about changing it so that 
journal events would never span an object boundary, but didn't do it -- 
it still uses Filer.






Parallelism
^^^

Mirroring many images is embarrassingly parallel. A simple unit of
work is an image (more specifically a journal, if e.g. a group of
images shared a journal as part of a consistency group in the future).

Spreading this work across threads within a single process is
relatively simple. For HA, and to avoid a single NIC becoming a
bottleneck, we'll want to spread out the work across multiple
processes (and probably multiple hosts). rbd-mirror should have no
local state, so we just need a mechanism to coordinate the division of
work across multiple processes.

One way to do this would be layering on top of watch/notify. Each
rbd-mirror process in a zone could watch the same object, and shard
the set of images to mirror based on a hash of image ids onto the
current set of rbd-mirror processes sorted by client gid. The set of
rbd-mirror processes could be determined by listing watchers.

You're going to have some tricky cases here when reassigning authority
as watchers come and go, but I think it should be doable.


I've been fantasizing about something similar to this for CephFS 
backward scrub/recovery.  My current code supports parallelism, but 
relies on the user to script their population of workers across client 
nodes.


I had been thinking of more of a master/slaves model, where one guy 
would get to be the master by e.g. taking the lock on an object, and he 
would then hand out work to everyone else that was a watch/notify 
subscriber to the magic object.  It seems like that could be simpler 
than having workers have to work out independently what their workload 
should be, and have the added bonus of providing a command-like 
mechanism in addition to continuous operation.


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RFC: progress bars

2015-05-28 Thread John Spray


On 28/05/2015 17:41, Robert LeBlanc wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Let me see if I understand this... Your idea is to have a progress bar
that show (active+clean + active+scrub + active+deep-scrub) / pgs and
then estimate time remaining?


Not quite: it's not about doing a calculation on the global PG state 
counts.  The code identifies specific PGs affected by specific 
operations, and then watches the status of those PGs.




So if PGs are split the numbers change and the progress bar go
backwards, is that a big deal?


I don't see a case where the progress bars go backwards with the code I 
have so far?  In the case of operations on PGs that split, it'll just 
ignore the new PGs, but you'll get a separate event tracking the 
creation of the new ones.  In general, progress bars going backwards 
isn't something we should allow to happen (happy to hear counter 
examples though, I'm mainly speaking from intuition on that point!)


If this was extended to track operations across PG splits (it's unclear 
to me that that complexity is worthwhile), then the bar still wouldn't 
need to go backwards, as whatever stat was being tracked would remain 
the same when summed across the newly split PGs.



I don't think so, it might take a
little time to recalculate how long it will take, but no big deal. I
do like the idea of the progress bar even if it is fuzzy. I keep
running ceph status or ceph -w to watch things and have to imagine it
in my mind.


Right, the idea is to save the admin from having to interpret PG counts 
mentally.



It might be nice to have some other stats like client I/O
and rebuild I/O so that I can see if recovery is impacting production
I/O.


We already have some of these stats globally, but it would be nice to be 
able to reason about what proportion of I/O is associated with specific 
operations, e.g. I have some total recovery IO number, what proportion 
of that is due to a particular drive failure?. Without going and 
looking at current pg stat structures I don't know if there is enough 
data in the mon right now to guess those numbers.  This would 
*definitely* be heuristic rather than exact, in any case.


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Slow file creating and deleting using bonnie ++ on Hammer

2015-05-26 Thread John Spray




On 26/05/2015 15:50, Barclay Jameson wrote:

Thank you for the great explanation Zheng! That definitely shows what
I was seeing with the bonnie++ test. What bad things would happen if I
modified the config option mds_tick_interval to flush the journal to a
second or less?


The MDS does various pieces of housekeeping according to that interval, 
so setting it extremely low will cause some CPU cycles to be wasted, and 
flushing the log more often will cause a larger number of smaller IOs to 
get generated.  I would be very surprised if decreasing it to approx 1s 
was harmful though.


On a busy real world system, other metadata operations will often drive 
log writes through faster than waiting for a tick.



Does this also mean any custom code written should
avoid use of fsync() if writing a large number of files?


You should call it only when your application requires it for 
consistency, and always expect it to be a high latency operation. Add up 
the latency from your client to your server and from the server to the 
disk, and the length of the IO queue on the disk, and then the return 
leg -- that is the *minimum* time you should expect to wait for an fsync.


For example, a real world workload creating N files in a directory would 
hopefully call fsync on the directory once at the end, rather than in 
between every file, unless you really do need to be sure that the dentry 
for the preceding file will be persistent before you start writing the 
next file.


Sometimes it's easier to reason about it in terms of concurrency: if you 
have a bunch of IOs that you could safely run in parallel in a thread 
each, then you shouldn't be fsyncing between them, just at the point you 
join them.


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Slow file creating and deleting using bonnie ++ on Hammer

2015-05-26 Thread John Spray




On 26/05/2015 07:55, Yan, Zheng wrote:
the reason for slow file creations is that bonnie++ call fsync(2) 
after each creat(2). fsync() wait for safe replies of the create 
requests. MDS sends safe reply when log event for the request gets 
journaled safely. MDS flush the journal every 5 seconds 
(mds_tick_interval). So the speed of file creation for bonnie++ is one 
file every file seconds.


Ah, I hadn't noticed that the benchmark called... I wonder if I'm seeing 
the fuse client return quickly because it simply doesn't implement the 
fsyncdir call.  We should fix that!


It looks like we used to have an OP_FSYNC in the client-server protocol 
(perhaps for flushing the log immediately on fsyncs), anyone have any 
background on why that went away?


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MDS auth caps for cephfs

2015-05-22 Thread John Spray




On 21/05/2015 01:14, Sage Weil wrote:

Looking at the MDSAuthCaps again, I think there are a few things we might
need to clean up first.  The way it is currently structured, the idea is
that you have an array of grants (MDSCapGrant).  For any operation, you'd
look at each grant until one that says what you're trying to do is okay.
If non match, you fail.  (i.e., they're additive only.)

Each MDSCapGrant has a 'spec' and a 'match'.  The 'match' is a check
to see if the current grant applies to a given operation, and the 'spec'
says what you're allowed to do.

Currently MDSCapMatch is just

   int uid;  // Require UID to be equal to this, if !=MDS_AUTH_UID_ANY
   std::string path;  // Require path to be child of this (may be / for any)

I think path is clearly right.  UID I'm not sure makes sense here... I'm
inclined to ignore it (instead of removing it) until we decide
how to restrict a mount to be a single user.

The spec is

   bool read;
   bool write;
   bool any;

I'm not quite sure what 'any' means, but read/write are pretty clear.


Ah, I added that when implementing 'tell' -- 'any' is checked when 
handling incoming MCommand in MDS, so it's effectively the admin permission.



The root_squash option clearly belongs in spec, and Nistha's first patch
adds it there.  What about the other NFS options.. should be mirror those
too?

root_squash
  Map requests from uid/gid 0 to the anonymous uid/gid. Note that this does
  not apply to any other uids or gids that might be equally sensitive, such
  as user bin or group staff.
no_root_squash
  Turn off root squashing. This option is mainly useful for diskless
  clients.
all_squash
  Map all uids and gids to the anonymous user. Useful for NFS-exported
  public FTP directories, news spool directories, etc. The opposite option
  is no_all_squash, which is the default setting.
anonuid and anongid
  These options explicitly set the uid and gid of the anonymous account.
  This option is primarily useful for PC/NFS clients, where you might want
  all requests appear to be from one user. As an example, consider the
  export entry for /home/joe in the example section below, which maps all
  requests to uid 150 (which is supposedly that of user joe).


Yes, I think we should.  Part of me wants to say that people who want 
NFS-like behaviour should be using NFS gateways.  However, these are all 
probably straightforward enough to implement that it's worth maintaining 
them in cephfs too.


We probably need to mirror these in our mount options too, so that e.g. 
someone with an admin key can still enable root_squash at will, rather 
than having to craft an authentication token with the desired behaviour.



We could also do an all_squash bool at the same time (or a flags field for
more efficient encoding), and anonuid/gid so that we don't hard-code
65534.

In order to add these to the grammer, I suspect we should go back to
root_squash (not squash_root), and add an 'optoins' tag.  e.g.,

  allow path /foo rw options no_root_squash anonuid=123 anongid=123

(having them live next to rw was breaking the spirit parser, bah).

Looks good to me.

John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Slow file creating and deleting using bonnie ++ on Hammer

2015-05-22 Thread John Spray




On 22/05/2015 16:25, Barclay Jameson wrote:

The Bonnie++ job _FINALLY_ finished. If I am reading this correctly it
took days to create, stat, and delete 16 files??
[root@blarg cephfs]# ~/bonnie++-1.03e/bonnie++ -u root:root -s 256g -r
131072 -d /cephfs/ -m CephBench -f -b
Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03e   --Sequential Output-- --Sequential Input- --Random-
 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
CephBench  256G   1006417  76 90114  13   137110
8 329.8   7
 --Sequential Create-- Random Create
 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
   files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
  16 0   0 + +++ 0   0 0   0  5267  19 0   0
CephBench,256G,,,1006417,76,90114,13,,,137110,8,329.8,7,16,0,0,+,+++,0,0,0,0,5267,19,0,0

Any thoughts?

It's 16000 files by default (not 16), but this usually takes only a few 
minutes.


FWIW I tried running a quick bonnie++ (with -s 0 to skip the IO phase) 
on a development (vstart.sh) cluster with a fuse client, and it readily 
handles several hundred client requests per second (checked with ceph 
daemonperf mds.id)


Nothing immediately leapt out at me from a quick look at the log you 
posted, but with issues like these it is always worth trying to narrow 
it down by trying the fuse client instead of the kernel client, and/or 
different kernel versions.


You may also want to check that your underlying RADOS cluster is 
performing reasonably by doing a rados bench too.


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: negative pool stats (was: Bug?)

2015-04-13 Thread John Spray


On 10/04/2015 19:27, Barclay Jameson wrote:

Watching osd pool stats (watch --interval=.5 -d 'ceph osd pool stats')
while restarting 1/3 of my OSDs give some odd numbers:

pool cephfs_data id 1
   -768/9 objects degraded (-8533.333%)
   recovery io 18846 B/s, 1 objects/s
   client io 15356 B/s wr, 2 op/s

pool cephfs_metadata id 2
   -1/0 objects degraded (-inf%)



Negative stats are: http://tracker.ceph.com/issues/7737

The fix appears to have just missed 0.94, but should be in the next 
stable releases.


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RFC: progress bars

2015-04-07 Thread John Spray


Hi all,

[this is a re-send of a mail from yesterday that didn't make it, 
probably due to an attachment]


It has always annoyed me that we don't provide a simple progress bar 
indicator for things like the migration of data from an OSD when it's 
marked out, the rebalance that happens when we add a new OSD, or 
scrubbing the PGs on an OSD.


I've experimented a bit with adding user-visible progress bars for some 
of the simple cases (screenshot at http://imgur.com/OaifxMf). The code 
is here:
https://github.com/ceph/ceph/blob/wip-progress-events/src/mon/ProgressEvent.cc 



This is based on a series of ProgressEvent classes that are 
instantiated when certain things happen, like marking and OSD in or 
out.  They provide an init() hook that captures whatever state is needed 
at the start of the operation (generally noting which PGs are affected) 
and a tick() hook that checks whether the affected PGs have reached 
their final state.


Clearly, while this is simple for the simple cases, there are lots of 
instances where things will overlap: a PG can get moved again while it's 
being backfilled following a particular OSD going out. These progress 
indicators don't have to capture that complexity, but the goal would be 
to make sure they did complete eventually rather than getting 
stuck/confused in those cases.


This is just a rough cut to play with the idea, there's no persistence 
of the ProgressEvents, and the init/tick() methods are peppered with 
correctness issues.  Still, it gives a flavour of how we could add 
something friendlier like this to expose simplified progress indicators.


Ideas for further work:
 * Add in an MDS handler to capture the progress of an MDS rank as it 
goes through replay/reconnect/clientreplay
 * A handler for overall cluster restart, that noticed when the mon 
quorum was established and all the map timestamps were some time in the 
past, and then generated progress based on OSDs coming up and PGs peering.

 * Simple: a handler for PG creation after pool creation
 * Generate estimated completion times from the rate of progress so far
 * Friendlier PGMap output, by hiding all PG states that are explained 
by an ongoing ProgressEvent, to only indicate low level PG status for 
things that the ProgressEvents don't understand.


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Advice for implementation of LED behavior in Ceph ecosystem

2015-04-01 Thread John Spray


On 01/04/2015 22:17, John Spray wrote:
Once you have found the block device and reported it in the OSD 
metadata, you can use that information to go poke its LEDs using 
enclosure services hooks as you suggest, and wrap that in an OSD 
'tell' command (OSD::do_command).  In a similar vein to finding the 
block device, it would be a good thing to have a config option here so 
that admins can optionally specify a custom command for flashing a 
particular OSD's LED.  Admins might not bother setting that, but it 
would mean a system integrator could optionally configure ceph to work 
with whatever exotic custom stuff they have.
One more thought occurs to me -- one of the main cases where you'd want 
to flash an LED would be to identify the drive of an OSD that is 
down/out due to a dead drive.  In that instance, the ceph-osd process 
wouldn't actually be running, so you wouldn't be able to send it the 
'tell' to flash the LED.


I guess in this interesting case you could either:
 * Allow other OSDs on the same host to handle the 'tell blink' command 
for the dead OSD's drive
 * Leave this to calamari/whoever to read the dead OSD's block device 
path from ceph osd metadata, and go blink the LEDs themselves.


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Advice for implementation of LED behavior in Ceph ecosystem

2015-04-01 Thread John Spray


On 01/04/2015 22:57, Mark Nelson wrote:
It seems to me that the OSD potentially would flash the LED on it's 
way down if it thinks it's drive is dead/dying?
That's a good idea for the case where ceph-osd is proactively 
identifying a failing drive.  I'm also thinking about the case where we 
come back from a reboot and a drive is sufficiently unreadable that 
ceph-disk doesn't see the OSD partitions and ceph-osd never gets 
started, or the OSD's local filesystem is unmountable.  Because the 
keyring lives on that local filesystem, OSDs couldn't phone home in that 
case, even to report a failure.


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Advice for implementation of LED behavior in Ceph ecosystem

2015-04-01 Thread John Spray


On 01/04/2015 23:04, John Spray wrote:

On 01/04/2015 22:57, Mark Nelson wrote:
It seems to me that the OSD potentially would flash the LED on it's 
way down if it thinks it's drive is dead/dying?
That's a good idea for the case where ceph-osd is proactively 
identifying a failing drive.  I'm also thinking about the case where 
we come back from a reboot and a drive is sufficiently unreadable that 
ceph-disk doesn't see the OSD partitions and ceph-osd never gets 
started, or the OSD's local filesystem is unmountable.  Because the 
keyring lives on that local filesystem, OSDs couldn't phone home in 
that case, even to report a failure.
Sorry, mental lapse: we're not talking about phoning home, we're talking 
about flashing the LED.  So perhaps ceph-disk itself could be modified 
to flash an LED on a drive if it has a GPT partition ID for a ceph osd 
but we can't mount it or start an OSD service.


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Advice for implementation of LED behavior in Ceph ecosystem

2015-04-01 Thread John Spray


On 01/04/2015 19:56, Handzik, Joe wrote:

1. Stick everything in Calamari via Salt calls similar to what Gregory is 
showing. I have concerns about this, I think I'd still need extra information 
from the OSDs themselves. I might need to implement the first half of option #2 
anyway.
2. Scatter it across the codebases (would probably require changes in Ceph, 
Calamari, and Calamari-clients). Expose the storage target data via the OSDs, 
and move that information upward via the RESTful API. Then, expose another 
RESTful API behavior that allows a user to change the LED state. Implementing 
as much as possible in the Ceph codebase itself has an added benefit (as far as 
I see it, at least) if someone ever decides that the fault LED should be 
toggled on based on the state of the OSD or backing storage device. It should 
be easier for Ceph to hook into that kind of functionality if Calamari doesn't 
need to be involved.

Dan mentioned something I thought about too...not EVERY OSD's backing storage 
is going to be able to use this (Kinetic drives, NVDIMMs, M.2, etc etc), I'd 
need to implement some way to filter devices and communicate via the Calamari 
GUI that the device doesn't have an LED to toggle or doesn't understand SCSI 
Enclosure Services (I'm targeting industry standard HBAs first, and I'll deal 
with RAID controllers like Smart Array later).

I'm trying to get this out there early so anyone with particularly strong 
implementation opinions can give feedback. Any advice would be appreciated! I'm 
still new to the Ceph source base, and probably understand Calamari and 
Calamari-clients better than Ceph proper at the moment.


Similar to Mark's comment, I would lean towards option 2 -- it would be 
great to have a CLI-driven ability to flash the LEDs for an OSD, and 
work on integrating that with a GUI afterwards.


Currently the OSD metadata on drives is pretty limited, it'll just tell 
you the /var/lib/ceph/osd/ceph-X path for the data and journal -- the 
task of resolving that to a physical device is left as an exercise to 
the reader, so to speak.


I would suggest extending osd metadata to also report the block device, 
but only for the simple case where an OSD is a GPT partition on a raw 
/dev/sdX block device.  Resolving block device to underlying disks in 
configurations like LVM/MDRAID/multipath is complex in the general case 
(I've done it, I don't recommend it), and most ceph clusters don't use 
those layers.  You could add a fallback ability for users to specify 
their block device in ceph.conf, in case the simple GPT-assuming OSD 
probing code can't find it from the mount point.


Once you have found the block device and reported it in the OSD 
metadata, you can use that information to go poke its LEDs using 
enclosure services hooks as you suggest, and wrap that in an OSD 'tell' 
command (OSD::do_command).  In a similar vein to finding the block 
device, it would be a good thing to have a config option here so that 
admins can optionally specify a custom command for flashing a particular 
OSD's LED.  Admins might not bother setting that, but it would mean a 
system integrator could optionally configure ceph to work with whatever 
exotic custom stuff they have.


Hopefully that's some help, it sounds like you've already thought it 
through a fair bit anyway.


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: adding {mds,mon} metadata asok command

2015-03-24 Thread John Spray


On 24/03/2015 10:11, Joao Eduardo Luis wrote:

I don't think people change hostnames for sport


Sounds interesting, I might buy tickets to a game :-D

John

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: adding {mds,mon} metadata asok command

2015-03-23 Thread John Spray


On 23/03/2015 10:04, Joao Eduardo Luis wrote:
I agree.  And I don't think we need a new service for this, and I also 
don't think we need to write stuff to the store.  We can generate this 
information when the monitor hits 'bootstrap()' and share it with the 
rest of the quorum once an election finishes, and always keep it in 
memory (unless there's some information that needs to be persisted, 
but I was under the impression that was not the case).
Just to clarify, you mean we don't need to write the mon metadata to the 
store, but we'd still want to persist the MDS/OSD metadata - right?


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: adding {mds,mon} metadata asok command

2015-03-20 Thread John Spray


On 20/03/2015 05:39, kefu chai wrote:

to pave the road to http://tracker.ceph.com/issues/10904, where we
need to add a command to list the hostname of nodes in a ceph cluster,
i would like to add the {mds,mon} metadata commands to print the
system information including, but not limited to hostname,
mem_{total,swap}_kb, and distro info, of specified mds and mon.

the implementation follow the mechanism of osd metadata:

on the mds side i would like to reuse the MDSMonitor service:
1. piggy back a map for the metadata in MMDSBeacon message,
2. put the metadata into the same DBStore transaction but with another
prefix when storing the pending inc into local storage.
3. and expose it using the mds metadata and later on the service
ls (not sure about the name ...)

@greg and @zyan, are you good with this? not sure this will overburden
the mds or not. i will use uname(2) and grep /proc/meminfo to get the
metadata in the same way of OSD.
It should be straightforward to include the metadata in MMDSBeacon only 
once per daemon lifetime, by checking if state is CEPH_MDS_STATE_BOOT -- 
that way we don't have to worry about any ongoing costs.  I expect that 
change can live entirely in Beacon.cc without touching any other MDS code.


As for the means of getting the information, I expect the generic 
kernel/mem/cpu/distro stuff from OSD::_collect_metadata can be moved up 
into common/ somewhere and reused as-is from mon+mds.


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Persistence of completed_requests in sessionmap (do we need it?)

2015-03-03 Thread John Spray



Zheng noticed on my new sessionmap code [1] that sessions weren't 
getting dirtied on trim_completed_requests.  I had missed that, because 
I was only updating the places that we already incremented the 
sessionmap version while modifying something.


I went and looked at how this worked in the existing code, and it 
appears that we don't actually bother persisting updates to the 
sessionmap if completed_requests is the only thing that changed.  We 
would *tend* to persist it as a consequence to other session updates 
like prealloc_inos, but if one is simply issuing lots of metadata 
updates to existing files in a loop, the sessionmap never gets written 
back (even when expiring log segments).


During replay, we rebuild completed_requests from EMetaBlob::replay, and 
we've made it this far without reliably persisting it in sessionmap, so 
I wonder if we ever needed to save this at all? Thoughts?


Cheers,
John

1. https://github.com/ceph/ceph/pull/3718
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph versions

2015-03-02 Thread John Spray


On 26/02/2015 23:12, Sage Weil wrote:

Hammer will most likely be v0.94[.x].  We're getting awfully close to
0.99, though, which makes many people think 1.0 or 1.00 (isntead of
0.100), and the current versioning is getting a bit silly.  So let's talk
about alternatives!


I'm late to this thread, but...

I find option B preferable, because it puts the most important 
information (which series, and is it stable) within the X.Y part that 
people will typically use in normal speech.  In an ideal world option D, 
but I have found historically that it gets very confusing to have 
multiple releases with the same number, differentiated only by a 
trailing -dev/-rc.  Folks are prone to drop the critical trailing 
qualifier when they say I'm running 1.2.  However, option D would be 
my second choice to option B, as its the most explicit of the alternatives.


As for the others... Option A (doubles and triples) has a similar 
abbreviation problem, that people will say I'm running 1.2 whether 
they were running a 1.2 dev or a 1.2.1 stable.  Option C (semantic) is 
nice for APIs, but for software releases will confuse ordinary humans 
who associate big version jumps with big features etc. Option E 
differentiates between stable/dev with double/triple, which means it too 
has the abbreviation problem when spoken about colloquially.  Option F 
is confusing because it requires the reader to differentiate between 8 
(not 0!) and 9 (point 0) to see stability.


Cheers,
John



Here are a few of options:

-- Option A -- doubles and triples

X.Y[.Z]

  - Increment X at the start of each major dev cycle (hammer, infernalis)
  - Increment Y for each iteration during that cycle
  - Eventually decide it's good and start adding .Z for the stable fixes.

For example,

  1.0 first infernalis dev release
  1.1 dev release
  ...
  1.8 infernalis rc
  1.9 infernalis final
  1.9.1 stable update
  1.9.2 stable update
  ...
  2.0 first j (jewel?) dev release
  2.1 next dev release
  ...
  2.8 final j
  2.8.x stable j releases

Q: How do I tell if it's a stable release?
A: It is a triple instead of a double.

Q: How do I tell if this is the final release in the series?
A: Nobody knows that until we start doing stable updates; see above.


-- Option B -- even/odd

X.Y.Z

  - if Y is even, this is a stable series
  - if Y is odd, this is a dev release
  - increment X when something major happens

  1.0 hammer final
  1.0.1 stable/bugfix
  1.0.2 stable
  1.0.3 stable
  ...
  1.1.0 infernalis dev release
  1.1.1 infernalis dev release
  1.1.2 infernalis dev release
  ...
  1.2.0 infernalis final
  1.2.1 stable branch
  ...
  1.3.0 j-release dev
  1.3.1 j-release dev
  1.3.2 j-release dev
  ...
  1.4.0 j-release final
  1.4.1 stable
  1.4.1 stable

Q: How do I tell if it's a stable release?
A: Second item is even.


-- Option C -- semantic

major.minor.patch

- MAJOR version when you make incompatible API changes,
- MINOR version when you add functionality in a backwards-compatible
manner, and
- PATCH version when you make backwards-compatible bug fixes.

  1.0.0 hammer final
  1.0.1 bugfix
  1.0.1 bugfix
  1.0.1 bugfix
  1.1 infernalis dev release
  1.2 infernalis dev release
  2.0 infernalis dev release
  2.1 infernalis dev release
  2.2 infernalis dev release
  2.3 infernalis final
  2.3.0 bugfix
  2.3.1 bugfix
  2.3.2 bugfix
  2.4 j dev release
  ...
  2.14 j final
  2.14.0 bugfix
  2.14.1 bugfix
  ...
  2.15 k dev
  ..
  3.3 k final
  3.3.1 bugfix
  ...
  
Q: How do I tell what named release series this is?

A: As with the others, you just have to know.

Q: How do we distinguish between stable-series updates and dev updates?
A: Stable series are triples.

Q: How do I know if I can downgrade?
A: The major hasn't changed.

Q: Really?
A: Well, maybe.  We haven't dealt with downgrades yet so this assumes we
get it right (and test it).  We may not realize there is a
backward-incompatible change right away and only discover it later during
testing, at which point the versions are fixed; we'd probably bump the
*next* release in response.


-- Option D -- labeled

X.Y-{dev,rc,release}Z

  - Increment Y on each major named release
  - Increment X if it's a major major named release (bigger change
than usual)
  - Use dev, rc, or release prefix to clearly label what type of release
this is
  - Increment Z for stable updates

  1.0-dev1 first infernalis dev release
  1.0-dev2 another dev release
  ...
  1.0-rc1 first rc
  1.0-rc2 next rc
  1.0-release1 final release
  1.0-release2 stable update
  1.0-release3 stable update
  1.1-dev1 first cut for j-release
  1.1-dev2 ...
  ...
  1.1-rc1
  1.1-release1 stable
  1.1-release2 stable
  1.1-release3 stable

Q: How do I tell what kind of release this is?
A: Look at the string embedded in the version

Q: Will these funny strings confuse things that sort by version?
A: I don't think so.


-- Option E -- ubuntu

YY.MM[.Z]

  - YY is year, MM is month of release
  - Z for stable updates

  15.03 hammer

Re: MDS crashes (80.8)

2015-02-26 Thread John Spray


On 26/02/2015 15:26, Wyllys Ingersoll wrote:

Trying to run ceph-mds on a freshly installed firefly cluster with no
ceph FS created yet.

It consistently crashes upon startup. Below is debug output showing
the point of the crash.   Something is obviously misconfigured or
broken but Im at a loss as to where the issue would be. Any ideas?
The MDS thinks that its on-disk tables should already exist (the MDSMap 
structure on the mons are telling it so), but when it goes to read them 
from RADOS it's finding that they aren't there.


This is reminiscent of http://tracker.ceph.com/issues/7485, although I 
just checked and the v0.80.8 tag does include the fix for that. Please 
could you see if you have the MDS log from the very first time it ran?  
There is probably a clue there.


Once you have recovered that and any other interesting diagnostic 
information, you can try to get out of this bad state by using ceph mds 
newfs to reset the MDSMap.


Thanks,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MDS crashes (80.8)

2015-02-26 Thread John Spray


On 26/02/2015 17:19, Wyllys Ingersoll wrote:

OK, attached is the initial log, or at least the earliest log I can find.

Ah, now that I look more closely at the backtrace, I realise that 
creation succeeded, but it is now failing on subsequent runs because it 
can't find the metadata pool.  I guess you probably deleted it at some 
point between Monday and today.  You will need to create yourself some 
filesystem pools (metadata and data) before using newfs to reset your 
filesystem.


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MDS crashes (80.8)

2015-02-26 Thread John Spray


On 26/02/2015 17:58, Wyllys Ingersoll wrote:

Yeah, I noticed that too, so I recreated both of those pools and it
still wont start. It crashes in a different place now, but still wont
start, even after running 'newfs'.  Attached is the debug log output
when I start ceph-mds

...
common/Thread.cc: In function 'int Thread::join(void**)' thread
7f0127612700 time 2015-02-26 07:55:09.734802
common/Thread.cc: 141: FAILED assert(status == 0)
  ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7)
  1: (Thread::detach()+0) [0x8c30d0]
  2: (MonClient::shutdown()+0x50f) [0x881d5f]
  3: (MDS::suicide()+0xe5) [0x576285]
  4: (MDLog::handle_journaler_write_error(int)+0x5f) [0x7aab2f]
  5: (Context::complete(int)+0x9) [0x56d9a9]
  6: (Journaler::handle_write_error(int)+0x5e) [0x7b949e]
  7: (Journaler::_finish_write_head(int, Journaler::Header,
Context*)+0x306) [0x7b9946]
  8: (Context::complete(int)+0x9) [0x56d9a9]
  9: (Objecter::check_op_pool_dne(Objecter::Op*)+0x214) [0x7ce6a4]
  10: (Objecter::C_Op_Map_Latest::finish(int)+0x124) [0x7cea04]
  11: (Context::complete(int)+0x9) [0x56d9a9]
  12: (Finisher::finisher_thread_entry()+0x1b8) [0x9aced8]
  13: (()+0x8182) [0x7f012db95182]
  14: (clone()+0x6d) [0x7f012c50befd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is
needed to interpret this.
2015-02-26 07:55:09.736256 7f0127612700 -1 common/Thread.cc: In
function 'int Thread::join(void**)' thread 7f0127612700 time
2015-02-26 07:55:09.734802
common/Thread.cc: 141: FAILED assert(status == 0)
That is still looking like a missing pool (the check_op_pool_dne frame 
in the trace).  Can you post ceph osd dump and ceph mds dump?


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MDS crashes (80.8)

2015-02-26 Thread John Spray




On 26/02/2015 18:07, Wyllys Ingersoll wrote:

Here is 'ceph df', followed by mds dump and osd dump


Your MDS map is trying to use pool 0 for both data and metadata. 
Firstly, these should be different pools.  Secondly, you have no pool 
0.  You do have metadata and data pools with ids 6 and 7, so you should 
be typing ceph mds newfs 6 7.


Note that this situation can't occur on more recent versions of Ceph, 
where the filesystem-pool relations are enforced strictly.  If you are 
intent on using the filesystem, you should consider using more recent 
versions to benefit from the numerous other filesystem bugfixes.


Cheers,
John


$ ceph df
GLOBAL:
 SIZE  AVAIL RAW USED %RAW USED
 1307T 1302T5039G  0.38
POOLS:
 NAME  ID USED  %USED MAX AVAIL OBJECTS
 ks3backup 3   160M 0 0   9
 test  4  0 0 0   0
 TEST  5  2501G  0.19 0  320263
 metadata  6  0 0 0   0
 data  7  0 0 0   0
 rbd   8  0 0 0   0

$ ceph mds dump
dumped mdsmap epoch 220
epoch 220
flags 0
created 2015-02-26 07:43:23.410584
modified 2015-02-26 07:55:28.817383
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
last_failure 0
last_failure_osd_epoch 867
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
ranges,3=default file layouts on dirs,4=dir inode in separate
object,5=mds uses versioned encoding,6=dirfrag is stored in omap}
max_mds 1
in 0
up {0=20039}
failed
stopped
data_pools 0
metadata_pool 0
inline_data disabled
20039: 10.2.3.33:6800/5098 'mdc03' mds.0.91 up:creating seq 1 laggy
since 2015-02-26 07:55:28.817341

$ ceph osd dump
epoch 872
fsid e73bfab8-01f1-4534-9a66-1b425d5c3341
created 2015-02-23 07:53:05.626725
modified 2015-02-26 08:01:48.834331
flags
pool 3 'ks3backup' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 512 pgp_num 512 last_change 251 flags
hashpspool stripe_width 0
pool 4 'test' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 363 flags hashpspool
stripe_width 0
pool 5 'TEST' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 473 flags hashpspool
stripe_width 0
pool 6 'metadata' replicated size 2 min_size 1 crush_ruleset 0
object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 808 flags
hashpspool stripe_width 0
pool 7 'data' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 810 flags hashpspool
stripe_width 0
pool 8 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 512 pgp_num 512 last_change 833 flags hashpspool
stripe_width 0
max_osd 242
osd.0 up   in  weight 1 up_from 774 up_thru 833 down_at 748
last_clean_interval [692,747) 10.2.2.109:6851/6280
10.3.2.109:6834/6280 10.3.2.109:6835/6280 10.2.2.109:6852/6280
exists,up 43999652-2966-4b1b-a8f0-c33ef5db0970
osd.1 up   in  weight 1 up_from 648 up_thru 833 down_at 630
last_clean_interval [598,629) 10.2.3.105:6857/6775
10.3.3.105:6838/6775 10.3.3.105:6839/6775 10.2.3.105:6858/6775
exists,up 1e4b6fa3-044c-40b8-9f5b-968bdfae5dd7
osd.2 up   in  weight 1 up_from 627 up_thru 833 down_at 619
last_clean_interval [455,618) 10.2.1.108:6848/5782
10.3.1.108:6832/5782 10.3.1.108:6833/5782 10.2.1.108:6849/5782
exists,up 95000872-72f2-46e8-8d4a-f9b1422032cb
osd.3 up   in  weight 1 up_from 773 up_thru 833 down_at 752
last_clean_interval [693,751) 10.2.2.110:6833/4189
10.3.2.110:6822/4189 10.3.2.110:6823/4189 10.2.2.110:6834/4189
exists,up 980182ab-51bf-4cd9-a9a4-cfca05123592
osd.4 up   in  weight 1 up_from 732 up_thru 833 down_at 722
last_clean_interval [718,721) 10.2.1.101:6806/2789
10.3.1.101:6804/2789 10.3.1.101:6805/2789 10.2.1.101:6807/2789
exists,up 92a8fe48-082d-4ad1-93ba-5a4f9598d866
osd.5 up   in  weight 1 up_from 766 up_thru 810 down_at 739
last_clean_interval [663,738) 10.2.3.111:6800/4353
10.3.3.111:6800/4353 10.3.3.111:6801/4353 10.2.3.111:6801/4353
exists,up d6539347-2c2a-4fba-867d-74c84b2018ad
osd.6 up   in  weight 1 up_from 781 up_thru 833 down_at 760
last_clean_interval [711,759) 10.2.2.104:6830/4317
10.3.2.104:6820/4317 10.3.2.104:6821/4317 10.2.2.104:6831/4317
exists,up 55b05a0f-2b17-44e2-84c5-78db81ea862b
osd.7 up   in  weight 1 up_from 639 up_thru 833 down_at 633
last_clean_interval [608,632) 10.2.3.106:6842/4762
10.3.3.106:6828/4762 10.3.3.106:6829/4762 10.2.3.106:6843/4762
exists,up 6a62d4fc-487c-4b39-a64a-823d675969a9
osd.8 up   in  weight 1 up_from 769 up_thru 833 down_at 744
last_clean_interval [671,743) 10.2.3.112:6845/5168
10.3.3.112:6830/5168 10.3.3.112:6831/5168 10.2.3.112:6846/5168
exists,up 9c4202e6-7749-4436-adb1-9c8bf1dc7f04
osd.9 up   in  weight 1 up_from 734

Re: MDS crashes (80.8)

2015-02-26 Thread John Spray


On 26/02/2015 18:27, Wyllys Ingersoll wrote:

Excellent, thanks.

It starts now without crashing, but I see lots of errors like this:

2015-02-26 08:24:19.134693 7f78091aa700  0 cephx: verify_reply
couldn't decrypt with error: error decoding block for decryption
2015-02-26 08:24:19.134695 7f78091aa700  0 -- 10.2.3.33:6800/5236 
10.2.2.104:6857/8883 pipe(0x26f5700 sd=33 :50680 s=1 pgs=0 cs=0 l=1
c=0x280b1e0).failed verifying authorize reply
2015-02-26 08:24:19.134717 7f7809bb4700  0 cephx: verify_reply
couldn't decrypt with error: error decoding block for decryption
2015-02-26 08:24:19.134719 7f7809bb4700  0 -- 10.2.3.33:6800/5236 
10.2.3.111:6848/4652 pipe(0x26f4300 sd=23 :58785 s=1 pgs=0 cs=0 l=1
c=0x2694c00).failed verifying authorize reply
2015-02-26 08:24:19.134761 7f78093ac700  0 cephx: verify_reply
couldn't decrypt with error: error decoding block for decryption
2015-02-26 08:24:19.134764 7f78093ac700  0 -- 10.2.3.33:6800/5236 
10.2.3.105:6842/5408 pipe(0x26f5c00 sd=30 :38240 s=1 pgs=0 cs=0 l=1
c=0x280a000).failed verifying authorize reply

It would appear that something is wrong with your deployment, such as 
missing keys.  You might want to take this issue to ceph-users to see if 
anyone can help with whatever system you were using to deploy these 
services.


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Negative stats on lab cluster

2015-02-24 Thread John Spray



The lab cluster hosting teuthology logs is currently exhibiting negative 
statistics (#5884, #7737).  Could be a good time for someone with more 
low-level RADOS expertise than me to take a look at the status and see 
if they can work out why it's happening.


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-calamari] disk failure prediction

2015-02-19 Thread John Spray



On 18/02/2015 23:20, Sage Weil wrote:

We wouldn't see
quite the same results since our raid sets are effectively entire pools
I think we could do better than pool-wide, e.g. if multiple drives in 
one chassis are at risk (where PG stores at most one copy per chassis), 
we can identify that as less severe than the general case where multiple 
at-risk drives might be in the same PG.  Making it CRUSH-aware like this 
would be a good hook for users to take advantage of the ceph/calamari 
SMART monitoring rather than rolling their own.


John

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

ceph-fuse remount issues

2015-02-19 Thread John Spray



Background: a while ago, we found (#10277) that existing cache 
expiration mechanism wasn't working with latest kernels.  We used to 
invalidate the top level dentries, which caused fuse to invalidate 
everything, but an implementation detail in fuse caused it to start 
ignoring our repeated invalidate calls, so this doesn't work any more.  
To persuade fuse to dirty its entire metadata cache, Zheng added in a 
system() call to mount -o remount after we expire things from our 
client side cache.


However, this was a bit of a hack and has created problems:
 * You can't call mount -o remount unless you're root, so we are less 
flexible than we used to be (#10542)
 * While the remount is happening, unmounts sporadically fail and the 
fuse process can become unresponsive to SIGKILL (#10916)


The first issue was maybe an acceptable compromise, but the second issue 
is just painful, and it seems like we might not have seen the last of 
the knock on effects -- upstream maintainers certainly aren't expecting 
filesystems to remount themselves quite so frequently.


We probably have an opportunity to get something upstream in fuse to 
support a direct call to trigger the invalidation we want, if we can 
work out what that should look like.  Thoughts?


John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Standardization of perf counters comments

2015-02-11 Thread John Spray

On Wed, Feb 11, 2015 at 6:02 PM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Feb 11, 2015 at 9:33 AM, Alyona Kiseleva
 akisely...@mirantis.com wrote:
 Hi,
 I would like to propose something.

 There are a lot of perf counters in different places in code, but the most
 of them are undocumented. I found only one commented counter in OSD.cc code,
 but not for all metrics. Name of counter is not very clear as it's
 description, sometimes isn't at all.

 So, I have an idea, that it would be great, if perf schema would contain not
 only the counter type, but some description too. It can be added in
 PerfCountersBuilder methods - at first as optional parameter with empty
 string by default, later, may be, as required parameter. This short
 description could be saved in perf_counter_data_any_d struct together with
 other counter properties and appear in perf schema as the second counter
 property.

There will be lots of counters that aren't usefully describable in a
single sentence.  That doesn't excuse them from being documented, but
we should be careful to avoid generating useless tautological strings
like num_strays_delayed - Number of strays currently that are
currently delayed.  For lots of things, an understandable definition
will require some level of introduction of terms and concepts -- in my
example, what's a stray?  what does it mean for it to be delayed?

While we shouldn't hold up the descriptions waiting for documentation
that explains all the needed concepts, we should think about how that
will fit together.  Perhaps, rather than having a single string in the
code, we should look to create a separate metadata file that allows
richer RST docs for each setting, and verify that all settings are
described during the docs build (i.e. a gitbuilder fail will tell us
if someone added a setting without the associated documentation).
That way, the short perfcounter descriptions could include hyperlinks
to related concepts elsewhere in the docs.

John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph bindings for go docker

2015-02-10 Thread John Spray

It would also be pretty interesting to do a CephFS driver here using
subdirs as volumes.  Just a thought, if anyone writes the RBD driver
they could put rbd in the name so that we can disambiguate them in
future?

Cheers,
John

On Mon, Feb 9, 2015 at 6:23 PM, Noah Watkins nwatk...@redhat.com wrote:
 Hi Loic,

 This sounds great. The librados bindings have good test converage, but I 
 merged a PR for RBD support a couple weeks ago and haven't had time to get it 
 cleaned up and tests written. Do you need support for the AIO interface in 
 librbd?

 -Noah

 - Original Message -
 From: Loic Dachary l...@dachary.org
 To: Noah Watkins noah.watk...@inktank.com
 Cc: Ceph Development ceph-devel@vger.kernel.org, Vincent Batts 
 vba...@redhat.com, Johan Euphrosine pro...@google.com
 Sent: Monday, February 9, 2015 9:15:02 AM
 Subject: Ceph bindings for go  docker

 Hi,

 I discovered https://github.com/noahdesu/go-ceph today :-) It would be useful 
 in the context of a Ceph volume driver for docker ( see 
 https://github.com/docker/docker/issues/10661  
 https://github.com/docker/docker/pull/8484 ).

 Are you a docker user by any chance ?

 --
 Loïc Dachary, Artisan Logiciel Libre

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: flock() on libcephfs ?

2015-02-06 Thread John Spray

On Fri, Feb 6, 2015 at 11:39 AM, Xavier Roche roche+k...@exalead.com wrote:
 I'm not sure this is the right place to post, so do not hesitate to redirect 
 me to a more appropriate list if necessary!

You're in the right place!

 New to ceph, I naively attempted to add a flock (int ceph_flock(struct 
 ceph_mount_info *cmount, int fd, int operation)) function in libcephfs, but I 
 could not find the proper way to figure out what was the owner identifier 
 used by Client::ll_flock() (an uint64_t integer) - this is not the pid, I 
 presume ?

You should let the caller pass in the owner ID.  For example, if you
were using libcephfs to implement a filesystem interface, it would be
up to that interface to work out the unique ID from the calling layer
above it.  So you can pass it through libcephfs, no logic needed.

 Thanks in advance for any hints! By the way - would this new feature 
 (flock()) make sense ?

Definitely, the flock support in the userspace client is new, and the
libcephfs bindings just didn't catch up yet: it was added in October:

commit a1b2c8ff955b30807ac53ce6bdc97cf61a7262ca
Author: Yan, Zheng z...@redhat.com
Date:   Thu Oct 2 19:07:41 2014 +0800

client: posix file lock support

Signed-off-by: Yan, Zheng z...@redhat.com


Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: CDS process

2015-02-05 Thread John Spray

+1 I've always found the sometimes-populated-sometimes-not blueprint
docs confusing.

John

On Thu, Feb 5, 2015 at 2:50 PM, Sage Weil sw...@redhat.com wrote:
 I wonder if we should simplify the cds workflow a bit to go straight to an
 etherpad outline of the blueprint instead of the wiki blueprint doc.  I
 find it a bit disorienting to be flipping between the two, and after the
 fact find it frustrating that there isn't a single reference to go back to
 for the outcome of the session (you have to look at both the pad and the
 bp).

 Perhaps just using the pad from the get-go will streamline things a bit
 and make it a little more lightweight?  What does everyone think?

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: using ceph-deploy on build after make install

2015-02-04 Thread John Spray

I suspect that your clue is Failed to execute command:
/usr/sbin/service ceph -- as a rule, service scripts are part of the
per-distro packaging rather than make install.

Personally, if I'm installing systemwide I always start with some
built packages, and if I need to substitute a home-built binary for
debugging then I do so by directly overwriting it in /usr/bin/.

John

On Wed, Feb 4, 2015 at 12:58 AM, Deneau, Tom tom.den...@amd.com wrote:
 New to ceph building but here is my situation...

 I have been successfully able to build ceph starting from
git checkout firefly
 (also successful from git checkout master).  After building, I am able
 to run vstarth.sh from the source directory as ./vstart.sh -d -n -x
 (or with -X).  I can then do rados commands such as rados bench.

 I should also add that when I have installed binaries from rpms (this
 is a fedora21 aarch64 system), I have been successfully able to deploy
 a cluster using various ceph-deploy commands.

 Now I would like to do make install to install my built version and
 then use the installed version with my ceph-deploy commands.  In this
 case I installed ceph-deploy with pip install ceph-deploy which gives
 me 5.21.

 The first ceph-deploy command I use is:

 ceph-deploy new myhost

 which seems to work fine.

 The next command however is

  ceph-deploy mon create-initial

 which ends up failing with

 [INFO  ] Running command: /usr/sbin/service ceph -c /etc/ceph/ceph.conf start 
 mon.hostname
 [WARNIN] The service command supports only basic LSB actions (start, stop, 
 restart, try-rest\
 art, reload, force-reload, status). For other actions, please try to use 
 systemctl.
 [ERROR ] RuntimeError: command returned non-zero exit status: 2
 [ERROR ] Failed to execute command: /usr/sbin/service ceph -c 
 /etc/ceph/ceph.conf start mon.seattle-tdeneau
 [ERROR ] GenericError: Failed to create 1 monitors

 and even the ceph status command fails

 #  ceph -c ./ceph.conf status
  -1 monclient(hunting): ERROR: missing keyring, cannot use cephx for 
 authentication
   0 librados: client.admin initialization error (2) No such file or directory
 Error connecting to cluster: ObjectNotFound

 Whereas this all worked fine when I used binaries from rpms.
 Is there some install step that I am missing?

 -- Tom Deneau, AMD

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] chattr +i not working with cephfs

2015-01-28 Thread John Spray

On Wed, Jan 28, 2015 at 5:23 PM, Gregory Farnum g...@gregs42.com wrote:
 My concern is whether we as the FS are responsible for doing anything
 more than storing and returning that immutable flag — are we supposed
 to block writes to anything that has it set? That could be much
 trickier...

The VFS layer is checking the flag for us, but some filesystems do
have paths where they need to do their own too (e.g. XFS has various
ioctls that do explicit checks too).  It's also up to us to publish
the S_IMMUTABLE bit to the i_flags attribute of the generic inode,
based on wherever/however we store the flag ourselves.

Fuse doesn't seem to have a path for us to update i_flags though, so
it might be that we either have to extend that interface or do the
checking ourselves in userspace in order to support it there.

John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] chattr +i not working with cephfs

2015-01-28 Thread John Spray

We don't implement the GETFLAGS and SETFLAGS ioctls used for +i.

Adding the ioctls is pretty easy, but then we need somewhere to put
the flags.  Currently we don't store a flags attribute on inodes,
but maybe we could borrow the high bits of the mode attribute for this
if we wanted to implement it?

CCing ceph-devel to see if Sage/Greg can offer any more background.

John

On Wed, Jan 28, 2015 at 1:24 AM, Eric Eastman
eric.east...@keepertech.com wrote:
 Should chattr +i  work with cephfs?

 Using ceph v0.91 and a 3.18 kernel on the CephFS client, I tried this:

 # mount | grep ceph
 172.16.30.10:/ on /cephfs/test01 type ceph (name=cephfs,key=client.cephfs)
 # echo 1  /cephfs/test01/test.1
 # ls -l /cephfs/test01/test.1
 -rw-r—r-- 1 root root 2 Jan 27 19:09 /cephfs/test01/test.1
 # chattr +i /cephfs/test01/test.1
 chattr: Inappropriate ioctl for device while reading flags on
 /cephfs/test01/test.1

 I also tried it using the FUSE interface:

 # ceph-fuse -m 172.16.30.10 /cephfs/fuse01/
 ceph-fuse[5326]: starting ceph client
 2015-01-27 19:54:59.002563 7f6f8fbcb7c0 -1 init, newargv = 0x2ec2be0
 newargc=11
 ceph-fuse[5326]: starting fuse
 # mount | grep ceph
 ceph-fuse on /cephfs/fuse01 type fuse.ceph-fuse
 (rw,nosuid,nodev,allow_other,default_permissions)
 # echo 1  /cephfs/fuse01/test02.dat
 # chattr +i   /cephfs/fuse01/test02.dat
 chattr: Invalid argument while reading flags on /cephfs/fuse01/test02.dat

 Eric

 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MDS aborted after recovery and active, FAILED assert (r =0)

2015-01-16 Thread John Spray

Hmm, upgrading should help here, as the problematic data structure
(anchortable) no longer exists in the latest version.  I haven't
checked, but hopefully we don't try to write it during upgrades.

The bug you're hitting is more or less the same as a similar one we
have with the sessiontable in the latest ceph, but you won't hit it
there unless you're very unlucky!

John

On Fri, Jan 16, 2015 at 7:37 AM, Mohd Bazli Ab Karim
bazli.abka...@mimos.my wrote:
 Dear Ceph-Users, Ceph-Devel,

 Apologize me if you get double post of this email.

 I am running a ceph cluster version 0.72.2 and one MDS (in fact, it's 3, 2 
 down and only 1 up) at the moment.
 Plus I have one CephFS client mounted to it.

 Now, the MDS always get aborted after recovery and active for 4 secs.
 Some parts of the log are as below:

 -3 2015-01-15 14:10:28.464706 7fbcc8226700  1 -- 10.4.118.21:6800/5390 
 == osd.19 10.4.118.32:6821/243161 73  osd_op_re
 ply(3742 1000240c57e. [create 0~0,setxattr (99)] v56640'1871414 
 uv1871414 ondisk = 0) v6  221+0+0 (261801329 0 0) 0x
 7770bc80 con 0x69c7dc0
 -2 2015-01-15 14:10:28.464730 7fbcc8226700  1 -- 10.4.118.21:6800/5390 
 == osd.18 10.4.118.32:6818/243072 67  osd_op_re
 ply(3645 107941c. [tmapup 0~0] v56640'1769567 uv1769567 ondisk = 
 0) v6  179+0+0 (3759887079 0 0) 0x7757ec80 con
 0x1c6bb00
 -1 2015-01-15 14:10:28.464754 7fbcc8226700  1 -- 10.4.118.21:6800/5390 
 == osd.47 10.4.118.35:6809/8290 79  osd_op_repl
 y(3419 mds_anchortable [writefull 0~94394932] v0'0 uv0 ondisk = -90 (Message 
 too long)) v6  174+0+0 (3942056372 0 0) 0x69f94
 a00 con 0x1c6b9a0
  0 2015-01-15 14:10:28.471684 7fbcc8226700 -1 mds/MDSTable.cc: In 
 function 'void MDSTable::save_2(int, version_t)' thread 7
 fbcc8226700 time 2015-01-15 14:10:28.46
 mds/MDSTable.cc: 83: FAILED assert(r = 0)

  ceph version  ()
  1: (MDSTable::save_2(int, unsigned long)+0x325) [0x769e25]
  2: (Context::complete(int)+0x9) [0x568d29]
  3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1097) [0x7c15d7]
  4: (MDS::handle_core_message(Message*)+0x5a0) [0x588900]
  5: (MDS::_dispatch(Message*)+0x2f) [0x58908f]
  6: (MDS::ms_dispatch(Message*)+0x1e3) [0x58ab93]
  7: (DispatchQueue::entry()+0x549) [0x975739]
  8: (DispatchQueue::DispatchThread::entry()+0xd) [0x8902dd]
  9: (()+0x7e9a) [0x7fbcccb0de9a]
  10: (clone()+0x6d) [0x7fbccb4ba3fd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.

 Is there any workaround/patch to fix this issue? Let me know if need to see 
 the log with debug-mds of certain level as well.
 Any helps would be very much appreciated.

 Thanks.
 Bazli

 
 DISCLAIMER:


 This e-mail (including any attachments) is for the addressee(s) only and may 
 be confidential, especially as regards personal data. If you are not the 
 intended recipient, please note that any dealing, review, distribution, 
 printing, copying or use of this e-mail is strictly prohibited. If you have 
 received this email in error, please notify the sender immediately and delete 
 the original message (including any attachments).


 MIMOS Berhad is a research and development institution under the purview of 
 the Malaysian Ministry of Science, Technology and Innovation. Opinions, 
 conclusions and other information in this e-mail that do not relate to the 
 official business of MIMOS Berhad and/or its subsidiaries shall be understood 
 as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and 
 neither MIMOS Berhad nor its subsidiaries accepts responsibility for the 
 same. All liability arising from or in connection with computer viruses 
 and/or corrupted e-mails is excluded to the fullest extent permitted by law.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MDS aborted after recovery and active, FAILED assert (r =0)

2015-01-16 Thread John Spray

It has just been pointed out to me that you can also workaround this
issue on your existing system by increasing the osd_max_write_size
setting on your OSDs (default 90MB) to something higher, but still
smaller than your osd journal size.  That might get you on a path to
having an accessible filesystem before you consider an upgrade.

John

On Fri, Jan 16, 2015 at 10:57 AM, John Spray john.sp...@redhat.com wrote:
 Hmm, upgrading should help here, as the problematic data structure
 (anchortable) no longer exists in the latest version.  I haven't
 checked, but hopefully we don't try to write it during upgrades.

 The bug you're hitting is more or less the same as a similar one we
 have with the sessiontable in the latest ceph, but you won't hit it
 there unless you're very unlucky!

 John

 On Fri, Jan 16, 2015 at 7:37 AM, Mohd Bazli Ab Karim
 bazli.abka...@mimos.my wrote:
 Dear Ceph-Users, Ceph-Devel,

 Apologize me if you get double post of this email.

 I am running a ceph cluster version 0.72.2 and one MDS (in fact, it's 3, 2 
 down and only 1 up) at the moment.
 Plus I have one CephFS client mounted to it.

 Now, the MDS always get aborted after recovery and active for 4 secs.
 Some parts of the log are as below:

 -3 2015-01-15 14:10:28.464706 7fbcc8226700  1 -- 10.4.118.21:6800/5390 
 == osd.19 10.4.118.32:6821/243161 73  osd_op_re
 ply(3742 1000240c57e. [create 0~0,setxattr (99)] v56640'1871414 
 uv1871414 ondisk = 0) v6  221+0+0 (261801329 0 0) 0x
 7770bc80 con 0x69c7dc0
 -2 2015-01-15 14:10:28.464730 7fbcc8226700  1 -- 10.4.118.21:6800/5390 
 == osd.18 10.4.118.32:6818/243072 67  osd_op_re
 ply(3645 107941c. [tmapup 0~0] v56640'1769567 uv1769567 ondisk = 
 0) v6  179+0+0 (3759887079 0 0) 0x7757ec80 con
 0x1c6bb00
 -1 2015-01-15 14:10:28.464754 7fbcc8226700  1 -- 10.4.118.21:6800/5390 
 == osd.47 10.4.118.35:6809/8290 79  osd_op_repl
 y(3419 mds_anchortable [writefull 0~94394932] v0'0 uv0 ondisk = -90 (Message 
 too long)) v6  174+0+0 (3942056372 0 0) 0x69f94
 a00 con 0x1c6b9a0
  0 2015-01-15 14:10:28.471684 7fbcc8226700 -1 mds/MDSTable.cc: In 
 function 'void MDSTable::save_2(int, version_t)' thread 7
 fbcc8226700 time 2015-01-15 14:10:28.46
 mds/MDSTable.cc: 83: FAILED assert(r = 0)

  ceph version  ()
  1: (MDSTable::save_2(int, unsigned long)+0x325) [0x769e25]
  2: (Context::complete(int)+0x9) [0x568d29]
  3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x1097) [0x7c15d7]
  4: (MDS::handle_core_message(Message*)+0x5a0) [0x588900]
  5: (MDS::_dispatch(Message*)+0x2f) [0x58908f]
  6: (MDS::ms_dispatch(Message*)+0x1e3) [0x58ab93]
  7: (DispatchQueue::entry()+0x549) [0x975739]
  8: (DispatchQueue::DispatchThread::entry()+0xd) [0x8902dd]
  9: (()+0x7e9a) [0x7fbcccb0de9a]
  10: (clone()+0x6d) [0x7fbccb4ba3fd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.

 Is there any workaround/patch to fix this issue? Let me know if need to see 
 the log with debug-mds of certain level as well.
 Any helps would be very much appreciated.

 Thanks.
 Bazli

 
 DISCLAIMER:


 This e-mail (including any attachments) is for the addressee(s) only and may 
 be confidential, especially as regards personal data. If you are not the 
 intended recipient, please note that any dealing, review, distribution, 
 printing, copying or use of this e-mail is strictly prohibited. If you have 
 received this email in error, please notify the sender immediately and 
 delete the original message (including any attachments).


 MIMOS Berhad is a research and development institution under the purview of 
 the Malaysian Ministry of Science, Technology and Innovation. Opinions, 
 conclusions and other information in this e-mail that do not relate to the 
 official business of MIMOS Berhad and/or its subsidiaries shall be 
 understood as neither given nor endorsed by MIMOS Berhad and/or its 
 subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts 
 responsibility for the same. All liability arising from or in connection 
 with computer viruses and/or corrupted e-mails is excluded to the fullest 
 extent permitted by law.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: New Defects reported by Coverity Scan for ceph

2015-01-16 Thread John Spray

Hmm, maybe it's just because they're in a main() function?  I notice
that an exception handler was added to ceph_authtool.cc to handle the
same coverity complaint there a few months ago.

John

On Fri, Jan 16, 2015 at 3:17 PM, Gregory Farnum g...@gregs42.com wrote:
 Sage, are these uncaught assertion errors something we normally
 ignore? I'm not familiar with any code that tries to catch errors in
 our standard init patterns, which is what looks to be the problem on
 these new coverity issues in cephfs-table-tool.
 -Greg

 On Fri, Jan 16, 2015 at 6:39 AM,  scan-ad...@coverity.com wrote:

 Hi,

 Please find the latest report on new defect(s) introduced to ceph found with 
 Coverity Scan.

 4 new defect(s) introduced to ceph found with Coverity Scan.
 19 defect(s), reported by Coverity Scan earlier, were marked fixed in the 
 recent build analyzed by Coverity Scan.

 New defect(s) Reported-by: Coverity Scan
 Showing 4 of 4 defect(s)


 ** CID 1264457:  Uncaught exception  (UNCAUGHT_EXCEPT)
 /tools/cephfs/cephfs-table-tool.cc: 11 in main()
 /tools/cephfs/cephfs-table-tool.cc: 11 in main()
 /tools/cephfs/cephfs-table-tool.cc: 11 in main()
 /tools/cephfs/cephfs-table-tool.cc: 11 in main()
 /tools/cephfs/cephfs-table-tool.cc: 11 in main()

 ** CID 1264458:  Uninitialized scalar field  (UNINIT_CTOR)
 /test/librbd/test_ImageWatcher.cc: 47 in 
 TestImageWatcher::WatchCtx::WatchCtx(TestImageWatcher)()

 ** CID 1264459:  Uninitialized scalar field  (UNINIT_CTOR)
 /test/librbd/test_fixture.cc: 44 in TestFixture::TestFixture()()

 ** CID 1264460:  Structurally dead code  (UNREACHABLE)
 /common/sync_filesystem.h: 51 in sync_filesystem(int)()


 
 *** CID 1264457:  Uncaught exception  (UNCAUGHT_EXCEPT)
 /tools/cephfs/cephfs-table-tool.cc: 11 in main()
 5 #include common/errno.h
 6 #include global/global_init.h
 7
 8 #include TableTool.h
 9
 10
 CID 1264457:  Uncaught exception  (UNCAUGHT_EXCEPT)
 In function main(int, char const **) an exception of type 
 ceph::FailedAssertion is thrown and never caught.
 11 int main(int argc, const char **argv)
 12 {
 13   vectorconst char* args;
 14   argv_to_vec(argc, argv, args);
 15   env_to_vec(args);
 16
 /tools/cephfs/cephfs-table-tool.cc: 11 in main()
 5 #include common/errno.h
 6 #include global/global_init.h
 7
 8 #include TableTool.h
 9
 10
 CID 1264457:  Uncaught exception  (UNCAUGHT_EXCEPT)
 In function main(int, char const **) an exception of type 
 ceph::FailedAssertion is thrown and never caught.
 11 int main(int argc, const char **argv)
 12 {
 13   vectorconst char* args;
 14   argv_to_vec(argc, argv, args);
 15   env_to_vec(args);
 16
 /tools/cephfs/cephfs-table-tool.cc: 11 in main()
 5 #include common/errno.h
 6 #include global/global_init.h
 7
 8 #include TableTool.h
 9
 10
 CID 1264457:  Uncaught exception  (UNCAUGHT_EXCEPT)
 In function main(int, char const **) an exception of type 
 ceph::FailedAssertion is thrown and never caught.
 11 int main(int argc, const char **argv)
 12 {
 13   vectorconst char* args;
 14   argv_to_vec(argc, argv, args);
 15   env_to_vec(args);
 16
 /tools/cephfs/cephfs-table-tool.cc: 11 in main()
 5 #include common/errno.h
 6 #include global/global_init.h
 7
 8 #include TableTool.h
 9
 10
 CID 1264457:  Uncaught exception  (UNCAUGHT_EXCEPT)
 In function main(int, char const **) an exception of type 
 ceph::FailedAssertion is thrown and never caught.
 11 int main(int argc, const char **argv)
 12 {
 13   vectorconst char* args;
 14   argv_to_vec(argc, argv, args);
 15   env_to_vec(args);
 16
 /tools/cephfs/cephfs-table-tool.cc: 11 in main()
 5 #include common/errno.h
 6 #include global/global_init.h
 7
 8 #include TableTool.h
 9
 10
 CID 1264457:  Uncaught exception  (UNCAUGHT_EXCEPT)
 In function main(int, char const **) an exception of type 
 ceph::FailedAssertion is thrown and never caught.
 11 int main(int argc, const char **argv)
 12 {
 13   vectorconst char* args;
 14   argv_to_vec(argc, argv, args);
 15   env_to_vec(args);
 16

 
 *** CID 1264458:  Uninitialized scalar field  (UNINIT_CTOR)
 /test/librbd/test_ImageWatcher.cc: 47 in 
 TestImageWatcher::WatchCtx::WatchCtx(TestImageWatcher)()
 41 NOTIFY_OP_REQUEST_LOCK  = 2,
 42 NOTIFY_OP_HEADER_UPDATE = 3
 43   };
 44
 45   class WatchCtx : public librados::WatchCtx2 {
 46   public:
 CID 1264458:  Uninitialized scalar field  (UNINIT_CTOR)
 Non-static class member m_handle is not initialized in this 
 constructor nor in any functions that it calls.
 47 WatchCtx(TestImageWatcher parent) : m_parent(parent) {}
 48
 49 int watch(const

Re: 'Immutable bit' on pools to prevent deletion

2015-01-15 Thread John Spray

On Thu, Jan 15, 2015 at 6:07 PM, Sage Weil sw...@redhat.com wrote:
 What would that buy us? Preventing injectargs on it would require mon
 restarts, which is unfortunate ? and makes it sounds more like a
 security feature than a safety blanket.

 I meant 'ceph tell mon.* injectargs ...' as distinct from 'ceph daemon ...
 config set', which requires access to the host.  But yeah, if we went to
 the effort to limit injectargs (maybe a blanket option that disables
 injectargs on mons?), it could double as a security feature.

 But whether it may also useful for security doesn't change whether it is a
 good safety blanket.  I like it because it's simple, easy to implement,
 and easy to disable for testing... :)

The trouble with this is admin socket part is that any tool that
manages Ceph must use the admin socket interface as well as the normal
over-the-network command interface, and by extension must be able to
execute locally on a mon.  We would no longer have a comprehensive
remote management interface for the mon: management tools would have
to run some code locally too.

I think it's sufficient to require two API calls (set the flag or
config option, then do the delete) within the remote API, rather than
requiring that anyone driving the interface knows how to speak two
network protocols (usual mon remote command + SSH-to-asok).

John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 'Immutable bit' on pools to prevent deletion

2015-01-15 Thread John Spray

On Thu, Jan 15, 2015 at 7:07 PM, Sage Weil sw...@redhat.com wrote:
 The trouble with this is admin socket part is that any tool that
 manages Ceph must use the admin socket interface as well as the normal
 over-the-network command interface, and by extension must be able to
 execute locally on a mon.  We would no longer have a comprehensive
 remote management interface for the mon: management tools would have
 to run some code locally too.

 True.. if we make that option enabled by default.  If we it's off by
 default them it's an opt-in layer of protection.  Most clusters don't have
 ephemeral pools so I think lots of people would want this.

+1, the problem goes away if it's opt-in, should be easy enough for
API consumers to inspect the conf setting and give a nice informative
error if the safety catch is engaged.  I can imagine wanting to engage
this ahead of plugging in a GUI or some config management recipes that
you didn't quite trust yet.

John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph backports

2015-01-05 Thread John Spray

Sounds sane -- is the new plan to always do backports via this
process?  i.e. if I see a backport PR which has not been through
integration testing, should I refrain from merging it?

John

On Mon, Jan 5, 2015 at 11:53 AM, Loic Dachary l...@dachary.org wrote:
 Hi Ceph,

 I'm going to spend time to care for the Ceph backports (i.e. help reduce the 
 time they stay in pull requests or redmine tickets). It should roughly go as 
 follows:

 0. Developer follows normal process to land PR to master. Once complete and 
 ticket is marked Pending Backport this process initiates.
 1. I periodically polls Redmine to look for tickets in Pending Backport state
 2. I find commit associated with Redmine ticket and Cherry Picks it to 
 backport integration branch off of desired maintenance branch (Dumping, 
 Firefly, etc). (Note - patch may require backport to multiple branches)
 3. I resolve any merge conflicts with the cherry-picked commit
 4. Once satisfied with group of backported commits to integration branch, I 
 notifies QE.
 5. QE tests backport integration branch against appropriate suites
 6a. If QE is satisfied with test results, they merge backport integration 
 branch.
 6b. If QE is NOT satisfied with the test results, they indicate backport 
 integration branch is NOT ready to merge and return to me to work with 
 original Developer to resolve issue and return to steps 2/3
 7. Ticket is moved to Resolved once backport integration branch containing 
 cherry-picked backport is merged to the desired mainteance branch(es)

 I'll first try to implement this semi manually and document / script when 
 convenient. If anyone has ideas to improve this tentative process, now is the 
 time :-)

 Cheers

 --
 Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-calamari] 'pg stat' structured output

2014-12-22 Thread John Spray

From a quick read of the code, calamari just uses pg dump pgs_brief,
and calculates its own totals etc -- this change shouldn't affect it.

John

On Sun, Dec 21, 2014 at 3:18 PM, Sage Weil sw...@redhat.com wrote:
 I recently switched this around to not to a full 'pg dump' for the
 formatted 'pg stat' command.  The problem is that the current code that
 does the num_pg_by_state uses the state name as the key.  This
 includes 'active+clean' and other instances of +, which is not a valid
 character for an XML token.

 Is calamari relying on this code anywhere or can we switch this
 around to be

 state
  nameactive+clean/name
  num123/num
 /state

 (or equivalent JSON)?

 sage


 GET pg/stat: json 200
 GET pg/stat: xml 200
 FAILURE: url  http://localhost:5000/api/v0.1/pg/stat
 Invalid XML returned: not well-formed (invalid token): line 4, column 40
 Response content:
 response
   output
 
 pg_summarynum_pg_by_stateactive+clean24/active+clean/num_pg_by_stateversion55/versionnum_pgs24/num_pgsnum_bytes3892/num_bytesraw_bytes_used491817414656/raw_bytes_usedraw_bytes_avail663907856384/raw_bytes_availraw_bytes1217698979840/raw_bytes/pg_summary

   /output
   status
 OK
   /status
 /response
 ___
 ceph-calamari mailing list
 ceph-calam...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-calamari-ceph.com
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Where do MDSHealthMetrics show up?

2014-11-24 Thread John Spray

On Fri, Nov 21, 2014 at 1:54 AM, Michael Sevilla mikesevil...@gmail.com wrote:
 Hi. Where do the MDSHealthMetrics in MMDSBeacon (e.g.,
 MDS_HEALTH_TRIM) show up in the monitors? When we run ceph -s? I
 suspect I don't see them because I'd have to run ceph -s at the exact
 moment when the MDS is trimming. Is there an easier way to see these
 warning or is there some debug flag I need to turn on?

In the specific case of MDS_HEALTH_TRIM, this is aimed at detecting
systems that are trimming at a pathologically bad rate (or perhaps
stuck entirely due to a bug), so the in such an unhealthy system we
would expect the state to stick around for a while -- it shouldn't
just be a blink and you miss it status.  However, you would have to
look at the status sometime in the unhealthy period: there's currently
nothing in the cluster log for that health check.

For the new MDS health warnings, we have some overlapping coverage
between health indications (i.e. things that show up in ceph -s) and
cluster log messages (i.e. things that show up in ceph -w).  There
is a general problem here for the health stuff (not just for the MDS
things) that it is only generated on-demand when someone looks at it
-- e.g. things like clock skew also only show up if you happen to run
ceph -s at the right moment.  Internally this corresponds to the
various get_health() functions in the mon subsystems.

It would be good to have a generic way for health indicators (MDS and
beyond) to emit clog messages when they appear and disappear, so that
you don't have to look at the status at the right moment.  That would
be a little hard to implement at the moment because the health
messages are just freeform strings, but I put some notes on cleaning
up health reporting here a while back:
http://tracker.ceph.com/issues/7192

Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Consul

2014-11-05 Thread John Spray

Consul is an interesting thing.  It had crossed my mind as a service
monitoring/discovery thing for cases where:
 * We have services other than Ceph in the IO path (e.g. apache, samba, nfs)
 * The mons aren't happy (something to tell me which of my mons are up
even if there is no quorum)
 * There might be multiple Ceph clusters and you want one health state
reflecting both clusters

Most other monitoring tools (e.g. calamari) take the shortcut of
having a single central monitoring server -- something consul-esque
that is lighter-weight and more resilient could be a step forward for
cluster monitoring applications in more flexible and less enterprisey
environments.

The whole separate service but it's lightweight so that's okay
approach is embodied by Consul.  I think there is an alternative path
available that I think of as we already have a consensus system,
let's make a way to plug monitoring applications on top of it -- a
way to plug extra smarts into the mons could be interesting too.

John


On Wed, Nov 5, 2014 at 12:38 AM, Loic Dachary l...@dachary.org wrote:
 Hi Ceph,

 While at the OpenStack summit Dan Bode spoke highly of Consul ( 
 https://consul.io/intro/index.html ). Its scope is new to me. Each individual 
 feature is familiar but I'm not entirely sure if combining them into a single 
 software is necessary. And I wonder how it could relate to Ceph. It is 
 entirely possible that it does not even make sense to ask theses questions ;-)

 Cheers

 --
 Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/7] ceph: remove unused `map_waiters` from osdc client

2014-11-04 Thread John Spray

This is initialized but never used.

Signed-off-by: John Spray john.sp...@redhat.com
---
 include/linux/ceph/osd_client.h | 1 -
 net/ceph/osd_client.c   | 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 03aeb27..7cb5cea 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -197,7 +197,6 @@ struct ceph_osd_client {
 
struct ceph_osdmap *osdmap;   /* current map */
struct rw_semaphoremap_sem;
-   struct completion  map_waiters;
u64last_requested_map;
 
struct mutex   request_mutex;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 5a75395..75ab07c 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2537,7 +2537,6 @@ int ceph_osdc_init(struct ceph_osd_client *osdc, struct 
ceph_client *client)
osdc-client = client;
osdc-osdmap = NULL;
init_rwsem(osdc-map_sem);
-   init_completion(osdc-map_waiters);
osdc-last_requested_map = 0;
mutex_init(osdc-request_mutex);
osdc-last_tid = 0;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/7] ceph: update ceph_msg_header structure

2014-11-04 Thread John Spray

2 bytes of what was reserved space is now used by
userspace for the compat_version field.

Signed-off-by: John Spray john.sp...@redhat.com
---
 include/linux/ceph/msgr.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/ceph/msgr.h b/include/linux/ceph/msgr.h
index cac4b28..1c18872 100644
--- a/include/linux/ceph/msgr.h
+++ b/include/linux/ceph/msgr.h
@@ -152,7 +152,8 @@ struct ceph_msg_header {
 receiver: mask against ~PAGE_MASK */
 
struct ceph_entity_name src;
-   __le32 reserved;
+   __le16 compat_version;
+   __le16 reserved;
__le32 crc;   /* header crc32c */
 } __attribute__ ((packed));
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/7] ceph: include osd epoch barrier in debugfs

2014-11-04 Thread John Spray

This is useful in our automated testing, so that
we can verify that the barrier is propagating
correctly between servers and clients.

Signed-off-by: John Spray john.sp...@redhat.com
---
 fs/ceph/debugfs.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index 5d5a4c8..60db629 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -174,6 +174,9 @@ static int mds_sessions_show(struct seq_file *s, void *ptr)
/* The -o name mount argument */
seq_printf(s, name \%s\\n, opt-name ? opt-name : );
 
+   /* The latest OSD epoch barrier known to this client */
+   seq_printf(s, osd_epoch_barrier \%d\\n, mdsc-cap_epoch_barrier);
+
/* The list of MDS session rank+state */
for (mds = 0; mds  mdsc-max_sessions; mds++) {
struct ceph_mds_session *session =
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/7] ceph: update CAPRELEASE message format

2014-11-04 Thread John Spray

Version 2 includes the new osd epoch barrier
field.

This allows clients to inform servers that their
released caps may not be used until a particular
OSD map epoch.

Signed-off-by: John Spray john.sp...@redhat.com
---
 fs/ceph/mds_client.c | 13 +
 fs/ceph/mds_client.h |  8 ++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index dce7977..3f5bc23 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -1508,12 +1508,25 @@ void ceph_send_cap_releases(struct ceph_mds_client 
*mdsc,
struct ceph_mds_session *session)
 {
struct ceph_msg *msg;
+   u32 *cap_barrier;
 
dout(send_cap_releases mds%d\n, session-s_mds);
spin_lock(session-s_cap_lock);
while (!list_empty(session-s_cap_releases_done)) {
msg = list_first_entry(session-s_cap_releases_done,
 struct ceph_msg, list_head);
+
+   BUG_ON(msg-front.iov_len + sizeof(*cap_barrier)  \
+  PAGE_CACHE_SIZE);
+
+   // Append cap_barrier field
+   cap_barrier = msg-front.iov_base + msg-front.iov_len;
+   *cap_barrier = cpu_to_le32(mdsc-cap_epoch_barrier);
+   msg-front.iov_len += sizeof(*cap_barrier);
+
+   msg-hdr.version = cpu_to_le16(2);
+   msg-hdr.compat_version = cpu_to_le16(1);
+
list_del_init(msg-list_head);
spin_unlock(session-s_cap_lock);
msg-hdr.front_len = cpu_to_le32(msg-front.iov_len);
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 7b40568..b9412a8 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -92,10 +92,14 @@ struct ceph_mds_reply_info_parsed {
 
 /*
  * cap releases are batched and sent to the MDS en masse.
+ *
+ * Account for per-message overhead of mds_cap_release header
+ * and u32 for osd epoch barrier trailing field.
  */
 #define CEPH_CAPS_PER_RELEASE ((PAGE_CACHE_SIZE -  \
-   sizeof(struct ceph_mds_cap_release)) /  \
-  sizeof(struct ceph_mds_cap_item))
+   sizeof(struct ceph_mds_cap_release) -   \
+   sizeof(u32)) /  \
+   sizeof(struct ceph_mds_cap_item))
 
 
 /*
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/7] ceph: add ceph_osdc_cancel_writes

2014-11-04 Thread John Spray

To allow us to abort writes in ENOSPC conditions, instead
of having them block indefinitely.

Signed-off-by: John Spray john.sp...@redhat.com
---
 include/linux/ceph/osd_client.h |  8 +
 net/ceph/osd_client.c   | 67 +
 2 files changed, 75 insertions(+)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 7cb5cea..f82000c 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -21,6 +21,7 @@ struct ceph_authorizer;
 /*
  * completion callback for async writepages
  */
+typedef void (*ceph_osdc_full_callback_t)(struct ceph_osd_client *, void *);
 typedef void (*ceph_osdc_callback_t)(struct ceph_osd_request *,
 struct ceph_msg *);
 typedef void (*ceph_osdc_unsafe_callback_t)(struct ceph_osd_request *, bool);
@@ -226,6 +227,9 @@ struct ceph_osd_client {
u64 event_count;
 
struct workqueue_struct *notify_wq;
+
+ceph_osdc_full_callback_t map_cb;
+void *map_p;
 };
 
 extern int ceph_osdc_setup(void);
@@ -331,6 +335,7 @@ extern void ceph_osdc_put_request(struct ceph_osd_request 
*req);
 extern int ceph_osdc_start_request(struct ceph_osd_client *osdc,
   struct ceph_osd_request *req,
   bool nofail);
+extern u32 ceph_osdc_cancel_writes(struct ceph_osd_client *osdc, int r);
 extern void ceph_osdc_cancel_request(struct ceph_osd_request *req);
 extern int ceph_osdc_wait_request(struct ceph_osd_client *osdc,
  struct ceph_osd_request *req);
@@ -361,5 +366,8 @@ extern int ceph_osdc_create_event(struct ceph_osd_client 
*osdc,
  void *data, struct ceph_osd_event **pevent);
 extern void ceph_osdc_cancel_event(struct ceph_osd_event *event);
 extern void ceph_osdc_put_event(struct ceph_osd_event *event);
+
+extern void ceph_osdc_register_map_cb(struct ceph_osd_client *osdc,
+ ceph_osdc_full_callback_t cb, void *data);
 #endif
 
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 75ab07c..eb7e735 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -836,6 +836,59 @@ __lookup_request_ge(struct ceph_osd_client *osdc,
return NULL;
 }
 
+/*
+ * Drop all pending write/modify requests and complete
+ * them with the `r` as return code.
+ *
+ * Returns the highest OSD map epoch of a request that was
+ * cancelled, or 0 if none were cancelled.
+ */
+u32 ceph_osdc_cancel_writes(
+struct ceph_osd_client *osdc,
+int r)
+{
+struct ceph_osd_request *req;
+struct rb_node *n = osdc-requests.rb_node;
+u32 latest_epoch = 0;
+
+   dout(enter cancel_writes r=%d, r);
+
+mutex_lock(osdc-request_mutex);
+
+while (n) {
+req = rb_entry(n, struct ceph_osd_request, r_node);
+n = rb_next(n);
+
+ceph_osdc_get_request(req);
+if (req-r_flags  CEPH_OSD_FLAG_WRITE) {
+req-r_result = r;
+complete_all(req-r_completion);
+complete_all(req-r_safe_completion);
+
+if (req-r_callback) {
+// Requires callbacks used for write ops are 
+// amenable to being called with NULL msg
+// (e.g. writepages_finish)
+req-r_callback(req, NULL);
+}
+
+__unregister_request(osdc, req);
+
+if (*req-r_request_osdmap_epoch  latest_epoch) {
+latest_epoch = *req-r_request_osdmap_epoch;
+}
+}
+ceph_osdc_put_request(req);
+}
+
+mutex_unlock(osdc-request_mutex);
+
+   dout(complete cancel_writes latest_epoch=%ul, latest_epoch);
+
+return latest_epoch;
+}
+EXPORT_SYMBOL(ceph_osdc_cancel_writes);
+
 static void __kick_linger_request(struct ceph_osd_request *req)
 {
struct ceph_osd_client *osdc = req-r_osdc;
@@ -2102,6 +2155,10 @@ done:
downgrade_write(osdc-map_sem);
ceph_monc_got_osdmap(osdc-client-monc, osdc-osdmap-epoch);
 
+   if (osdc-map_cb) {
+   osdc-map_cb(osdc, osdc-map_p);
+   }
+
/*
 * subscribe to subsequent osdmap updates if full to ensure
 * we find out when we are no longer full and stop returning
@@ -2125,6 +2182,14 @@ bad:
up_write(osdc-map_sem);
 }
 
+void ceph_osdc_register_map_cb(struct ceph_osd_client *osdc,
+ceph_osdc_full_callback_t cb, void *data)
+{
+osdc-map_cb = cb;
+osdc-map_p = data;
+}
+EXPORT_SYMBOL(ceph_osdc_register_map_cb);
+
 /*
  * watch/notify callback event infrastructure
  *
@@ -2553,6 +2618,8 @@ int ceph_osdc_init(struct ceph_osd_client *osdc, struct 
ceph_client *client)
spin_lock_init(osdc-event_lock);
osdc-event_tree = RB_ROOT;
osdc-event_count = 0;
+   osdc-map_cb = NULL;
+   osdc-map_p = NULL;
 
schedule_delayed_work(osdc-osds_timeout_work

[PATCH 4/7] ceph: handle full condition by cancelling ops

2014-11-04 Thread John Spray

While cancelling, we store the OSD epoch at time
of cancellation, this will be used later in
CAPRELEASE messages.

Signed-off-by: John Spray john.sp...@redhat.com
---
 fs/ceph/mds_client.c | 21 +
 fs/ceph/mds_client.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 9f00853..dce7977 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3265,6 +3265,23 @@ static void delayed_work(struct work_struct *work)
schedule_delayed(mdsc);
 }
 
+/**
+ * Call this with map_sem held for read
+ */
+static void handle_osd_map(struct ceph_osd_client *osdc, void *p)
+{
+   struct ceph_mds_client *mdsc = (struct ceph_mds_client*)p;
+   u32 cancelled_epoch = 0;
+
+   if (osdc-osdmap-flags  CEPH_OSDMAP_FULL) {
+   cancelled_epoch = ceph_osdc_cancel_writes(osdc, -ENOSPC);
+   if (cancelled_epoch) {
+   mdsc-cap_epoch_barrier = max(cancelled_epoch + 1,
+ mdsc-cap_epoch_barrier);
+   }
+   }
+}
+
 int ceph_mdsc_init(struct ceph_fs_client *fsc)
 
 {
@@ -3311,6 +3328,10 @@ int ceph_mdsc_init(struct ceph_fs_client *fsc)
 
ceph_caps_init(mdsc);
ceph_adjust_min_caps(mdsc, fsc-min_caps);
+   mdsc-cap_epoch_barrier = 0;
+
+   ceph_osdc_register_map_cb(fsc-client-osdc,
+ handle_osd_map, (void*)mdsc);
 
return 0;
 }
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 230bda7..7b40568 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -298,6 +298,7 @@ struct ceph_mds_client {
int   num_cap_flushing; /* # caps we are flushing */
spinlock_tcap_dirty_lock;   /* protects above items */
wait_queue_head_t cap_flushing_wq;
+   u32   cap_epoch_barrier;
 
/*
 * Cap reservations
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] the state of cephfs in giant

2014-10-30 Thread John Spray

On Thu, Oct 30, 2014 at 10:55 AM, Florian Haas flor...@hastexo.com wrote:
 * ganesha NFS integration: implemented, no test coverage.

 I understood from a conversation I had with John in London that
 flock() and fcntl() support had recently been added to ceph-fuse, can
 this be expected to Just Work™ in Ganesha as well?

To clarify this comment: flock in ceph-fuse was recently implemented
(by Yan Zheng) in *master* rather than giant, so it's in line for
hammer.

Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Request for comments

2014-10-10 Thread John Spray

We do have a lot of functions that bubble up standard error numbers --
in general callers of these functions have to accommodate the
possibility of any error code (including errors they don't
understand).  I notice that ::mount is a bit unusual in additionally
having some explicit 'magic' return codes to indicate which stage of
init failed.

A pull request to document the magic -1001, -1002 etc returns from
that function would be welcome, it might not be useful to try and
enumerate all possible error numbers for all API calls though -- it
could be hard to prove that the list was comprehensive, and callers
should always handle unexpected codes too.

Cheers,
John

On Fri, Oct 10, 2014 at 5:22 PM, Barclay Jameson
almightybe...@gmail.com wrote:
 When reading code for libcephfs.cc (giant branch) it wasn't apparent
 to me for the lines 95 and lines 101 what the return values was
 expected from the function calls init and mount other than 0.

 It was only after tracing a bit of code did I see that error code such
 as: -ENOENT and -EEXIST could be returned as well.

 It would be awesome if comments were added for these functions to show
 what the expected return values are going to be for these functions.

 Thanks,

 almightybeeij
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: New Defects reported by Coverity Scan for ceph (fwd)

2014-09-25 Thread John Spray

Nice to see that coverity and lockdep agree :-)

This should go away with the fix for #9562.

John

On Thu, Sep 25, 2014 at 4:02 PM, Sage Weil sw...@redhat.com wrote:


 -- Forwarded message --
 From: scan-ad...@coverity.com
 To: undisclosed-recipients:;
 Cc:
 Date: Thu, 25 Sep 2014 06:18:46 -0700
 Subject: New Defects reported by Coverity Scan for ceph


 Hi,


 Please find the latest report on new defect(s) introduced to ceph found with 
 Coverity Scan.

 Defect(s) Reported-by: Coverity Scan
 Showing 1 of 1 defect(s)


 ** CID 1241497:  Thread deadlock  (ORDER_REVERSAL)



 
 *** CID 1241497:  Thread deadlock  (ORDER_REVERSAL)
 /osdc/Filer.cc: 314 in Filer::_do_purge_range(PurgeRange *, int)()
 308 return;
 309   }
 310
 311   int max = 10 - pr-uncommitted;
 312   while (pr-num  0  max  0) {
 313 object_t oid = file_object_t(pr-ino, pr-first);
 CID 1241497:  Thread deadlock  (ORDER_REVERSAL)
 Calling get_osdmap_read acquires lock RWLock.L while holding lock 
 Mutex._m (count: 15 / 30).
 314 const OSDMap *osdmap = objecter-get_osdmap_read();
 315 object_locator_t oloc = 
 osdmap-file_to_object_locator(pr-layout);
 316 objecter-put_osdmap_read();
 317 objecter-remove(oid, oloc, pr-snapc, pr-mtime, pr-flags,
 318  NULL, new C_PurgeRange(this, pr));
 319 pr-uncommitted++;


 
 To view the defects in Coverity Scan visit, 
 http://scan.coverity.com/projects/25?tab=overview

 To unsubscribe from the email notification for new defects, 
 http://scan5.coverity.com/cgi-bin/unsubscribe.py




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Calamari Goes Open Source

2014-07-30 Thread John Spray

On Wed, Jul 30, 2014 at 3:40 PM, Dan Ryder (daryder) dary...@cisco.com wrote:
 I had similar issues, I tried many different ways to use vagrant but
 couldn’t build packages successfully. I’m not sure how reliable this is, but
 if you are looking to get Calamari packages quickly, you can skip the
 Vagrant install steps and just use the Makefile.

A note of warning here -- the vagrant stuff exists for a reason.  The
ease of building things directly depends very much on what distro
you're on and what external packages are installed.  Because of the
way the virtualenv for /opt/calamari is built, it can be sensitive not
just to what packages you have installed, but what packages you
*don't* have installed -- if something is installed systemwide then
that can prevent pip from realizing it needs to build it into the
virtualenv.

BTW, the configuration of the build virtual machines is separable from
vagrant itself: within each folder in vagrant you'll see a salt/roots
subdir.  Those salt states can also be used on a virtual machine set
up using your choice of provisioner, if you're running salt there.

Finally, anyone working in this area should join the ceph-calamari
mailing list at
http://lists.ceph.com/listinfo.cgi/ceph-calamari-ceph.com -- it would
be good to hear about any specific steps in build instructions that
are failing there.

Cheers,
John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: wip-objecter

2014-07-28 Thread John Spray

On Mon, Jul 28, 2014 at 10:48 PM, Sage Weil sw...@redhat.com wrote:
 Anyway, I rebased and resolved conflicts and pushed a new
 wip-objecter-rebased which looks like it has the right diff between the
 old and new versions.

 Take a look?  The main change is that the set data pool virtual xattrs do
 objecter-wait_for_latest_osdmap() and the flawed MDS::request_osdmap()
 helper is now gone.

Yep, looks right.  I was going to do about the same squashup once
things were passing unit tests, but I had missed the leak of the
contexts.

John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: wip-objecter

2014-07-28 Thread John Spray

On Mon, Jul 28, 2014 at 11:17 PM, Sage Weil sw...@redhat.com wrote:
 I wonder if a safer approach is to make a subclass of Context called
 RelockingContext, which takes an mds_lock * in the ctor, and then make all
 of the other completions subclasses of that.  That way we can use type
 checking to ensure that all Context's passed to MDLog::submit_entry() (or
 other places that make sense) are of the right type.  Hopefully there
 aren't any classes that are used in both and locking and non-locking
 context...

Greg and I chatted about this sort of thing a bit last week, where the
same sort of mechanism might be used to distinguish completions that
need to go via a finisher vs. ones that are safe to call directly.  I
convinced myself that wasn't going to work last week (because we have
to call completions from the journaler in the right order), but the
general idea of declaratively distinguishing contexts like this is
still appealing.

The following are the interesting attributes of contexts:
  1 Contexts that just don't care how they're called (e.g.
C_OnFinisher can be called any which way)
  2 Contexts that will take mds_lock (if you're holding it, send via
finisher thread)
  3 Contexts that require mds_lock is already held
  4 Contexts that may call into Journaler (you may not call this while
holding Journaler.lock, if you're holding it, send via finisher
thread)
  5 Contexts that may do Objecter I/O (you may not call this from an
objecter callback, if you're in that situation you must go via a
finisher thread)

These are not all exclusive.  Number 5 goes away when we switch to
librados, because it's already going to be going via a finisher thread
before calling back our I/O completions.  For the moment we get the
same effect by using C_OnFinisher with all the C_IO_* contexts.

Number 4 goes away if Journaler sends all its completions via a
finisher, as seems to be the simplest thing to do right now.

 Alternatively, we could make a LockingFinisher class that takes the
 specified lock (mds_lock) before calling each Context.  That might be
 simpler?

Hmm, if we did that then completions might sometimes hop up multiple
finishers if they went e.g. first into a take the mds lock finisher
then subsequently via a the mds lock must be held finisher (second
one being strictly redundant but it sure would be nice to have the
check in there somehow).

The explicit Context subclasses has lots of appeal to me from the
compile-time assurance we would have that we were using the right kind
of thing in the right kind of place, e.g. when we do add_waiter calls
on an inode's lock we can check at compile time that we're passing a
mds lock already held subclass.

John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: wip-objecter

2014-07-27 Thread John Spray

I think that for the calls where we check the epoch and then
conditionally call wait_for_map with the context, we have to change
them to do the wait_for_map call first and check the boolean response.
Otherwise, if the map we have is updated between the read of the epoch
and the call to wait_for_map, it could already be ready and never call
back our context.

My version is here (on top a rebase of wip-objecter to master on
branch wip-objecter-rebase):
https://github.com/ceph/ceph/commit/3cd82464ed6f13ec5b44da303849061648b9e3a1

John

On Sat, Jul 26, 2014 at 6:42 AM, Sage Weil sw...@redhat.com wrote:
 Hey John,

 I fixed up the mds osdmap wait stuff and squashed it into wip-objecter.
 That rebased out from underneath your branch; sorry.  Will try to look at
 the other patches shortly!

 sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: collectd / graphite / grafana .. calamari?

2014-05-23 Thread John Spray

Hi Ricardo,

Let me share a few notes on metrics in calamari:
 * We're bundling graphite, and using diamond to send home metrics.
The diamond collector used in calamari has always been open source
[1].
 * The Calamari UI has its own graphs page that talks directly to the
graphite API (the calamari REST API does not duplicate any of the
graphing interface)
 * We also bundle the default graphite dashboard, so that folks can go
to /graphite/dashboard/ on the calamari server to plot anything custom
they want to.

It could be quite interesting hook in Grafana there in the same way
that we currently hook in the default graphite dashboard, as it
grafana definitely nicer and would give us a roadmap to influxdb (a
project I am quite excited about).

Cheers,
John

1. https://github.com/ceph/Diamond/commits/calamari

On Fri, May 23, 2014 at 1:58 AM, Ricardo Rocha rocha.po...@gmail.com wrote:
 Hi.

 I saw the thread a couple days ago on ceph-users regarding collectd...
 and yes, i've been working on something similar for the last few days
 :)

 https://github.com/rochaporto/collectd-ceph

 It has a set of collectd plugins pushing metrics which mostly map what
 the ceph commands return. In the setup we have it pushes them to
 graphite and the displays rely on grafana (check for a screenshot in
 the link above).

 As it relies on common building blocks, it's easily extensible and
 we'll come up with new dashboards soon - things like plotting osd data
 against the metrics from the collectd disk plugin, which we also
 deploy.

 This email is mostly to share the work, but also to check on Calamari?
 I asked Patrick after the RedHat/Inktank news and have no idea what it
 provides, but i'm sure it comes with lots of extra sauce - he
 suggested to ask in the list.

 What's the timeline to have it open sourced? It would be great to have
 a look at it, and as there's work from different people in this area
 maybe start working together on some fancier monitoring tools.

 Regards,
   Ricardo
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Feedback: default FS pools and newfs behavior

2014-05-21 Thread John Spray

In response to #8010[1], I'm looking at making it possible to
explicitly disable CephFS, so that the (often unused) filesystem pools
don't hang around if they're unwanted.

The administrative behavior would change such that:
 * To enable the filesystem it is necessary to create two pools and
use ceph newfs metadata data
 * There's a new ceph rmfs command to disable the filesystem and
allow removing its pools
 * Initially, the filesystem is disabled and the 'data' and 'metadata'
pools are not created by default

There's an initial cut of this on a branch:
https://github.com/ceph/ceph/commits/wip-nullfs

Questions:
 * Are there strong opinions about whether the CephFS pools should
exist by default?  I think it makes life simpler if they don't,
avoiding what the heck is the 'data' pool? type questions from
newcomers.
 * Is it too unfriendly to require users to explicitly create pools
before running newfs, or do we need to auto-create pools when they run
newfs?  Auto-creating some pools from newfs is a bit awkward
internally because it requires modifying both OSD and MDS maps in one
command.

Cheers,
John

1. http://tracker.ceph.com/issues/8010
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A plea for more tooling usage

2014-05-08 Thread John Spray

Couple of notes for anyone following along on ubuntu precise or
another older system:
 * make sure you're using clang =3.3 to get support for suppressing
all the warning flags in Daniel's command line.
 * libcrypto++ 5.6.1 (the version in ubuntu) doesn't compile with
clang, unless you hack

I was looking through the warnings out of interest and made a couple notes.

On Tue, May 6, 2014 at 2:48 PM, Daniel Hofmann dan...@trvx.org wrote:
 warning: anonymous types declared in an anonymous union are an extension 
 [-Wnested-anon-types]

 $ grep Wnested-anon-types uniq.txt  | wc -l
 28

This is mostly ceph_osd_op in rados.h - code is fine but nonstandard.
We can avoid this particular warning by explicitly typedeffing the
struct types before using them, but would still have the warning that
anonymous unions are themselves nonstandard.  Anonymous unions are
nice...


 warning: zero size arrays are an extension [-Wzero-length-array]

 $ grep Wzero-length-array uniq.txt | wc -l
 8134

This is mostly dout_impl, I think -- it is using an array whose size
is defined as 0 or -1 conditionally, apparently in order to detect out
of bounds severity numbers.

I wonder if there is a neater way to do this - I am not a preprocessor guru.

 warning: '%' may not be nested in a struct due to flexible array member 
 [-Wflexible-array-extensions]

 $ grep Wflexible-array-extensions uniq.txt | wc -l
 2

This is true (ceph_frag_tree in ceph_mds_reply_inode)... I don't see a
nice way around it.  Perhaps its a useful enough language extension
that we should continue to use it.

 warning: cast between pointer-to-function and pointer-to-object is an 
 extension [-Wpedantic]
 warning: use of non-standard escape character '\%' [-Wpedantic]

 $ grep Wpedantic uniq.txt | wc -l
 15

The pointer cast thing is overly pedantic in the cases we're using
dlsym (which returns a void*), object pointer cast to fn pointer is
illegal in the language standard but guaranteed to work in POSIX.

The \% thing is easily fixed.

 warning: use of GNU old-style field designator extension [-Wgnu-designator]

 $ grep Wgnu-designator uniq.txt | wc -l
 43

fuse_ll.cc:fuse_ll_oper and config.cc:g_default_file_layout.

Can easily just remove the field designators and initialize as {val1,
val2} but that's much less readable :-/  I don't think nice struct
initialization becomes a thing until C++11.

 warning: using namespace directive in global context in header 
 [-Wheader-hygiene]

 $ grep Wheader-hygiene uniq.txt | wc -l
 87

We should be able to solve these easily with a fairly mechanical procedure:
 * Remove the 'using's from the headers
 * Change type references in headers to use appropriate prefix (mostly std::)
 * Add in the 'using's to any .cc files that relied on them

 warning: struct '%' was previously declared as a class [-Wmismatched-tags]

 $ grep Wmismatched-tags uniq.txt | wc -l
 93

At least some of these (especially Inode in libcephfs.h) appear to
have the intention of exposing POD classes in C-compatible headers.
We should probably change the C++ side to also use struct for the
things that are going to be exposed to C land.

 warning: missing field '%' initializer [-Wmissing-field-initializers]

 $ grep Wmissing-field-initializer uniq.txt | wc -l
 3

Hmm, I only saw one of these:
test/osd/TestRados.cc:260:36: warning: missing field 'ec_pool_valid'
initializer [-Wmissing-field-initializers]

...but I'm on an older clang (3.3) so perhaps that explains it.

 warning: private field '%' is not used [-Wunused-private-field]

 $ grep Wunused-private-field uniq.txt | wc -l
 11

This is a handy warning indeed, it appears to be accurately pointing
out unused fields.

John
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RADOS translator for GlusterFS

2014-05-05 Thread John Spray

In terms of making something work really quickly, one approach would
be to base off the existing POSIX translator, use a local FS backed by
an RBD volume for the metadata, and store the file content directly
using librados.  That would avoid the need to invent a way to map
filesystem-style metadata to librados calls, while still getting
reasonably efficient data operations through to rados.

I would doubt this would be very slick, but it could be a fun hack!

John


On Mon, May 5, 2014 at 4:21 PM, Jeff Darcy jda...@redhat.com wrote:

 Now that we're all one big happy family, I've been mulling over
 different ways that the two technology stacks could work together.  One
 idea would be to use some of the GlusterFS upper layers for their
 interface and integration possibilities, but then falling down to RADOS
 instead of GlusterFS's own distribution and replication.  I must
 emphasize that I don't necessarily think this is The Right Way for
 anything real, but I think it's an important experiment just to see what
 the problems are and how well it performs.  So here's what I'm thinking.

 For the Ceph folks, I'll describe just a tiny bit of how GlusterFS
 works.  The core concept in GlusterFS is a translator which accepts
 file system requests and generates file system requests in exactly the
 same form.  This allows them to be stacked in arbitrary orders, moved
 back and forth across the server/client divide, etc.  There are several
 broad classes of translators:

 * Some, such as FUSE or GFAPI, inject new requests into the translator
   stack.

 * Some, such as posix, satisfy requests by calling a server-local FS.

 * The client and server translators together get requests from one
   machine to another.

 * Some translators *route* requests (one in to one of several out).

 * Some translators *fan out* requests (one in to all of several out).

 * Most are one in, one out, to add e.g. locks or caching etc.

 Of particular interest here are the DHT (routing/distribution) and AFR
 (fan-out/replication) translators, which mirror functionality in RADOS.
 My idea is to cut out everything from these on below, in favor of a
 translator based on librados instead.  How this works is pretty obvious
 for file data - just read and write to RADOS objects instead of to
 files.  It's a bit less obvious for metadata, especially directory
 entries.  One really simple idea is to store metadata as data, in some
 format defined by the translator itself, and have it handle the
 read/modify/write for adding/deleting entries and such.  That would be
 enough to get some basic performance tests done.  A slightly more
 sophisticated idea might be to use OSD class methods to do the
 read/modify/write, but I don't know much about that mechanism so I'm not
 sure that's even feasible.

 This is not something I'm going to be working on as part of my main job,
 but I'd like to get the experiment started in some of my spare time.
 Is there anyone else interested in collaborating, or are there any other
 obvious ideas I'm missing?
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Full OSD handling (CephFS and in general)

2014-03-19 Thread John Spray

Having found that our full space handling in CephFS wasn't working
right now[1], there was some discussion on the CephFS standup about
how to improve the free space handling in a more general way.

Currently (once #7780 is fixed), we just give the MDS a pass on all
the fullness checks, so that it can journal file deletions to free up
space.  This is a halfway solution because there are still ways for
the MDS to fill up the remaining space with operations other than
deletions, especially with the advent of inlining for small files.
It's also hacky, because there is code inside the OSD that special
cases writes from the MDS.

Changes discussed
===

In the CephFS layer:
 * we probably need to do some work to blacklist client requests other
than housekeeping and deletions, when we are in a low-space situation.

In the RADOS layer:
 * Per-pool full flag: For situations where metadata+data pools are on
separate OSDs, a per-pool full flag (instead of the current global
one), so that we can distinguish between situations where we need to
be conservative with metadata operations (low space on metadata pool)
vs situations where only client data IO is blocked (low space on data
pool).  This seems fairly uncontroversial, as the current global full
flag doesn't reflect that different pools can be on entirely separate
storage.
 * Per-pool full ratio: For situations where metadata+data pools are
on the same OSDs, separate full ratios per pool, so that once the data
pool's threshold is reached, the remaining reserved space is given
over to the metadata pool (assuming metadata pool has a higher full
ratio, possibly just set to 100%).

Throwing it out to the list for thoughts.

Cheers,
John

1. http://tracker.ceph.com/issues/7780
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problem with MDS respawning (up:replay)

2014-03-17 Thread John Spray

Hi Luke,

I've responded to your colleague on ceph-users (I'm assuming this is
the same issue).

John

On Mon, Mar 17, 2014 at 4:00 PM, Luke Jing Yuan jyl...@mimos.my wrote:
 Dear all,

 We had been running our cluster for at least 1.5 months without any issues 
 but something really bad happened yesterday with the MDS and we really look 
 forward for some guidance/pointer on how this may be resolved urgently.

 Anyway, we started to noticed the following messages repeating in MDS log:
 # cat /var/log/ceph/ceph-mds.mon01.log
 2014-03-16 18:40:41.894404 7f0f2875c700  0 mds.0.server 
 handle_client_file_setlock: start: 0, length: 0, client: 324186, pid: 30684, 
 pid_ns: 18446612141968944256, type: 4

 2014-03-16 18:49:09.993985 7f0f24645700  0 -- x.x.x.x:6801/3739  
 y.y.y.y:0/1662262473 pipe(0x728d2780 sd=26 :6801 s=0 pgs=0 cs=0 l=0 
 c=0x100adc6e0).accept peer addr is really y.y.y.y:0/1662262473 (socket is 
 y.y.y.y:33592/0)
 2014-03-16 18:49:10.000197 7f0f24645700  0 -- x.x.x.x:6801/3739  
 y.y.y.y:0/1662262473 pipe(0x728d2780 sd=26 :6801 s=0 pgs=0 cs=0 l=0 
 c=0x100adc6e0).accept connect_seq 0 vs existing 1 state standby
 2014-03-16 18:49:10.000239 7f0f24645700  0 -- x.x.x.x:6801/3739  
 y.y.y.y:0/1662262473 pipe(0x728d2780 sd=26 :6801 s=0 pgs=0 cs=0 l=0 
 c=0x100adc6e0).accept peer reset, then tried to connect to us, replacing
 2014-03-16 18:49:10.550726 7f4c34671780  0 ceph version 0.72.2 
 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 13282
 2014-03-16 18:49:10.826713 7f4c2f6f8700  1 mds.-1.0 handle_mds_map standby
 2014-03-16 18:49:10.984992 7f4c2f6f8700  1 mds.0.14 handle_mds_map i am now 
 mds.0.14
 2014-03-16 18:49:10.985010 7f4c2f6f8700  1 mds.0.14 handle_mds_map state 
 change up:standby -- up:replay
 2014-03-16 18:49:10.985017 7f4c2f6f8700  1 mds.0.14 replay_start
 2014-03-16 18:49:10.985024 7f4c2f6f8700  1 mds.0.14  recovery set is
 2014-03-16 18:49:10.985027 7f4c2f6f8700  1 mds.0.14  need osdmap epoch 3446, 
 have 3445
 2014-03-16 18:49:10.985030 7f4c2f6f8700  1 mds.0.14  waiting for osdmap 3446 
 (which blacklists prior instance)
 2014-03-16 18:49:16.945500 7f4c2f6f8700  0 mds.0.cache creating system inode 
 with ino:100
 2014-03-16 18:49:16.945747 7f4c2f6f8700  0 mds.0.cache creating system inode 
 with ino:1
 2014-03-16 18:49:17.358681 7f4c2b5e1700 -1 mds/journal.cc: In function 'void 
 EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 7f4c2b5e1700 
 time 2014-03-16 18:49:17.356336
 mds/journal.cc: 1316: FAILED assert(i == used_preallocated_ino)

 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: (EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)+0x7587) [0x5af5e7]
 2: (EUpdate::replay(MDS*)+0x3a) [0x5b67ea]
 3: (MDLog::_replay_thread()+0x678) [0x79dbb8]
 4: (MDLog::ReplayThread::entry()+0xd) [0x58bded]
 5: (()+0x7e9a) [0x7f4c33a96e9a]
 6: (clone()+0x6d) [0x7f4c3298b3fd]

 From ceph -s, we didn't notice any stuck PGs or what not but the following:
 # ceph -s
 cluster x
  health HEALTH_WARN mds cluster is degraded
  monmap e1: 3 mons at 
 {mon01=x.x.x.x:6789/0,mon02=x.x.x.y:6789/0,mon03=x.x.x.z:6789/0}, election 
 epoch 1210, quorum 0,1,2 mon01,mon02,mon03
  mdsmap e17020: 1/1/1 up {0=mon01=up:replay}, 2 up:standby
  osdmap e20195: 24 osds: 24 up, 24 in
   pgmap v1424671: 3300 pgs, 6 pools, 793 GB data, 3284 kobjects
 1611 GB used, 87636 GB / 89248 GB avail
 3300 active+clean
   client io 2750 kB/s rd, 0 op/s

 We also noticed in our syslog (dmesg actually) that the MDS services had been 
 flapping:
 [5165030.941804] init: ceph-mds (ceph/mon01) main process (2264) killed by 
 ABRT signal
 [5165030.941919] init: ceph-mds (ceph/mon01) main process ended, respawning
 [5165040.907291] init: ceph-mds (ceph/mon01) main process (2302) killed by 
 ABRT signal
 [5165040.907363] init: ceph-mds (ceph/mon01) main process ended, respawning
 [5165050.860593] init: ceph-mds (ceph/mon01) main process (2346) killed by 
 ABRT signal
 [5165050.860670] init: ceph-mds (ceph/mon01) main process ended, respawning

 More info from ceph df:
 GLOBAL:
 SIZE   AVAIL  RAW USED %RAW USED
 89248G 87636G 1611G1.81

 POOLS:
 NAMEID  USED %USED  OBJECTS
 Data0   9387M   0.012350
 metadata1941M   0   547003
 rbd 2   0   0   0
 backuppc4   783G0.882813040
 mysqlfs 5   114M0   1278
 mysqlrbd6   0   0   0

 Appreciate if someone would able to enlighten us on a possible solution to 
 this. Thanks in advance.

 Regards,
 Luke


 
 DISCLAIMER:


 This e-mail (including any attachments) is for the addressee(s) only and may 
 be confidential, especially as regards personal data. If you are not the 
 intended recipient, please note that any

Re: Erasure code properties in OSDMap

2014-03-12 Thread John Spray

I am sure all of that will work, but it doesn't explain why these
properties must be stored and named separately to crush rulesets.  To
flesh this out one also needs get and list operations for the sets
of properties, which feels like overkill if there is an existing place
we could be storing these things.  The reason I'm taking such an
interest in what may seem something minor is that once this has been
added, we will be stuck with it for some time once external tools
start depending on the interface.

The ruleset-based approach doesn't have to be more complicated for CLI
users, we would essentially replace any myproperties above with a
ruleset name instead.

osd pool create mypool pgnum pgpnum ruleset
osd set ruleset-properties ruleset key=val [key=val...]

The simple default cases of pool create mypool pgnum pgpnum
erasure could be handled by making sure there exist default rulesets
called erasure and replicated rather than having these be magic
words to the commands that cause ruleset creation.  Rulesets currently
just have numbers instead of names, but it would be simpler to add
names to rulesets than to introduce a whole new type of object to the
interface.

John

On Tue, Mar 11, 2014 at 2:03 PM, Loic Dachary
loic.dach...@cloudwatt.com wrote:
 On 11/03/2014 13:21, John Spray wrote:
 From a high level view, what is the logical difference between the
 crush ruleset and the properties object?  I'm thinking about how this
 is exposed to users and tools, and it seems like both would be defined
 as the settings about data placement and encoding.  I certainly
 understand the separation internally, I am just concerned about making
 the interface we expose upwards more complicated by adding a new type
 of object.

 Is there really a need for a new type of properties object, instead of
 storing these properties under the existing ruleset ID?
 These properties are used to configure the new feature that was introduced in 
 Firefly : erasure coded pools. From a user point of view the simplest would 
 be to

 ceph osd pool create mypool erasure

 and rely on the fact that a default ruleset will be created using the default 
 erasure code plugin with the default parameters.

 If the sysadmin wants to tweak the K+M factors (s)he could:

 ceph osd set properties myproperties k=10 m=4

 and then

 ceph osd pool create mypool erasure myproperties

 which would implicitly ask the default erasure code plugin to create a 
 ruleset named mypool-ruleset after configuring it with myproperties.

 If the sysadmin wants to share rulesets between pools instead of relying on 
 their implicit creation, (s)he could

 ceph osd create-serasure myruleset myproperties

 and then ceph osd set crush_ruleset as per usual. And if (s)he really wants 
 fine tuning, manually adding the ruleset is also possible.

 I feel confortable explaining this but I'm probably much too familiar with 
 the subject to be a good judge of what makes sense to someone new or not ;-)

 Cheers


 John


 On Sun, Mar 9, 2014 at 12:13 PM, Loic Dachary
 loic.dach...@cloudwatt.com wrote:
 Hi Sage  Sam,

 I quickly sketched the replacement of the pg_pool_t::properties map with a 
 OSDMap::properties list of maps at 
 https://github.com/dachary/ceph/commit/fe3819a62eb139fc3f0fa4282b4d22aecd8cd398
  and explained how I see it at http://tracker.ceph.com/issues/7662#note-2

 It indeed makes things simpler, more consistent and easier to explain. I 
 can provide an implementation this week if this seems reasonable to you.

 Cheers

 --
 Loďc Dachary, Senior Developer

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 Loïc Dachary, Senior Developer

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 105 matches

Mail list logo