Re: [ceph-users] Living with huge bucket sizes

2017-06-09 Thread Yehuda Sadeh-Weinraub
On Fri, Jun 9, 2017 at 2:21 AM, Dan van der Ster  wrote:
> Hi Bryan,
>
> On Fri, Jun 9, 2017 at 1:55 AM, Bryan Stillwell  
> wrote:
>> This has come up quite a few times before, but since I was only working with
>> RBD before I didn't pay too close attention to the conversation.  I'm
>> looking
>> for the best way to handle existing clusters that have buckets with a large
>> number of objects (>20 million) in them.  The cluster I'm doing test on is
>> currently running hammer (0.94.10), so if things got better in jewel I would
>> love to hear about it!
>> ...
>> Has anyone found a good solution for this for existing large buckets?  I
>> know sharding is the solution going forward, but afaik it can't be done
>> on existing buckets yet (although the dynamic resharding work mentioned
>> on today's performance call sounds promising).
>
> I haven't tried it myself, but 0.94.10 should have the (offline)
> resharding feature. From the release notes:
>

Right. We did add automatic dynamic resharding to Luminous, but
offline resharding should be enough.


>> * In RADOS Gateway, it is now possible to reshard an existing bucket's index
>> using an off-line tool.
>>
>> Usage:
>>
>> $ radosgw-admin bucket reshard --bucket= 
>> --num_shards=
>>
>> This will create a new linked bucket instance that points to the newly 
>> created
>> index objects. The old bucket instance still exists and currently it's up to
>> the user to manually remove the old bucket index objects. (Note that bucket
>> resharding currently requires that all IO (especially writes) to the specific
>> bucket is quiesced.)

Once resharding is done, use the radosgw-admin bi purge command to
remove the old bucket indexes.

Yehuda

>
> -- Dan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW: Auth error with hostname instead of IP

2017-06-09 Thread Eric Choi
When I send the a RGW request with hostname (with port that is not 80), I
am seeing "SignatureDoesNotMatch" error.

GET / HTTP/1.1
Host: cephrgw0002s2mdw1.sendgrid.net:50680
User-Agent: Minio (linux; amd64) minio-go/2.0.4 mc/2017-04-03T18:35:01Z
Authorization: AWS **REDACTED**:**REDACTED**


SignatureDoesNotMatchtx00093e0c1-00593b145c-996aae1-default996aae1-default-defaultmc:


However this works fine when I send it with an IP address instead.  Is the
hostname part of the signature?  If so, how can I make it so that it will
work with hostname as well?


Thank you,


Eric
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] disk mishap + bad disk and xfs corruption = stuck PG's

2017-06-09 Thread Mazzystr
Well I did bad I just don't know how bad yet.  Before we get into it my
critical data is backed up to CrashPlan.  I'd rather not lose all my
archive data.  Losing some of the data is ok.

I added a bunch of disks to my ceph cluster so I turned off the cluster and
dd'd the raw disks around so that the disks and osd's were ordered by id's
on the HBA.  I fat fingered one disk and overwrote it.  Another disk didn't
dd correctly... it seems to have not unmounted correctly plus it has some
failures according to smartctl.  A repair_xfs command put a whole bunch of
data into lost+found.

I brought the cluster up and let it settle down.  The result is 49 stuck
pg's and CephFS is halted.

ceph -s is here 
ceph osd tree is here 
ceph pg dump minus the active pg's is here 

OSD-2 is gone with no chance to restore it.

OSD-3 had the xfs corruption.  I have a bunch of
/var/lib/ceph/osd/ceph-3/lost+found/blah/DIR_[0-9]+/blah.blah__head_blah.blah
files after xfs_repair.  I for looped these files through ceph osd map
 $file and it seems they have all been replicated to other OSD's.
It seems to be safe to delete this data.

There are files named [0-9]+ in the top level of
/var/lib/ceph/osd/ceph-3/lost+found.  I don't know what to do with these
files.


I have a couple questions:
1) can the top level lost+found files be used to recreate the stuck pg's?

2a) can the pg's be dropped and recreated to bring the cluster to a healthy
state?
2b) if i do this can CephFS be restored with just partial data loss?  The
cephfs documentation isn't quite clear on how to do this.

Thanks for your time and help!
/Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD crash (hammer): osd/ReplicatedPG.cc: 7477: FAILED assert(repop_queue.front() == repop)

2017-06-09 Thread Ricardo J. Barberis
Hi list,

A few days ago we had some problems with our ceph cluster, and now we have some
OSDs crashing on start with messages like this right before crashing:

2017-06-09 15:35:02.226430 7fb46d9e4700 -1 log_channel(cluster) log [ERR] : 
trim_object Snap 4aae0 not in clones

I can start those OSDs if i set 'osd recovery max active = 0', but then the
PGs in those OSDs stay in stuck degraded/unclean state.

Now, one of the problems we faced was missing clones, which we solved with
ceph-objectstore-tool's remove-clone-metadata option, but it doesn't seem to
work in this case (PG and object taken from the sample log posted below):

# ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-199 --journal-path 
/var/lib/ceph/osd/ceph-199/journal --pgid 3.1cd 
rbd_data.6e30d2d518fee6d.008f remove-clone-metadata $((0x4aae0))
Clone 4aae0 not present


Any hints how can we debug/fix this problem?

Thanks in advance.


Our cluster:
Hammer (0.94.10)
5 mons
576 osds


Our clients:
Openstack, some Hammer (0.94.10), some Giant (0.87.2)
(We're in the process of upgrading everithing to Hhammer and then to Jewel)


Our ceph.conf:

[global]
fsid = 5ecc1509-4344-4087-9a1c-ac68fb085a75
mon initial members = ceph17, ceph19, ceph20, cephmon01, cephmon02
mon host = 172.17.22.17,172.17.22.19,172.17.22.20,172.17.23.251,172.17.23.252
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
filestore xattr use omap = true
osd pool default size = 2
osd pool default min size = 1
public network = 172.17.0.0/16
cluster network = 172.18.0.0/16
osd pool default pg num = 2048
osd pool default pgp num = 2048
mon osd down out subtree limit = host
rbd default format = 2
osd op threads = 2
osd disk threads = 1
osd max backfills = 2
osd recovery threads = 1
osd recovery max active = 2
osd recovery op priority = 2
[mon]
mon compact on start = true
[osd]
osd crush update on start = false


Here's a sample log, with [...] replacing very long text/lines, but I can
provide full logs if needed:

2017-06-08 16:47:14.519092 7fa61bee4700 -1 log_channel(cluster) log [ERR] : 
trim_object Snap 4aae0 not in clones
2017-06-08 16:47:16.197600 7fa62ee75700  0 osd.199 pg_epoch: 1779479 pg[3.1cd( 
v 1779479'105504641 (1779407'105501638,1779479'105504641] local-les=1779478 
n=1713 
ec=33 les/c 1779478/1779479 1779477/1779477/1779477) [199,134] r=0 lpr=1779477 
luod=1779479'105504639 crt=1779479'105504637 lcod 1779479'105504638 mlcod 
1779479'105504638 active+clean 
snaptrimq=[4aae0~1,4aae2~1,4aae4~1,4aae6~1,[...],4bc5f~4,4bc64~2,4bc68~2]]  
removing repgather(0xf9d09c0 1779479'105504639 
rep_tid=1753 committed?=1 applied?=1 lock=0 
op=osd_op(client.215404673.0:6234090 rbd_data.6e30d2d518fee6d.008f 
[set-alloc-hint object_size 8388608 
write_size 8388608,write 2101248~4096] 3.56a5e1cd snapc 4a86e=[] 
ack+ondisk+write+known_if_redirected e1779479) v4)
2017-06-08 16:47:16.197834 7fa62ee75700  0 osd.199 pg_epoch: 1779479 pg[3.1cd( 
v 1779479'105504641 (1779407'105501638,1779479'105504641] local-les=1779478 
n=1713 
ec=33 les/c 1779478/1779479 1779477/1779477/1779477) [199,134] r=0 lpr=1779477 
luod=1779479'105504639 crt=1779479'105504637 lcod 1779479'105504638 mlcod 
1779479'105504638 active+clean 
snaptrimq=[4aae0~1,4aae2~1,4aae4~1,[...],4bc5f~4,4bc64~2,4bc68~2]]q front 
is repgather(0x10908a00 0'0 rep_tid=1740 committed?=0 
applied?=0 lock=0)
2017-06-08 16:47:16.267386 7fa62ee75700 -1 osd/ReplicatedPG.cc: In function 
'void ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)' thread 7fa62ee75700 
time 
2017-06-08 16:47:16.197953
osd/ReplicatedPG.cc: 7477: FAILED assert(repop_queue.front() == repop)

 ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) 
[0xbdf735]
 2: (ReplicatedPG::eval_repop(ReplicatedPG::RepGather*)+0x12e1) [0x853111]
 3: (ReplicatedPG::repop_all_committed(ReplicatedPG::RepGather*)+0xcc) 
[0x85341c]
 4: (Context::complete(int)+0x9) [0x6c6329]
 5: (ReplicatedBackend::op_commit(ReplicatedBackend::InProgressOp*)+0x1dc) 
[0xa11b4c]
 6: (Context::complete(int)+0x9) [0x6c6329]
 7: (ReplicatedPG::BlessedContext::finish(int)+0x94) [0x8ad2d4]
 8: (Context::complete(int)+0x9) [0x6c6329]
 9: (Finisher::finisher_thread_entry()+0x168) [0xb02168]
 10: (()+0x7dc5) [0x7fa63ce39dc5]
 11: (clone()+0x6d) [0x7fa63b91b76d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

--- begin dump of recent events ---
-1> 2017-06-09 15:35:27.341080 7fb4862fb700  5 osd.199 1779827 tick
 -> 2017-06-09 15:35:27.341102 7fb4862fb700 20 osd.199 1779827 
scrub_time_permit should run between 0 - 24 now 15 = yes
 -9998> 2017-06-09 15:35:27.341121 7fb4862fb700 20 osd.199 1779827 
scrub_load_below_threshold loadavg 1.35 >= max 0.5 = no, load too high
 -9997> 2017-06-09 15:35:27.341130 7fb4862fb700 20 osd.199 1779827 sched_scrub 
load_is_low=0
[...]
-5> 2017-06-09 15:35:30.894979 7fb480edd700 15 

Re: [ceph-users] OSD node type/count mixes in the cluster

2017-06-09 Thread Deepak Naidu
Thanks David for sharing your experience, appreciate it.

--
Deepak

From: David Turner [mailto:drakonst...@gmail.com]
Sent: Friday, June 09, 2017 5:38 AM
To: Deepak Naidu; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD node type/count mixes in the cluster


I ran a cluster with 2 generations of the same vendor hardware. 24 osd 
supermicro and 32 osd supermicro (with faster/more RAM and CPU cores).  The 
cluster itself ran decently well, but the load differences was drastic between 
the 2 types of nodes. It required me to run the cluster with 2 separate config 
files for each type of node and was an utter PITA when troubleshooting 
bottlenecks.

Ultimately I moved around hardware and created a legacy cluster on the old 
hardware and created a new cluster using the newer configuration.  In general 
it was very hard to diagnose certain bottlenecks due to everything just looking 
so different.  The primary one I encountered was snap trimming due to deleted 
thousands of snapshots/day.

If you aren't pushing any limits of Ceph, you will probably be fine.  But if 
you have a really large cluster, use a lot of snapshots, or are pushing your 
cluster harder than the average user... Then I'd avoid mixing server 
configurations in a cluster.

On Fri, Jun 9, 2017, 1:36 AM Deepak Naidu 
> wrote:
Wanted to check if anyone has a ceph cluster which has mixed vendor servers 
both with same disk size i.e. 8TB but different count i.e. Example 10 OSD 
servers from Dell with 60 Disk per server and other 10 OSD servers from HP with 
26 Disk per server.

If so does that change any performance dynamics ? or is it not advisable .

--
Deepak
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-09 Thread Sage Weil
On Fri, 9 Jun 2017, Dan van der Ster wrote:
> On Fri, Jun 9, 2017 at 5:58 PM, Vasu Kulkarni  wrote:
> > On Fri, Jun 9, 2017 at 6:11 AM, Wes Dillingham
> >  wrote:
> >> Similar to Dan's situation we utilize the --cluster name concept for our
> >> operations. Primarily for "datamover" nodes which do incremental rbd
> >> import/export between distinct clusters. This is entirely coordinated by
> >> utilizing the --cluster option throughout.
> >>
> >> The way we set it up is that all clusters are actually named "ceph" on the
> >> mons and osds etc, but the clients themselves get /etc/ceph/clusterA.conf
> >> and /etc/ceph/clusterB.conf so that we can differentiate. I would like to
> >> see the functionality of clients being able to specify which conf file to
> >> read preserved.
> >
> > ceph.conf along with keyring file can stay in any location, the
> > default location is /etc/ceph but one could use
> > other location for clusterB.conf (
> > http://docs.ceph.com/docs/jewel/rados/configuration/ceph-conf/ ), At
> > least
> > for client which doesn't run any daemon this should be sufficient to
> > make it talk to different clusters.
> 
> So we start with this:
> 
> > ceph --cluster=flax health
> HEALTH_OK
> 
> Then for example do:
> > cd /etc/ceph/
> > mkdir flax
> > cp flax.conf flax/ceph.conf
> > cp flax.client.admin.keyring flax/ceph.client.admin.keyring
> 
> Now this works:
> 
> > ceph --conf=/etc/ceph/flax/ceph.conf 
> > --keyring=/etc/ceph/flax/ceph.client.admin.keyring health
> HEALTH_OK
> 
> So --cluster is just convenient shorthand for the CLI.

Yeah, although it's used elsewhere too:

$ grep \$cluster ../src/common/config_opts.h 
OPTION(admin_socket, OPT_STR, "$run_dir/$cluster-$name.asok") // default 
changed by common_preinit()
OPTION(log_file, OPT_STR, "/var/log/ceph/$cluster-$name.log") // default 
changed by common_preinit()
"default=/var/log/ceph/$cluster.$channel.log 
cluster=/var/log/ceph/$cluster.log")

"/etc/ceph/$cluster.$name.keyring,/etc/ceph/$cluster.keyring,/etc/ceph/keyring,/etc/ceph/keyring.bin,"
 

"/usr/local/etc/ceph/$cluster.$name.keyring,/usr/local/etc/ceph/$cluster.keyring,"
OPTION(mon_data, OPT_STR, "/var/lib/ceph/mon/$cluster-$id")
OPTION(mon_debug_dump_location, OPT_STR, "/var/log/ceph/$cluster-$name.tdump")
OPTION(mds_data, OPT_STR, "/var/lib/ceph/mds/$cluster-$id")
OPTION(osd_data, OPT_STR, "/var/lib/ceph/osd/$cluster-$id")
OPTION(osd_journal, OPT_STR, "/var/lib/ceph/osd/$cluster-$id/journal")
OPTION(rgw_data, OPT_STR, "/var/lib/ceph/radosgw/$cluster-$id")
OPTION(mgr_data, OPT_STR, "/var/lib/ceph/mgr/$cluster-$id") // where to find 
keyring etc

The only non-daemon ones are admin_socket and log_file, so keep that in 
mind.

> I guess it won't be the end of the world if you drop it, but would it
> be so costly to keep that working? (CLI only -- no use-case for
> server-side named clusters over here).

But yeah... I don't think we'll change any of this except to make the 
deployment tools' lives easier by not support it there.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-09 Thread Dan van der Ster
On Fri, Jun 9, 2017 at 5:58 PM, Vasu Kulkarni  wrote:
> On Fri, Jun 9, 2017 at 6:11 AM, Wes Dillingham
>  wrote:
>> Similar to Dan's situation we utilize the --cluster name concept for our
>> operations. Primarily for "datamover" nodes which do incremental rbd
>> import/export between distinct clusters. This is entirely coordinated by
>> utilizing the --cluster option throughout.
>>
>> The way we set it up is that all clusters are actually named "ceph" on the
>> mons and osds etc, but the clients themselves get /etc/ceph/clusterA.conf
>> and /etc/ceph/clusterB.conf so that we can differentiate. I would like to
>> see the functionality of clients being able to specify which conf file to
>> read preserved.
>
> ceph.conf along with keyring file can stay in any location, the
> default location is /etc/ceph but one could use
> other location for clusterB.conf (
> http://docs.ceph.com/docs/jewel/rados/configuration/ceph-conf/ ), At
> least
> for client which doesn't run any daemon this should be sufficient to
> make it talk to different clusters.

So we start with this:

> ceph --cluster=flax health
HEALTH_OK

Then for example do:
> cd /etc/ceph/
> mkdir flax
> cp flax.conf flax/ceph.conf
> cp flax.client.admin.keyring flax/ceph.client.admin.keyring

Now this works:

> ceph --conf=/etc/ceph/flax/ceph.conf 
> --keyring=/etc/ceph/flax/ceph.client.admin.keyring health
HEALTH_OK

So --cluster is just convenient shorthand for the CLI.

I guess it won't be the end of the world if you drop it, but would it
be so costly to keep that working? (CLI only -- no use-case for
server-side named clusters over here).

--
Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-09 Thread Sage Weil
On Fri, 9 Jun 2017, Erik McCormick wrote:
> On Fri, Jun 9, 2017 at 12:07 PM, Sage Weil  wrote:
> > On Thu, 8 Jun 2017, Sage Weil wrote:
> >> Questions:
> >>
> >>  - Does anybody on the list use a non-default cluster name?
> >>  - If so, do you have a reason not to switch back to 'ceph'?
> >
> > It sounds like the answer is "yes," but not for daemons. Several users use
> > it on the client side to connect to multiple clusters from the same host.
> >
> 
> I thought some folks said they were running with non-default naming
> for daemons, but if not, then count me as one who does. This was
> mainly a relic of the past, where I thought I would be running
> multiple clusters on one host. Before long I decided it would be a bad
> idea, but by then the cluster was already in heavy use and I couldn't
> undo it.
> 
> I will say that I am not opposed to renaming back to ceph, but it
> would be great to have a documented process for accomplishing this
> prior to deprecation. Even going so far as to remove --cluster from
> deployment tools will leave me unable to add OSDs if I want to upgrade
> when Luminous is released.

Note that even if the tool doesn't support it, the cluster name is a 
host-local thing, so you can always deploy ceph-named daemons on other 
hosts.

For an existing host, the removal process should be as simple as

 - stop the daemons on the host
 - rename /etc/ceph/foo.conf -> /etc/ceph/ceph.conf
 - rename /var/lib/ceph/*/foo-* -> /var/lib/ceph/*/ceph-* (this mainly 
matters for non-osds, since the osd dirs will get dynamically created by 
ceph-disk, but renaming will avoid leaving clutter behind)
 - comment out the CLUSTER= line in /etc/{sysconfig,default}/ceph (if 
you're on jewel)
 - reboot

If you wouldn't mind being a guinea pig and verifying that this is 
sufficient that would be really helpful!  We'll definitely want to 
document this process.

Thanks!
sage


> 
> > Nobody is colocating multiple daemons from different clusters on the same
> > host.  Some have in the past but stopped.  If they choose to in the
> > future, they can customize the systemd units themselves.
> >
> > The rbd-mirror daemon has a similar requirement to talk to multiple
> > clusters as a client.
> >
> > This makes me conclude our current path is fine:
> >
> >  - leave existing --cluster infrastructure in place in the ceph code, but
> >  - remove support for deploying daemons with custom cluster names from the
> > deployment tools.
> >
> > This neatly avoids the systemd limitations for all but the most
> > adventuresome admins and avoid the more common case of an admin falling
> > into the "oh, I can name my cluster? cool! [...] oh, i have to add
> > --cluster rover to every command? ick!" trap.
> >
> 
> Yeah, that was me in 2012. Oops.
> 
> -Erik
> 
> > sage
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-09 Thread Erik McCormick
On Fri, Jun 9, 2017 at 12:07 PM, Sage Weil  wrote:
> On Thu, 8 Jun 2017, Sage Weil wrote:
>> Questions:
>>
>>  - Does anybody on the list use a non-default cluster name?
>>  - If so, do you have a reason not to switch back to 'ceph'?
>
> It sounds like the answer is "yes," but not for daemons. Several users use
> it on the client side to connect to multiple clusters from the same host.
>

I thought some folks said they were running with non-default naming
for daemons, but if not, then count me as one who does. This was
mainly a relic of the past, where I thought I would be running
multiple clusters on one host. Before long I decided it would be a bad
idea, but by then the cluster was already in heavy use and I couldn't
undo it.

I will say that I am not opposed to renaming back to ceph, but it
would be great to have a documented process for accomplishing this
prior to deprecation. Even going so far as to remove --cluster from
deployment tools will leave me unable to add OSDs if I want to upgrade
when Luminous is released.

> Nobody is colocating multiple daemons from different clusters on the same
> host.  Some have in the past but stopped.  If they choose to in the
> future, they can customize the systemd units themselves.
>
> The rbd-mirror daemon has a similar requirement to talk to multiple
> clusters as a client.
>
> This makes me conclude our current path is fine:
>
>  - leave existing --cluster infrastructure in place in the ceph code, but
>  - remove support for deploying daemons with custom cluster names from the
> deployment tools.
>
> This neatly avoids the systemd limitations for all but the most
> adventuresome admins and avoid the more common case of an admin falling
> into the "oh, I can name my cluster? cool! [...] oh, i have to add
> --cluster rover to every command? ick!" trap.
>

Yeah, that was me in 2012. Oops.

-Erik

> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-09 Thread Sage Weil
On Thu, 8 Jun 2017, Sage Weil wrote:
> Questions:
> 
>  - Does anybody on the list use a non-default cluster name?
>  - If so, do you have a reason not to switch back to 'ceph'?

It sounds like the answer is "yes," but not for daemons. Several users use 
it on the client side to connect to multiple clusters from the same host.

Nobody is colocating multiple daemons from different clusters on the same 
host.  Some have in the past but stopped.  If they choose to in the 
future, they can customize the systemd units themselves.

The rbd-mirror daemon has a similar requirement to talk to multiple 
clusters as a client.

This makes me conclude our current path is fine:

 - leave existing --cluster infrastructure in place in the ceph code, but
 - remove support for deploying daemons with custom cluster names from the 
deployment tools.

This neatly avoids the systemd limitations for all but the most 
adventuresome admins and avoid the more common case of an admin falling 
into the "oh, I can name my cluster? cool! [...] oh, i have to add 
--cluster rover to every command? ick!" trap.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-09 Thread Vasu Kulkarni
On Fri, Jun 9, 2017 at 6:11 AM, Wes Dillingham
 wrote:
> Similar to Dan's situation we utilize the --cluster name concept for our
> operations. Primarily for "datamover" nodes which do incremental rbd
> import/export between distinct clusters. This is entirely coordinated by
> utilizing the --cluster option throughout.
>
> The way we set it up is that all clusters are actually named "ceph" on the
> mons and osds etc, but the clients themselves get /etc/ceph/clusterA.conf
> and /etc/ceph/clusterB.conf so that we can differentiate. I would like to
> see the functionality of clients being able to specify which conf file to
> read preserved.

ceph.conf along with keyring file can stay in any location, the
default location is /etc/ceph but one could use
other location for clusterB.conf (
http://docs.ceph.com/docs/jewel/rados/configuration/ceph-conf/ ), At
least
for client which doesn't run any daemon this should be sufficient to
make it talk to different clusters.

>
> As a note though we went the route of naming all clusters "ceph" to
> workaround difficulties in non-standard naming so this issue does need some
> attention.
It would be nice if you can throw in the steps in tracker which can be
then moved to docs so that it can help others follow
those steps to rename the cluster back to 'ceph'


>
> On Fri, Jun 9, 2017 at 8:19 AM, Alfredo Deza  wrote:
>>
>> On Thu, Jun 8, 2017 at 3:54 PM, Sage Weil  wrote:
>> > On Thu, 8 Jun 2017, Bassam Tabbara wrote:
>> >> Thanks Sage.
>> >>
>> >> > At CDM yesterday we talked about removing the ability to name your
>> >> > ceph
>> >> > clusters.
>> >>
>> >> Just to be clear, it would still be possible to run multiple ceph
>> >> clusters on the same nodes, right?
>> >
>> > Yes, but you'd need to either (1) use containers (so that different
>> > daemons see a different /etc/ceph/ceph.conf) or (2) modify the systemd
>> > unit files to do... something.
>>
>> In the container case, I need to clarify that ceph-docker deployed
>> with ceph-ansible is not capable of doing this, since
>> the ad-hoc systemd units use the hostname as part of the identifier
>> for the daemon, e.g:
>>
>> systemctl enable ceph-mon@{{ ansible_hostname }}.service
>>
>>
>> >
>> > This is actually no different from Jewel. It's just that currently you
>> > can
>> > run a single cluster on a host (without containers) but call it 'foo'
>> > and
>> > knock yourself out by passing '--cluster foo' every time you invoke the
>> > CLI.
>> >
>> > I'm guessing you're in the (1) case anyway and this doesn't affect you
>> > at
>> > all :)
>> >
>> > sage
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majord...@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> Respectfully,
>
> Wes Dillingham
> wes_dilling...@harvard.edu
> Research Computing | Senior CyberInfrastructure Storage Engineer
> Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 102
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW radosgw-admin reshard bucket ends with ERROR: bi_list(): (4) Interrupted system call

2017-06-09 Thread Andreas Calminder
Hi,
I'm trying to reshard a rather large bucket (+13M objects) as per the
Red Hat documentation
(https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/object_gateway_guide_for_ubuntu/administration_cli#resharding-bucket-index)
to be able to delete it, the process starts and runs well until the
very end where it barfs out:

ERROR: bi_list(): (4) Interrupted system call

# radosgw-admin metadata get bucket:big_bucket
returns the old instance id and
# radosgw-admin metadata get bucket.instance:big_bucket:new_instance_id
returns, seemingly, a copy of the bucket but with shards and the new
instance id. So I seem to have a duplicate bucket instance id that
wasn't switched over to.

Is it possible to manually "attach" the bucket to the new instance id
and then carry on with the rest of the steps in the docs, bi purge the
old instance. If not, how do i get rid of the duplicate?

Also, as the data in big_bucket isn't of any value I just want the
bucket and the objects gone, can I do this without affecting my
cluster performance?

Last time I tried delete the bucket with --purge-objects it ran for
days locking up osd's left and right. If I remember correctly write
actions to the index will lock other operations to that particular osd
and when the index object is large, it'll take forever(tm)

regards,
Andreas
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-09 Thread Wes Dillingham
Similar to Dan's situation we utilize the --cluster name concept for our
operations. Primarily for "datamover" nodes which do incremental rbd
import/export between distinct clusters. This is entirely coordinated by
utilizing the --cluster option throughout.

The way we set it up is that all clusters are actually named "ceph" on the
mons and osds etc, but the clients themselves get /etc/ceph/clusterA.conf
and /etc/ceph/clusterB.conf so that we can differentiate. I would like to
see the functionality of clients being able to specify which conf file to
read preserved.

As a note though we went the route of naming all clusters "ceph" to
workaround difficulties in non-standard naming so this issue does need some
attention.

On Fri, Jun 9, 2017 at 8:19 AM, Alfredo Deza  wrote:

> On Thu, Jun 8, 2017 at 3:54 PM, Sage Weil  wrote:
> > On Thu, 8 Jun 2017, Bassam Tabbara wrote:
> >> Thanks Sage.
> >>
> >> > At CDM yesterday we talked about removing the ability to name your
> ceph
> >> > clusters.
> >>
> >> Just to be clear, it would still be possible to run multiple ceph
> >> clusters on the same nodes, right?
> >
> > Yes, but you'd need to either (1) use containers (so that different
> > daemons see a different /etc/ceph/ceph.conf) or (2) modify the systemd
> > unit files to do... something.
>
> In the container case, I need to clarify that ceph-docker deployed
> with ceph-ansible is not capable of doing this, since
> the ad-hoc systemd units use the hostname as part of the identifier
> for the daemon, e.g:
>
> systemctl enable ceph-mon@{{ ansible_hostname }}.service
>
>
> >
> > This is actually no different from Jewel. It's just that currently you
> can
> > run a single cluster on a host (without containers) but call it 'foo' and
> > knock yourself out by passing '--cluster foo' every time you invoke the
> > CLI.
> >
> > I'm guessing you're in the (1) case anyway and this doesn't affect you at
> > all :)
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Senior CyberInfrastructure Storage Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 102
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD node type/count mixes in the cluster

2017-06-09 Thread David Turner
I ran a cluster with 2 generations of the same vendor hardware. 24 osd
supermicro and 32 osd supermicro (with faster/more RAM and CPU cores).  The
cluster itself ran decently well, but the load differences was drastic
between the 2 types of nodes. It required me to run the cluster with 2
separate config files for each type of node and was an utter PITA when
troubleshooting bottlenecks.

Ultimately I moved around hardware and created a legacy cluster on the old
hardware and created a new cluster using the newer configuration.  In
general it was very hard to diagnose certain bottlenecks due to everything
just looking so different.  The primary one I encountered was snap trimming
due to deleted thousands of snapshots/day.

If you aren't pushing any limits of Ceph, you will probably be fine.  But
if you have a really large cluster, use a lot of snapshots, or are pushing
your cluster harder than the average user... Then I'd avoid mixing server
configurations in a cluster.

On Fri, Jun 9, 2017, 1:36 AM Deepak Naidu  wrote:

> Wanted to check if anyone has a ceph cluster which has mixed vendor
> servers both with same disk size i.e. 8TB but different count i.e. Example
> 10 OSD servers from Dell with 60 Disk per server and other 10 OSD servers
> from HP with 26 Disk per server.
>
> If so does that change any performance dynamics ? or is it not advisable .
>
> --
> Deepak
>
> ---
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> ---
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-09 Thread Alfredo Deza
On Thu, Jun 8, 2017 at 3:54 PM, Sage Weil  wrote:
> On Thu, 8 Jun 2017, Bassam Tabbara wrote:
>> Thanks Sage.
>>
>> > At CDM yesterday we talked about removing the ability to name your ceph
>> > clusters.
>>
>> Just to be clear, it would still be possible to run multiple ceph
>> clusters on the same nodes, right?
>
> Yes, but you'd need to either (1) use containers (so that different
> daemons see a different /etc/ceph/ceph.conf) or (2) modify the systemd
> unit files to do... something.

In the container case, I need to clarify that ceph-docker deployed
with ceph-ansible is not capable of doing this, since
the ad-hoc systemd units use the hostname as part of the identifier
for the daemon, e.g:

systemctl enable ceph-mon@{{ ansible_hostname }}.service


>
> This is actually no different from Jewel. It's just that currently you can
> run a single cluster on a host (without containers) but call it 'foo' and
> knock yourself out by passing '--cluster foo' every time you invoke the
> CLI.
>
> I'm guessing you're in the (1) case anyway and this doesn't affect you at
> all :)
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] (no subject)

2017-06-09 Thread Steele, Tim
unsubscribe ceph-users

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] removing cluster name support

2017-06-09 Thread Tim Serong
On 06/09/2017 06:41 AM, Benjeman Meekhof wrote:
> Hi Sage,
> 
> We did at one time run multiple clusters on our OSD nodes and RGW
> nodes (with Jewel).  We accomplished this by putting code in our
> puppet-ceph module that would create additional systemd units with
> appropriate CLUSTER=name environment settings for clusters not named
> ceph.  IE, if the module were asked to configure OSD for a cluster
> named 'test' it would copy/edit the ceph-osd service to create a
> 'test-osd@.service' unit that would start instances with CLUSTER=test
> so they would point to the right config file, etc   Eventually on the
> RGW side I started doing instance-specific overrides like
> '/etc/systemd/system/ceph-rado...@client.name.d/override.conf' so as
> to avoid replicating the stock systemd unit.
> 
> We gave up on multiple clusters on the OSD nodes because it wasn't
> really that useful to maintain a separate 'test' cluster on the same
> hardware.  We continue to need ability to reference multiple clusters
> for RGW nodes and other clients. For the other example, users of our
> project might have their own Ceph clusters in addition to wanting to
> use ours.
> 
> If the daemon solution in the no-clustername future is to 'modify
> systemd unit files to do something' we're already doing that so it's
> not a big issue.  However the current modification of over-riding
> CLUSTER in the environment section of systemd files does seem cleaner
> than over-riding an exec command to specify a different config file
> and keyring path.   Maybe systemd units could ship with those
> arguments as variables for easily over-riding.

systemd units can be templated/parameterized, but with only one
parameter, the instance ID, which we're already using
(ceph-mon@$(hostname), ceph-osd@$ID, etc.)

> 
> thanks,
> Ben
> 
> On Thu, Jun 8, 2017 at 3:37 PM, Sage Weil  wrote:
>> At CDM yesterday we talked about removing the ability to name your ceph
>> clusters.  There are a number of hurtles that make it difficult to fully
>> get rid of this functionality, not the least of which is that some
>> (many?) deployed clusters make use of it.  We decided that the most we can
>> do at this point is remove support for it in ceph-deploy and ceph-ansible
>> so that no new clusters or deployed nodes use it.
>>
>> The first PR in this effort:
>>
>> https://github.com/ceph/ceph-deploy/pull/441
>>
>> Background:
>>
>> The cluster name concept was added to allow multiple clusters to have
>> daemons coexist on the same host.  At the type it was a hypothetical
>> requirement for a user that never actually made use of it, and the
>> support is kludgey:
>>
>>  - default cluster name is 'ceph'
>>  - default config is /etc/ceph/$cluster.conf, so that the normal
>> 'ceph.conf' still works
>>  - daemon data paths include the cluster name,
>>  /var/lib/ceph/osd/$cluster-$id
>>which is weird (but mostly people are used to it?)
>>  - any cli command you want to touch a non-ceph cluster name
>> needs -C $name or --cluster $name passed to it.
>>
>> Also, as of jewel,
>>
>>  - systemd only supports a single cluster per host, as defined by $CLUSTER
>> in /etc/{sysconfig,default}/ceph
>>
>> which you'll notice removes support for the original "requirement".
>>
>> Also note that you can get the same effect by specifying the config path
>> explicitly (-c /etc/ceph/foo.conf) along with the various options that
>> substitute $cluster in (e.g., osd_data=/var/lib/ceph/osd/$cluster-$id).
>>
>>
>> Crap preventing us from removing this entirely:
>>
>>  - existing daemon directories for existing clusters
>>  - various scripts parse the cluster name out of paths
>>
>>
>> Converting an existing cluster "foo" back to "ceph":
>>
>>  - rename /etc/ceph/foo.conf -> ceph.conf
>>  - rename /var/lib/ceph/*/foo-* -> /var/lib/ceph/*/ceph-*
>>  - remove the CLUSTER=foo line in /etc/{default,sysconfig}/ceph
>>  - reboot
>>
>>
>> Questions:
>>
>>  - Does anybody on the list use a non-default cluster name?
>>  - If so, do you have a reason not to switch back to 'ceph'?
>>
>> Thanks!
>> sage
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Living with huge bucket sizes

2017-06-09 Thread Dan van der Ster
Hi Bryan,

On Fri, Jun 9, 2017 at 1:55 AM, Bryan Stillwell  wrote:
> This has come up quite a few times before, but since I was only working with
> RBD before I didn't pay too close attention to the conversation.  I'm
> looking
> for the best way to handle existing clusters that have buckets with a large
> number of objects (>20 million) in them.  The cluster I'm doing test on is
> currently running hammer (0.94.10), so if things got better in jewel I would
> love to hear about it!
> ...
> Has anyone found a good solution for this for existing large buckets?  I
> know sharding is the solution going forward, but afaik it can't be done
> on existing buckets yet (although the dynamic resharding work mentioned
> on today's performance call sounds promising).

I haven't tried it myself, but 0.94.10 should have the (offline)
resharding feature. From the release notes:

> * In RADOS Gateway, it is now possible to reshard an existing bucket's index
> using an off-line tool.
>
> Usage:
>
> $ radosgw-admin bucket reshard --bucket= 
> --num_shards=
>
> This will create a new linked bucket instance that points to the newly created
> index objects. The old bucket instance still exists and currently it's up to
> the user to manually remove the old bucket index objects. (Note that bucket
> resharding currently requires that all IO (especially writes) to the specific
> bucket is quiesced.)

-- Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados rm: device or resource busy

2017-06-09 Thread Jan Kasprzak
Hello,

Brad Hubbard wrote:
: I can reproduce this.
[...] 
: That's here where you will notice it is returning EBUSY which is error
: code 16, "Device or resource busy".
: 
: 
https://github.com/badone/ceph/blob/wip-ceph_test_admin_socket_output/src/cls/lock/cls_lock.cc#L189
: 
: In order to remove the existing parts of the file you should be able
: to just run "rados --pool testpool ls" and remove the listed objects
: belonging to "testfile".
: 
: Example:
: rados --pool testpool ls
: testfile.0004
: testfile.0001
: testfile.
: testfile.0003
: testfile.0005
: testfile.0002
: 
: rados --pool testpool rm testfile.
: rados --pool testpool rm testfile.0001
: ...

This works for me, thanks!

: Please open a tracker for this so it can be investigated further.

Done: http://tracker.ceph.com/issues/20233

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
> That's why this kind of vulnerability is a concern: deploying stuff is  <
> often about collecting an obscene number of .jar files and pushing them <
> up to the application server.  --pboddie at LWN <
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Effect of tunables on client system load

2017-06-09 Thread Nathanial Byrnes
Hi All,
   First, some background:
   I have been running a small (4 compute nodes) xen server cluster
backed by both a small ceph (4 other nodes with a total of 18x 1-spindle
osd's) and small gluster cluster (2 nodes each with a 14 spindle RAID
array). I started with gluster 3-4 years ago, at first using NFS to access
gluster, then upgraded to gluster FUSE. However, I had been facinated with
ceph since I first read about it, and probably added ceph as soon as XCP
released a kernel with RBD support, possibly approaching 2 years ago.
   With Ceph, since I started out with the kernel RBD, I believe it
locked me to Bobtail tunables. I connected to XCP via a project that tricks
XCP into running LVM on the RBDs managing all this through the iSCSI mgmt
infrastructure somehow... Only recently I've switched to a newer project
that uses the RBD-NBD mapping instead. This should let me use whatever
tunables my client SW support AFAIK. I have not yet changed my tunables as
the data re-org will probably take a day or two (only 1Gb networking...).

   Over this time period, I've observed that my gluster backed guests tend
not to consume as much of domain-0's (the Xen VM management host) resources
as do my Ceph backed guests. To me, this is somewhat intuitive  as the ceph
client has to do more "thinking" than the gluster client. However, It seems
to me that the IO performance of the VM guests is well outside than the
difference in spindle count would suggest. I am open to the notion that
there are probably quite a few sub-optimal design choices/constraints
within the environment. However, I haven't the resources to conduct all
that many experiments and benchmarks So, over time I've ended up
treating ceph as my resilient storage, and gluster as my more performant
(3x vs 2x replication, and, as mentioned above, my gluster guests had
quicker guest IO and lower dom-0 load).

So, on to my questions:

   Would setting my tunables to jewel (my present release), or anything
newer than bobtail (which is what I think I am set to if I read the ceph
status warning correctly) reduce my dom-0 load and/or improve any aspects
of the client IO performance?

   Will adding nodes to the cluster ceph reduce load on dom-0, and/or
improve client IO performance (I doubt the former and would expect the
latter...)?

   So, why did I bring up gluster at all? In an ideal world, I would like
to have just one storage environment that would satisfy all my
organizations needs. If forced to choose with the knowledge I have today, I
would have to select gluster. I am hoping to come up with some actionable
data points that might help me discover some of my mistakes which might
explain my experience to date and maybe even help remedy said mistakes. As
I mentioned earlier, I like ceph, more so than gluster, and would like to
employ more within my environment. But, given budgetary constraints, I need
to do what's best for my organization.

   Thanks in advance,
   Nate
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com