Re: [ceph-users] enterprise support

2019-07-15 Thread Brady Deetz
https://www.mirantis.com/software/ceph/

On Mon, Jul 15, 2019 at 2:53 PM Void Star Nill 
wrote:

> Hello,
>
> Other than Redhat and SUSE, are there other companies that provide
> enterprise support for Ceph?
>
> Thanks,
> Shridhar
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic 13.2.3?

2019-01-04 Thread Brady Deetz
I agree with the comments above. I don't feel comfortable upgrading because
I never know what's been deemed stable. We used to get an announcement at
the same times that the packages hit the repo. What's going on? Frankly,
the entire release cycle of Mimic has seemed very haphazard.

On Fri, Jan 4, 2019 at 10:22 AM Daniel Baumann 
wrote:

> On 01/04/2019 05:07 PM, Matthew Vernon wrote:
> > how is it still the case that packages are being pushed onto the
> official ceph.com repos that people
> > shouldn't install?
>
> We're still on 12.2.5 because of this. Basically every 12.2.x after that
> had notes on the mailinglist like "don't use, wait for ..."
>
> I don't dare updating to 13.2.
>
> For the 10.2.x and 11.2.x cycles, we upgraded our production cluster
> within a matter of days after the release of an update. Since the second
> half of the 12.2.x releases, this seems to be not possible anymore.
>
> Ceph is great and all, but this decrease of release quality seriously
> harms the image and perception of Ceph as a stable software platform in
> the enterprise environment and makes people do the wrong things (rotting
> systems update-wise, for the sake of stability).
>
> Regards,
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Supermicro server 5019D8-TR12P for new Ceph cluster

2018-11-13 Thread Brady Deetz
I'd ensure the io performance you expect can be achieved. If your scopes
create tons of small files, you may have a problem. You mentioned 10TB/day.
But, what is the scope's expectation with regard to dumping the data to
network storage. For instance, does the scope function normally while it is
transferring data?

On Tue, Nov 13, 2018, 8:15 AM Ashley Merrick  I’d say them CPU’s should more than be fine for your use case and
> requirements then.
>
> You have more than one thread per an OSD which seems to be the ongoing
> recommendation.
>
> On Tue, 13 Nov 2018 at 10:12 PM, Michal Zacek  wrote:
>
>> Hi,
>>
>> The server support up to 128GB RAM, so upgrade RAM will not be problem.
>> The storage will be used for storing data from a microscopes. Users will
>> download data from the storage to local PC, make some changes and then will
>> upload data back to the storage. We want use the cluster for direct
>> computing in the future, but now we only need separate data from the
>> microscopes from normal office data. We are expecting up to 10TB
>> upload/download per day.
>>
>> Michal
>> Dne 13. 11. 18 v 14:50 Ashley Merrick napsal(a):
>>
>> Not sure about CPU, but I would definitely suggest more than 64GB of ram.
>>
>> With the next release of Mimic the default memory will be set to 4GB per
>> a OSD (if I am correct), this only includes the bluestore layer, so id
>> easily expect to see you getting close to 64GB after OS cache's e.t.c, and
>> the last thing you want on a CEPH OSD box is a OOM.
>>
>> Are you looking at near to cold storage for these photos? Or is it
>> storage for designers working out of programs which require low latency and
>> quick performance?
>>
>> On Tue, Nov 13, 2018 at 9:43 PM Michal Zacek  wrote:
>>
>>> Hello,
>>>
>>> what do you think about this Supermicro server:
>>> http://www.supermicro.com/products/system/1U/5019/SSG-5019D8-TR12P.cfm
>>> ? We are considering about eight or ten server each with twelve 10TB SATA
>>> drives, one m.2 SSD and 64GB RAM. Public and cluster network will be
>>> 10Gbit/s. The question is if one Intel XEON D-2146NT wit eight cores
>>> (16 with HT) will be enough for 12 SAT disks. Cluster will be used for
>>> storing pictures. File size from 1MB to 2TB ;-).
>>>
>>> Thanks,
>>> Michal
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Intel S2600STB issues on new cluster

2018-10-31 Thread Brady Deetz
Not directly related to Ceph, so I apologize for being off topic. Is
anybody else running S2600STB with Skylake and experiencing issues?

I'm standing up a new cluster based on this mobo with Mellanox ConnectX-4
and Avago/Broadcom 9305-24i and experiencing extreme performance issues.
I'm not seeing any difference between the latest elrepo kernel and the
default CentOS 7.5 kernel. I'm running the next to latest bios. I'm
currently working on upgrading the HBA firmware/bios and the Mellanox
firmware. But, this feels more like a mobo issue than a driver issue with
the HBA or network.

Any thoughts?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] odd osd id in ceph health

2018-10-24 Thread Brady Deetz
en you need at least ten hosts.
> On Wed, Oct 24, 2018 at 9:39 PM Brady Deetz  wrote:
> >
> > My cluster (v12.2.8) is currently recovering and I noticed this odd OSD
> ID in ceph health detail:
> > "2147483647"
> >
> > [ceph-admin@admin libr-cluster]$ ceph health detail | grep 2147483647
> > pg 50.c3 is stuck undersized for 148638.689866, current state
> active+recovery_wait+undersized+degraded+remapped, last acting
> [275,282,330,25,154,98,239,2147483647,75,49]
> > pg 50.d4 is stuck undersized for 148638.649657, current state
> active+recovery_wait+undersized+degraded+remapped, last acting
> [239,275,307,49,184,25,281,2147483647,283,378]
> > pg 50.10b is stuck undersized for 148638.666901, current state
> active+undersized+degraded+remapped+backfill_wait, last acting
> [131,192,283,308,169,258,2147483647,75,306,25]
> > pg 50.110 is stuck undersized for 148638.684818, current state
> active+recovery_wait+undersized+degraded+remapped, last acting
> [169,377,2147483647,2,274,47,306,192,131,283]
> > pg 50.116 is stuck undersized for 148638.703043, current state
> active+recovery_wait+undersized+degraded+remapped, last acting
> [99,283,168,47,71,400,2147483647,108,239,2]
> > pg 50.121 is stuck undersized for 148638.700838, current state
> active+undersized+degraded+remapped+backfill_wait, last acting
> [71,2,75,307,286,73,168,2147483647,376,25]
> > pg 50.12a is stuck undersized for 145362.808035, current state
> active+undersized+degraded+remapped+backfill_wait, last acting
> [71,378,169,2147483647,192,308,131,108,239,97]
> >
> >
> > [ceph-admin@admin libr-cluster]$ ceph osd metadata 2147483647
> > Error ENOENT: osd.2147483647 does not exist
> >
> > Is this expected? If not, what should I do?
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] odd osd id in ceph health

2018-10-24 Thread Brady Deetz
My cluster (v12.2.8) is currently recovering and I noticed this odd OSD ID
in ceph health detail:
"2147483647"

[ceph-admin@admin libr-cluster]$ ceph health detail | grep 2147483647
pg 50.c3 is stuck undersized for 148638.689866, current state
active+recovery_wait+undersized+degraded+remapped, last acting
[275,282,330,25,154,98,239,2147483647,75,49]
pg 50.d4 is stuck undersized for 148638.649657, current state
active+recovery_wait+undersized+degraded+remapped, last acting
[239,275,307,49,184,25,281,2147483647,283,378]
pg 50.10b is stuck undersized for 148638.666901, current state
active+undersized+degraded+remapped+backfill_wait, last acting
[131,192,283,308,169,258,2147483647,75,306,25]
pg 50.110 is stuck undersized for 148638.684818, current state
active+recovery_wait+undersized+degraded+remapped, last acting
[169,377,2147483647,2,274,47,306,192,131,283]
pg 50.116 is stuck undersized for 148638.703043, current state
active+recovery_wait+undersized+degraded+remapped, last acting
[99,283,168,47,71,400,2147483647,108,239,2]
pg 50.121 is stuck undersized for 148638.700838, current state
active+undersized+degraded+remapped+backfill_wait, last acting
[71,2,75,307,286,73,168,2147483647,376,25]
pg 50.12a is stuck undersized for 145362.808035, current state
active+undersized+degraded+remapped+backfill_wait, last acting
[71,378,169,2147483647,192,308,131,108,239,97]


[ceph-admin@admin libr-cluster]$ ceph osd metadata 2147483647
Error ENOENT: osd.2147483647 does not exist

Is this expected? If not, what should I do?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] add existing rbd to new tcmu iscsi gateways

2018-10-10 Thread Brady Deetz
Looks like that may have recently been broken.

Unfortunately no real logs of use in rbd-target-api.log or
rbd-target-gw.log. Is there an increased log level I can enable for
whatever web-service is handling this?

[root@dc1srviscsi01 ~]# rbd -p vmware_ssd_metadata --data-pool vmware_ssd
--size 2T create ssd_test_0

[root@dc1srviscsi01 ~]# rbd -p vmware_ssd_metadata info ssd_test_0
rbd image 'ssd_test_0':
size 2 TiB in 524288 objects
order 22 (4 MiB objects)
id: b4343f6b8b4567
data_pool: vmware_ssd
block_name_prefix: rbd_data.56.b4343f6b8b4567
format: 2
features: layering, exclusive-lock, object-map, fast-diff,
deep-flatten, data-pool
op_features:
flags:
create_timestamp: Wed Oct 10 16:36:18 2018

[root@dc1srviscsi01 ~]# gwcli
/> cd disks/
/disks> create pool=vmware_ssd_metadata
image=vmware_ssd_metadata.ssd_test_0 size=2T
Failed : 500 INTERNAL SERVER ERROR
/disks>


[root@dc1srviscsi02 ~]# rpm -qa |egrep "ceph|iscsi|tcmu|rst|kernel-ml"
ceph-mds-13.2.2-0.el7.x86_64
ceph-mgr-13.2.2-0.el7.x86_64
ceph-release-1-1.el7.noarch
tcmu-runner-1.4.0-1.el7.x86_64
ceph-iscsi-cli-2.7-54.g9b18a3b.el7.noarch
ceph-common-13.2.2-0.el7.x86_64
ceph-mon-13.2.2-0.el7.x86_64
ceph-13.2.2-0.el7.x86_64
tcmu-runner-debuginfo-1.4.0-1.el7.x86_64
ceph-iscsi-config-2.6-42.gccca57d.el7.noarch
libcephfs2-13.2.2-0.el7.x86_64
ceph-base-13.2.2-0.el7.x86_64
ceph-osd-13.2.2-0.el7.x86_64
ceph-radosgw-13.2.2-0.el7.x86_64
kernel-ml-4.18.12-1.el7.elrepo.x86_64
python-cephfs-13.2.2-0.el7.x86_64
ceph-selinux-13.2.2-0.el7.x86_64



On Tue, Oct 9, 2018 at 3:51 PM Jason Dillaman  wrote:

> On Tue, Oct 9, 2018 at 3:14 PM Brady Deetz  wrote:
> >
> > I am attempting to migrate to the new tcmu iscsi gateway. Is there a way
> to configure gwcli to export an rbd that was created outside gwcli?
>
> You should be able to just run "/disks create .
> " from within "gwcli" to have it add an existing image.
>
> > This is necessary for me because I have a lun exported from an old LIO
> gateway to a Windows host that I need to transition to the new tcmu based
> cluster.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmu iscsi (failover not supported)

2018-10-10 Thread Brady Deetz
Thanks Jason,
That got us running. We'll see how it goes.

On Wed, Oct 10, 2018 at 2:41 PM Jason Dillaman  wrote:

> The latest master branch version on shaman should be functional:
>
> [1] https://shaman.ceph.com/repos/ceph-iscsi-config/
> [2] https://shaman.ceph.com/repos/ceph-iscsi-cli
> [3] https://shaman.ceph.com/repos/tcmu-runner/
>
> On Wed, Oct 10, 2018 at 3:39 PM Brady Deetz  wrote:
> >
> > Here's where we are now.
> >
> > By cherry-picking that patch into ceph-iscsi-config tags/v2.6 and
> cleaning up the merge conflicts, the rbd-target-gw service would not start.
> >
> > With the release of ceph-iscsi-config v2.6 (no cherry picked commits)
> and tcmu-runner v1.3.0 the originally described errors still exist.
> >
> > Is there a known working combination of releases of trslib-fb,
> targetcli-fb, ceph-iscsi-config, ceph-iscsi-cli, and tcmu-runner?
> >
> > On Wed, Oct 10, 2018 at 1:31 PM Mike Christie 
> wrote:
> >>
> >> On 10/10/2018 01:13 PM, Brady Deetz wrote:
> >> > ceph-iscsi-config v2.6 https://github.com/ceph/ceph-iscsi-config.git
> >>
> >> 
> >>
> >> > Ignore that. ceph-iscsi-config 2.6 enabled explicit alua in
> anticipation
> >> > for the tcmu-runner support. We are about to release 2.7 which
> matches
> >> > tcmu-runner 1.4.0.
> >> >
> >>
> >> You need this patch which sets the failover type back to implicit to
> >> match tcmu-runner 1.4.0 and also makes it configurable for future
> versions:
> >>
> >> commit 8d66492b8c7134fb37b72b5e8e77d7c8109220d9
> >> Author: Mike Christie 
> >> Date:   Mon Jul 23 15:45:09 2018 -0500
> >>
> >> Allow alua failover type to be configurable
> >>
> >> in the ceph-iscsi-config git tree master branch. It will be in
> >> ceph-iscsi-config 2.7 that we are trying to finish up by Friday.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmu iscsi (failover not supported)

2018-10-10 Thread Brady Deetz
Here's where we are now.

By cherry-picking that patch into ceph-iscsi-config tags/v2.6 and cleaning
up the merge conflicts, the rbd-target-gw service would not start.

With the release of ceph-iscsi-config v2.6 (no cherry picked commits) and
tcmu-runner v1.3.0 the originally described errors still exist.

Is there a known working combination of releases of trslib-fb,
targetcli-fb, ceph-iscsi-config, ceph-iscsi-cli, and tcmu-runner?

On Wed, Oct 10, 2018 at 1:31 PM Mike Christie  wrote:

> On 10/10/2018 01:13 PM, Brady Deetz wrote:
> > ceph-iscsi-config v2.6 https://github.com/ceph/ceph-iscsi-config.git
>
> 
>
> > Ignore that. ceph-iscsi-config 2.6 enabled explicit alua in
> anticipation
> > for the tcmu-runner support. We are about to release 2.7 which
> matches
> > tcmu-runner 1.4.0.
> >
>
> You need this patch which sets the failover type back to implicit to
> match tcmu-runner 1.4.0 and also makes it configurable for future versions:
>
> commit 8d66492b8c7134fb37b72b5e8e77d7c8109220d9
> Author: Mike Christie 
> Date:   Mon Jul 23 15:45:09 2018 -0500
>
> Allow alua failover type to be configurable
>
> in the ceph-iscsi-config git tree master branch. It will be in
> ceph-iscsi-config 2.7 that we are trying to finish up by Friday.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does anyone use interactive CLI mode?

2018-10-10 Thread Brady Deetz
I run 2 clusters and have never purposely executed the interactive cli. I
save remove the code bloat.

On Wed, Oct 10, 2018 at 9:20 AM John Spray  wrote:

> Hi all,
>
> Since time immemorial, the Ceph CLI has had a mode where when run with
> no arguments, you just get an interactive prompt that lets you run
> commands without "ceph" at the start.
>
> I recently discovered that we actually broke this in Mimic[1], and it
> seems that nobody noticed!
>
> So the question is: does anyone actually use this feature?  It's not
> particularly expensive to maintain, but it might be nice to have one
> less path through the code if this is entirely unused.
>
> Cheers,
> John
>
> 1. https://github.com/ceph/ceph/pull/24521
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] tcmu iscsi (failover not supported)

2018-10-10 Thread Brady Deetz
dc1srviscsi02]
| o- lun 1
..
[rbd.EyeTracker1_1(60G), Owner: dc1srviscsi01]


On Wed, Oct 10, 2018 at 1:01 PM Mike Christie  wrote:

> On 10/10/2018 12:40 PM, Mike Christie wrote:
> > On 10/09/2018 05:09 PM, Brady Deetz wrote:
> >> I'm trying to replace my old single point of failure iscsi gateway with
> >> the shiny new tcmu-runner implementation. I've been fighting a Windows
> >> initiator all day. I haven't tested any other initiators, as Windows is
> >> currently all we use iscsi for.
> >>
> >> One issue I've considered is our Ceph cluster is running 12.2.8 but I
> >> built my iscsi gateways against 13.2.2 since we will be moving to mimic
> >> within the next month or so.
> >>
> >> I compiled tcmu-runner with default options against 13.2.2 on a fresh
> >> fully updated version of centos 7.5.1804 with elrepo kernel 4.18.12-1.
> >>
> >> syslog:
> >> Oct  9 16:55:31 dc1srviscsi01 tcmu-runner: tcmu_get_alua_grp:225
> >> rbd/rbd.test_0: Unsupported alua_access_type: Implicit and Explicit
> >> failover not supported.
> >
> > We do not yet support explicit failover. Did you use targetcli directly
> > to set this up or did you use the ceph-iscsi tools?
> >
> > If you are using targetlci then you need to set alua_access_type to 1.
> > In the rc releases for 1.4.0 we had explicit enabled but there were too
> > many bugs and never got to fully QA it so for the final release it was
> > disabled.
> >
> > If you used the ceph-iscsi tools did you use ansible or gwcli and what
> > versions of ceph-iscsi-config and ceph-iscsi-cli or ceph-ansible?
>
> Ignore that. ceph-iscsi-config 2.6 enabled explicit alua in anticipation
> for the tcmu-runner support. We are about to release 2.7 which matches
> tcmu-runner 1.4.0.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] tcmu iscsi (failover not supported)

2018-10-09 Thread Brady Deetz
I'm trying to replace my old single point of failure iscsi gateway with the
shiny new tcmu-runner implementation. I've been fighting a Windows
initiator all day. I haven't tested any other initiators, as Windows is
currently all we use iscsi for.

One issue I've considered is our Ceph cluster is running 12.2.8 but I built
my iscsi gateways against 13.2.2 since we will be moving to mimic within
the next month or so.

I compiled tcmu-runner with default options against 13.2.2 on a fresh fully
updated version of centos 7.5.1804 with elrepo kernel 4.18.12-1.

syslog:
Oct  9 16:55:31 dc1srviscsi01 tcmu-runner: tcmu_get_alua_grp:225
rbd/rbd.test_0: Unsupported alua_access_type: Implicit and Explicit
failover not supported.
Oct  9 16:55:31 dc1srviscsi01 tcmu-runner: tcmu_get_alua_grps:319
rbd/rbd.test_0: Could not get alua group ao.
Oct  9 16:55:31 dc1srviscsi01 tcmu-runner: tcmu_get_alua_grp:225
rbd/rbd.test_0: Unsupported alua_access_type: Implicit and Explicit
failover not supported.
Oct  9 16:55:31 dc1srviscsi01 tcmu-runner: tcmu_get_alua_grps:319
rbd/rbd.test_0: Could not get alua group ao.
Oct  9 16:55:31 dc1srviscsi01 tcmu-runner: tcmu_get_alua_grp:225
rbd/rbd.test_0: Unsupported alua_access_type: Implicit and Explicit
failover not supported.
Oct  9 16:55:31 dc1srviscsi01 tcmu-runner: tcmu_get_alua_grps:319
rbd/rbd.test_0: Could not get alua group ao.
O

Any thoughts would be appreciated.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] add existing rbd to new tcmu iscsi gateways

2018-10-09 Thread Brady Deetz
I am attempting to migrate to the new tcmu iscsi gateway. Is there a way to
configure gwcli to export an rbd that was created outside gwcli?

This is necessary for me because I have a lun exported from an old LIO
gateway to a Windows host that I need to transition to the new tcmu based
cluster.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs issue with moving files between data pools gives Input/output error

2018-10-01 Thread Brady Deetz
I have a python script that is migrating my data from replicated to ec
pools for cephfs on files that haven't been accessed in a while. My process
involves setting the data_pool recursively for an existing replicated dir
to the new ec pool, copying the existing replicated file to a temporary
file in the same directory, then moving/renaming the ec file over the
replicated file. Ceph does correctly handle discarding the replicated file
data from the replicated pool.

Since mv operations are based on inode, you can't simply perform a mv to
migrate data to a new pool. Obviously it would be nice if Ceph was smart
enough to do this for us in the backend, but I feel like it's moderately
reasonable for it not to.

On Mon, Oct 1, 2018 at 3:13 PM Gregory Farnum  wrote:

> On Mon, Oct 1, 2018 at 12:43 PM Marc Roos 
> wrote:
>
>> Hmmm, did not know that, so it becomes a soft link or so?
>>
>> totally new for me, also not what I would expect of a mv on a fs. I know
>> this is normal to expect coping between pools, also from the s3cmd
>> client. But I think more people will not expect this behaviour. Can't
>> the move be implemented as a move?
>>
>> How can users even know about what folders have a 'different layout'.
>> What happens if we export such mixed pool filesystem via smb. How would
>> smb deal with the 'move' between those directories?
>>
>
> Since the CephX permissions are thoroughly outside of POSIX, handling this
> is unfortunately just your problem. :(
>
> Consider it the other way around — what if a mv *did* copy the file data
> into a new pool, and somebody who had the file open was suddenly no longer
> able to access it? There's no feasible way for us to handle that with rules
> that fall inside of POSIX; what we have now is better.
>
> John's right; it would be great if we could do a server-side "re-stripe"
> or "re-layout" or something, but that will also be an "outside POSIX"
> operation and never the default.
> -Greg
>
>
>>
>>
>>
>>
>> -Original Message-
>> From: Gregory Farnum [mailto:gfar...@redhat.com]
>> Sent: maandag 1 oktober 2018 21:28
>> To: Marc Roos
>> Cc: ceph-users; jspray; ukernel
>> Subject: Re: [ceph-users] cephfs issue with moving files between data
>> pools gives Input/output error
>>
>> Moving a file into a directory with a different layout does not, and is
>> not intended to, copy the underlying file data into a different pool
>> with the new layout. If you want to do that you have to make it happen
>> yourself by doing a copy.
>>
>> On Mon, Oct 1, 2018 at 12:16 PM Marc Roos 
>> wrote:
>>
>>
>>
>> I will explain the test again, I think you might have some bug in
>> your
>> cephfs copy between data pools.
>>
>> c04 has mounted the root cephfs
>> /a (has data pool a, ec21)
>> /test (has data pool b, r1)
>>
>> test2 has mounted
>> /m  (nfs mount of cephfs /a)
>> /m2 (cephfs mount of /a)
>>
>> Creating the test file.
>> [root@c04 test]# echo asdfasdfasdfasdfasdfasdfasdfasdfasdf >
>> testfile.txt
>>
>> Then I am moving on c04 the test file from the test folder(pool
>> b)
>> to
>> the a folder/pool
>>
>> Now on test2
>> [root@test2 m]# ls -arlt
>> -rw-r--r--  1 nobody nobody21 Oct  1 20:48 r1.txt
>> -rw-r--r--  1 nobody nobody21 Oct  1 20:49 r1-copy.txt
>> -rw-r--r--  1 nobody nobody37 Oct  1 21:02
>> testfile.txt
>>
>> [root@test2 /]# cat /mnt/m/testfile.txt
>> cat: /mnt/m/old/testfile.txt: Input/output error
>>
>> [root@test2 /]# cat /mnt/m2/testfile.txt
>> cat: /mnt/m2/old/testfile.txt: Operation not permitted
>>
>> Now I am creating a copy of the test file in the same directory
>> back on
>> c04
>>
>> [root@c04 a]# cp testfile.txt testfile-copy.txt
>> [root@c04 a]# ls -alrt
>> -rw-r--r-- 1 root root 21 Oct  1 20:49 r1-copy.txt
>> -rw-r--r-- 1 root root 37 Oct  1 21:02 testfile.txt
>> -rw-r--r-- 1 root root 37 Oct  1 21:07 testfile-copy.txt
>>
>> Now I trying to access the copy of testfile.txt back on test2
>> (without
>> unmounting, or changing permissions)
>>
>> [root@test2 /]# cat /mnt/m/testfile-copy.txt
>> asdfasdfasdfasdfasdfasdfasdfasdfasdf
>> [root@test2 /]# cat /mnt/m2/testfile-copy.txt
>> asdfasdfasdfasdfasdfasdfasdfasdfasdf
>>
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Yan, Zheng [mailto:uker...@gmail.com]
>> Sent: zaterdag 29 september 2018 6:55
>> To: Marc Roos
>> Subject: Re: [ceph-users] cephfs issue with moving files between
>> data
>> pools gives Input/output error
>>
>> check_pool_perm on pool 30 ns  need Fr, but no read perm
>>
>> client does not permission to read the pool.  ceph-fuse did
>> return
>> EPERM
>> for the kernel readpage 

Re: [ceph-users] Omap warning in 12.2.6

2018-07-19 Thread Brady Deetz
12.2.6 has a regression. See "v12.2.7 Luminous released" and all of the
related disaster posts. Also in the release nodes for .7 is a bug
disclosure for 12.2.5 that affects rgw users pretty badly during upgrade.
You might take a look there.

On Thu, Jul 19, 2018 at 2:13 PM Brent Kennedy  wrote:

> I just upgraded our cluster to 12.2.6 and now I see this warning about 1
> large omap object.  I looked and it seems this warning was just added in
> 12.2.6.  I found a few discussions on what is was but not much information
> on addressing it properly.  Our cluster uses rgw exclusively with just a
> few buckets in the .rgw.buckets pool.  Our largest bucket has millions of
> objects in it.
>
>
>
> Any thoughts or links on this?
>
>
>
>
>
> Regards,
>
> -Brent
>
>
>
> Existing Clusters:
>
> Test: Luminous 12.2.6 with 3 osd servers, 1 mon/man, 1 gateway ( all
> virtual )
>
> US Production: Firefly with 4 osd servers, 3 mons, 3 gateways behind
> haproxy LB
>
> UK Production: Luminous 12.2.6 with 8 osd servers, 3 mons/man, 3 gateways
> behind haproxy LB
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] is upgrade from 12.2.5 to 12.2.7 an emergency for EC users

2018-07-18 Thread Brady Deetz
I'm trying to determine if I need to perform an emergency update on my 2PB
CephFS environment running on EC.

What triggers the corruption bug? Is it only at the time of an OSD restart
before data is quiesced?

When do you know if corruption has occurred? deep-scrub?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Approaches for migrating to a much newer cluster

2018-07-13 Thread Brady Deetz
Just a thought: have you considered rbd replication?

On Fri, Jul 13, 2018 at 9:30 AM r...@cleansafecloud.com <
r...@cleansafecloud.com> wrote:

>
> Hello folks,
>
> We have an old active Ceph cluster on Firefly (v0.80.9) which we use for
> OpenStack and have multiple live clients. We have been put in a position
> whereby we need to move to a brand new cluster under a new OpenStack
> deployment. The new cluster is on Luminous (v.12.2.5). Now we obviously do
> not want to migrate huge images across in one go if we can avoid it, so our
> current plan is to transfer base images well in advance of the migration,
> and use the rbd export-diff feature to apply incremental updates from that
> point forwards. I wanted to reach out to you experts to see if we are going
> down the right path here, what issues we might encounter, or if there might
> be any better options. Or does this sound like the right approach?
>
> Many thanks,
> Rob
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] mimic cephfs snapshot in active/standby mds env

2018-06-06 Thread Brady Deetz
I've seen several mentions of stable snapshots in Mimic for cephfs in
multi-active mds environments. I'm currently running active/standby in
12.2.5 with no snapshops. If I upgrade to Mimic, is there any concern with
snapshots in an active/standby MDS environment. It seems like a silly
question since it is considered stable for multi-mds, but better safe than
sorry.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph EC profile, how are you using?

2018-06-02 Thread Brady Deetz
One thing I'd love to see are benchmarks for ec profiles in different
host/network configurations. We're building a new cluster and I will
definitely be putting together an extensive benchmark panel together this
time around so that we know exactly what works best in what situations.

On Sat, Jun 2, 2018, 4:13 AM Marc Roos  wrote:

>
> It would be nicer to keep such things on the mailinglist for future
> reference external links expire etc.
>
>
>
>
>
> -Original Message-
> From: Vasu Kulkarni [mailto:vakul...@redhat.com]
> Sent: vrijdag 1 juni 2018 18:51
> To: ceph-users
> Subject: Re: [ceph-users] Ceph EC profile, how are you using?
>
> Thanks to those who have added their config,  Request anyone in list
> using EC profile in production to add high level config which will be
> helpful for tests.
>
> Thanks
>
> On Wed, May 30, 2018 at 12:16 PM, Vasu Kulkarni 
> wrote:
> > Hello Ceph Users,
> >
> > I would like to know how folks are using EC profile in the production
> > environment, what kind of EC configurations are you using (10+4, 5+3 ?
> > ) with other configuration options, If you can reply to this thread or
>
> > update in the shared excel sheet below that will help design better
> > tests that are run on nightly basis.
> >
> > https://docs.google.com/spreadsheets/d/1B7WLM3_6nV_DMf18POI7cWLWx6_vQJ
> > ABVC2-bbglNEM/edit?usp=sharing
> >
> > Thanks
> > Vasu
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph , VMWare , NFS-ganesha

2018-05-28 Thread Brady Deetz
You might look into open vstorage as a gateway into ceph.

On Mon, May 28, 2018, 2:42 PM Steven Vacaroaia  wrote:

> Hi,
>
> I need to design and build a storage platform that will be "consumed"
> mainly by VMWare
>
> CEPH is my first choice
>
> As far as I can see, there are 3 ways CEPH storage can be made available
> to VMWare
>
> 1. iSCSI
> 2. NFS-Ganesha
> 3. mounted rbd to a lInux NFS server
>
> Any suggestions / advice as to which one is better ( and why) as well as
> links to doumentation/best practices will be truly appreciated
>
> Thanks
> Steven
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-25 Thread Brady Deetz
I'm not sure this is a cache issue. To me, this feels like a memory leak.
I'm now at 129GB (haven't had a window to upgrade yet) on a configured 80GB
cache.

[root@mds0 ceph-admin]# ceph daemon mds.mds0 cache status
{
"pool": {
"items": 166753076,
"bytes": 71766944952
}
}


ran a 10 minute heap profile.

[root@mds0 ceph-admin]# ceph tell mds.mds0 heap start_profiler
2018-05-25 08:15:04.428519 7f3f657fa700  0 client.127046191 ms_handle_reset
on 10.124.103.50:6800/2248223690
2018-05-25 08:15:04.447528 7f3f667fc700  0 client.127055541 ms_handle_reset
on 10.124.103.50:6800/2248223690
mds.mds0 started profiler


[root@mds0 ceph-admin]# ceph tell mds.mds0 heap dump
2018-05-25 08:25:14.265450 7f1774ff9700  0 client.127057266 ms_handle_reset
on 10.124.103.50:6800/2248223690
2018-05-25 08:25:14.356292 7f1775ffb700  0 client.127057269 ms_handle_reset
on 10.124.103.50:6800/2248223690
mds.mds0 dumping heap profile now.

MALLOC:   123658130320 (117929.6 MiB) Bytes in use by application
MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
MALLOC: +   6969713096 ( 6646.8 MiB) Bytes in central cache freelist
MALLOC: + 26700832 (   25.5 MiB) Bytes in transfer cache freelist
MALLOC: + 54460040 (   51.9 MiB) Bytes in thread cache freelists
MALLOC: +531034272 (  506.4 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: = 131240038560 (125160.3 MiB) Actual memory used (physical + swap)
MALLOC: +   7426875392 ( 7082.8 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: = 138666913952 (132243.1 MiB) Virtual address space used
MALLOC:
MALLOC:7434952  Spans in use
MALLOC: 20  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via
madvise()).
Bytes released to the OS take up virtual address space but no physical
memory.

[root@mds0 ceph-admin]# ceph tell mds.mds0 heap stop_profiler
2018-05-25 08:25:26.394877 7fbe48ff9700  0 client.127047898 ms_handle_reset
on 10.124.103.50:6800/2248223690
2018-05-25 08:25:26.736909 7fbe49ffb700  0 client.127035608 ms_handle_reset
on 10.124.103.50:6800/2248223690
mds.mds0 stopped profiler

[root@mds0 ceph-admin]# pprof --pdf /bin/ceph-mds
/var/log/ceph/mds.mds0.profile.000* > profile.pdf



On Thu, May 10, 2018 at 2:11 PM, Patrick Donnelly <pdonn...@redhat.com>
wrote:

> On Thu, May 10, 2018 at 12:00 PM, Brady Deetz <bde...@gmail.com> wrote:
> > [ceph-admin@mds0 ~]$ ps aux | grep ceph-mds
> > ceph1841  3.5 94.3 133703308 124425384 ? Ssl  Apr04 1808:32
> > /usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup
> ceph
> >
> >
> > [ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status
> > {
> > "pool": {
> > "items": 173261056,
> > "bytes": 76504108600
> > }
> > }
> >
> > So, 80GB is my configured limit for the cache and it appears the mds is
> > following that limit. But, the mds process is using over 100GB RAM in my
> > 128GB host. I thought I was playing it safe by configuring at 80. What
> other
> > things consume a lot of RAM for this process?
> >
> > Let me know if I need to create a new thread.
>
> The cache size measurement is imprecise pre-12.2.5 [1]. You should upgrade
> ASAP.
>
> [1] https://tracker.ceph.com/issues/22972
>
> --
> Patrick Donnelly
>


profile.pdf
Description: Adobe PDF document
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] open vstorage

2018-05-23 Thread Brady Deetz
http://www.openvstorage.com
https://www.openvstorage.org

I came across this the other day and am curious if anybody has run it in
front of their Ceph cluster. I'm looking at it for a clean-ish Ceph
integration with VMWare.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi site with cephfs

2018-05-21 Thread Brady Deetz
What is your expected behavior for when Client A writes to File B in
Datacenter 1 and Client C writes to File B in Datacenter 2 at the exact
same time?

I don't think you can perfectly achieve what you are requesting with Ceph
or many other storage solutions.

On Mon, May 21, 2018 at 9:33 AM, Up Safe  wrote:

> I'll explain.
> Right now we have 2 sites (racks) with several dozens of servers at each
> accessing a NAS (let's call it a NAS, although it's an IBM v7000 Unified
> that serves the files via NFS).
>
> The biggest problem is that it works active-passive, i.e. we always access
> one of the storages for read/write
> and the other one is replicated once every few hours, so it's more for
> backup needs.
>
> In this setup once the power goes down in our main site - we're stuck with
> a bit (several hours) outdated files
> and we need to remount all of the servers and what not.
>
> The multi site ceph was supposed to solve this problem for us. This way we
> would have only local mounts, i.e.
> each server would only access the filesystem that is in the same site. And
> if one of the sited go down - no pain.
>
> The files are rather small, pdfs and xml of 50-300KB mostly.
> The total size is about 25 TB right now.
>
> We're a low budget company, so your advise about developing is not going
> to happen as we have no such skills or resources for this.
> Plus, I want to make this transparent for the devs and everyone - just an
> infrastructure replacement that will buy me all of the ceph benefits and
> allow the company to survive the power outages or storage crashes.
>
>
>
> On Mon, May 21, 2018 at 5:12 PM, David Turner 
> wrote:
>
>> Not a lot of people use object storage multi-site.  I doubt anyone is
>> using this like you are.  In theory it would work, but even if somebody has
>> this setup running, it's almost impossible to tell if it would work for
>> your needs and use case.  You really should try it out for yourself to see
>> if it works to your needs.  And if you feel so inclined, report back here
>> with how it worked.
>>
>> If you're asking for advice, why do you need a networked posix
>> filesystem?  Unless you are using proprietary software with this
>> requirement, it's generally lazy coding that requires a mounted filesystem
>> like this and you should aim towards using object storage instead without
>> any sort of NFS layer.  It's a little more work for the developers, but is
>> drastically simpler to support and manage.
>>
>> On Mon, May 21, 2018 at 10:06 AM Up Safe  wrote:
>>
>>> guys,
>>> please tell me if I'm in the right direction.
>>> If ceph object storage can be set up in multi site configuration,
>>> and I add ganesha (which to my understanding is an "adapter"
>>> that serves s3 objects via nfs to clients) -
>>> won't this work as active-active?
>>>
>>>
>>> Thanks
>>>
>>> On Mon, May 21, 2018 at 11:48 AM, Up Safe  wrote:
>>>
 ok, thanks.
 but it seems to me that having pool replicas spread over sites is a bit
 too risky performance wise.
 how about ganesha? will it work with cephfs and multi site setup?

 I was previously reading about rgw with ganesha and it was full of
 limitations.
 with cephfs - there is only one and one I can live with.

 Will it work?


 On Mon, May 21, 2018 at 10:57 AM, Adrian Saul <
 adrian.s...@tpgtelecom.com.au> wrote:

>
>
> We run CephFS in a limited fashion in a stretched cluster of about
> 40km with redundant 10G fibre between sites – link latency is in the order
> of 1-2ms.  Performance is reasonable for our usage but is noticeably 
> slower
> than comparable local ceph based RBD shares.
>
>
>
> Essentially we just setup the ceph pools behind cephFS to have
> replicas on each site.  To export it we are simply using Linux kernel NFS
> and it gets exported from 4 hosts that act as CephFS clients.  Those 4
> hosts are then setup in an DNS record that resolves to all 4 IPs, and we
> then use automount to do automatic mounting and host failover on the NFS
> clients.  Automount takes care of finding the quickest and available NFS
> server.
>
>
>
> I stress this is a limited setup that we use for some fairly light
> duty, but we are looking to move things like user home directories onto
> this.  YMMV.
>
>
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
> Behalf Of *Up Safe
> *Sent:* Monday, 21 May 2018 5:36 PM
> *To:* David Turner 
> *Cc:* ceph-users 
> *Subject:* Re: [ceph-users] multi site with cephfs
>
>
>
> Hi,
>
> can you be a bit more specific?
>
> I need to understand whether this is doable at all.
>
> Other options would be using ganesha, but I understand it's 

Re: [ceph-users] multi site with cephfs

2018-05-21 Thread Brady Deetz
At this point in the conversation, based on what's already been said, I
have 2 recommendations.

If you haven't already, read a lot of the architecture documentation for
ceph. This will give you a good idea what capabilities exist and don't
exist.

If after reading the architecture documentation, you are still unsure,
don't invest in Ceph. It's a great platform for many people, but it isn't
for every team or problem.

On Mon, May 21, 2018, 9:56 AM Up Safe  wrote:

> Active-passive sounds not what I want.
>  But maybe I misunderstand.
>
> Does rbd mirror replicate both ways?
> And how do I do it with nfs?
>
> Thanks
>
> On Mon, May 21, 2018, 17:42 Paul Emmerich  wrote:
>
>> For active/passive and async replication with a POSIX filesystem:
>> Maybe two Ceph clusters with RBD mirror and re-exporting the RBD(s) via
>> NFS?
>>
>>
>> Paul
>>
>> 2018-05-21 16:33 GMT+02:00 Up Safe :
>>
>>> I'll explain.
>>> Right now we have 2 sites (racks) with several dozens of servers at each
>>> accessing a NAS (let's call it a NAS, although it's an IBM v7000 Unified
>>> that serves the files via NFS).
>>>
>>> The biggest problem is that it works active-passive, i.e. we always
>>> access one of the storages for read/write
>>> and the other one is replicated once every few hours, so it's more for
>>> backup needs.
>>>
>>> In this setup once the power goes down in our main site - we're stuck
>>> with a bit (several hours) outdated files
>>> and we need to remount all of the servers and what not.
>>>
>>> The multi site ceph was supposed to solve this problem for us. This way
>>> we would have only local mounts, i.e.
>>> each server would only access the filesystem that is in the same site.
>>> And if one of the sited go down - no pain.
>>>
>>> The files are rather small, pdfs and xml of 50-300KB mostly.
>>> The total size is about 25 TB right now.
>>>
>>> We're a low budget company, so your advise about developing is not going
>>> to happen as we have no such skills or resources for this.
>>> Plus, I want to make this transparent for the devs and everyone - just
>>> an infrastructure replacement that will buy me all of the ceph benefits and
>>> allow the company to survive the power outages or storage crashes.
>>>
>>>
>>>
>>> On Mon, May 21, 2018 at 5:12 PM, David Turner 
>>> wrote:
>>>
 Not a lot of people use object storage multi-site.  I doubt anyone is
 using this like you are.  In theory it would work, but even if somebody has
 this setup running, it's almost impossible to tell if it would work for
 your needs and use case.  You really should try it out for yourself to see
 if it works to your needs.  And if you feel so inclined, report back here
 with how it worked.

 If you're asking for advice, why do you need a networked posix
 filesystem?  Unless you are using proprietary software with this
 requirement, it's generally lazy coding that requires a mounted filesystem
 like this and you should aim towards using object storage instead without
 any sort of NFS layer.  It's a little more work for the developers, but is
 drastically simpler to support and manage.

 On Mon, May 21, 2018 at 10:06 AM Up Safe  wrote:

> guys,
> please tell me if I'm in the right direction.
> If ceph object storage can be set up in multi site configuration,
> and I add ganesha (which to my understanding is an "adapter"
> that serves s3 objects via nfs to clients) -
> won't this work as active-active?
>
>
> Thanks
>
> On Mon, May 21, 2018 at 11:48 AM, Up Safe  wrote:
>
>> ok, thanks.
>> but it seems to me that having pool replicas spread over sites is a
>> bit too risky performance wise.
>> how about ganesha? will it work with cephfs and multi site setup?
>>
>> I was previously reading about rgw with ganesha and it was full of
>> limitations.
>> with cephfs - there is only one and one I can live with.
>>
>> Will it work?
>>
>>
>> On Mon, May 21, 2018 at 10:57 AM, Adrian Saul <
>> adrian.s...@tpgtelecom.com.au> wrote:
>>
>>>
>>>
>>> We run CephFS in a limited fashion in a stretched cluster of about
>>> 40km with redundant 10G fibre between sites – link latency is in the 
>>> order
>>> of 1-2ms.  Performance is reasonable for our usage but is noticeably 
>>> slower
>>> than comparable local ceph based RBD shares.
>>>
>>>
>>>
>>> Essentially we just setup the ceph pools behind cephFS to have
>>> replicas on each site.  To export it we are simply using Linux kernel 
>>> NFS
>>> and it gets exported from 4 hosts that act as CephFS clients.  Those 4
>>> hosts are then setup in an DNS record that resolves to all 4 IPs, and we
>>> then use automount to do automatic mounting and host failover 

Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-10 Thread Brady Deetz
[ceph-admin@mds0 ~]$ ps aux | grep ceph-mds
ceph1841  3.5 94.3 133703308 124425384 ? Ssl  Apr04 1808:32
/usr/bin/ceph-mds -f --cluster ceph --id mds0 --setuser ceph --setgroup ceph


[ceph-admin@mds0 ~]$ sudo ceph daemon mds.mds0 cache status
{
"pool": {
"items": 173261056,
"bytes": 76504108600
}
}

So, 80GB is my configured limit for the cache and it appears the mds is
following that limit. But, the mds process is using over 100GB RAM in my
128GB host. I thought I was playing it safe by configuring at 80. What
other things consume a lot of RAM for this process?

Let me know if I need to create a new thread.




On Thu, May 10, 2018 at 12:40 PM, Patrick Donnelly <pdonn...@redhat.com>
wrote:

> Hello Brady,
>
> On Thu, May 10, 2018 at 7:35 AM, Brady Deetz <bde...@gmail.com> wrote:
> > I am now seeing the exact same issues you are reporting. A heap release
> did
> > nothing for me.
>
> I'm not sure it's the same issue...
>
> > [root@mds0 ~]# ceph daemon mds.mds0 config get mds_cache_memory_limit
> > {
> > "mds_cache_memory_limit": "80530636800"
> > }
>
> 80G right? What was the memory use from `ps aux | grep ceph-mds`?
>
> > [root@mds0 ~]# ceph daemon mds.mds0 perf dump
> > {
> > ...
> > "inode_max": 2147483647,
> > "inodes": 35853368,
> > "inodes_top": 23669670,
> > "inodes_bottom": 12165298,
> > "inodes_pin_tail": 18400,
> > "inodes_pinned": 2039553,
> > "inodes_expired": 142389542,
> > "inodes_with_caps": 831824,
> > "caps": 881384,
>
> Your cap count is 2% of the inodes in cache; the inodes pinned 5% of
> the total. Your cache should be getting trimmed assuming the cache
> size (as measured by the MDS, there are fixes in 12.2.5 which improve
> its precision) is larger than your configured limit.
>
> If the cache size is larger than the limit (use `cache status` admin
> socket command) then we'd be interested in seeing a few seconds of the
> MDS debug log with higher debugging set (`config set debug_mds 20`).
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph mds memory usage 20GB : is it normal ?

2018-05-10 Thread Brady Deetz
I am now seeing the exact same issues you are reporting. A heap release did
nothing for me.

The only odd thing I'm doing is migrating data in cephfs from one pool to
another. The process looks something like the following:
TARGET_DIR=/media/cephfs/labs/
TARGET_POOL="cephfs_ec_data"
setfattr -n ceph.dir.layout.pool -v ${TARGET_POOL} ${TARGET_DIR}
#for every file
##NEWFILE="${file}.ec"
##cp "${file}" "${NEWFILE}"
##mv "${NEWFILE}" "${file}"

I have a fear that this process may not be releasing the inode of ${file}
and deleting the objects from RADOS. But, I'm not sure that would have much
to do with MDS outside tracking an inode that isn't accessible anymore.



[root@mds0 ~]# rpm -qa | grep ceph
ceph-mgr-12.2.4-0.el7.x86_64
ceph-12.2.4-0.el7.x86_64
ceph-osd-12.2.4-0.el7.x86_64
ceph-release-1-1.el7.noarch
libcephfs2-12.2.4-0.el7.x86_64
ceph-base-12.2.4-0.el7.x86_64
ceph-mds-12.2.4-0.el7.x86_64
ceph-deploy-2.0.0-0.noarch
ceph-common-12.2.4-0.el7.x86_64
ceph-mon-12.2.4-0.el7.x86_64
ceph-radosgw-12.2.4-0.el7.x86_64
python-cephfs-12.2.4-0.el7.x86_64
ceph-selinux-12.2.4-0.el7.x86_64


[root@mds0 ~]# ceph daemon mds.mds0 config get mds_cache_memory_limit
{
"mds_cache_memory_limit": "80530636800"
}


[root@mds0 ~]# ceph daemon mds.mds0 perf dump
{
"AsyncMessenger::Worker-0": {
"msgr_recv_messages": 48568037,
"msgr_send_messages": 51895350,
"msgr_recv_bytes": 50001752194,
"msgr_send_bytes": 59667899407,
"msgr_created_connections": 28522,
"msgr_active_connections": 939,
"msgr_running_total_time": 9158.145665485,
"msgr_running_send_time": 3270.445768873,
"msgr_running_recv_time": 8951.883602486,
"msgr_running_fast_dispatch_time": 684.964408603
},
"AsyncMessenger::Worker-1": {
"msgr_recv_messages": 81557461,
"msgr_send_messages": 88149491,
"msgr_recv_bytes": 59543645402,
"msgr_send_bytes": 99790426210,
"msgr_created_connections": 28705,
"msgr_active_connections": 881,
"msgr_running_total_time": 14513.332929088,
"msgr_running_send_time": 5214.994372044,
"msgr_running_recv_time": 13891.320681575,
"msgr_running_fast_dispatch_time": 682.921363330
},
"AsyncMessenger::Worker-2": {
"msgr_recv_messages": 104018424,
"msgr_send_messages": 117265828,
"msgr_recv_bytes": 70248474177,
"msgr_send_bytes": 175930469394,
"msgr_created_connections": 30034,
"msgr_active_connections": 1043,
"msgr_running_total_time": 18836.813930876,
"msgr_running_send_time": 7227.884643396,
"msgr_running_recv_time": 17825.385233846,
"msgr_running_fast_dispatch_time": 692.710777921
},
"finisher-PurgeQueue": {
"queue_len": 0,
"complete_latency": {
"avgcount": 22554047,
"sum": 2515.425093728,
"avgtime": 0.000111528
}
},
"mds": {
"request": 156766118,
"reply": 156766111,
"reply_latency": {
"avgcount": 156766111,
"sum": 337276.533677320,
"avgtime": 0.002151463
},
"forward": 0,
"dir_fetch": 6468158,
"dir_commit": 539656,
"dir_split": 0,
"dir_merge": 0,
"inode_max": 2147483647,
"inodes": 35853368,
"inodes_top": 23669670,
"inodes_bottom": 12165298,
"inodes_pin_tail": 18400,
"inodes_pinned": 2039553,
"inodes_expired": 142389542,
"inodes_with_caps": 831824,
"caps": 881384,
"subtrees": 2,
"traverse": 167546977,
"traverse_hit": 53323050,
"traverse_forward": 0,
"traverse_discover": 0,
"traverse_dir_fetch": 4853,
"traverse_remote_ino": 0,
"traverse_lock": 39597,
"load_cent": 15676533928,
"q": 0,
"exported": 0,
"exported_inodes": 0,
"imported": 0,
"imported_inodes": 0
},
"mds_cache": {
"num_strays": 1369,
"num_strays_delayed": 12,
"num_strays_enqueuing": 0,
"strays_created": 2667808,
"strays_enqueued": 2666306,
"strays_reintegrated": 246,
"strays_migrated": 0,
"num_recovering_processing": 0,
"num_recovering_enqueued": 0,
"num_recovering_prioritized": 0,
"recovery_started": 524,
"recovery_completed": 524,
"ireq_enqueue_scrub": 0,
"ireq_exportdir": 0,
"ireq_flush": 0,
"ireq_fragmentdir": 0,
"ireq_fragstats": 0,
"ireq_inodestats": 0
},
"mds_log": {
"evadd": 34813343,
"evex": 34809732,
"evtrm": 34809732,
"ev": 22489,
"evexg": 0,
"evexd": 728,
"segadd": 47980,
"segex": 47980,
"segtrm": 47980,
"seg": 31,
"segexg": 0,
"segexd": 1,
"expos": 8687078876712,
"wrpos": 8687143594883,

Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

2018-04-26 Thread Brady Deetz
I do it in production

On Thu, Apr 26, 2018, 2:47 AM John Hearns  wrote:

> Ronny, talking about reboots, has anyone had experience of live kernel
> patching with CEPH?  I am asking out of simple curiosity.
>
>
> On 25 April 2018 at 19:40, Ronny Aasen  wrote:
>
>> the difference in cost between 2 and 3 servers are not HUGE. but the
>> reliability  difference between a size 2/1 pool and a 3/2 pool is massive.
>> a 2/1 pool is just a single fault during maintenance away from dataloss.
>> but you need multiple simultaneous faults, and have very bad luck to break
>> a 3/2 pool
>>
>> I would recommend rather using 2/2 pools if you are willing to accept a
>> little downtime when a disk dies.  the cluster io would stop until the
>> disks backfill to cover for the lost disk.
>> but it is better then having inconsistent pg's or dataloss because a disk
>> crashed during a routine reboot, or 2 disks
>>
>> also worth to read this link
>> https://www.spinics.net/lists/ceph-users/msg32895.html   a good
>> explanation.
>>
>> you have good backups and are willing to restore the whole pool. And it
>> is of course your privilege to run 2/1 pools but be mind full of the risks
>> of doing so.
>>
>>
>> kind regards
>> Ronny Aasen
>>
>> BTW: i did not know ubuntu automagically rebooted after a upgrade. you
>> can probably avoid that reboot somehow in ubuntu. and do the restarts of
>> services manually. if you wish to maintain service during upgrade
>>
>>
>>
>>
>>
>> On 25.04.2018 11:52, Ranjan Ghosh wrote:
>>
>>> Thanks a lot for your detailed answer. The problem for us, however, was
>>> that we use the Ceph packages that come with the Ubuntu distribution. If
>>> you do a Ubuntu upgrade, all packages are upgraded in one go and the server
>>> is rebooted. You cannot influence anything or start/stop services
>>> one-by-one etc. This was concering me, because the upgrade instructions
>>> didn't mention anything about an alternative or what to do in this case.
>>> But someone here enlightened me that - in general - it all doesnt matter
>>> that much *if you are just accepting a downtime*. And, indeed, it all
>>> worked nicely. We stopped all services on all servers, upgraded the Ubuntu
>>> version, rebooted all servers and were ready to go again. Didn't encounter
>>> any problems there. The only problem turned out to be our own fault and
>>> simply a firewall misconfiguration.
>>>
>>> And, yes, we're running a "size:2 min_size:1" because we're on a very
>>> tight budget. If I understand correctly, this means: Make changes of files
>>> to one server. *Eventually* copy them to the other server. I hope this
>>> *eventually* means after a few minutes. Up until now I've never experienced
>>> *any* problems with file integrity with this configuration. In fact, Ceph
>>> is incredibly stable. Amazing. I have never ever had any issues whatsoever
>>> with broken files/partially written files, files that contain garbage etc.
>>> Even after starting/stopping services, rebooting etc. With GlusterFS and
>>> other Cluster file system I've experienced many such problems over the
>>> years, so this is what makes Ceph so great. I have now a lot of trust in
>>> Ceph, that it will eventually repair everything :-) And: If a file that has
>>> been written a few seconds ago is really lost it wouldnt be that bad for
>>> our use-case. It's a web-server. Most important stuff is in the DB. We have
>>> hourly backups of everything. In a huge emergency, we could even restore
>>> the backup from an hour ago if we really had to. Not nice, but if it
>>> happens every 6 years or sth due to some freak hardware failure, I think it
>>> is manageable. I accept it's not the recommended/perfect solution if you
>>> have infinite amounts of money at your hands, but in our case, I think it's
>>> not extremely audacious either to do it like this, right?
>>>
>>>
>>> Am 11.04.2018 um 19:25 schrieb Ronny Aasen:
>>>
 ceph upgrades are usualy not a problem:
 ceph have to be upgraded in the right order. normally when each service
 is on its own machine this is not difficult.
 but when you have mon, mgr, osd, mds, and klients on the same host you
 have to do it a bit carefully..

 i tend to have a terminal open with "watch ceph -s" running, and i
 never do another service until the health is ok again.

 first apt upgrade the packages on all the hosts. This only update the
 software on disk and not the running services.
 then do the restart of services in the right order.  and only on one
 host at the time

 mons: first you restart the mon service on all mon running hosts.
 all the 3 mons are active at the same time, so there is no "shifting
 around" but make sure the quorum is ok again before you do the next mon.

 mgr: then restart mgr on all hosts that run mgr. there is only one
 active mgr at the time now, so here there will be a bit of shifting 

Re: [ceph-users] Dying OSDs

2018-04-10 Thread Brady Deetz
What distribution and kernel are you running?

I recently found my cluster running the 3.10 centos kernel when I thought
it was running the elrepo kernel. After forcing it to boot correctly, my
flapping osd issue went away.

On Tue, Apr 10, 2018, 2:18 AM Jan Marquardt  wrote:

> Hi,
>
> we are experiencing massive problems with our Ceph setup. After starting
> a "repair pg" because of scrub errors OSDs started to crash, which we
> could not stop so far. We are running Ceph 12.2.4. Crashed OSDs are both
> bluestore and filestore.
>
> Our cluster currently looks like this:
>
> # ceph -s
>   cluster:
> id: c59e56df-2043-4c92-9492-25f05f268d9f
> health: HEALTH_ERR
> 1 osds down
> 73005/17149710 objects misplaced (0.426%)
> 5 scrub errors
> Reduced data availability: 2 pgs inactive, 2 pgs down
> Possible data damage: 1 pg inconsistent
> Degraded data redundancy: 611518/17149710 objects degraded
> (3.566%), 86 pgs degraded, 86 pgs undersized
>
>   services:
> mon: 3 daemons, quorum head1,head2,head3
> mgr: head3(active), standbys: head2, head1
> osd: 34 osds: 24 up, 25 in; 18 remapped pgs
>
>   data:
> pools:   1 pools, 768 pgs
> objects: 5582k objects, 19500 GB
> usage:   62030 GB used, 31426 GB / 93456 GB avail
> pgs: 0.260% pgs not active
>  611518/17149710 objects degraded (3.566%)
>  73005/17149710 objects misplaced (0.426%)
>  670 active+clean
>  75  active+undersized+degraded
>  8   active+undersized+degraded+remapped+backfill_wait
>  8   active+clean+remapped
>  2   down
>  2   active+undersized+degraded+remapped+backfilling
>  2   active+clean+scrubbing+deep
>  1   active+undersized+degraded+inconsistent
>
>   io:
> client:   10911 B/s rd, 118 kB/s wr, 0 op/s rd, 54 op/s wr
> recovery: 31575 kB/s, 8 objects/s
>
> # ceph osd tree
> ID  CLASS WEIGHTTYPE NAME  STATUS REWEIGHT PRI-AFF
>  -1   124.07297 root default
>  -229.08960 host ceph1
>   0   hdd   3.63620 osd.0  up  1.0 1.0
>   1   hdd   3.63620 osd.1down0 1.0
>   2   hdd   3.63620 osd.2  up  1.0 1.0
>   3   hdd   3.63620 osd.3  up  1.0 1.0
>   4   hdd   3.63620 osd.4down0 1.0
>   5   hdd   3.63620 osd.5down0 1.0
>   6   hdd   3.63620 osd.6  up  1.0 1.0
>   7   hdd   3.63620 osd.7  up  1.0 1.0
>  -3 7.27240 host ceph2
>  14   hdd   3.63620 osd.14 up  1.0 1.0
>  15   hdd   3.63620 osd.15 up  1.0 1.0
>  -429.11258 host ceph3
>  16   hdd   3.63620 osd.16 up  1.0 1.0
>  18   hdd   3.63620 osd.18   down0 1.0
>  19   hdd   3.63620 osd.19   down0 1.0
>  20   hdd   3.65749 osd.20 up  1.0 1.0
>  21   hdd   3.63620 osd.21 up  1.0 1.0
>  22   hdd   3.63620 osd.22 up  1.0 1.0
>  23   hdd   3.63620 osd.23 up  1.0 1.0
>  24   hdd   3.63789 osd.24   down0 1.0
>  -929.29919 host ceph4
>  17   hdd   3.66240 osd.17 up  1.0 1.0
>  25   hdd   3.66240 osd.25 up  1.0 1.0
>  26   hdd   3.66240 osd.26   down0 1.0
>  27   hdd   3.66240 osd.27 up  1.0 1.0
>  28   hdd   3.66240 osd.28   down0 1.0
>  29   hdd   3.66240 osd.29 up  1.0 1.0
>  30   hdd   3.66240 osd.30 up  1.0 1.0
>  31   hdd   3.66240 osd.31   down0 1.0
> -1129.29919 host ceph5
>  32   hdd   3.66240 osd.32 up  1.0 1.0
>  33   hdd   3.66240 osd.33 up  1.0 1.0
>  34   hdd   3.66240 osd.34 up  1.0 1.0
>  35   hdd   3.66240 osd.35 up  1.0 1.0
>  36   hdd   3.66240 osd.36   down  1.0 1.0
>  37   hdd   3.66240 osd.37 up  1.0 1.0
>  38   hdd   3.66240 osd.38 up  1.0 1.0
>  39   hdd   3.66240 osd.39 up  1.0 1.0
>
> The last OSDs that crashed are #28 and #36. Please find the
> corresponding log files here:
>
> http://af.janno.io/ceph/ceph-osd.28.log.1.gz
> http://af.janno.io/ceph/ceph-osd.36.log.1.gz
>
> The backtraces look almost the same for all crashed OSDs.
>
> Any help, hint or advice would really be appreciated. Please let me know
> if you need any further information.
>
> Best Regards
>
> Jan
>
> --
> Artfiles New Media GmbH | Zirkusweg 1 | 20359 Hamburg
> Tel: 040 - 32 02 72 90 | Fax: 040 - 32 02 72 95
> E-Mail: supp...@artfiles.de | Web: http://www.artfiles.de
> Geschäftsführer: Harald Oltmanns | Tim Evers
> 

Re: [ceph-users] ceph-deploy: recommended?

2018-04-04 Thread Brady Deetz
We use ceph-deploy in production. That said, our crush map is getting more
complex and we are starting to make use of other tooling as that occurs.
But we still use ceph-deploy to install ceph and bootstrap OSDs.

On Wed, Apr 4, 2018, 1:58 PM Robert Stanford 
wrote:

>
>  I read a couple of versions ago that ceph-deploy was not recommended for
> production clusters.  Why was that?  Is this still the case?  We have a lot
> of problems automating deployment without ceph-deploy.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs metadata dump

2018-03-09 Thread Brady Deetz
I'd also be very interested in this. At the moment I just use robinhood (
https://github.com/cea-hpc/robinhood) , which is less than optimal. I also
have a few scripts that use the xattrs instead of statting every file.



On Mar 9, 2018 8:09 AM, "Pavan, Krish"  wrote:

> Hi All,
>
> We have cephfs with larger size( > PB)  and expected to grow more. I need
> to dump the metadata ( cinode,Cdir with ACL, size, ctime, …)  weekly to
> find/report usage as well as acl.
>
> Is there any tool to dump the metadata pool and decode, without going via
> MDS servers?.
>
> What is the best way to do?
>
>
>
> Regards
>
> Krish
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph cluster design

2018-03-09 Thread Brady Deetz
I'd increase ram. 1GB per 1TB of disk is the recommendation.

Another thing you need to consider is your node density. 12x10TB is a lot
of data to have to rebalance if you aren't going to have 20+ nodes. I have
17 nodes with 24x6TB disks each. Rebuilds can take what seems like an
eternity. It may be worth looking at cheaper sockets and smaller disks in
order to increase your node count.

How many nodes will this cluster have?


On Mar 9, 2018 4:16 AM, "Ján Senko"  wrote:

I am planning a new Ceph deployement and I have few questions that I could
not find good answers yet.

Our nodes will be using Xeon-D machines with 12 HDDs each and 64GB each.
Our target is to use 10TB drives for 120TB capacity per node.

1. We want to have small amount of SSDs in the machines. For OS and I guess
for WAL/DB of Bluestore. I am thinking about having a RAID 1 with two 400GB
2.5" SSD drives. Will this fit WAL/DB? We plan to store many small objects.
2. While doing scrub/deep scrub, is there any significant network traffic?
Assuming we are using Erasure coding pool, how do the nodes check the
consistency of an object? Do they transfer the whole object chunks or do
they only transfer the checksums?
3. We have to decide on which HDD to use, and there is a question of HGST
vs Seagate, 512e vs 4kn sectors, SATA vs SAS. Do you have some tips for
these decisions? We do not have very high IO, so we do not need performance
at any cost. As for manufacturer and the sector size, I haven't found any
guidelines/benchmarks that would steer me towards any.

Thank you for your insight
Jan



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd mirror mechanics

2018-03-05 Thread Brady Deetz
While preparing a risk assessment for a DR solution involving RBD, I'm
increasingly unsure of a few things.

1) Does the failover from primary to secondary cluster occur automatically
in the case that the primary backing rados pool becomes inaccessible?

1.a) If the primary backing rados pool is unintentionally deleted, can the
client still failover to the secondary?


2) When an RBD image that is mirrored is deleted from the primary cluster,
is it automatically deleted from the secondary cluster?

2.a) If the primary RBD image is unintentionally deleted, can the client
still failover to the secondary?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD mirroring to DR site

2018-02-28 Thread Brady Deetz
Great. We are read heavy. I assume the journals do not replicate reads. Is
that correct?

On Wed, Feb 28, 2018 at 1:50 PM, Jason Dillaman <jdill...@redhat.com> wrote:

> On Wed, Feb 28, 2018 at 2:42 PM, Brady Deetz <bde...@gmail.com> wrote:
> > I'm considering doing one-way rbd mirroring to a DR site. The
> documentation
> > states that my link to the DR site should have sufficient throughput to
> > support replication.
> >
> > Our write activity is bursty. As such, we tend to see moments of high
> > throughput 4-6gbps followed by long bouts of basically no activity.
> >
> > 1) how sensitive is rbd mirroring to latency?
>
> It's not sensitive at all -- at the worse case, your journals will
> expand during the burst period and shrink again during the idle
> period.
>
> > 2) how sensitive is rbd mirroring to falling behind on replication and
> > having to catch up?
>
> It's designed to be asynchronous replication w/ consistency so it
> doesn't matter to rbd-mirror if it's behind. In fact, you can even
> configure it to always be X hours behind if you want to have a window
> for avoiding accidents from propagating to the DR site.
>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD mirroring to DR site

2018-02-28 Thread Brady Deetz
I'm considering doing one-way rbd mirroring to a DR site. The documentation
states that my link to the DR site should have sufficient throughput to
support replication.

Our write activity is bursty. As such, we tend to see moments of high
throughput 4-6gbps followed by long bouts of basically no activity.

1) how sensitive is rbd mirroring to latency?
2) how sensitive is rbd mirroring to falling behind on replication and
having to catch up?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] LVM+bluestore via ceph-volume vs bluestore via ceph-disk

2018-01-31 Thread Brady Deetz
I recently became aware that LVM has become a component of the preferred
OSD provision process when using ceph-volume. We'd already started our
migration to bluestore before ceph-disk's deprecation was announced and
decided to stick with the process with which we started.

I'm concerned my decision may become negative in the future. Are there any
plans for future features in Ceph to be dependent on LVM?

I'm specifically concerned about a dependency for CephFS snapshots once
they are announced as stable.

Aside from disk enumeration, what is driving the preference for LVM?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI over RBD

2018-01-19 Thread Brady Deetz
I too experienced this with that kernel as well as the elrepo kernel.

On Jan 19, 2018 2:13 PM, "Steven Vacaroaia"  wrote:

Hi Joshua,

I was under the impression that kernel  3.10.0-693 will work with iscsi

Unfortunately  I still cannot create a disk because qfull_time_out is not
supported

What am I missing / do it wrong ?

2018-01-19 15:06:45,216 INFO [lun.py:601:add_dev_to_lio()] -
(LUN.add_dev_to_lio) Adding image 'rbd.disk2' to LIO
2018-01-19 15:06:45,295ERROR [lun.py:634:add_dev_to_lio()] - Could not
set LIO device attribute cmd_time_out/qfull_time_out for device: rbd.disk2.
Kernel not supported. - error(Cannot find attribute: qfull_time_out)
2018-01-19 15:06:45,300ERROR [rbd-target-api:731:_disk()] - LUN alloc
problem - Could not set LIO device attribute cmd_time_out/qfull_time_out
for device: rbd.disk2. Kernel not supported. - error(Cannot find attribute:
qfull_time_out)


Many thanks

Steven

On 4 January 2018 at 22:40, Joshua Chen  wrote:

> Hello Steven,
>   I am using CentOS 7.4.1708 with kernel 3.10.0-693.el7.x86_64
>   and the following packages:
>
> ceph-iscsi-cli-2.5-9.el7.centos.noarch.rpm
> ceph-iscsi-config-2.3-12.el7.centos.noarch.rpm
> libtcmu-1.3.0-0.4.el7.centos.x86_64.rpm
> libtcmu-devel-1.3.0-0.4.el7.centos.x86_64.rpm
> python-rtslib-2.1.fb64-2.el7.centos.noarch.rpm
> python-rtslib-doc-2.1.fb64-2.el7.centos.noarch.rpm
> targetcli-2.1.fb47-0.1.20170815.git5bf3517.el7.centos.noarch.rpm
> tcmu-runner-1.3.0-0.4.el7.centos.x86_64.rpm
> tcmu-runner-debuginfo-1.3.0-0.4.el7.centos.x86_64.rpm
>
>
> Cheers
> Joshua
>
>
> On Fri, Jan 5, 2018 at 2:14 AM, Steven Vacaroaia  wrote:
>
>> Hi Joshua,
>>
>> How did you manage to use iSCSI gateway ?
>> I would like to do that but still waiting for a patched kernel
>>
>> What kernel/OS did you use and/or how did you patch it ?
>>
>> Tahnsk
>> Steven
>>
>> On 4 January 2018 at 04:50, Joshua Chen 
>> wrote:
>>
>>> Dear all,
>>>   Although I managed to run gwcli and created some iqns, or luns,
>>> but I do need some working config example so that my initiator could
>>> connect and get the lun.
>>>
>>>   I am familiar with targetcli and I used to do the following ACL style
>>> connection rather than password,
>>> the targetcli setting tree is here:
>>>
>>> (or see this page
>>> )
>>>
>>> #targetcli ls
>>> o- / 
>>> . [...]
>>>   o- backstores ..
>>> 
>>> [...]
>>>   | o- block ..
>>> 
>>> [Storage Objects: 1]
>>>   | | o- vmware_5t 
>>> ..
>>> [/dev/rbd/rbd/vmware_5t (5.0TiB) write-thru activated]
>>>   | |   o- alua ..
>>> .
>>> [ALUA Groups: 1]
>>>   | | o- default_tg_pt_gp ..
>>> . [ALUA state: Active/optimized]
>>>   | o- fileio ..
>>> ...
>>> [Storage Objects: 0]
>>>   | o- pscsi ..
>>> 
>>> [Storage Objects: 0]
>>>   | o- ramdisk ..
>>> ..
>>> [Storage Objects: 0]
>>>   | o- user:rbd ..
>>> .
>>> [Storage Objects: 0]
>>>   o- iscsi 
>>>  [Targets: 1]
>>>   | o- iqn.2017-12.asiaa.cephosd1:vmware5t
>>> ...
>>> [TPGs: 1]
>>>   |   o- tpg1 ..
>>> 
>>> [gen-acls, no-auth]
>>>   | o- acls ..
>>> ...
>>> [ACLs: 12]
>>>   | | o- iqn.1994-05.com.redhat:15dbed23be9e
>>> ..
>>> [Mapped LUNs: 1]
>>>   | | | o- mapped_lun0 ..
>>> ... [lun0 block/vmware_5t
>>> (rw)]
>>>   | | o- iqn.1994-05.com.redhat:15dbed23be9e-ovirt1
>>> ... [Mapped
>>> LUNs: 1]
>>>   | | | o- mapped_lun0 ..
>>> 

Re: [ceph-users] Ceph-objectstore-tool import failure

2018-01-13 Thread Brady Deetz
It's not simply a zip. I recently went through an incomplete pg incident as
well. I'm not sure why your import is failing, but I do know that much.
Here's a note in slack from our effort to reverse the export. I'm hoping to
explore this a bit more in the next week.

Data frames appear to have the following format:
ceff DTDT SIZE PAYLOAD

Size is probably in bytes? DTDT is frame type, 8-bits, repeated (so BEGIN=1
is encoded as 0101).

File format looks like:

Superblock - starts with:   ceff ceff 0200
PG_BEGINceff 0101 <64 bit little-Endian size>
PG_METADATA ceff 0909
OBJECT_BEGINceff 0303 <64 bit little-Endian size>
(represents first object of a file?)
TYPE_DATA   ceff 0505 (represents an object?)
(repeat TYPE_DATA frames until file completed)
TYPE_ATTRS  ceff 0606
TYPE_OMAP   ceff 0808
OBJECT_END  ceff 0404
(repeat above block for all files?)


enum {
TYPE_NONE = 0,
TYPE_PG_BEGIN,
TYPE_PG_END,
TYPE_OBJECT_BEGIN,
TYPE_OBJECT_END,
TYPE_DATA,
TYPE_ATTRS,
TYPE_OMAP_HDR,
TYPE_OMAP,
TYPE_PG_METADATA,
TYPE_POOL_BEGIN,
TYPE_POOL_END,
END_OF_TYPES, //Keep at the end
};



On Jan 14, 2018 1:27 AM, "Brent Kennedy"  wrote:

> I was able to bring a server back online for a short time and perform an
> export of the incomplete PGs I originally posted about last week.  The
> export showed the files it was exporting and then dropped them all to a
> PGID.export file.  I then SCP’ed the four PGID.export files to a server
> where I had an empty OSD weighted to 0.  I stopped that OSD and then tried
> to import all four PGs.  I then got the following messages for all four I
> tried:
>
>
>
> finish_remove_pgs 11.720_head removing 11.720
>
> Importing pgid 11.c13
>
> do_import threw exception error buffer::malformed_input: void
> object_stat_sum_t::decode(ceph::buffer::list::iterator&) decode past end
> of struct encoding
>
> Corrupt input for import
>
>
>
>
>
> Command I ran:
>
> ceph-objectstore-tool --op import --data-path /var/lib/ceph/osd/ceph-13
> --journal-path /var/lib/ceph/osd/ceph-13/block --file 11.c13.export
>
>
>
> The files match the space used by PGs on the disk.  As noted above, I saw
> it copy the PG to the export file successfully.  Both servers are running
> Ubuntu 14 with the newest ceph-objectstore-tool installed via the package
> from here:  http://download.ceph.com/debian-luminous/pool/main/c/
> ceph/ceph-test_12.2.2-1trusty_amd64.deb  ( cluster is Luminous 12.2.2 .
> Its possible the PGs in question are on the jewel version as I wasn’t able
> to complete the upgrade to luminous on them.
>
>
>
> Am I missing something?  Can I just copy the files off the failing server
> via a zip operation locally and then a unzip operation at the destination
> server?
>
>
>
> -Brent
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jemalloc on centos7

2018-01-13 Thread Brady Deetz
There are some documented issues with bluestore and jemalloc. At the
moment, I would avoid it.

On Jan 13, 2018 5:43 PM, "Marc Roos"  wrote:

>
> I was thinking of enabling this jemalloc. Is there a recommended procedure
> for a default centos7 cluster?
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] filestore to bluestore: osdmap epoch problem and is the documentation correct?

2018-01-11 Thread Brady Deetz
I hear you on time. I have 350 x 6TB drives to convert. I recently posted
about a disaster I created automating my migration. Good luck

On Jan 11, 2018 12:22 PM, "Reed Dier"  wrote:

> I am in the process of migrating my OSDs to bluestore finally and thought
> I would give you some input on how I am approaching it.
> Some of saga you can find in another ML thread here:
> https://www.spinics.net/lists/ceph-users/msg41802.html
>
> My first OSD I was cautious, and I outed the OSD without downing it,
> allowing it to move data off.
> Some background on my cluster, for this OSD, it is an 8TB spinner, with an
> NVMe partition previously used for journaling in filestore, intending to be
> used for block.db in bluestore.
>
> Then I downed it, flushed the journal, destroyed it, zapped with
> ceph-volume, set norecover and norebalance flags, did ceph osd crush remove
> osd.$ID, ceph auth del osd.$ID, and ceph osd rm osd.$ID and used
> ceph-volume locally to create the new LVM target. Then unset the norecover
> and norebalance flags and it backfilled like normal.
>
> I initially ran into issues with specifying --osd.id causing my osd’s to
> fail to start, but removing that I was able to get it to fill in the gap of
> the OSD I just removed.
>
> I’m now doing quicker, more destructive migrations in an attempt to reduce
> data movement.
> This way I don’t read from OSD I’m replacing, write to other OSD
> temporarily, read back from temp OSD, write back to ‘new’ OSD.
> I’m just reading from replica and writing to ‘new’ OSD.
>
> So I’m setting the norecover and norebalance flags, down the OSD (but not
> out, it stays in, also have the noout flag set), destroy/zap, recreate
> using ceph-volume, unset the flags, and it starts backfilling.
> For 8TB disks, and with 23 other 8TB disks in the pool, it takes a *long* time
> to offload it and then backfill back from them. I trust my disks enough to
> backfill from the other disks, and its going well. Also seeing very good
> write performance backfilling compared to previous drive replacements in
> filestore, so thats very promising.
>
> Reed
>
> On Jan 10, 2018, at 8:29 AM, Jens-U. Mozdzen  wrote:
>
> Hi Alfredo,
>
> thank you for your comments:
>
> Zitat von Alfredo Deza :
>
> On Wed, Jan 10, 2018 at 8:57 AM, Jens-U. Mozdzen  wrote:
>
> Dear *,
>
> has anybody been successful migrating Filestore OSDs to Bluestore OSDs,
> keeping the OSD number? There have been a number of messages on the list,
> reporting problems, and my experience is the same. (Removing the existing
> OSD and creating a new one does work for me.)
>
> I'm working on an Ceph 12.2.2 cluster and tried following
> http://docs.ceph.com/docs/master/rados/operations/add-
> or-rm-osds/#replacing-an-osd
> - this basically says
>
> 1. destroy old OSD
> 2. zap the disk
> 3. prepare the new OSD
> 4. activate the new OSD
>
> I never got step 4 to complete. The closest I got was by doing the
> following
> steps (assuming OSD ID "999" on /dev/sdzz):
>
> 1. Stop the old OSD via systemd (osd-node # systemctl stop
> ceph-osd@999.service)
>
> 2. umount the old OSD (osd-node # umount /var/lib/ceph/osd/ceph-999)
>
> 3a. if the old OSD was Bluestore with LVM, manually clean up the old OSD's
> volume group
>
> 3b. zap the block device (osd-node # ceph-volume lvm zap /dev/sdzz)
>
> 4. destroy the old OSD (osd-node # ceph osd destroy 999
> --yes-i-really-mean-it)
>
> 5. create a new OSD entry (osd-node # ceph osd new $(cat
> /var/lib/ceph/osd/ceph-999/fsid) 999)
>
>
> Step 5 and 6 are problematic if you are going to be trying ceph-volume
> later on, which takes care of doing this for you.
>
>
> 6. add the OSD secret to Ceph authentication (osd-node # ceph auth add
> osd.999 mgr 'allow profile osd' osd 'allow *' mon 'allow profile osd' -i
> /var/lib/ceph/osd/ceph-999/keyring)
>
>
> I at first tried to follow the documented steps (without my steps 5 and
> 6), which did not work for me. The documented approach failed with "init
> authentication >> failed: (1) Operation not permitted", because actually
> ceph-volume did not add the auth entry for me.
>
> But even after manually adding the authentication, the "ceph-volume"
> approach failed, as the OSD was still marked "destroyed" in the osdmap
> epoch as used by ceph-osd (see the commented messages from ceph-osd.999.log
> below).
>
>
> 7. prepare the new OSD (osd-node # ceph-volume lvm prepare --bluestore
> --osd-id 999 --data /dev/sdzz)
>
>
> You are going to hit a bug in ceph-volume that is preventing you from
> specifying the osd id directly if the ID has been destroyed.
>
> See http://tracker.ceph.com/issues/22642
>
>
> If I read that bug description correctly, you're confirming why I needed
> step #6 above (manually adding the OSD auth entry. But even if ceph-volume
> had added it, the ceph-osd.log entries suggest that starting the OSD would
> still have failed, because of accessing the wrong osdmap epoch.

[ceph-users] Bluestore migration disaster - incomplete pgs recovery process and progress (in progress)

2018-01-07 Thread Brady Deetz
Below is the status of my disastrous self-inflicted journey. I will preface
this by admitting this could not have been prevented by software attempting
to keep me from being stupid.

I have a production cluster with over 350 XFS backed osds running Luminous.
We want to transition the cluster to Bluestore for the purpose of enabling
EC for CephFS. We are currently at 75+% utilization and EC coding could
really help us reclaim some much needed capacity. Formatting 1 osd at a
time and waiting on the cluster to backfill for every disk was going to
take a very long time (based on our observations an estimated 240+ days).
Formatting an entire host at once caused a little too much turbulence in
the cluster. Furthermore, we could start the transition to EC if we had
enough hosts with enough disks running Bluestore, before the entire cluster
was migrated. As such, I decided to parallelize. The general idea is that
we could format any osd that didn't have anything other than active+clean
pgs associated. I maintain that this method should work. But, something
went terribly wrong with the script and somehow we formatted disks in a
manner that brought PGs into an incomplete state. It's now pretty obvious
that the affected PGs were backfilling to other osds when the script
clobbered the last remaining good set of objects.

This cluster serves CephFS and a few RBD volumes.

mailing list submissions related to this outage:
cephfs-data-scan pg_files errors
finding and manually recovering objects in bluestore
Determine cephfs paths and rados objects affected by incomplete pg

Our recovery
1) We allowed the cluster to repair itself as much as possible.

2) Following self-healing we were left with 3 PGs incomplete. 2 were in the
cephfs data pool and 1 in an RBD pool.

3) Using ceph pg ${pgid} query, we found all disks known to have recently
contained some of that PG's data

4) For each osd listed in the pg query, we exported the remaining PG data
using ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-${osdid}/
--pgid ${pgid} --op export --file /media/ceph_recovery/ceph-${
osdid}/recover.${pgid}

5) After having all of the possible exports we compared the recovery files
and chose the largest. I would have appreciated the ability to do a merge
of some sort on these exports, but we'll take what we can get. We're just
going to assume the largest export was the most complete backfill at the
time disaster struck.

6) We removed the nearly empty pg from the acting osds
using ceph-objectstore-tool --op remove --data-path
/var/lib/ceph/osd/ceph-${osdid} --pgid ${pgid}

7) We imported the largest export we had into the acting osds for the pg

8) We marked the pg as complete using the following on the primary
acting ceph-objectstore-tool --op mark-complete --data-path
/var/lib/ceph/osd/ceph-${osdid}/ --pgid ${pgid}

9) We were convinced that it would be possible multiple exports of the same
partially backfilled PG different objects. As such, we started reversing
the format of the export file to extract the objects from the exports and
compared.

10) While our resident reverse engineer was hard at work, focus was shifted
toward tooling for the purpose of identifying corrupt files, rbds, and
appropriate actions for each
10a) A list of all rados objects were dumped for our most valuable data
(CephFS). Our first mechanism of detection is a skip in object sequence
numbers
10b) Because our metadata pool was unaffected by this mess, we are trusting
that ls delivers correct file sizes even for corrupt files. As such, we
should be able to identify how many objects make up the file. If the count
of objects for that file's inode are less than that, there's a problem..
More than the calculated amount??? The world definitely explodes.
10c) Finally, the saddest check is if there are no objects in rados for
that inode.

That's where we are right now. I'll update this thread as we get closer to
recovery from backups and accepting data loss if necessary.

I will note that we wish there were some documentation on using on
ceph-objectstore-tool. We understand that it's for emergencies, but that's
when concise documentation is most important. From what we've found, the
only documentation seems to be --help and the source code.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs-data-scan pg_files errors

2018-01-05 Thread Brady Deetz
When running the cephfs-data-scan tool to discover what files are affected
by my incomplete PGs, I get paths returned as expected. But, I also receive
2 different kinds of errors in the output.

2018-01-05 10:49:01.217218 7fc274fbb140 -1 pgeffects.hit_dir: Failed to
stat path
/homefolders/bdeetz-2/cephfs_datapool_migrator//virtenv/lib/python2.7/stat.py:
(2) No such file or directory

2018-01-05 10:49:38.298795 7fd47fd11140 -1 pgeffects.hit_dir: Failed to
open path: (13) Permission denied

Should I just assume that these paths are also affected by the incomplete
PGs?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] finding and manually recovering objects in bluestore

2018-01-03 Thread Brady Deetz
In filestore (XFS), you'd find files representing objects using traditional
bash commands like find. What tools do I have at my disposal for recovering
data in bluestore?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Determine cephfs paths and rados objects affected by incomplete pg

2018-01-03 Thread Brady Deetz
I would like to track down what objects are affected by an incomplete pg
and in the case of cephfs map those objects to file paths.

At the moment, the best I've come up with for mapping objects to a pg is
very very slow:
pool="pool"
incomplete="1.cb7"
for object in `rados -p ${pool} ls`; do
  ceph osd map one ${object} | egrep "${incomplete}"
done

Is there not a way to ask ceph for all objects that belong to a pg?

Finally, is there a way to determine what cephfs path maps to a rados
object?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph luminous - performance issue

2018-01-03 Thread Brady Deetz
Can you provide more detail regarding the infrastructure backing this
environment? What hard drive, ssd, and processor are you using? Also, what
is providing networking?

I'm seeing 4k blocksize tests here. Latency is going to destroy you.

On Jan 3, 2018 8:11 AM, "Steven Vacaroaia"  wrote:

> Hi,
>
> I am doing a PoC with 3 DELL R620 and 12 OSD , 3 SSD drives ( one on each
> server), bluestore
>
> I configured the OSD using the following ( /dev/sda is my SSD drive)
> ceph-disk prepare --zap-disk --cluster ceph  --bluestore /dev/sde
> --block.wal /dev/sda --block.db /dev/sda
>
> Unfortunately both fio and bench tests show much worse performance for the
> pools than for the individual disks
>
> Example:
> DISKS
> fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k
> --numjobs=14 --iodepth=1 --runtime=60 --time_based --group_reporting
> --name=journal-test
>
> SSD drive
> Jobs: 14 (f=14): [W(14)] [100.0% done] [0KB/465.2MB/0KB /s] [0/119K/0
> iops] [eta 00m:00s]
>
> HD drive
> Jobs: 14 (f=14): [W(14)] [100.0% done] [0KB/179.2MB/0KB /s] [0/45.9K/0
> iops] [eta 00m:00s]
>
> POOL
>
> fio write.fio
> Jobs: 1 (f=0): [w(1)] [100.0% done] [0KB/51428KB/0KB /s] [0/12.9K/0 iops]
>
>  cat write.fio
> [write-4M]
> description="write test with 4k block"
> ioengine=rbd
> clientname=admin
> pool=scbench
> rbdname=image01
> iodepth=32
> runtime=120
> rw=randwrite
> bs=4k
>
>
> rados bench -p scbench 12 write
>
> Max bandwidth (MB/sec): 224
> Min bandwidth (MB/sec): 0
> Average IOPS:   26
> Stddev IOPS:24
> Max IOPS:   56
> Min IOPS:   0
> Average Latency(s): 0.59819
> Stddev Latency(s):  1.64017
> Max latency(s): 10.8335
> Min latency(s): 0.00475139
>
>
>
>
> I must be missing something - any help/suggestions will be greatly
> appreciated
>
> Here are some specific info
>
> ceph -s
>   cluster:
> id: 91118dde-f231-4e54-a5f0-a1037f3d5142
> health: HEALTH_OK
>
>   services:
> mon: 1 daemons, quorum mon01
> mgr: mon01(active)
> osd: 12 osds: 12 up, 12 in
>
>   data:
> pools:   4 pools, 484 pgs
> objects: 70082 objects, 273 GB
> usage:   570 GB used, 6138 GB / 6708 GB avail
> pgs: 484 active+clean
>
>   io:
> client:   2558 B/s rd, 2 op/s rd, 0 op/s wr
>
> ceph osd pool ls detail
> pool 1 'test-replicated' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 128 pgp_num 128 last_change 157 flags
> hashpspool stripe_width 0 application rbd
> removed_snaps [1~3]
> pool 2 'test-erasure' erasure size 3 min_size 3 crush_rule 1 object_hash
> rjenkins pg_num 128 pgp_num 128 last_change 334 flags hashpspool
> stripe_width 8192 application rbd
> removed_snaps [1~5]
> pool 3 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 128 pgp_num 128 last_change 200 flags hashpspool
> stripe_width 0 application rbd
> removed_snaps [1~3]
> pool 4 'scbench' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 100 pgp_num 100 last_change 330 flags hashpspool
> stripe_width 0
> removed_snaps [1~3]
>
> [cephuser@ceph ceph-config]$ ceph osd df tree
> ID CLASS WEIGHT  REWEIGHT SIZE  USEAVAIL %USE  VAR  PGS TYPE NAME
> -1   6.55128- 2237G   198G 2038G 00   - root default
> -7 0- 0  0 0 00   - host ods03
> -3   2.18475- 2237G   181G 2055G  8.12 0.96   - host osd01
>  3   hdd 0.54619  1.0  559G 53890M  506G  9.41 1.11  90 osd.3
>  4   hdd 0.54619  1.0  559G 30567M  529G  5.34 0.63  89 osd.4
>  5   hdd 0.54619  1.0  559G 59385M  501G 10.37 1.22  93 osd.5
>  6   hdd 0.54619  1.0  559G 42156M  518G  7.36 0.87  93 osd.6
> -5   2.18178- 2234G   189G 2044G  8.50 1.00   - host osd02
>  0   hdd 0.54520  1.0  558G 32460M  526G  5.68 0.67  90 osd.0
>  1   hdd 0.54520  1.0  558G 54578M  504G  9.55 1.12  89 osd.1
>  2   hdd 0.54520  1.0  558G 47761M  511G  8.35 0.98  93 osd.2
>  7   hdd 0.54619  1.0  559G 59584M  501G 10.40 1.22  92 osd.7
> -9   2.18475- 2237G   198G 2038G  8.88 1.04   - host osd03
>  8   hdd 0.54619  1.0  559G 52462M  508G  9.16 1.08  99 osd.8
> 10   hdd 0.54619  1.0  559G 35284M  524G  6.16 0.73  88 osd.10
> 11   hdd 0.54619  1.0  559G 71739M  489G 12.53 1.47  87 osd.11
> 12   hdd 0.54619  1.0  559G 43832M  516G  7.65 0.90  93 osd.12
> TOTAL 6708G   570G 6138G  8.50
> MIN/MAX VAR: 0.63/1.47  STDDEV: 2.06
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] formatting bytes and object counts in ceph status ouput

2018-01-02 Thread Brady Deetz
I'd implement byte counters in base 2 (KB, MB, etc). MiB is annoying to us
old grumpy folk, but I'd live with it.

But, I absolutely hate that object count is in base 2. 1kg is not 1024
kilograms. We have a reason for bytes to be in base 2. Very few other
things are expected to be in base 2. A normal person looking at ceph status
would interpret 1M objects as one million.


On Jan 2, 2018 4:43 AM, "Jan Fajerski"  wrote:

Hi lists,
Currently the ceph status output formats all numbers with binary unit
prefixes, i.e. 1MB equals 1048576 bytes and an object count of 1M equals
1048576 objects.  I received a bug report from a user that printing object
counts with a base 2 multiplier is confusing (I agree) so I opened a bug
and https://github.com/ceph/ceph/pull/19117.
In the PR discussion a couple of questions arose that I'd like to get some
opinions on:
- Should we print binary unit prefixes (MiB, GiB, ...) since that would be
technically correct?
- Should counters (like object counts) be formatted with a base 10
multiplier or  a multiplier woth base 2?

My proposal would be to both use binary unit prefixes and use base 10
multipliers for counters. I think this aligns with user expectations as
well as the relevant standard(s?).

Best,
Jan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD+LVM -> iSCSI -> VMWare

2017-12-09 Thread Brady Deetz
That's not a bad position. I have concerns with what I'm proposing, so a
hypervisor migration may actually bring less risk than a storage
abomination.

On Dec 9, 2017 7:09 PM, "Donny Davis" <do...@fortnebula.com> wrote:

> What I am getting at is that instead of sinking a bunch of time into this
> bandaid, why not sink that time into a hypervisor migration. Seems well
> timed if you ask me.
>
> There are even tools to make that migration easier
>
> http://libguestfs.org/virt-v2v.1.html
>
> You should ultimately move your hypervisor instead of building a one off
> case for ceph. Ceph works really well if you stay inside the box. So does
> KVM. They work like Gang Buster's together.
>
> I know that doesn't really answer your OP, but this is what I would do.
>
> ~D
>
> On Sat, Dec 9, 2017 at 7:56 PM Brady Deetz <bde...@gmail.com> wrote:
>
>> We have over 150 VMs running in vmware. We also have 2PB of Ceph for
>> filesystem. With our vmware storage aging and not providing the IOPs we
>> need, we are considering and hoping to use ceph. Ultimately, yes we will
>> move to KVM, but in the short term, we probably need to stay on VMware.
>> On Dec 9, 2017 6:26 PM, "Donny Davis" <do...@fortnebula.com> wrote:
>>
>>> Just curious but why not just use a hypervisor with rbd support? Are
>>> there VMware specific features you are reliant on?
>>>
>>> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bde...@gmail.com> wrote:
>>>
>>>> I'm testing using RBD as VMWare datastores. I'm currently testing with
>>>> krbd+LVM on a tgt target hosted on a hypervisor.
>>>>
>>>> My Ceph cluster is HDD backed.
>>>>
>>>> In order to help with write latency, I added an SSD drive to my
>>>> hypervisor and made it a writeback cache for the rbd via LVM. So far I've
>>>> managed to smooth out my 4k write latency and have some pleasing results.
>>>>
>>>> Architecturally, my current plan is to deploy an iSCSI gateway on each
>>>> hypervisor hosting that hypervisor's own datastore.
>>>>
>>>> Does anybody have any experience with this kind of configuration,
>>>> especially with regard to LVM writeback caching combined with RBD?
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD+LVM -> iSCSI -> VMWare

2017-12-09 Thread Brady Deetz
We have over 150 VMs running in vmware. We also have 2PB of Ceph for
filesystem. With our vmware storage aging and not providing the IOPs we
need, we are considering and hoping to use ceph. Ultimately, yes we will
move to KVM, but in the short term, we probably need to stay on VMware.

On Dec 9, 2017 6:26 PM, "Donny Davis" <do...@fortnebula.com> wrote:

> Just curious but why not just use a hypervisor with rbd support? Are there
> VMware specific features you are reliant on?
>
> On Fri, Dec 8, 2017 at 4:08 PM Brady Deetz <bde...@gmail.com> wrote:
>
>> I'm testing using RBD as VMWare datastores. I'm currently testing with
>> krbd+LVM on a tgt target hosted on a hypervisor.
>>
>> My Ceph cluster is HDD backed.
>>
>> In order to help with write latency, I added an SSD drive to my
>> hypervisor and made it a writeback cache for the rbd via LVM. So far I've
>> managed to smooth out my 4k write latency and have some pleasing results.
>>
>> Architecturally, my current plan is to deploy an iSCSI gateway on each
>> hypervisor hosting that hypervisor's own datastore.
>>
>> Does anybody have any experience with this kind of configuration,
>> especially with regard to LVM writeback caching combined with RBD?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD+LVM -> iSCSI -> VMWare

2017-12-08 Thread Brady Deetz
I'm testing using RBD as VMWare datastores. I'm currently testing with
krbd+LVM on a tgt target hosted on a hypervisor.

My Ceph cluster is HDD backed.

In order to help with write latency, I added an SSD drive to my hypervisor
and made it a writeback cache for the rbd via LVM. So far I've managed to
smooth out my 4k write latency and have some pleasing results.

Architecturally, my current plan is to deploy an iSCSI gateway on each
hypervisor hosting that hypervisor's own datastore.

Does anybody have any experience with this kind of configuration,
especially with regard to LVM writeback caching combined with RBD?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] tcmu-runner failing during image creation

2017-12-04 Thread Brady Deetz
I thought I was good to go with tcmu-runner on Kernel 4.14, but I guess
not? Any thoughts on the output below?

2017-12-04 17:44:09,631ERROR [rbd-target-api:665:_disk()] - LUN alloc
problem - Could not set LIO device attribute cmd_time_out/qfull_time_out
for device: iscsi-primary.primary00. Kernel not supported. - error(Cannot
find attribute: qfull_time_out)


[root@dc1srviscsi01 ~]# uname -a
Linux dc1srviscsi01.ceph.xxx.xxx 4.14.3-1.el7.elrepo.x86_64 #1 SMP Thu Nov
30 09:35:20 EST 2017 x86_64 x86_64 x86_64 GNU/Linux

ceph-iscsi-cli/
[root@dc1srviscsi01 ceph-iscsi-cli]# git branch
* (detached from 2.5)
  master

ceph-iscsi-cli/
[root@dc1srviscsi01 ceph-iscsi-cli]# git branch
* (detached from 2.5)
  master

ceph-iscsi-config/
[root@dc1srviscsi01 ceph-iscsi-config]# git branch
* (detached from 2.3)
  master

rtslib-fb/
[root@dc1srviscsi01 rtslib-fb]# git branch
* (detached from v2.1.fb64)
  master

targetcli-fb/
[root@dc1srviscsi01 targetcli-fb]# git branch
* (detached from v2.1.fb47)
  master

tcmu-runner/
[root@dc1srviscsi01 tcmu-runner]# git branch
* (detached from v1.3.0-rc4)
  master
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] injecting args output misleading

2017-12-04 Thread Brady Deetz
I'm not sure if this is a bug where ceph incorrectly reports to the user or
if this is just a matter of misleading language. Thought I might bring it
up in any case.

I under stand that "may require restart" is fairly direct in its ambiguity,
but this probably shouldn't be ambiguous without a good technical reason.
But, I find "not observed" to be quite misleading. These arg injections are
very clearly being observed. Maybe the output should be "not observed by
'component x', change may require restart." But, I'd still like a
definitive yes or no for service restarts required by arg injects.

I've run into this on osd args as well.

Ceph Luminous 12.2.1 (CentOS 7.4.1708)

[root@mon0 ceph-admin]# ceph --admin-daemon
/var/run/ceph/ceph-mon.mon0.asok config show | grep "mon_allow_pool_delete"
"mon_allow_pool_delete": "false",

[root@mon0 ceph-admin]# ceph tell mon.0 injectargs
'--mon_allow_pool_delete=true'
injectargs:mon_allow_pool_delete = 'true' (not observed, change may require
restart)
[root@mon0 ceph-admin]# ceph tell mon.1 injectargs
'--mon_allow_pool_delete=true'
injectargs:mon_allow_pool_delete = 'true' (not observed, change may require
restart)
[root@mon0 ceph-admin]# ceph tell mon.2 injectargs
'--mon_allow_pool_delete=true'
injectargs:mon_allow_pool_delete = 'true' (not observed, change may require
restart)

[root@mon0 ceph-admin]# ceph --admin-daemon
/var/run/ceph/ceph-mon.mon0.asok config show | grep "mon_allow_pool_delete"
"mon_allow_pool_delete": "true",

[root@mon0 ceph-admin]# ceph tell mon.0 injectargs
'--mon_allow_pool_delete=false'
injectargs:mon_allow_pool_delete = 'false' (not observed, change may
require restart)
[root@mon0 ceph-admin]# ceph tell mon.1 injectargs
'--mon_allow_pool_delete=false'
injectargs:mon_allow_pool_delete = 'false' (not observed, change may
require restart)
[root@mon0 ceph-admin]# ceph tell mon.2 injectargs
'--mon_allow_pool_delete=false'
injectargs:mon_allow_pool_delete = 'false' (not observed, change may
require restart)

[root@mon0 ceph-admin]# ceph --admin-daemon
/var/run/ceph/ceph-mon.mon0.asok config show | grep "mon_allow_pool_delete"
"mon_allow_pool_delete": "false",

Thanks for the hard work, devs!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-mgr bug - zabbix module division by zero

2017-11-07 Thread Brady Deetz
I'm guessing this is not expected behavior


$ ceph zabbix send
Error EINVAL: Traceback (most recent call last):
  File "/usr/lib64/ceph/mgr/zabbix/module.py", line 234, in handle_command
self.send()
  File "/usr/lib64/ceph/mgr/zabbix/module.py", line 206, in send
data = self.get_data()
  File "/usr/lib64/ceph/mgr/zabbix/module.py", line 174, in get_data
osd_fill.append((float(osd['kb_used']) / float(osd['kb'])) * 100)
ZeroDivisionError: float division by zero


$rpm -qa |grep ceph
ceph-mon-12.2.1-0.el7.x86_64
ceph-12.2.1-0.el7.x86_64
libcephfs2-12.2.1-0.el7.x86_64
python-cephfs-12.2.1-0.el7.x86_64
ceph-base-12.2.1-0.el7.x86_64
ceph-common-12.2.1-0.el7.x86_64
ceph-osd-12.2.1-0.el7.x86_64
ceph-radosgw-12.2.1-0.el7.x86_64
ceph-deploy-1.5.39-0.noarch
ceph-selinux-12.2.1-0.el7.x86_64
ceph-mgr-12.2.1-0.el7.x86_64
ceph-release-1-1.el7.noarch
ceph-mds-12.2.1-0.el7.x86_64
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs snapshot work

2017-11-07 Thread Brady Deetz
Are there any existing fuzzing tools you'd recommend? I know about ceph osd
thrash, which could be tested against, but what about on the client side? I
could just use something pre-built for posix, but that wouldn't coordinate
simulated failures on the storage side with actions against the fs. If
there is not any current tooling for coordinating server and client
simulation, maybe that's where I start.

On Nov 7, 2017 5:57 AM, "John Spray" <jsp...@redhat.com> wrote:

> On Sun, Nov 5, 2017 at 4:19 PM, Brady Deetz <bde...@gmail.com> wrote:
> > My organization has a production  cluster primarily used for cephfs
> upgraded
> > from jewel to luminous. We would very much like to have snapshots on that
> > filesystem, but understand that there are risks.
> >
> > What kind of work could cephfs admins do to help the devs stabilize this
> > feature?
>
> If you have a disposable test system, then you could install the
> latest master branch of Ceph (which has a stream of snapshot fixes in
> it) and run a replica of your intended workload.  If you can find
> snapshot bugs (especially crashes) on master then they will certainly
> attract interest.
>
> John
>
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs snapshot work

2017-11-05 Thread Brady Deetz
My organization has a production  cluster primarily used for cephfs
upgraded from jewel to luminous. We would very much like to have snapshots
on that filesystem, but understand that there are risks.

What kind of work could cephfs admins do to help the devs stabilize this
feature?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How dead is my ec pool?

2017-10-13 Thread Brady Deetz
At this point, before I go any further, I'm copying my pools to new pools
so that I can attempt manual rados operations.

My current thinking is I could compare all objects in the cache tier
against the ec pool. Then if the object doesn't exist, copy the object. If
the objects exist in both and are different replace the ec pool's object
with the cache tier's object.

thoughts?

On Fri, Oct 13, 2017 at 10:13 PM, Brady Deetz <bde...@gmail.com> wrote:

> TLDR; In Jewel, I briefly had 2 cache tiers assigned to an ec pool and I
> think that broke my ec pool. I then made a series of decisions attempting
> to repair that mistake. I now think I've caused further issues.
>
> Background:
>
> Following having some serious I/O issues with my ec pool's cache tier, I
> decided I wanted to use a cache tier hosted on a different set of disks
> than my current tier.
>
> My first potentially poor decision was not removing the original cache
> tier before adding the new one.
>
> Basically, the workflow was as follows:
>
> pools:
> data_ec
> data_cache
> data_new_cache
>
> ceph osd tier add data_ec data_new_cache
> ceph osd tier cache-mode data_new_cache writeback
>
> ceph osd tier set-overlay data_ec data_new_cache
> ceph osd pool set data_new_cache hit_set_type bloom
> ceph osd pool set data_new_cache hit_set_count 1
> ceph osd pool set data_new_cache hit_set_period 3600
> ceph osd pool set data_new_cache target_max_bytes 1
> ceph osd pool set data_new_cache min_read_recency_for_promote 1
> ceph osd pool set data_new_cache min_write_recency_for_promote 1
>
> #so now I decided to attempt to remove the old cache
> ceph osd tier cache-mode data_cache forward
>
> #here is where things got bad
> rados -p data_cache cache-flush-evict-all
>
> #every object rados attempted to flush from the cache, left errors of the
> following varieties
> #
> rados -p data_cache cache-flush-evict-all
> rbd_data.af81e6238e1f29.0001732e
> error listing snap shots /rbd_data.af81e6238e1f29.0001732e: (2)
> No such file or directory
> rbd_data.af81e6238e1f29.000143bb
> error listing snap shots /rbd_data.af81e6238e1f29.000143bb: (2)
> No such file or directory
> rbd_data.af81e6238e1f29.000cf89d
> failed to flush /rbd_data.af81e6238e1f29.000cf89d: (2) No such
> file or directory
> rbd_data.af81e6238e1f29.000cf82c
>
>
>
> #Following these errors, I thought maybe the world would become happy
> again if I just removed the newly added ecpool.
>
> ceph osd tier cache-mode data_new_cache forward
> rados -p data_new_cache cache-flush-evict-all
>
> #when running the evict against the new tier, I received no errors
> #and so begins potential mistake number 3
>
> ceph osd tier remove-overlay ec_data
> ceph osd tier remove data_ec data_new_cache
>
> #I received the same errors. while trying to evict
>
> #knowing my data had been untouched for over an hour, I made a terrible
> decison
> ceph osd tier remove data_ec data_cache
>
> #I then discovered that I couldn't add the new or the old cache back to
> the ec pool, even with --force-nonempty
>
> ceph osd tier add data_ec data_cache --force-nonempty
> Error ENOTEMPTY: tier pool 'data_cache' has snapshot state; it cannot be
> added as a tier without breaking the pool
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How dead is my ec pool?

2017-10-13 Thread Brady Deetz
TLDR; In Jewel, I briefly had 2 cache tiers assigned to an ec pool and I
think that broke my ec pool. I then made a series of decisions attempting
to repair that mistake. I now think I've caused further issues.

Background:

Following having some serious I/O issues with my ec pool's cache tier, I
decided I wanted to use a cache tier hosted on a different set of disks
than my current tier.

My first potentially poor decision was not removing the original cache tier
before adding the new one.

Basically, the workflow was as follows:

pools:
data_ec
data_cache
data_new_cache

ceph osd tier add data_ec data_new_cache
ceph osd tier cache-mode data_new_cache writeback

ceph osd tier set-overlay data_ec data_new_cache
ceph osd pool set data_new_cache hit_set_type bloom
ceph osd pool set data_new_cache hit_set_count 1
ceph osd pool set data_new_cache hit_set_period 3600
ceph osd pool set data_new_cache target_max_bytes 1
ceph osd pool set data_new_cache min_read_recency_for_promote 1
ceph osd pool set data_new_cache min_write_recency_for_promote 1

#so now I decided to attempt to remove the old cache
ceph osd tier cache-mode data_cache forward

#here is where things got bad
rados -p data_cache cache-flush-evict-all

#every object rados attempted to flush from the cache, left errors of the
following varieties
#
rados -p data_cache cache-flush-evict-all
rbd_data.af81e6238e1f29.0001732e
error listing snap shots /rbd_data.af81e6238e1f29.0001732e: (2) No
such file or directory
rbd_data.af81e6238e1f29.000143bb
error listing snap shots /rbd_data.af81e6238e1f29.000143bb: (2) No
such file or directory
rbd_data.af81e6238e1f29.000cf89d
failed to flush /rbd_data.af81e6238e1f29.000cf89d: (2) No such file
or directory
rbd_data.af81e6238e1f29.000cf82c



#Following these errors, I thought maybe the world would become happy again
if I just removed the newly added ecpool.

ceph osd tier cache-mode data_new_cache forward
rados -p data_new_cache cache-flush-evict-all

#when running the evict against the new tier, I received no errors
#and so begins potential mistake number 3

ceph osd tier remove-overlay ec_data
ceph osd tier remove data_ec data_new_cache

#I received the same errors. while trying to evict

#knowing my data had been untouched for over an hour, I made a terrible
decison
ceph osd tier remove data_ec data_cache

#I then discovered that I couldn't add the new or the old cache back to the
ec pool, even with --force-nonempty

ceph osd tier add data_ec data_cache --force-nonempty
Error ENOTEMPTY: tier pool 'data_cache' has snapshot state; it cannot be
added as a tier without breaking the pool
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph release cadence

2017-09-22 Thread Brady Deetz
I'll be first to admit that most of my comments are anecdotal. But, I
suspect when it comes to storage many of us don't require a lot to get
scared back into our dark corners. In short it seems that the dev team
should get better at selecting features and delivering on the existing
scheduled cadence before shortening it. To me, the odd releases represent
feature previews for the next even release. If that's a fair way to look at
them, they could play a very important role in the stability of the even
release.

On Sep 22, 2017 8:59 PM, "Sage Weil"  wrote:

On Fri, 22 Sep 2017, Gregory Farnum wrote:
> On Fri, Sep 22, 2017 at 3:28 PM, Sage Weil  wrote:
> > Here is a concrete proposal for everyone to summarily shoot down (or
> > heartily endorse, depending on how your friday is going):
> >
> > - 9 month cycle
> > - enforce a predictable release schedule with a freeze date and
> >   a release date.  (The actual .0 release of course depends on no
blocker
> >   bugs being open; not sure how zealous 'train' style projects do
> >   this.)
>
> Train projects basically commit to a feature freeze enough in advance
> of the expected release date that it's feasible, and don't let people
> fake it by rushing in stuff they "finished" the day before. I'm not
> sure if every-9-month LTSes will be more conducive to that or not — if
> we do scheduled releases, we still fundamentally need to be able to
> say "nope, that feature we've been saying for 9 months we hope to have
> out in this LTS won't make it until the next one". And we seem pretty
> bad at that.

I'll be the first to say I'm no small part of the "we" there.  But I'm
also suggesting that's not a reason not to try to do better.  As I
said I think this will be easier than in the past because we don't
have as many headline features we're trying to wedge in.


That's excellent as long as it actually happens. Otherwise the collective
you may end up pushing worse code on a 9mo cycle than the current
theoretical 12mo cycle that is delayed when necessary. We all know that
software development never happens on time or on budget.


In any case, is there an alternative way to get to the much-desired
regular cadence?

> > - no more even/odd pattern; all stable releases are created equal.
> > - support upgrades from up to 3 releases back.
> >
> > This shortens the cycle a bit to relieve the "this feature must go in"
> > stress, without making it so short as to make the release pointless
(e.g.,
> > infernalis, kraken).  (I also think that the feature pressure is much
> > lower now than it has been in the past.)
> >
> > This creates more work for the developers because there are more upgrade
> > paths to consider: we no longer have strict "choke points" (like all
> > upgrades must go through luminous).  We could reserve the option to pick
> > specific choke point releases in the future, perhaps taking care to make
> > sure these are the releases that go into downstream distros.  We'll need
> > to be more systematic about the upgrade testing.
>
> This sounds generally good to me — we did multiple-release upgrades
> for a long time, and stuff is probably more complicated now but I
> don't think it will actually be that big a deal.
>
> 3 releases back might be a bit much though — that's 27 months! (For
> luminous, the beginning of 2015. Hammer.)

I'm *much* happier with 2 :) so no complaint from me.  I just heard a lot
of "2 years" and 2 releases (18 months) doesn't quite cover it.  Maybe
it's best to start with that, though?  It's still an improvement over the
current ~12 months.


A lot of vulnerabilities and bugs can come out in one year. As such, I
upgrade anything in my environment, at minimum, once a year. The "if it
ain't broke don't fix it" mentality is usually more dangerous than an
upgrade between minor releases. But... I will say that as my Ceph
environment grows, upgrades become increasingly difficult to manage and
anxiety increases with every node I add to my growing 2PB cluster.


> > Somewhat separately, several people expressed concern about having
stable
> > releases to develop against.  This is somewhat orthogonal to what users
> > need.  To that end, we can do a dev checkpoint every 1 or 2 months
> > (preferences?), where we fork a 'next' branch and stabilize all of the
> > tests before moving on.  This is good practice anyway to avoid
> > accumulating low-frequency failures in the test suite that have to be
> > squashed at the end.
>
> So this sounds like a fine idea to me, but how do we distinguish this
> from the intermediate stable releases?
>
> By which I mean, are we *really* going to do a stabilization branch
> that will never get seen by users? What kind of testing and bug fixing
> are we going to commit to doing against it, and how do we balance that
> effort with feature work?
>
> It seems like the same conflict we have now, only since the dev
> checkpoints are less important they'll lose more often. Then we'll end

[ceph-users] Cephfs IO monitoring

2017-08-09 Thread Brady Deetz
Curious if there is a method way I could see in near real-time the io
patters for an fs. For instance, what files are currently being
read/written and the block sizes. I suspect this is a big ask. The only
thing I know of that can provide that level of detail for a filesystem is
dtrace with zfs.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs increase max file size

2017-08-04 Thread Brady Deetz
https://www.spinics.net/lists/ceph-users/msg36285.html

On Aug 4, 2017 8:28 AM, "Rhian Resnick"  wrote:

> Morning,
>
>
> We ran into an issue with the default max file size of a cephfs file. Is
> it possible to increase this value to 20 TB from 1 TB without recreating
> the file system?
>
>
> Rhian Resnick
>
> Assistant Director Middleware and HPC
>
> Office of Information Technology
>
>
> Florida Atlantic University
>
> 777 Glades Road, CM22, Rm 173B
>
> Boca Raton, FL 33431
>
> Phone 561.297.2647 <(561)%20297-2647>
>
> Fax 561.297.0222 <(561)%20297-0222>
>
>  [image: image] 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] oVirt/RHEV and Ceph

2017-07-25 Thread Brady Deetz
Thank you for the clarification. I'm excited to give it a shot.

On Jul 25, 2017 3:24 AM, "David" <dclistsli...@gmail.com> wrote:

> My understanding was Cinder is needed to create/delete/manage etc. on
> volumes but I/O to the volumes is direct from the hypervisors. In theory
> you could lose your Cinder service and VMs would stay up.
>
> On 25 Jul 2017 4:18 a.m., "Brady Deetz" <bde...@gmail.com> wrote:
>
> Thanks for pointing to some documentation. I'd seen that and it is
> certainly an option. From my understanding, with a Cinder deployment, you'd
> have the same failure domains and similar performance characteristics to an
> oVirt + NFS + RBD deployment. This is acceptable. But, the dream I have in
> my head is where the RBD images are mounted and controlled on each
> hypervisor instead of a central storage authority like Cinder. Does that
> exist for anything or is this a fundamentally flawed idea?
>
> On Mon, Jul 24, 2017 at 9:41 PM, Jason Dillaman <jdill...@redhat.com>
> wrote:
>
>> oVirt 3.6 added Cinder/RBD integration [1] and it looks like they are
>> currently working on integrating Cinder within a container to simplify
>> the integration [2].
>>
>> [1] http://www.ovirt.org/develop/release-management/features/sto
>> rage/cinder-integration/
>> [2] http://www.ovirt.org/develop/release-management/features/cin
>> derglance-docker-integration/
>>
>> On Mon, Jul 24, 2017 at 10:27 PM, Brady Deetz <bde...@gmail.com> wrote:
>> > Funny enough, I just had a call with Redhat where the OpenStack
>> engineer was
>> > voicing his frustration that there wasn't any movement on RBD for oVirt.
>> > This is important to me because I'm building out a user-facing private
>> cloud
>> > that just isn't going to be big enough to justify OpenStack and its
>> > administrative overhead. But, I already have 1.75PB (soon to be 2PB) of
>> > CephFS in production. So, it puts me in a really difficult design
>> position.
>> >
>> > On Mon, Jul 24, 2017 at 9:09 PM, Dino Yancey <dino2...@gmail.com>
>> wrote:
>> >>
>> >> I was as much as told by Redhat in a sales call that they push Gluster
>> >> for oVirt/RHEV and Ceph for OpenStack, and don't have any plans to
>> >> change that in the short term. (note this was about a year ago, i
>> >> think - so this isn't super current information).
>> >>
>> >> I seem to recall the hangup was that oVirt had no orchestration
>> >> capability for RBD comparable to OpenStack, and that CephFS wasn't
>> >> (yet?) viable for use as a "POSIX filesystem" oVirt storage domain.
>> >> Personally, I feel like Redhat is worried about competing with
>> >> themselves with GlusterFS versus CephFS and is choosing to focus on
>> >> Gluster as a filesystem, and Ceph as everything minus the filesystem.
>> >>
>> >> Which is a shame, as I'm a fan of both Ceph and oVirt and would love
>> >> to use my existing RHEV infrastructure to bring Ceph into my
>> >> environment.
>> >>
>> >>
>> >> On Mon, Jul 24, 2017 at 8:39 PM, Brady Deetz <bde...@gmail.com> wrote:
>> >> > I haven't seen much talk about direct integration with oVirt.
>> Obviously
>> >> > it
>> >> > kind of comes down to oVirt being interested in participating. But,
>> is
>> >> > the
>> >> > only hold-up getting development time toward an integration or is
>> there
>> >> > some
>> >> > kind of friction between the dev teams?
>> >> >
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> __
>> >> Dino Yancey
>> >> 2GNT.com Admin
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>
>> --
>> Jason
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] oVirt/RHEV and Ceph

2017-07-24 Thread Brady Deetz
Thanks for pointing to some documentation. I'd seen that and it is
certainly an option. From my understanding, with a Cinder deployment, you'd
have the same failure domains and similar performance characteristics to an
oVirt + NFS + RBD deployment. This is acceptable. But, the dream I have in
my head is where the RBD images are mounted and controlled on each
hypervisor instead of a central storage authority like Cinder. Does that
exist for anything or is this a fundamentally flawed idea?

On Mon, Jul 24, 2017 at 9:41 PM, Jason Dillaman <jdill...@redhat.com> wrote:

> oVirt 3.6 added Cinder/RBD integration [1] and it looks like they are
> currently working on integrating Cinder within a container to simplify
> the integration [2].
>
> [1] http://www.ovirt.org/develop/release-management/features/
> storage/cinder-integration/
> [2] http://www.ovirt.org/develop/release-management/features/
> cinderglance-docker-integration/
>
> On Mon, Jul 24, 2017 at 10:27 PM, Brady Deetz <bde...@gmail.com> wrote:
> > Funny enough, I just had a call with Redhat where the OpenStack engineer
> was
> > voicing his frustration that there wasn't any movement on RBD for oVirt.
> > This is important to me because I'm building out a user-facing private
> cloud
> > that just isn't going to be big enough to justify OpenStack and its
> > administrative overhead. But, I already have 1.75PB (soon to be 2PB) of
> > CephFS in production. So, it puts me in a really difficult design
> position.
> >
> > On Mon, Jul 24, 2017 at 9:09 PM, Dino Yancey <dino2...@gmail.com> wrote:
> >>
> >> I was as much as told by Redhat in a sales call that they push Gluster
> >> for oVirt/RHEV and Ceph for OpenStack, and don't have any plans to
> >> change that in the short term. (note this was about a year ago, i
> >> think - so this isn't super current information).
> >>
> >> I seem to recall the hangup was that oVirt had no orchestration
> >> capability for RBD comparable to OpenStack, and that CephFS wasn't
> >> (yet?) viable for use as a "POSIX filesystem" oVirt storage domain.
> >> Personally, I feel like Redhat is worried about competing with
> >> themselves with GlusterFS versus CephFS and is choosing to focus on
> >> Gluster as a filesystem, and Ceph as everything minus the filesystem.
> >>
> >> Which is a shame, as I'm a fan of both Ceph and oVirt and would love
> >> to use my existing RHEV infrastructure to bring Ceph into my
> >> environment.
> >>
> >>
> >> On Mon, Jul 24, 2017 at 8:39 PM, Brady Deetz <bde...@gmail.com> wrote:
> >> > I haven't seen much talk about direct integration with oVirt.
> Obviously
> >> > it
> >> > kind of comes down to oVirt being interested in participating. But, is
> >> > the
> >> > only hold-up getting development time toward an integration or is
> there
> >> > some
> >> > kind of friction between the dev teams?
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >>
> >>
> >>
> >> --
> >> __
> >> Dino Yancey
> >> 2GNT.com Admin
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Jason
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] oVirt/RHEV and Ceph

2017-07-24 Thread Brady Deetz
Funny enough, I just had a call with Redhat where the OpenStack engineer
was voicing his frustration that there wasn't any movement on RBD for
oVirt. This is important to me because I'm building out a user-facing
private cloud that just isn't going to be big enough to justify OpenStack
and its administrative overhead. But, I already have 1.75PB (soon to be
2PB) of CephFS in production. So, it puts me in a really difficult design
position.

On Mon, Jul 24, 2017 at 9:09 PM, Dino Yancey <dino2...@gmail.com> wrote:

> I was as much as told by Redhat in a sales call that they push Gluster
> for oVirt/RHEV and Ceph for OpenStack, and don't have any plans to
> change that in the short term. (note this was about a year ago, i
> think - so this isn't super current information).
>
> I seem to recall the hangup was that oVirt had no orchestration
> capability for RBD comparable to OpenStack, and that CephFS wasn't
> (yet?) viable for use as a "POSIX filesystem" oVirt storage domain.
> Personally, I feel like Redhat is worried about competing with
> themselves with GlusterFS versus CephFS and is choosing to focus on
> Gluster as a filesystem, and Ceph as everything minus the filesystem.
>
> Which is a shame, as I'm a fan of both Ceph and oVirt and would love
> to use my existing RHEV infrastructure to bring Ceph into my
> environment.
>
>
> On Mon, Jul 24, 2017 at 8:39 PM, Brady Deetz <bde...@gmail.com> wrote:
> > I haven't seen much talk about direct integration with oVirt. Obviously
> it
> > kind of comes down to oVirt being interested in participating. But, is
> the
> > only hold-up getting development time toward an integration or is there
> some
> > kind of friction between the dev teams?
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> __
> Dino Yancey
> 2GNT.com Admin
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] oVirt/RHEV and Ceph

2017-07-24 Thread Brady Deetz
I haven't seen much talk about direct integration with oVirt. Obviously it
kind of comes down to oVirt being interested in participating. But, is the
only hold-up getting development time toward an integration or is there
some kind of friction between the dev teams?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-19 Thread Brady Deetz
Thanks Greg. I thought it was impossible when I reported 34MB for 52
million files.

On Jul 19, 2017 1:17 PM, "Gregory Farnum" <gfar...@redhat.com> wrote:

>
>
> On Wed, Jul 19, 2017 at 10:25 AM David <dclistsli...@gmail.com> wrote:
>
>> On Tue, Jul 18, 2017 at 6:54 AM, Blair Bethwaite <
>> blair.bethwa...@gmail.com> wrote:
>>
>>> We are a data-intensive university, with an increasingly large fleet
>>> of scientific instruments capturing various types of data (mostly
>>> imaging of one kind or another). That data typically needs to be
>>> stored, protected, managed, shared, connected/moved to specialised
>>> compute for analysis. Given the large variety of use-cases we are
>>> being somewhat more circumspect it our CephFS adoption and really only
>>> dipping toes in the water, ultimately hoping it will become a
>>> long-term default NAS choice from Luminous onwards.
>>>
>>> On 18 July 2017 at 15:21, Brady Deetz <bde...@gmail.com> wrote:
>>> > All of that said, you could also consider using rbd and zfs or
>>> whatever filesystem you like. That would allow you to gain the benefits of
>>> scaleout while still getting a feature rich fs. But, there are some down
>>> sides to that architecture too.
>>>
>>> We do this today (KVMs with a couple of large RBDs attached via
>>> librbd+QEMU/KVM), but the throughput able to be achieved this way is
>>> nothing like native CephFS - adding more RBDs doesn't seem to help
>>> increase overall throughput. Also, if you have NFS clients you will
>>> absolutely need SSD ZIL. And of course you then have a single point of
>>> failure and downtime for regular updates etc.
>>>
>>> In terms of small file performance I'm interested to hear about
>>> experiences with in-line file storage on the MDS.
>>>
>>> Also, while we're talking about CephFS - what size metadata pools are
>>> people seeing on their production systems with 10s-100s millions of
>>> files?
>>>
>>
>> On a system with 10.1 million files, metadata pool is 60MB
>>
>>
> Unfortunately that's not really an accurate assessment, for good but
> terrible reasons:
> 1) CephFS metadata is principally stored via the omap interface (which is
> designed for handling things like the directory storage CephFS needs)
> 2) omap is implemented via Level/RocksDB
> 3) there is not a good way to determine which pool is responsible for
> which portion of RocksDBs data
> 4) So the pool stats do not incorporate omap data usage at all in their
> reports (it's part of the overall space used, and is one of the things that
> can make that larger than the sum of the per-pool spaces)
>
> You could try and estimate it by looking at how much "lost" space there is
> (and subtracting out journal sizes and things, depending on setup). But I
> promise there's more than 60MB of CephFS metadata for 10.1 million files!
> -Greg
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-18 Thread Brady Deetz
We have a cephfs data pool with 52.8M files stored in 140.7M objects. That
translates to a metadata pool size of 34.6MB across 1.5M objects.

On Jul 18, 2017 12:54 AM, "Blair Bethwaite" <blair.bethwa...@gmail.com>
wrote:

> We are a data-intensive university, with an increasingly large fleet
> of scientific instruments capturing various types of data (mostly
> imaging of one kind or another). That data typically needs to be
> stored, protected, managed, shared, connected/moved to specialised
> compute for analysis. Given the large variety of use-cases we are
> being somewhat more circumspect it our CephFS adoption and really only
> dipping toes in the water, ultimately hoping it will become a
> long-term default NAS choice from Luminous onwards.
>
> On 18 July 2017 at 15:21, Brady Deetz <bde...@gmail.com> wrote:
> > All of that said, you could also consider using rbd and zfs or whatever
> filesystem you like. That would allow you to gain the benefits of scaleout
> while still getting a feature rich fs. But, there are some down sides to
> that architecture too.
>
> We do this today (KVMs with a couple of large RBDs attached via
> librbd+QEMU/KVM), but the throughput able to be achieved this way is
> nothing like native CephFS - adding more RBDs doesn't seem to help
> increase overall throughput. Also, if you have NFS clients you will
> absolutely need SSD ZIL. And of course you then have a single point of
> failure and downtime for regular updates etc.
>
> In terms of small file performance I'm interested to hear about
> experiences with in-line file storage on the MDS.
>
> Also, while we're talking about CephFS - what size metadata pools are
> people seeing on their production systems with 10s-100s millions of
> files?
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-17 Thread Brady Deetz
No problem. We are a functional mri research institute. We have a fairly
mixed workload. But, I can tell you that we see 60+gbps of throughput when
multiple clients are reading sequencially on large files (1+GB) with 1-4MB
block sizes. IO involving small files and small block sizes are not very
good. Ssd would help a lot with small io, but our hardware architecture is
not designed for that and we don't care too much about throughput when a
person opens a spreadsheet.

One of the greatest benefits we've gained from CephFS that wasn't expected
to be as consequencial as it was is the xattrs. Specifically ceph.dir.* we
use this feature to track usage and it has dramatically reduced the number
of metadata operations we perform while trying to determine statistics
about a directory.

But, we very much miss the ability to perform nightly snapshots. I think
snapshots are supposed to be marked stable soon, but for now it is my
understanding that they are still not listed as stable. The xattrs have
indirectly facilitated this, but it isn't as convenient as a filesystem
snapshot.

All of that said, you could also consider using rbd and zfs or whatever
filesystem you like. That would allow you to gain the benefits of scaleout
while still getting a feature rich fs. But, there are some down sides to
that architecture too.

On Jul 17, 2017 10:21 PM, "许雪寒" <xuxue...@360.cn> wrote:

Thanks, sir☺
You are really a lot of help☺

May I ask what kind of business are you using cephFS for? What's the io
pattern:-)

If answering this may involve any business secret, I really understand if
you don't answer:-)

Thanks again:-)

发件人: Brady Deetz [mailto:bde...@gmail.com]
发送时间: 2017年7月18日 8:01
收件人: 许雪寒
抄送: ceph-users
主题: Re: [ceph-users] How's cephfs going?

I feel that the correct answer to this question is: it depends.

I've been running a 1.75PB Jewel based cephfs cluster in production for
about a 2 years at Laureate Institute for Brain Research. Before that we
had a good 6-8 month planning and evaluation phase. I'm running with
active/standby dedicated mds servers, 3x dedicated mons, and 12 osd nodes
with 24 disks in each server. Every group of 12 disks have journals mapped
to 1x Intel P3700. Each osd node has dual 40gbps ethernet lagged with lacp.
In our evaluation we did find that the rumors are true. Your cpu choice
will influence performance.

Here's why my answer is "it depends." If you expect to get the same
complete feature set as you do with isilon, scale-io, gluster, or other
more established scaleout systems, it is not production ready. But, in
terms of stability, it is. Over the course of the past 2 years I've
triggered 1 mds bug that put my filesystem into read only mode. That bug
was patched in 8 hours thanks to this community. Also that bug was trigger
by a stupid mistake on my part that the application did not validate before
the action was performed.

If you have a couple of people with a strong background in Linux,
networking, and architecture, I'd say Ceph may be a good fit for you. If
not, maybe not.

On Jul 16, 2017 9:59 PM, "许雪寒" <xuxue...@360.cn> wrote:
Hi, everyone.

We intend to use cephfs of Jewel version, however, we don’t know its
status. Is it production ready in Jewel? Does it still have lots of bugs?
Is it a major effort of the current ceph development? And who are using
cephfs now?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How's cephfs going?

2017-07-17 Thread Brady Deetz
I feel that the correct answer to this question is: it depends.

I've been running a 1.75PB Jewel based cephfs cluster in production for
about a 2 years at Laureate Institute for Brain Research. Before that we
had a good 6-8 month planning and evaluation phase. I'm running with
active/standby dedicated mds servers, 3x dedicated mons, and 12 osd nodes
with 24 disks in each server. Every group of 12 disks have journals mapped
to 1x Intel P3700. Each osd node has dual 40gbps ethernet lagged with lacp.
In our evaluation we did find that the rumors are true. Your cpu choice
will influence performance.

Here's why my answer is "it depends." If you expect to get the same
complete feature set as you do with isilon, scale-io, gluster, or other
more established scaleout systems, it is not production ready. But, in
terms of stability, it is. Over the course of the past 2 years I've
triggered 1 mds bug that put my filesystem into read only mode. That bug
was patched in 8 hours thanks to this community. Also that bug was trigger
by a stupid mistake on my part that the application did not validate before
the action was performed.

If you have a couple of people with a strong background in Linux,
networking, and architecture, I'd say Ceph may be a good fit for you. If
not, maybe not.

On Jul 16, 2017 9:59 PM, "许雪寒"  wrote:

> Hi, everyone.
>
>
>
> We intend to use cephfs of Jewel version, however, we don’t know its
> status. Is it production ready in Jewel? Does it still have lots of bugs?
> Is it a major effort of the current ceph development? And who are using
> cephfs now?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Access rights of /var/lib/ceph with Jewel

2017-07-10 Thread Brady Deetz
>From a least privilege standpoint, o=rx seems bad. Instead, if you need a
user to gave rx, why not set a default acl on each osd to allow Nagios to
have rx?

I think it's designed to best practice. If a user wishes to accept
additional risk, that's their risk.

On Jul 10, 2017 8:10 AM, "Jens Rosenboom"  wrote:

> 2017-07-10 10:40 GMT+00:00 Christian Balzer :
> > On Mon, 10 Jul 2017 11:27:26 +0200 Marc Roos wrote:
> >
> >> Looks to me by design (from rpm install), and the settings of the
> >> directorys below are probably the result of a user umask setting.
> >
> > I know it's deliberate, I'm asking why.
>
> It seems to have been introduced in
> https://github.com/ceph/ceph/pull/4456 and Sage writes there:
>
> > need to validate the permissiong choices for /var/log/ceph adn
> /var/lib/ceph
>
> I agree with you that setting "o=rx" would be a more sensible choice.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object repair not going as planned

2017-06-26 Thread Brady Deetz
Resolved.

After all of the involved OSDs had been down for a while, I brought them
back up and issued another ceph pg repair. We are clean now.

On Sun, Jun 25, 2017 at 11:54 PM, Brady Deetz <bde...@gmail.com> wrote:

> I should have mentioned, I'm running ceph jewel 10.2.7
>
> On Sun, Jun 25, 2017 at 11:46 PM, Brady Deetz <bde...@gmail.com> wrote:
>
>> Over the course of the past year, I've had 3 instances where I had to
>> manually repair an object due to size. In this case, I was immediately
>> disappointed to discover what I think is evidence of only 1 of 3 replicas
>> good. It got worse when a segfault occurred I attempted to flush the
>> journal for one of the seemingly bad replicas.
>>
>> Below is a segfault from ceph-osd -i 160 --flush-journal
>> https://pastebin.com/GQkCn9T9
>>
>> More logs and command history can be found here:
>> https://pastebin.com/5knjNTd0
>>
>> So far, I've copied the object file to a tmp backup location, set noout,
>> stopped the osd service for the associated osds for that pg, flushed the
>> journals, and made a second copy of the objects post flush.
>>
>> Any help would be greatly appreciated.
>>
>> I'm considering just deleting the 2 known bad files and attempting a ceph
>> pg repair. But, I'm not really sure that will work with only 1 good replica.
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object repair not going as planned

2017-06-25 Thread Brady Deetz
I should have mentioned, I'm running ceph jewel 10.2.7

On Sun, Jun 25, 2017 at 11:46 PM, Brady Deetz <bde...@gmail.com> wrote:

> Over the course of the past year, I've had 3 instances where I had to
> manually repair an object due to size. In this case, I was immediately
> disappointed to discover what I think is evidence of only 1 of 3 replicas
> good. It got worse when a segfault occurred I attempted to flush the
> journal for one of the seemingly bad replicas.
>
> Below is a segfault from ceph-osd -i 160 --flush-journal
> https://pastebin.com/GQkCn9T9
>
> More logs and command history can be found here:
> https://pastebin.com/5knjNTd0
>
> So far, I've copied the object file to a tmp backup location, set noout,
> stopped the osd service for the associated osds for that pg, flushed the
> journals, and made a second copy of the objects post flush.
>
> Any help would be greatly appreciated.
>
> I'm considering just deleting the 2 known bad files and attempting a ceph
> pg repair. But, I'm not really sure that will work with only 1 good replica.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Object repair not going as planned

2017-06-25 Thread Brady Deetz
Over the course of the past year, I've had 3 instances where I had to
manually repair an object due to size. In this case, I was immediately
disappointed to discover what I think is evidence of only 1 of 3 replicas
good. It got worse when a segfault occurred I attempted to flush the
journal for one of the seemingly bad replicas.

Below is a segfault from ceph-osd -i 160 --flush-journal
https://pastebin.com/GQkCn9T9

More logs and command history can be found here:
https://pastebin.com/5knjNTd0

So far, I've copied the object file to a tmp backup location, set noout,
stopped the osd service for the associated osds for that pg, flushed the
journals, and made a second copy of the objects post flush.

Any help would be greatly appreciated.

I'm considering just deleting the 2 known bad files and attempting a ceph
pg repair. But, I'm not really sure that will work with only 1 good replica.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-21 Thread Brady Deetz
On Jun 21, 2017 8:15 PM, "Christian Balzer" <ch...@gol.com> wrote:

On Wed, 21 Jun 2017 19:44:08 -0500 Brady Deetz wrote:

> Hello,
> I'm expanding my 288 OSD, primarily cephfs, cluster by about 16%. I have
12
> osd nodes with 24 osds each. Each osd node has 2 P3700 400GB NVMe PCIe
> drives providing 10GB journals for groups of 12 6TB spinning rust drives
> and 2x lacp 40gbps ethernet.
>
> Our hardware provider is recommending that we start deploying P4600 drives
> in place of our P3700s due to availability.
>
Welcome to the club and make sure to express your displeasure about
Intel's "strategy" to your vendor.

The P4600s are a poor replacement for P3700s and also still just
"announced" according to ARK.

Are you happy with your current NVMes?
Firstly as in, what is their wearout, are you expecting them to easily
survive 5 years at the current rate?
Secondly, how about speed? with 12 HDDs and 1GB/s write capacity of the
NVMe I'd expect them to not be a bottleneck in nearly all real life
situations.

Keep in mind that 1.6TB P4600 is going to last about as long as your 400GB
P3700, so if wear-out is a concern, don't put more stress on them.


Oddly enough, the Intel tools are telling me that we've only used about 10%
of each drive's endurance over the past year. This honestly surprises me
due to our workload, but maybe I'm thinking my researchers are doing more
science than they actually are.


Also the P4600 is only slightly faster in writes than the P3700, so that's
where putting more workload onto them is going to be a notable issue.

> I've seen some talk on here regarding this, but wanted to throw an idea
> around. I was okay throwing away 280GB of fast capacity for the purpose of
> providing reliable journals. But with as much free capacity as we'd have
> with a 4600, maybe I could use that extra capacity as a cache tier for
> writes on an rbd ec pool. If I wanted to go that route, I'd probably
> replace several existing 3700s with 4600s to get additional cache
capacity.
> But, that sounds risky...
>
Risky as in high failure domain concentration and as mentioned above a
cache-tier with obvious inline journals and thus twice the bandwidth needs
will likely eat into the write speed capacity of the journals.


Agreed. On the topic of journals and double bandwidth, am I correct in
thinking that btrfs (as insane as it may be) does not require double
bandwidth like xfs? Furthermore with bluestore being close to stable, will
my architecture need to change?


If (and seems to be a big IF) you can find them, the Samsung PM1725a 1.6TB
seems to be a) cheaper and b) at 2GB/s write speed more likely to be
suitable for double duty.
Similar (slightly better on paper) endurance than then P4600, so keep that
in mind, too.


My vendor is an HPC vendor so /maybe/ they have access to these elusive
creatures. In which case, how many do you want? Haha


Christian
--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Transitioning to Intel P4600 from P3700 Journals

2017-06-21 Thread Brady Deetz
Hello,
I'm expanding my 288 OSD, primarily cephfs, cluster by about 16%. I have 12
osd nodes with 24 osds each. Each osd node has 2 P3700 400GB NVMe PCIe
drives providing 10GB journals for groups of 12 6TB spinning rust drives
and 2x lacp 40gbps ethernet.

Our hardware provider is recommending that we start deploying P4600 drives
in place of our P3700s due to availability.

I've seen some talk on here regarding this, but wanted to throw an idea
around. I was okay throwing away 280GB of fast capacity for the purpose of
providing reliable journals. But with as much free capacity as we'd have
with a 4600, maybe I could use that extra capacity as a cache tier for
writes on an rbd ec pool. If I wanted to go that route, I'd probably
replace several existing 3700s with 4600s to get additional cache capacity.
But, that sounds risky...

What do you guys think?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs file size limit 0f 1.1TB?

2017-05-24 Thread Brady Deetz
Are there any repercussions to configuring this on an existing large fs?

On Wed, May 24, 2017 at 1:36 PM, John Spray  wrote:

> On Wed, May 24, 2017 at 7:19 PM, Jake Grimmett 
> wrote:
> > Dear All,
> >
> > I've been testing out cephfs, and bumped into what appears to be an upper
> > file size limit of ~1.1TB
> >
> > e.g:
> >
> > [root@cephfs1 ~]# time rsync --progress -av /ssd/isilon_melis.tar
> > /ceph/isilon_melis.tar
> > sending incremental file list
> > isilon_melis.tar
> > 1099341824000  54%  237.51MB/s1:02:05
> > rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]:
> > Broken pipe (32)
> > rsync: write failed on "/ceph/isilon_melis.tar": File too large (27)
> > rsync error: error in file IO (code 11) at receiver.c(322)
> [receiver=3.0.9]
> > rsync: connection unexpectedly closed (28 bytes received so far) [sender]
> > rsync error: error in rsync protocol data stream (code 12) at io.c(605)
> > [sender=3.0.9]
> >
> > Firstly, is this expected?
>
> CephFS has a configurable maximum file size, it's 1TB by default.
>
> Change it with:
>   ceph fs set  max_file_size 
>
> John
>
>
>
>
>
> >
> > If not, then does anyone have any suggestions on where to start digging?
> >
> > I'm using erasure encoding (4+1, 50 x 8TB drives over 5 servers), with an
> > nvme hot pool of 4 drives (2 x replication).
> >
> > I've tried both Kraken (release), and the latest Luminous Dev.
> >
> > many thanks,
> >
> > Jake
> > --
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg marked inconsistent while appearing to be consistent

2017-05-12 Thread Brady Deetz
This was a weird one. Eventually 2/3 files were the correct size and 1 of
them remained incorrect. At that point, I just followed the normal manual
repair process from the documentation at
http://ceph.com/geen-categorie/ceph-manually-repair-object/ .

Is it possible that the journal just hadn't flushed to disk yet? I thought
there was a timeout where the journal would flush even if it were not full
after some sensible amount of time.

On Fri, May 12, 2017 at 11:51 AM, Brady Deetz <bde...@gmail.com> wrote:

> I have a cluster with 1 inconsistent pg. I have attempted the following
> steps with no luck. What should my next move be?
>
> 1. executed ceph health detail to determine what pg was inconsistent
> [ceph-admin@admin libr-cluster]$ ceph health detail
> HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
> pg 1.959 is active+clean+inconsistent, acting [69,252,127]
> 1 scrub errors
>
> 2. executed ceph pg repair 1.959
>
> 3. after nothing happening for quite a while I decided to dig into it a
> bit. What strikes me most odd is that all of these files seem to be
> consistent using size, md5, and sha256. It's also a little concerning that
> they are all 0 for size.
> [root@osd5 ceph-admin]# rados list-inconsistent-pg cephfs_data
> ["1.959"]
> [root@osd5 ceph-admin]# rados list-inconsistent-pg cephfs_metadata
> []
> [root@osd5 ceph-admin]# rados list-inconsistent-pg rbd
> []
> [root@osd5 ceph-admin]# rados list-inconsistent-pg vmware_ecpool
> []
> [root@osd5 ceph-admin]# rados list-inconsistent-pg vmware_cache
> []
>
> [root@osd5 ceph-admin]# rados list-inconsistent-obj 1.959
> --format=json-pretty
> {
> "epoch": 178113,
> "inconsistents": []
> }
>
>
> [root@osd5 ceph-admin]# grep -Hn 'ERR' /var/log/ceph/ceph-osd.69.log
>
> [root@osd5 ceph]# zgrep -Hn 'ERR' ./ceph-osd.69.log-*
> ./ceph-osd.69.log-20170512.gz:717:2017-05-11 09:23:11.734142 7ff46cbe4700
> -1 log_channel(cluster) log [ERR] : scrub 1.959 
> 1:9a97a372:::10004313b01.0004:head
> on disk size (0) does not match object info size (1417216) adjusted for
> ondisk to (1417216)
> ./ceph-osd.69.log-20170512.gz:785:2017-05-11 09:26:02.877409 7ff46a3df700
> -1 log_channel(cluster) log [ERR] : 1.959 scrub 1 errors
>
>
> [root@osd0 ceph]# grep -Hn 'ERR' ./ceph-osd.127.log
>
> [root@osd0 ceph]# zgrep -Hn 'ERR' ./ceph-osd.127.log-*
>
>
> [root@osd11 ceph-admin]# grep -Hn 'ERR' /var/log/ceph/ceph-osd.252.log
>
> [root@osd11 ceph-admin]# zgrep -Hn 'ERR' /var/log/ceph/ceph-osd.252.log-*
>
>
>
>
> [root@osd5 ceph]# find /var/lib/ceph/osd/ceph-69/current/1.959_head/
> -name '10004313b01.0004*' -ls
> 27377764870 -rw-r--r--   1 ceph ceph0 May 10 07:01
> /var/lib/ceph/osd/ceph-69/current/1.959_head/DIR_9/DIR_
> 5/DIR_9/DIR_E/DIR_5/10004313b01.0004__head_4EC5E959__1
>
> [root@osd5 ceph-admin]# md5sum /var/lib/ceph/osd/ceph-69/
> current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/
> 10004313b01.0004__head_4EC5E959__1
> d41d8cd98f00b204e9800998ecf8427e  /var/lib/ceph/osd/ceph-69/
> current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/
> 10004313b01.0004__head_4EC5E959__1
>
> [root@osd5 ceph-admin]# sha256sum /var/lib/ceph/osd/ceph-69/
> current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/
> 10004313b01.0004__head_4EC5E959__1
> e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
>  /var/lib/ceph/osd/ceph-69/current/1.959_head/DIR_9/DIR_
> 5/DIR_9/DIR_E/DIR_5/10004313b01.0004__head_4EC5E959__1
>
> [root@osd5 ceph-admin]# ls -l /var/lib/ceph/osd/ceph-69/
> current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/
> 10004313b01.0004__head_4EC5E959__1
> -rw-r--r--. 1 ceph ceph 0 May 10 07:01 /var/lib/ceph/osd/ceph-69/
> current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/
> 10004313b01.0004__head_4EC5E959__1
>
>
> [root@osd0 ceph-admin]# find /var/lib/ceph/osd/ceph-127/current/1.959_head/
> -name '10004313b01.0004*' -ls
> 26846610640 -rw-r--r--   1 ceph ceph0 May 10 07:01
> /var/lib/ceph/osd/ceph-127/current/1.959_head/DIR_9/DIR_
> 5/DIR_9/DIR_E/DIR_5/10004313b01.0004__head_4EC5E959__1
>
> [root@osd0 ceph-admin]# md5sum /var/lib/ceph/osd/ceph-127/
> current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/
> 10004313b01.0004__head_4EC5E959__1
> d41d8cd98f00b204e9800998ecf8427e  /var/lib/ceph/osd/ceph-127/
> current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/
> 10004313b01.0004__head_4EC5E959__1
>
> [root@osd0 ceph-admin]# sha256sum /var/lib/ceph/osd/ceph-127/
> current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/
> 10004313b01.0004__head_4EC5E959__1
> e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
>  /var/lib/ceph/osd/ceph-127/current/1.9

[ceph-users] pg marked inconsistent while appearing to be consistent

2017-05-12 Thread Brady Deetz
h-252/current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/10004313b01.0004__head_4EC5E959__1
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
 
/var/lib/ceph/osd/ceph-252/current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/10004313b01.0004__head_4EC5E959__1

[root@osd11 ceph-admin]# ls -l
/var/lib/ceph/osd/ceph-252/current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/10004313b01.0004__head_4EC5E959__1
-rw-r--r--. 1 ceph ceph 0 May 10 07:01
/var/lib/ceph/osd/ceph-252/current/1.959_head/DIR_9/DIR_5/DIR_9/DIR_E/DIR_5/10004313b01.0004__head_4EC5E959__1





[ceph-admin@admin libr-cluster]$ ceph status
cluster 6f91f60c-7bc0-4aaa-a136-4a90851fbe10
 health HEALTH_ERR
1 pgs inconsistent
1 scrub errors
 monmap e17: 5 mons at {mon0=
10.124.103.60:6789/0,mon1=10.124.103.61:6789/0,mon2=10.124.103.62:6789/0,osd2=10.124.103.72:6789/0,osd3=10.124.103.73:6789/0
}
election epoch 454, quorum 0,1,2,3,4 mon0,mon1,mon2,osd2,osd3
  fsmap e7005: 1/1/1 up {0=mds0=up:active}, 1 up:standby
 osdmap e178115: 235 osds: 235 up, 235 in
flags sortbitwise,require_jewel_osds
  pgmap v21842842: 5892 pgs, 5 pools, 305 TB data, 119 Mobjects
917 TB used, 364 TB / 1282 TB avail
5863 active+clean
  16 active+clean+scrubbing+deep
  12 active+clean+scrubbing
   1 active+clean+inconsistent
  client io 4076 kB/s rd, 633 kB/s wr, 15 op/s rd, 58 op/s wr

Thanks for any advice!

-Brady Deetz
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Performance

2017-05-09 Thread Brady Deetz
Readding list:

So with email, you're talking about lots of small reads and writes. In my
experience with dicom data (thousands of 20KB files per directory), cephfs
doesn't perform very well at all on platter drivers. I haven't experimented
with pure ssd configurations, so I can't comment on that.

Somebody may correct me here, but small block io on writes just makes
latency all that much more important due to the need to wait for your
replicas to be written before moving on to the next block.

Without know exact hardware details, my brain is immediately jumping to
networking constraints. 2 or 3 spindle drives can pretty much saturate a
1gbps link. As soon as you create contention for that resource, you create
system load for iowait and latency.

You mentioned you don't control the network. Maybe you can scale down and
out.


On May 9, 2017 5:38 PM, "Webert de Souza Lima" <webert.b...@gmail.com>
wrote:


On Tue, May 9, 2017 at 4:40 PM, Brett Niver <bni...@redhat.com> wrote:

> What is your workload like?  Do you have a single or multiple active
> MDS ranks configured?


User traffic is heavy. I can't really say in terms of mb/s or iops but it's
an email server with 25k+ users, usually about 6k simultaneously connected
receiving and reading emails.
I have only one active MDS configured. The others are Stand-by.

On Tue, May 9, 2017 at 7:18 PM, Wido den Hollander <w...@42on.com> wrote:

>
> > Op 9 mei 2017 om 20:26 schreef Brady Deetz <bde...@gmail.com>:
> >
> >
> > If I'm reading your cluster diagram correctly, I'm seeing a 1gbps
> > interconnect, presumably cat6. Due to the additional latency of
> performing
> > metadata operations, I could see cephfs performing at those speeds. Are
> you
> > using jumbo frames? Also are you routing?
> >
> > If you're routing, the router will introduce additional latency that an
> l2
> > network wouldn't experience.
> >
>
> Partially true. I am running various Ceph clusters using L3 routing and
> with a decent router the latency for routing a packet is minimal, like 0.02
> ms or so.
>
> Ceph spends much more time in the CPU then it will take the network to
> forward that IP-packet.
>
> I wouldn't be too afraid to run Ceph over a L3 network.
>
> Wido
>
> > On May 9, 2017 12:01 PM, "Webert de Souza Lima" <webert.b...@gmail.com>
> > wrote:
> >
> > > Hello all,
> > >
> > > I'm been using cephfs for a while but never really evaluated its
> > > performance.
> > > As I put up a new ceph cluster, I though that I should run a benchmark
> to
> > > see if I'm going the right way.
> > >
> > > By the results I got, I see that RBD performs *a lot* better in
> > > comparison to cephfs.
> > >
> > > The cluster is like this:
> > >  - 2 hosts with one SSD OSD each.
> > >this hosts have 2 pools: cephfs_metadata and cephfs_cache (for
> > > cache tiering).
> > >  - 3 hosts with 5 HDD OSDs each.
> > >   this hosts have 1 pool: cephfs_data.
> > >
> > > all details, cluster set up and results can be seen here:
> > > https://justpaste.it/167fr
> > >
> > > I created the RBD pools the same way as the CEPHFS pools except for the
> > > number of PGs in the data pool.
> > >
> > > I wonder why that difference or if I'm doing something wrong.
> > >
> > > Regards,
> > >
> > > Webert Lima
> > > DevOps Engineer at MAV Tecnologia
> > > *Belo Horizonte - Brasil*
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Performance

2017-05-09 Thread Brady Deetz
If I'm reading your cluster diagram correctly, I'm seeing a 1gbps
interconnect, presumably cat6. Due to the additional latency of performing
metadata operations, I could see cephfs performing at those speeds. Are you
using jumbo frames? Also are you routing?

If you're routing, the router will introduce additional latency that an l2
network wouldn't experience.

On May 9, 2017 12:01 PM, "Webert de Souza Lima" 
wrote:

> Hello all,
>
> I'm been using cephfs for a while but never really evaluated its
> performance.
> As I put up a new ceph cluster, I though that I should run a benchmark to
> see if I'm going the right way.
>
> By the results I got, I see that RBD performs *a lot* better in
> comparison to cephfs.
>
> The cluster is like this:
>  - 2 hosts with one SSD OSD each.
>this hosts have 2 pools: cephfs_metadata and cephfs_cache (for
> cache tiering).
>  - 3 hosts with 5 HDD OSDs each.
>   this hosts have 1 pool: cephfs_data.
>
> all details, cluster set up and results can be seen here:
> https://justpaste.it/167fr
>
> I created the RBD pools the same way as the CEPHFS pools except for the
> number of PGs in the data pool.
>
> I wonder why that difference or if I'm doing something wrong.
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd iscsi gateway question

2017-04-06 Thread Brady Deetz
I appreciate everybody's responses here. I remember the announcement of
Petasan a whole back on here and some concerns about it.

Is anybody using it in production yet?

On Apr 5, 2017 9:58 PM, "Brady Deetz" <bde...@gmail.com> wrote:

> I apologize if this is a duplicate of something recent, but I'm not
> finding much. Does the issue still exist where dropping an OSD results in a
> LUN's I/O hanging?
>
> I'm attempting to determine if I have to move off of VMWare in order to
> safely use Ceph as my VM storage.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd iscsi gateway question

2017-04-05 Thread Brady Deetz
I apologize if this is a duplicate of something recent, but I'm not finding
much. Does the issue still exist where dropping an OSD results in a LUN's
I/O hanging?

I'm attempting to determine if I have to move off of VMWare in order to
safely use Ceph as my VM storage.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS Read-Only state in production CephFS

2017-03-28 Thread Brady Deetz
Thanks John. Since we're on 10.2.5, the mds package has a dependency on
10.2.6

Do you feel it is safe to perform a cluster upgrade to 10.2.6 in this state?

[root@mds0 ceph-admin]# rpm -Uvh ceph-mds-10.2.6-1.gdf5ca2d.el7.x86_64.rpm
error: Failed dependencies:
ceph-base = 1:10.2.6-1.gdf5ca2d.el7 is needed by
ceph-mds-1:10.2.6-1.gdf5ca2d.el7.x86_64
ceph-mds = 1:10.2.5-0.el7 is needed by (installed)
ceph-1:10.2.5-0.el7.x86_64


On Tue, Mar 28, 2017 at 2:37 PM, John Spray <jsp...@redhat.com> wrote:

> On Tue, Mar 28, 2017 at 7:12 PM, Brady Deetz <bde...@gmail.com> wrote:
> > Thank you very much. I've located the directory that's layout is against
> > that pool. I've dug around to attempt to create a pool with the same ID
> as
> > the deleted one, but for fairly obvious reasons, that doesn't seem to
> exist.
>
> So there's a candidate fix on a branch called wip-19401-jewel, you can
> see builds here:
> https://shaman.ceph.com/repos/ceph/wip-19401-jewel/
> df5ca2d8e3f930ddae5708c50c6495c03b3dc078/
> -- click through to one of those and do "repo url" to get to some
> built artifacts.
>
> Hopefully you're running one of centos 7, ubuntu xenial or ubuntu
> trusty, and therefore one of those builds will work for you (use the
> "default" variants rather than the "notcmalloc" variants) -- you
> should only need to pick out the ceph-mds package rather than
> upgrading everything.
>
> Cheers,
> John
>
>
> > On Tue, Mar 28, 2017 at 1:08 PM, John Spray <jsp...@redhat.com> wrote:
> >>
> >> On Tue, Mar 28, 2017 at 6:45 PM, Brady Deetz <bde...@gmail.com> wrote:
> >> > If I follow the recommendations of this doc, do you suspect we will
> >> > recover?
> >> >
> >> > http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/
> >>
> >> You might, but it's overkill and introduces its own risks -- your
> >> metadata isn't really corrupt, you're just hitting a bug in the
> >> running code where it's overreacting.  I'm writing a patch now.
> >>
> >> John
> >>
> >>
> >>
> >>
> >> > On Tue, Mar 28, 2017 at 12:37 PM, Brady Deetz <bde...@gmail.com>
> wrote:
> >> >>
> >> >> I did do that. We were experimenting with an ec backed pool on the
> fs.
> >> >> It
> >> >> was stuck in an incomplete+creating state over night for only 128 pgs
> >> >> so I
> >> >> deleted the pool this morning. At the time of deletion, the only
> issue
> >> >> was
> >> >> the stuck 128 pgs.
> >> >>
> >> >> On Tue, Mar 28, 2017 at 12:29 PM, John Spray <jsp...@redhat.com>
> wrote:
> >> >>>
> >> >>> Did you at some point add a new data pool to the filesystem, and
> then
> >> >>> remove the pool?  With a little investigation I've found that the
> MDS
> >> >>> currently doesn't handle that properly:
> >> >>> http://tracker.ceph.com/issues/19401
> >> >>>
> >> >>> John
> >> >>>
> >> >>> On Tue, Mar 28, 2017 at 6:11 PM, John Spray <jsp...@redhat.com>
> wrote:
> >> >>> > On Tue, Mar 28, 2017 at 5:54 PM, Brady Deetz <bde...@gmail.com>
> >> >>> > wrote:
> >> >>> >> Running Jewel 10.2.5 on my production cephfs cluster and came
> into
> >> >>> >> this ceph
> >> >>> >> status
> >> >>> >>
> >> >>> >> [ceph-admin@mds1 brady]$ ceph status
> >> >>> >> cluster 6f91f60c-7bc0-4aaa-a136-4a90851fbe10
> >> >>> >>  health HEALTH_WARN
> >> >>> >> mds0: Behind on trimming (2718/30)
> >> >>> >> mds0: MDS in read-only mode
> >> >>> >>  monmap e17: 5 mons at
> >> >>> >>
> >> >>> >>
> >> >>> >> {mon0=10.124.103.60:6789/0,mon1=10.124.103.61:6789/0,
> mon2=10.124.103.62:6789/0,osd2=10.124.103.72:6789/0,
> osd3=10.124.103.73:6789/0}
> >> >>> >> election epoch 378, quorum 0,1,2,3,4
> >> >>> >> mon0,mon1,mon2,osd2,osd3
> >> >>> >>   fsmap e6817: 1/1/1 up {0=mds0=up:active}, 1 up:standby
> >> >>> >>  osdmap e172126: 235 osds: 235 up, 235 in
> >> >>> >> flags sortbitwise,require_

Re: [ceph-users] MDS Read-Only state in production CephFS

2017-03-28 Thread Brady Deetz
Thank you very much. I've located the directory that's layout is against
that pool. I've dug around to attempt to create a pool with the same ID as
the deleted one, but for fairly obvious reasons, that doesn't seem to exist.

On Tue, Mar 28, 2017 at 1:08 PM, John Spray <jsp...@redhat.com> wrote:

> On Tue, Mar 28, 2017 at 6:45 PM, Brady Deetz <bde...@gmail.com> wrote:
> > If I follow the recommendations of this doc, do you suspect we will
> recover?
> >
> > http://docs.ceph.com/docs/jewel/cephfs/disaster-recovery/
>
> You might, but it's overkill and introduces its own risks -- your
> metadata isn't really corrupt, you're just hitting a bug in the
> running code where it's overreacting.  I'm writing a patch now.
>
> John
>
>
>
>
> > On Tue, Mar 28, 2017 at 12:37 PM, Brady Deetz <bde...@gmail.com> wrote:
> >>
> >> I did do that. We were experimenting with an ec backed pool on the fs.
> It
> >> was stuck in an incomplete+creating state over night for only 128 pgs
> so I
> >> deleted the pool this morning. At the time of deletion, the only issue
> was
> >> the stuck 128 pgs.
> >>
> >> On Tue, Mar 28, 2017 at 12:29 PM, John Spray <jsp...@redhat.com> wrote:
> >>>
> >>> Did you at some point add a new data pool to the filesystem, and then
> >>> remove the pool?  With a little investigation I've found that the MDS
> >>> currently doesn't handle that properly:
> >>> http://tracker.ceph.com/issues/19401
> >>>
> >>> John
> >>>
> >>> On Tue, Mar 28, 2017 at 6:11 PM, John Spray <jsp...@redhat.com> wrote:
> >>> > On Tue, Mar 28, 2017 at 5:54 PM, Brady Deetz <bde...@gmail.com>
> wrote:
> >>> >> Running Jewel 10.2.5 on my production cephfs cluster and came into
> >>> >> this ceph
> >>> >> status
> >>> >>
> >>> >> [ceph-admin@mds1 brady]$ ceph status
> >>> >> cluster 6f91f60c-7bc0-4aaa-a136-4a90851fbe10
> >>> >>  health HEALTH_WARN
> >>> >> mds0: Behind on trimming (2718/30)
> >>> >> mds0: MDS in read-only mode
> >>> >>  monmap e17: 5 mons at
> >>> >>
> >>> >> {mon0=10.124.103.60:6789/0,mon1=10.124.103.61:6789/0,
> mon2=10.124.103.62:6789/0,osd2=10.124.103.72:6789/0,
> osd3=10.124.103.73:6789/0}
> >>> >> election epoch 378, quorum 0,1,2,3,4
> >>> >> mon0,mon1,mon2,osd2,osd3
> >>> >>   fsmap e6817: 1/1/1 up {0=mds0=up:active}, 1 up:standby
> >>> >>  osdmap e172126: 235 osds: 235 up, 235 in
> >>> >> flags sortbitwise,require_jewel_osds
> >>> >>   pgmap v18008949: 5696 pgs, 2 pools, 291 TB data, 112 Mobjects
> >>> >> 874 TB used, 407 TB / 1282 TB avail
> >>> >> 5670 active+clean
> >>> >>   13 active+clean+scrubbing+deep
> >>> >>   13 active+clean+scrubbing
> >>> >>   client io 760 B/s rd, 0 op/s rd, 0 op/s wr
> >>> >>
> >>> >> I've tried rebooting both mds servers. I've started a rolling reboot
> >>> >> across
> >>> >> all of my osd nodes, but each node takes about 10 minutes fully
> >>> >> rejoin. so
> >>> >> it's going to take a while. Any recommendations other than reboot?
> >>> >
> >>> > As it says in the log, your MDSs are going read only because of
> errors
> >>> > writing to the OSDs:
> >>> > 2017-03-28 08:04:12.379747 7f25ed0af700 -1 log_channel(cluster) log
> >>> > [ERR] : failed to store backtrace on ino 10003a398a6 object, pool 20,
> >>> > errno -2
> >>> >
> >>> > These messages are also scary and indicates that something has gone
> >>> > seriously wrong, either with the storage of the metadata or
> internally
> >>> > with the MDS:
> >>> > 2017-03-28 08:04:12.251543 7f25ef2b5700 -1 log_channel(cluster) log
> >>> > [ERR] : bad/negative dir size on 608 f(v9 m2017-03-28 07:56:45.803267
> >>> > -223=-221+-2)
> >>> > 2017-03-28 08:04:12.251564 7f25ef2b5700 -1 log_channel(cluster) log
> >>> > [ERR] : unmatched fragstat on 608, inode has f(v10 m2017-03-28
> >>> > 07:56:45.803267 -223=-221+-2), dirfrags have f(v0 m2017-03-28
> >>> > 07:56:45.803267)
> >>> >
> >>> > The case that I know of that causes ENOENT on object writes is when
> >>> > the pool no longer exists.  You can set "debug objecter = 10" on the
> >>> > MDS and look for a message like "check_op_pool_dne tid 
> >>> > concluding pool  dne".
> >>> >
> >>> > Otherwise, go look at the OSD logs from the timestamp where the
> failed
> >>> > write is happening to see if there's anything there.
> >>> >
> >>> > John
> >>> >
> >>> >
> >>> >
> >>> >>
> >>> >> Attached are my mds logs during the failure.
> >>> >>
> >>> >> Any ideas?
> >>> >>
> >>> >> ___
> >>> >> ceph-users mailing list
> >>> >> ceph-users@lists.ceph.com
> >>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>> >>
> >>
> >>
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS Read-Only state in production CephFS

2017-03-28 Thread Brady Deetz
Running Jewel 10.2.5 on my production cephfs cluster and came into this
ceph status

[ceph-admin@mds1 brady]$ ceph status
cluster 6f91f60c-7bc0-4aaa-a136-4a90851fbe10
 health HEALTH_WARN
mds0: Behind on trimming (2718/30)
mds0: MDS in read-only mode
 monmap e17: 5 mons at {mon0=
10.124.103.60:6789/0,mon1=10.124.103.61:6789/0,mon2=10.124.103.62:6789/0,osd2=10.124.103.72:6789/0,osd3=10.124.103.73:6789/0
}
election epoch 378, quorum 0,1,2,3,4 mon0,mon1,mon2,osd2,osd3
  fsmap e6817: 1/1/1 up {0=mds0=up:active}, 1 up:standby
 osdmap e172126: 235 osds: 235 up, 235 in
flags sortbitwise,require_jewel_osds
  pgmap v18008949: 5696 pgs, 2 pools, 291 TB data, 112 Mobjects
874 TB used, 407 TB / 1282 TB avail
5670 active+clean
  13 active+clean+scrubbing+deep
  13 active+clean+scrubbing
  client io 760 B/s rd, 0 op/s rd, 0 op/s wr

I've tried rebooting both mds servers. I've started a rolling reboot across
all of my osd nodes, but each node takes about 10 minutes fully rejoin. so
it's going to take a while. Any recommendations other than reboot?

Attached are my mds logs during the failure.

Any ideas?


mds0
Description: Binary data


mds1
Description: Binary data
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: Does this indicate a "CPU bottleneck"?

2017-01-19 Thread Brady Deetz
Your switches maybe have limits of frames per second. Your journals may be
limited by iops. Can you fully describe the system?

On Jan 20, 2017 12:25 AM, "许雪寒"  wrote:

> The network is only about 10% full, and we tested the performance with
> different number of clients, and it turned out that no matter how we
> increase the number of clients, the result is the same.
>
> -邮件原件-
> 发件人: John Spray [mailto:jsp...@redhat.com]
> 发送时间: 2017年1月19日 16:11
> 收件人: 许雪寒
> 抄送: ceph-users@lists.ceph.com
> 主题: Re: [ceph-users] Does this indicate a "CPU bottleneck"?
>
> On Thu, Jan 19, 2017 at 8:51 AM, 许雪寒  wrote:
> > Hi, everyone.
> >
> >
> >
> > Recently, we did some stress test on ceph using three machines. We
> > tested the IOPS of the whole small cluster when there are 1~8 OSDs per
> > machines separately and the result is as follows:
> >
> >
> >
> >  OSD num per machine fio iops
> >
> > 1
> > 10k
> >
> > 2
> > 16.5k
> >
> > 3
> > 22k
> >
> > 4
> > 23.5k
> >
> > 5
> > 26k
> >
> > 6
> > 27k
> >
> > 7
> > 27k
> >
> > 8
> > 28k
> >
> >
> >
> > As shown above, it seems that there is some kind of bottleneck when
> > there are more than 4 OSDs per machine. Meanwhile, we observed that
> > the CPU %idle during the test, shown below, has also some kind of
> > correlation with the number of OSDs per machine.
> >
> >
> >
> >  OSD num per machine CPU idle
> >
> > 1
> > 74%
> >
> > 2
> > 52%
> >
> > 3
> > 30%
> >
> > 4
> > 25%
> >
> > 5
> > 24%
> >
> > 6
> > 17%
> >
> > 7
> > 14%
> >
> > 8
> > 11%
> >
> >
> >
> > It seems that with the number of OSDs per machine increasing, the CPU
> > idle time is reducing and the reduce rate Is also decreasing, can we
> > come to the conclusion that CPU is the performance bottleneck in this
> test?
>
> Impossible to say without looking at what else was bottlenecked, such as
> the network or the client.
>
> John
>
> >
> >
> > Thank youJ
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg remapped+peering forever and MDS trimming behind

2016-10-26 Thread Brady Deetz
Just before your response, I decided to take the chance of restarting the
primary osd for the pg (153).

At this point, the MDS trimming error is gone and I'm in a warning state
now. The pg has moved from peering+remapped
to active+degraded+remapped+backfilling.

I'd say we're probably nearly back to a normal state.  And, thanks for the
hint regarding pool ID.

Version Details:
[root@osd1 brady]# cat /etc/centos-release
CentOS Linux release 7.2.1511 (Core)

[root@osd1 brady]# uname -a
Linux osd1.ceph.laureateinstitute.org 3.10.0-327.22.2.el7.x86_64 #1 SMP Thu
Jun 23 17:05:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

[root@osd1 brady]# rpm -qa|grep ceph
ceph-base-10.2.3-0.el7.x86_64
ceph-10.2.3-0.el7.x86_64
ceph-release-1-1.el7.noarch
python-cephfs-10.2.3-0.el7.x86_64
ceph-selinux-10.2.3-0.el7.x86_64
ceph-osd-10.2.3-0.el7.x86_64
ceph-mds-10.2.3-0.el7.x86_64
ceph-radosgw-10.2.3-0.el7.x86_64
ceph-deploy-1.5.34-0.noarch
libcephfs1-10.2.3-0.el7.x86_64
ceph-common-10.2.3-0.el7.x86_64
ceph-mon-10.2.3-0.el7.x86_64

Thanks!

On Wed, Oct 26, 2016 at 2:02 PM, Wido den Hollander <w...@42on.com> wrote:

>
> > Op 26 oktober 2016 om 20:44 schreef Brady Deetz <bde...@gmail.com>:
> >
> >
> > Summary:
> > This is a production CephFS cluster. I had an OSD node crash. The cluster
> > rebalanced successfully. I brought the down node back online. Everything
> > has rebalanced except 1 hung pg and MDS trimming is now behind. No
> hardware
> > failures have become apparent yet.
> >
> > Questions:
> > 1) Is there a way to see what pool a placement group belongs to?
>
> The PG's ID always starts with the pool's ID. In your case it's '1'.
>
> # ceph osd dump|grep pool
>
> You will see the pool ID there.
>
> > 2) How should I move forward with unsticking my 1 pg in a constant
> > remapped+peering state?
> >
>
> Looking at the PG query have you tried to restart the primary OSD of the
> PG? And trying to restart the others: [153,162,5]
>
> Which version of Ceph are you running?
>
> > Based on the remapped+peering pg not going away and the mds trimming
> > getting further and further behind, I'm guessing that the pg belongs to
> the
> > cephfs metadata pool.
> >
>
> Probably the case indeed. The MDS is blocked by this single PG.
>
> > Any help you can provide is greatly appreciated.
> >
> > Details:
> > OSD Node Description:
> > -2 vlans going over 40gig ethernet for pub/priv nets
> > -256 GB RAM
> > -2x Xeon 2660v4
> > -2x P3700 (journal)
> > -24x OSD
> > Primary monitor is dedicated similar configuration to OSD
> > Primary MDS is dedicated similar configuration to OSD
> >
> > [brady@mon0 ~]$ ceph health detail
> > HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs
> > peering; 1 pgs stuck inactive; 47 requests are blocked > 32 sec; 1 osds
> > have slow requests; mds0: Behind on trimming (76/30)
> > pg 1.efa is stuck inactive for 174870.396769, current state
> > remapped+peering, last acting [153,162,5]
> > pg 1.efa is remapped+peering, acting [153,162,5]
> > 34 ops are blocked > 268435 sec on osd.153
> > 13 ops are blocked > 134218 sec on osd.153
> > 1 osds have slow requests
> > mds0: Behind on trimming (76/30)(max_segments: 30, num_segments: 76)
> >
> >
> > [brady@mon0 ~]$ ceph pg dump_stuck
> > ok
> > pg_stat state   up  up_primary  acting  acting_primary
> > 1.efa   remapped+peering[153,10,162]153 [153,162,5]
>  153
> >
> > [brady@mon0 ~]$ ceph pg 1.efa query
> > http://pastebin.com/Rz0ZRfSb
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg remapped+peering forever and MDS trimming behind

2016-10-26 Thread Brady Deetz
Summary:
This is a production CephFS cluster. I had an OSD node crash. The cluster
rebalanced successfully. I brought the down node back online. Everything
has rebalanced except 1 hung pg and MDS trimming is now behind. No hardware
failures have become apparent yet.

Questions:
1) Is there a way to see what pool a placement group belongs to?
2) How should I move forward with unsticking my 1 pg in a constant
remapped+peering state?

Based on the remapped+peering pg not going away and the mds trimming
getting further and further behind, I'm guessing that the pg belongs to the
cephfs metadata pool.

Any help you can provide is greatly appreciated.

Details:
OSD Node Description:
-2 vlans going over 40gig ethernet for pub/priv nets
-256 GB RAM
-2x Xeon 2660v4
-2x P3700 (journal)
-24x OSD
Primary monitor is dedicated similar configuration to OSD
Primary MDS is dedicated similar configuration to OSD

[brady@mon0 ~]$ ceph health detail
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs
peering; 1 pgs stuck inactive; 47 requests are blocked > 32 sec; 1 osds
have slow requests; mds0: Behind on trimming (76/30)
pg 1.efa is stuck inactive for 174870.396769, current state
remapped+peering, last acting [153,162,5]
pg 1.efa is remapped+peering, acting [153,162,5]
34 ops are blocked > 268435 sec on osd.153
13 ops are blocked > 134218 sec on osd.153
1 osds have slow requests
mds0: Behind on trimming (76/30)(max_segments: 30, num_segments: 76)


[brady@mon0 ~]$ ceph pg dump_stuck
ok
pg_stat state   up  up_primary  acting  acting_primary
1.efa   remapped+peering[153,10,162]153 [153,162,5] 153

[brady@mon0 ~]$ ceph pg 1.efa query
http://pastebin.com/Rz0ZRfSb
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Yet another hardware planning question ...

2016-10-13 Thread Brady Deetz
6 SSD per nvme journal might leave your journal in contention. Can you
provide the specific models you will be using?

On Oct 13, 2016 10:23 AM, "Patrik Martinsson" <
patrik.martins...@trioptima.com> wrote:

> Hello everyone,
>
> We are in the process of buying hardware for our first ceph-cluster. We
> will start with some testing and do some performance measurements to
> see that we are on the right track, and once we are satisfied with our
> setup we'll continue to grow in it as time comes along.
>
> Now, I'm just seeking some thoughts on our future hardware, I know
> there are a lot of these kind of questions out there, so please forgive
> me for posting another one.
>
> Details,
> - Cluster will be in the same datacenter, multiple racks as we grow
> - Typical workload (this is incredible vague, forgive me again) would
> be an Openstack environment, hosting 150~200 vms, we'll have quite a
> few databases for Jira/Confluence/etc. Some workload coming from
> Stash/Bamboo agents, puppet master/foreman, and other typical "core
> infra stuff".
>
> Given this prerequisites just given, the going all SSD's (and NVME for
> journals) may seem as overkill(?), but we feel like we can afford it
> and it will be a benefit for us in the future.
>
> Planned hardware,
>
> Six nodes to begin with (which would give us a cluster size of ~46TB,
> with a default replica of three (although probably a bit bigger since
> the vm's would be backed by a erasure coded pool) will look something
> like,
>  - 1x  Intel E5-2695 v4 2.1GHz, 45M Cache, 18 Cores
>  - 2x  Dell 64 GB RDIMM 2400MT
>  - 12x Dell 1.92TB Mix Use MLC 12Gbps (separate OS disks)
>  - 2x  Dell 1.6TB NVMe Mixed usage (6 osd's per NVME)
>
> Network between all nodes within a rack will be 40Gbit (and 200Gbit
> between racks), backed by Junipers QFX5200-32C.
>
> Rather then asking the question,
> - "Does this seems reasonable for our workload ?",
>
> I want to ask,
> - "Is there any reason *not* have a setup like this, is there any
> obvious bottlenecks or flaws that we are missing or could this may very
> well work as good start (and the ability to grow with adding more
> servers) ?"
>
> When it comes to workload-wise-issues, I think we'll just have to see
> and grow as we learn.
>
> We'll be grateful for any input, thoughts, ideas suggestions, you name
> it.
>
> Best regards,
> Patrik Martinsson,
> Sweden
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Attempt to access beyond end of device

2016-09-28 Thread Brady Deetz
The question:
Is this something I need to investigate further, or am I being paranoid?
Seems bad to me.


I have a fairly new cluster built using ceph-deploy 1.5.34-0, ceph
10.2.2-0, and centos 7.2.1511.

I recently noticed on every one of my osd nodes alarming dmesg log entries
for each osd on each node on some kind of periodic basis:
attempt to access beyond end of device
sda1: rw=0, want=11721043088, limit=11721043087

For instance one node had entries at times:
Sep 27 05:40:34
Sep 27 07:10:32
Sep 27 08:10:30
Sep 27 09:40:28
Sep 27 12:40:24
Sep 27 15:40:19

In every case, the "want" is 1 sector greater than the "limit"... My first
thought was 'could this be an off-by-one bug somewhere in Ceph?' But, after
thinking about the way stuff works and the data below, the seems unlikely.

Digging around I found and followed this redhat article:
https://access.redhat.com/solutions/21135

--
Error Message Device Size:
11721043087 * 512 = 6001174060544


Current Device Size:
cat /proc/partitions | grep sda1
8 1 5860521543 sda1

5860521543 * 1024 = 6001174060032


Filesystem Size:
sudo xfs_info /dev/sda1 | grep data | grep blocks
data = bsize=4096 blocks=1465130385, imaxpct=5

1465130385 * 4096 = 6001174056960
--

(EMDS != CDS) == true
Redhat says device naming may have change. All but 2 disks in the node are
identical. Those 2 disks are md raided and not exhibiting the issue. So, I
don't think this is the issue.

(FSS > CDS) == false
My filesystem is not larger than the device size or the error message
device size.

Thanks,
Brady
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Export nfs-ganesha from standby MDS and last MON

2016-08-22 Thread Brady Deetz
Is it an acceptable practice to configure my standby MDS and highest IP'd
MON as ganesha servers?

Since MDS is supposedly primarily bound to a single core, despite having
many threads, would exporting really cause any issues if the niceness of
the ganesha service was higher than the mds process?

What about the mon service?

Configuration:
3 dedicated mon
2 dedicated mds (active-standby)
8 osd node (24 disk + 2 nvme (journal))
40 gig interconnect
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS Fuse ACLs

2016-08-18 Thread Brady Deetz
apparently fuse_default_permission and client_acl_type have to be in the
fstab entry instead of the ceph.conf.

Sorry for polluting the mailing list with an amateur mis-configuration.

On Thu, Aug 18, 2016 at 4:26 PM, Brady Deetz <bde...@gmail.com> wrote:

> I'm having an issue with ACLs on my CephFS test environment. Am I an idiot
> or is something weird going on?
>
> TLDR;
> I setfacl as root for a local user and the user still can't access the
> file.
>
> Example:
> root@test-client:/media/cephfs/storage/labs# touch test
> root@test-client:/media/cephfs/storage/labs# chown root:root test
> root@test-client:/media/cephfs/storage/labs# chmod 660 test
> root@test-client:/media/cephfs/storage/labs# setfacl -m u:brady:rwx test
>
> other shell as local user:
> brady@test-client:/media/cephfs/storage/labs$ getfacl test
> # file: test
> # owner: root
> # group: root
> user::rw-
> user:brady:rwx
> group::rw-
> mask::rwx
> other::---
>
> brady@test-client:/media/cephfs/storage/labs$ cat test
> cat: test: Permission denied
>
>
>
> Configuration details:
> Ubuntu 16.04.1
> fuse 2.9.4-1ubuntu3.1
> ceph-fuse 10.2.2-0ubuntu0.16.04.2
> acl 2.2.52-3
> kernel 4.4.0-34-generic (from ubuntu)
>
> fstab entry:
> mount.fuse.ceph#id=admin,conf=/etc/ceph/ceph.conf   /media/cephfs
> fusedefaults,_netdev0   0
>
> ceph.conf:
> [global]
> fsid = 6f91f60c-7bc0-4aaa-a136-4a90851fbe10
> mon_initial_members = mon0
> mon_host = 10.124.103.60
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> public_network = 10.124.103.0/24
> cluster_network = 10.124.104.0/24
> osd_pool_default_size = 3
>
> [client]
> fuse_default_permission=0
> client_acl_type=posix_acl
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS Fuse ACLs

2016-08-18 Thread Brady Deetz
I'm having an issue with ACLs on my CephFS test environment. Am I an idiot
or is something weird going on?

TLDR;
I setfacl as root for a local user and the user still can't access the file.

Example:
root@test-client:/media/cephfs/storage/labs# touch test
root@test-client:/media/cephfs/storage/labs# chown root:root test
root@test-client:/media/cephfs/storage/labs# chmod 660 test
root@test-client:/media/cephfs/storage/labs# setfacl -m u:brady:rwx test

other shell as local user:
brady@test-client:/media/cephfs/storage/labs$ getfacl test
# file: test
# owner: root
# group: root
user::rw-
user:brady:rwx
group::rw-
mask::rwx
other::---

brady@test-client:/media/cephfs/storage/labs$ cat test
cat: test: Permission denied



Configuration details:
Ubuntu 16.04.1
fuse 2.9.4-1ubuntu3.1
ceph-fuse 10.2.2-0ubuntu0.16.04.2
acl 2.2.52-3
kernel 4.4.0-34-generic (from ubuntu)

fstab entry:
mount.fuse.ceph#id=admin,conf=/etc/ceph/ceph.conf   /media/cephfs
fusedefaults,_netdev0   0

ceph.conf:
[global]
fsid = 6f91f60c-7bc0-4aaa-a136-4a90851fbe10
mon_initial_members = mon0
mon_host = 10.124.103.60
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public_network = 10.124.103.0/24
cluster_network = 10.124.104.0/24
osd_pool_default_size = 3

[client]
fuse_default_permission=0
client_acl_type=posix_acl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS in the wild

2016-06-06 Thread Brady Deetz
This is an interesting idea that I hadn't yet considered testing.

My test cluster is also looking like 2K per object.

It looks like our hardware purchase for a one-half sized pilot is getting
approved and I don't really want to modify it when we're this close to
moving forward. So, using spare NVMe capacity is certainly an option, but
increasing my OS disk size or replacing OSDs is pretty much a no go for
this iteration of the cluster.

My single concern with the idea of using the NVMe capacity is the potential
to affect journal performance which is already cutting it close with each
NVMe supporting 12 journals. It seems to me what would probably be better
would be to replace 2 HDD OSDs with 2 SSD OSDs and put the metadata pool on
those dedicated SSDs. Even if testing goes well on the NVMe based pool,
dedicated SSDs seem like a safer play and may be what I implement when we
buy our second round of hardware to finish out the cluster and go live
(January-March 2017).



On Mon, Jun 6, 2016 at 12:02 PM, David <dclistsli...@gmail.com> wrote:

>
>
> On Mon, Jun 6, 2016 at 7:06 AM, Christian Balzer <ch...@gol.com> wrote:
>
>>
>> Hello,
>>
>> On Fri, 3 Jun 2016 15:43:11 +0100 David wrote:
>>
>> > I'm hoping to implement cephfs in production at some point this year so
>> > I'd be interested to hear your progress on this.
>> >
>> > Have you considered SSD for your metadata pool? You wouldn't need loads
>> > of capacity although even with reliable SSD I'd probably still do x3
>> > replication for metadata. I've been looking at the intel s3610's for
>> > this.
>> >
>> That's an interesting and potentially quite beneficial thought, but it
>> depends on a number of things (more below).
>>
>> I'm using S3610s (800GB) for a cache pool with 2x replication and am quite
>> happy with that, but then again I have a very predictable usage pattern
>> and am monitoring those SSDs religiously and I'm sure they will outlive
>> things by a huge margin.
>>
>> We didn't go for 3x replication due to (in order):
>> a) cost
>> b) rack space
>> c) increased performance with 2x
>
>
> I'd also be happy with 2x replication for data pools and that's probably
> what I'll do for the reasons you've given. I plan on using File Layouts to
> map some dirs to the ssd pool. I'm testing this at the moment and it's an
> awesome feature. I'm just very paranoid with the metadata and considering
> the relatively low capacity requirement I'd stick with the 3x replication
> although as you say that means a performance hit.
>
>
>>
>> Now for how useful/helpful a fast meta-data pool would be, I reckon it
>> depends on a number of things:
>>
>> a) Is the cluster write or read heavy?
>> b) Do reads, flocks, anything that is not directly considered a read
>>cause writes to the meta-data pool?
>> c) Anything else that might cause write storms to the meta-data pool, like
>>bit in the current NFS over CephFS thread with sync?
>>
>> A quick glance at my test cluster seems to indicate that CephFS meta data
>> per filesystem object is about 2KB, somebody with actual clues please
>> confirm this.
>>
>
> 2K per object appears to be the case on my test cluster too.
>
>
>> Brady has large amounts of NVMe space left over in his current design,
>> assuming 10GB journals about 2.8TB of raw space.
>> So if running the (verified) numbers indicates that the meta data can fit
>> in this space, I'd put it there.
>>
>> Otherwise larger SSDs (indeed S3610s) for OS and meta-data pool storage
>> may
>> be the way forward.
>>
>> Regards,
>>
>> Christian
>> >
>> >
>> > On Wed, Jun 1, 2016 at 9:50 PM, Brady Deetz <bde...@gmail.com> wrote:
>> >
>> > > Question:
>> > > I'm curious if there is anybody else out there running CephFS at the
>> > > scale I'm planning for. I'd like to know some of the issues you didn't
>> > > expect that I should be looking out for. I'd also like to simply see
>> > > when CephFS hasn't worked out and why. Basically, give me your war
>> > > stories.
>> > >
>> > >
>> > > Problem Details:
>> > > Now that I'm out of my design phase and finished testing on VMs, I'm
>> > > ready to drop $100k on a pilo. I'd like to get some sense of
>> > > confidence from the community that this is going to work before I pull
>> > > the trigger.
>> > >
>> > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320
>> with
>> > > CephF

Re: [ceph-users] CephFS in the wild

2016-06-02 Thread Brady Deetz
On Thu, Jun 2, 2016 at 8:58 PM, Christian Balzer <ch...@gol.com> wrote:

> On Thu, 2 Jun 2016 11:11:19 -0500 Brady Deetz wrote:
>
> > On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer <ch...@gol.com> wrote:
> >
> > >
> > > Hello,
> > >
> > > On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:
> > >
> > > > Question:
> > > > I'm curious if there is anybody else out there running CephFS at the
> > > > scale I'm planning for. I'd like to know some of the issues you
> > > > didn't expect that I should be looking out for. I'd also like to
> > > > simply see when CephFS hasn't worked out and why. Basically, give me
> > > > your war stories.
> > > >
> > > Not me, but diligently search the archives, there are people with large
> > > CephFS deployments (despite the non-production status when they did
> > > them). Also look at the current horror story thread about what happens
> > > when you have huge directories.
> > >
> > > >
> > > > Problem Details:
> > > > Now that I'm out of my design phase and finished testing on VMs, I'm
> > > > ready to drop $100k on a pilo. I'd like to get some sense of
> > > > confidence from the community that this is going to work before I
> > > > pull the trigger.
> > > >
> > > > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320
> > > > with CephFS by this time next year (hopefully by December). My
> > > > workload is a mix of small and vary large files (100GB+ in size). We
> > > > do fMRI analysis on DICOM image sets as well as other physio data
> > > > collected from subjects. We also have plenty of spreadsheets,
> > > > scripts, etc. Currently 90% of our analysis is I/O bound and
> > > > generally sequential.
> > > >
> > > There are other people here doing similar things (medical institutes,
> > > universities), again search the archives and maybe contact them
> > > directly.
> > >
> > > > In deploying Ceph, I am hoping to see more throughput than the 7320
> > > > can currently provide. I'm also looking to get away from traditional
> > > > file-systems that require forklift upgrades. That's where Ceph really
> > > > shines for us.
> > > >
> > > > I don't have a total file count, but I do know that we have about
> > > > 500k directories.
> > > >
> > > >
> > > > Planned Architecture:
> > > >
> > > Well, we talked about this 2 months ago, you seem to have changed only
> > > a few things.
> > > So lets dissect this again...
> > >
> > > > Storage Interconnect:
> > > > Brocade VDX 6940 (40 gig)
> > > >
> > > Is this a flat (single) network for all the storage nodes?
> > > And then from these 40Gb/s switches links to the access switches?
> > >
> >
> > This will start as a single 40Gb/s switch with a single link to each node
> > (upgraded in the future to dual-switch + dual-link). The 40Gb/s switch
> > will also be connected to several 10Gb/s and 1Gb/s access switches with
> > dual 40Gb/s uplinks.
> >
> So initially 80Gb/s and with the 2nd switch probably 160Gb/s for your
> clients.
> Network wise, your 8 storage servers outstrip that, actual storage
> bandwidth and IOPS wise, you're looking at 8x2GB/s aka 160Gb/s best case
> writes, so a match.
>
> > We do intend to segment the public and private networks using VLANs
> > untagged at the node. There are obviously many subnets on our network.
> > The 40Gb/s switch will handle routing for those networks.
> >
> > You can see list discussion in "Public and Private network over 1
> > interface" May 23,2016 regarding some of this.
> >
> And I did comment in that thread, the final one actually. ^o^
>
> Unless you can come up with a _very_ good reason not covered in that
> thread, I'd keep it to one network.
>
> Once the 2nd switch is in place and running vLAG (LACP on your servers)
> your network bandwidth per host VASTLY exceeds that of your storage.
>
>
My theory is that with a single switch, I can QoS traffic for the private
network in case of the situation where we do see massive client I/O at the
same time that a re-weight or something like that was happening. But... I
think you're right. KISS

My initial KISS thought was single network was the opposite due to the
alternate and maybe less tested configuration of Ceph. Perhaps
multi-netting is a better comp

Re: [ceph-users] CephFS in the wild

2016-06-02 Thread Brady Deetz
On Wed, Jun 1, 2016 at 8:18 PM, Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> On Wed, 1 Jun 2016 15:50:19 -0500 Brady Deetz wrote:
>
> > Question:
> > I'm curious if there is anybody else out there running CephFS at the
> > scale I'm planning for. I'd like to know some of the issues you didn't
> > expect that I should be looking out for. I'd also like to simply see
> > when CephFS hasn't worked out and why. Basically, give me your war
> > stories.
> >
> Not me, but diligently search the archives, there are people with large
> CephFS deployments (despite the non-production status when they did them).
> Also look at the current horror story thread about what happens when you
> have huge directories.
>
> >
> > Problem Details:
> > Now that I'm out of my design phase and finished testing on VMs, I'm
> > ready to drop $100k on a pilo. I'd like to get some sense of confidence
> > from the community that this is going to work before I pull the trigger.
> >
> > I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
> > CephFS by this time next year (hopefully by December). My workload is a
> > mix of small and vary large files (100GB+ in size). We do fMRI analysis
> > on DICOM image sets as well as other physio data collected from
> > subjects. We also have plenty of spreadsheets, scripts, etc. Currently
> > 90% of our analysis is I/O bound and generally sequential.
> >
> There are other people here doing similar things (medical institutes,
> universities), again search the archives and maybe contact them directly.
>
> > In deploying Ceph, I am hoping to see more throughput than the 7320 can
> > currently provide. I'm also looking to get away from traditional
> > file-systems that require forklift upgrades. That's where Ceph really
> > shines for us.
> >
> > I don't have a total file count, but I do know that we have about 500k
> > directories.
> >
> >
> > Planned Architecture:
> >
> Well, we talked about this 2 months ago, you seem to have changed only a
> few things.
> So lets dissect this again...
>
> > Storage Interconnect:
> > Brocade VDX 6940 (40 gig)
> >
> Is this a flat (single) network for all the storage nodes?
> And then from these 40Gb/s switches links to the access switches?
>

This will start as a single 40Gb/s switch with a single link to each node
(upgraded in the future to dual-switch + dual-link). The 40Gb/s switch will
also be connected to several 10Gb/s and 1Gb/s access switches with dual
40Gb/s uplinks.

We do intend to segment the public and private networks using VLANs
untagged at the node. There are obviously many subnets on our network. The
40Gb/s switch will handle routing for those networks.

You can see list discussion in "Public and Private network over 1
interface" May 23,2016 regarding some of this.


>
> > Access Switches for clients (servers):
> > Brocade VDX 6740 (10 gig)
> >
> > Access Switches for clients (workstations):
> > Brocade ICX 7450
> >
> > 3x MON:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for LevelDB
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> Total overkill in the CPU core arena, fewer but faster cores would be more
> suited for this task.
> A 6-8 core, 2.8-3GHz base speed would be nice, alas Intel has nothing like
> that, the closest one would be the E5-2643v4.
>
> Same for RAM, MON processes are pretty frugal.
>
> No need for NVMes for the leveldb, use 2 400GB DC S3710 for OS (and thus
> the leveldb) and that's being overly generous in the speed/IOPS department.
>
> Note also that 40Gb/s isn't really needed here, alas latency and KISS do
> speak in favor of it, especially if you can afford it.
>

Noted


>
> > 2x MDS:
> > 128GB RAM
> > 2x 200GB SSD for OS
> > 2x 400GB P3700 for LevelDB (is this necessary?)
> No, there isn't any persistent data with MDS, unlike what I assumed as
> well before reading up on it and trying it out for the first time.
>

That's what I thought. For some reason, my VAR keeps throwing these on the
config.


>
> > 2x E5-2660v4
> > 1x Dual Port 40Gb Ethernet
> >
> Dedicated MONs/MDS are often a waste, they are suggested to avoid people
> who don't know what they're doing from overloading things.
>
> So in your case, I'd (again) suggest to get 3 mixed MON/MDS nodes, make
> the first one a dedicated MON and give it the lowest IP.
> HW Specs as discussed above, make sure to use DIMMs that allow you to
> upgrade to 256GB RAM, as MDS can grow larger than the other Ceph demons
> (from my limited experience with CephFS)

[ceph-users] CephFS in the wild

2016-06-01 Thread Brady Deetz
Question:
I'm curious if there is anybody else out there running CephFS at the scale
I'm planning for. I'd like to know some of the issues you didn't expect
that I should be looking out for. I'd also like to simply see when CephFS
hasn't worked out and why. Basically, give me your war stories.


Problem Details:
Now that I'm out of my design phase and finished testing on VMs, I'm ready
to drop $100k on a pilo. I'd like to get some sense of confidence from the
community that this is going to work before I pull the trigger.

I'm planning to replace my 110 disk 300TB (usable) Oracle ZFS 7320 with
CephFS by this time next year (hopefully by December). My workload is a mix
of small and vary large files (100GB+ in size). We do fMRI analysis on
DICOM image sets as well as other physio data collected from subjects. We
also have plenty of spreadsheets, scripts, etc. Currently 90% of our
analysis is I/O bound and generally sequential.

In deploying Ceph, I am hoping to see more throughput than the 7320 can
currently provide. I'm also looking to get away from traditional
file-systems that require forklift upgrades. That's where Ceph really
shines for us.

I don't have a total file count, but I do know that we have about 500k
directories.


Planned Architecture:

Storage Interconnect:
Brocade VDX 6940 (40 gig)

Access Switches for clients (servers):
Brocade VDX 6740 (10 gig)

Access Switches for clients (workstations):
Brocade ICX 7450

3x MON:
128GB RAM
2x 200GB SSD for OS
2x 400GB P3700 for LevelDB
2x E5-2660v4
1x Dual Port 40Gb Ethernet

2x MDS:
128GB RAM
2x 200GB SSD for OS
2x 400GB P3700 for LevelDB (is this necessary?)
2x E5-2660v4
1x Dual Port 40Gb Ethernet

8x OSD:
128GB RAM
2x 200GB SSD for OS
2x 400GB P3700 for Journals
24x 6TB Enterprise SATA
2x E5-2660v4
1x Dual Port 40Gb Ethernet
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Public and Private network over 1 interface

2016-05-23 Thread Brady Deetz
To be clear for future responders, separate mds and mon servers are in the
design. Everything is the same as the osd hardware except the chassis and
there aren't 24 hdds in there.
On May 23, 2016 4:27 PM, "Oliver Dzombic" <i...@ip-interactive.de> wrote:

> Hi,
>
> keep it simple, would be in my opinion devide different tasks to
> different servers and networks.
>
> The more stuff is running on one device, the higher is the chance that
> they will influence each other and this way make debugging harder.
>
> Our first setup's had all ( mon, osd, mds ) on one server. Ending up
> that its hardware to debug. Because you dont know, if its caused by the
> mon, or maybe the kernel, or maybe just because of a combination of
> kernel + osd, the mds is kernel dumping ?!
>
> Same with the network. If you have big numbers of stuff flowing there,
> you have big numbers to keep an eye on, with cross-side effects, which
> will not be helpful to debug stuff fastly.
>
> So, if you want to keep stuff simply, make one device for one task.
>
> Of course, there is a natural balance between efficiency and deviding
> like that. I would also not ( like to ) buy a new switch for big money,
> just because i run out of ports ( while i have still so much bandwidth
> on it ).
>
> But on the other hand, in my humble opinion, the factor of how easy it
> is to debug and how big the chance of cross side effects is, is a big,
> considerable factor.
>
> --
> Mit freundlichen Gruessen / Best regards
>
> Oliver Dzombic
> IP-Interactive
>
> mailto:i...@ip-interactive.de
>
> Anschrift:
>
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
>
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
>
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
>
>
> Am 23.05.2016 um 23:19 schrieb Wido den Hollander:
> >
> >> Op 23 mei 2016 om 21:53 schreef Brady Deetz <bde...@gmail.com>:
> >>
> >>
> >> TLDR;
> >> Has anybody deployed a Ceph cluster using a single 40 gig nic? This is
> >> discouraged in
> >>
> http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/
> >>
> >> "One NIC OSD in a Two Network Cluster:
> >> Generally, we do not recommend deploying an OSD host with a single NIC
> in a
> >> cluster with two networks. --- [cut] --- Additionally, the public
> network
> >> and cluster network must be able to route traffic to each other, which
> we
> >> don’t recommend for security reasons."
> >>
> >
> > I still don't agree with that part of the docs.
> >
> > Imho the public and cluster network make 99% of the setups more complex.
> This make it harder to diagnose problems while a simple, single, flat
> network is easier to work with.
> >
> > I like the approach of a single machine with a single NIC. That machine
> will be up or down. Not in a state where one of the networks might be
> failing.
> >
> > Keep it simple is my advice.
> >
> > Wido
> >
> >> 
> >> Reason for this question:
> >> My hope is that I can keep capital expenses down for this year then add
> a
> >> second switch and second 40 gig DAC to each node next year.
> >>
> >> Thanks for any wisdom you can provide.
> >> -
> >>
> >> Details:
> >> Planned configuration - 40 gig interconnect via Brocade VDX 6940 and 8x
> OSD
> >> nodes configured as follows:
> >> 2x E5-2660v4
> >> 8x 16GB ECC DDR4 (128 GB RAM)
> >> 1x dual port Mellanox ConnectX-3 Pro EN
> >> 24x 6TB enterprise sata
> >> 2x P3700 400GB pcie nvme (journals)
> >> 2x 200GB SSD (OS drive)
> >>
> >> 1) From a security perspective, why not keep the networks segmented all
> the
> >> way to the node using tagged VLANs or VXLANs then untag them at the
> node?
> >> From a security perspective, that's no different than sending 2
> networks to
> >> the same host on different interfaces.
> >>
> >> 2) By using VLANs, I wouldn't have to worry about the special
> configuration
> >> of Ceph mentioned in referenced documentation, since the untagged VLANs
> >> would show up as individual interfaces on the host.
> >>
> >> 3) From a performance perspective, has anybody observed a significant
> >> performance hit by untagging vlans on the node? This is something I
> can't
> >> test, since I do

[ceph-users] Public and Private network over 1 interface

2016-05-23 Thread Brady Deetz
TLDR;
Has anybody deployed a Ceph cluster using a single 40 gig nic? This is
discouraged in
http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/

"One NIC OSD in a Two Network Cluster:
Generally, we do not recommend deploying an OSD host with a single NIC in a
cluster with two networks. --- [cut] --- Additionally, the public network
and cluster network must be able to route traffic to each other, which we
don’t recommend for security reasons."


Reason for this question:
My hope is that I can keep capital expenses down for this year then add a
second switch and second 40 gig DAC to each node next year.

Thanks for any wisdom you can provide.
-

Details:
Planned configuration - 40 gig interconnect via Brocade VDX 6940 and 8x OSD
nodes configured as follows:
2x E5-2660v4
8x 16GB ECC DDR4 (128 GB RAM)
1x dual port Mellanox ConnectX-3 Pro EN
24x 6TB enterprise sata
2x P3700 400GB pcie nvme (journals)
2x 200GB SSD (OS drive)

1) From a security perspective, why not keep the networks segmented all the
way to the node using tagged VLANs or VXLANs then untag them at the node?
>From a security perspective, that's no different than sending 2 networks to
the same host on different interfaces.

2) By using VLANs, I wouldn't have to worry about the special configuration
of Ceph mentioned in referenced documentation, since the untagged VLANs
would show up as individual interfaces on the host.

3) From a performance perspective, has anybody observed a significant
performance hit by untagging vlans on the node? This is something I can't
test, since I don't currently own 40 gig gear.

3.a) If I used a VXLAN offloading nic, wouldn't this remove this potential
issue?

3.a) My back of napkin estimate shows that total OSD read throughput per
node could max out around 38gbps (4800MB/s). But in reality, with plenty of
random I/O, I'm expecting to see something more around 30gbps. So a single
40 gig connection ought to leave plenty of headroom. right?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Maximum MON Network Throughput Requirements

2016-05-02 Thread Brady Deetz
Thanks.
Our initial deployment will be 8 OSD nodes containing 24 OSDs each
(spinning rust, not ssd). Each node will contain 2 PCIe p3700 NVMe for
journals. I expect us to grow to a maximum of 15 OSD nodes.

I'll just keep 40 gig on everything for the sake of consistency and not
risk under-sizing my monitor nodes.
On May 2, 2016 6:17 PM, "Chris Jones" <cjo...@cloudm2.com> wrote:

> Mons and RGWs only use the public network but Mons can have a good deal of
> traffic. I would not recommend 1Gb but if looking for lower bandwidth then
> 10Gb would be good for most. It all depends in the overall size of the
> cluster. You mentioned 40Gb. If the nodes are high density then 40Gb but if
> they are lower density then 20Gb would be fine.
>
> -CJ
>
> On Mon, May 2, 2016 at 12:09 PM, Brady Deetz <bde...@gmail.com> wrote:
>
>> I'm working on finalizing designs for my Ceph deployment. I'm currently
>> leaning toward 40gbps ethernet for interconnect between OSD nodes and to my
>> MDS servers. But, I don't really want to run 40 gig to my mon servers
>> unless there is a reason. Would there be an issue with using 1 gig on my
>> monitor servers?
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Best Regards,
> Chris Jones
>
> cjo...@cloudm2.com
> (p) 770.655.0770
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Maximum MON Network Throughput Requirements

2016-05-02 Thread Brady Deetz
I'm working on finalizing designs for my Ceph deployment. I'm currently
leaning toward 40gbps ethernet for interconnect between OSD nodes and to my
MDS servers. But, I don't really want to run 40 gig to my mon servers
unless there is a reason. Would there be an issue with using 1 gig on my
monitor servers?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds segfault on cephfs snapshot creation

2016-04-20 Thread Brady Deetz
On Wed, Apr 20, 2016 at 4:09 AM, Yan, Zheng <uker...@gmail.com> wrote:

> On Wed, Apr 20, 2016 at 12:12 PM, Brady Deetz <bde...@gmail.com> wrote:
> > As soon as I create a snapshot on the root of my test cephfs deployment
> with
> > a single file within the root, my mds server kernel panics. I understand
> > that snapshots are not recommended. Is it beneficial to developers for
> me to
> > leave my cluster in its present state and provide whatever debugging
> > information they'd like? I'm not really looking for a solution to a
> mission
> > critical issue as much as providing an opportunity for developers to pull
> > stack traces, logs, etc from a system affected by some sort of bug in
> > cephfs/mds. This happens every time I create a directory inside my .snap
> > directory.
>
> It's likely your kernel is too old for kernel mount. which version of
> kernel do you use?
>

All nodes in the cluster share the versions listed below. This actually
appears to be a cephfs client (native) issue (see stacktrace and kernel
dump below). I have my fs mounted on my mds which is why I thought it was
the mds causing a panic.

Linux mon0 3.13.0-77-generic #121-Ubuntu SMP Wed Jan 20 10:50:42 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux

ceph-admin@mon0:~$ cat /etc/issue
Ubuntu 14.04.4 LTS \n \l

ceph-admin@mon0:~$ dpkg -l | grep ceph | tr -s ' ' | cut -d ' ' -f 2,3
ceph 0.80.11-0ubuntu1.14.04.1
ceph-common 0.80.11-0ubuntu1.14.04.1
ceph-deploy 1.4.0-0ubuntu1
ceph-fs-common 0.80.11-0ubuntu1.14.04.1
ceph-mds 0.80.11-0ubuntu1.14.04.1
libcephfs1 0.80.11-0ubuntu1.14.04.1
python-ceph 0.80.11-0ubuntu1.14.04.1


ceph-admin@mon0:~$ ceph status
cluster 186408c3-df8a-4e46-a397-a788fc380039
 health HEALTH_OK
 monmap e1: 1 mons at {mon0=192.168.1.120:6789/0}, election epoch 1,
quorum 0 mon0
 mdsmap e48: 1/1/1 up {0=mon0=up:active}
 osdmap e206: 15 osds: 15 up, 15 in
  pgmap v25298: 704 pgs, 5 pools, 123 MB data, 53 objects
1648 MB used, 13964 GB / 13965 GB avail
 704 active+clean


ceph-admin@mon0:~$ ceph osd tree
# idweight  type name   up/down reweight
-1  13.65   root default
-2  2.73host osd0
0   0.91osd.0   up  1
1   0.91osd.1   up  1
2   0.91osd.2   up  1
-3  2.73host osd1
3   0.91osd.3   up  1
4   0.91osd.4   up  1
5   0.91osd.5   up  1
-4  2.73host osd2
6   0.91osd.6   up  1
7   0.91osd.7   up  1
8   0.91osd.8   up  1
-5  2.73host osd3
9   0.91osd.9   up  1
10  0.91osd.10  up  1
11  0.91osd.11  up  1
-6  2.73host osd4
12  0.91osd.12  up  1
13  0.91osd.13  up  1
14  0.91osd.14  up  1


http://tech-hell.com/dump.201604201536

[ 5869.157340] [ cut here ]
[ 5869.157527] kernel BUG at
/build/linux-faWYrf/linux-3.13.0/fs/ceph/inode.c:928!
[ 5869.157797] invalid opcode:  [#1] SMP
[ 5869.157977] Modules linked in: kvm_intel kvm serio_raw ceph libceph
libcrc32c fscache psmouse floppy
[ 5869.158415] CPU: 0 PID: 46 Comm: kworker/0:1 Not tainted
3.13.0-77-generic #121-Ubuntu
[ 5869.158709] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 5869.158925] Workqueue: ceph-msgr con_work [libceph]
[ 5869.159124] task: 8809abf3c800 ti: 8809abf46000 task.ti:
8809abf46000
[ 5869.159422] RIP: 0010:[]  []
splice_dentry+0xd5/0x190 [ceph]
[ 5869.159768] RSP: 0018:8809abf47b68  EFLAGS: 00010282
[ 5869.159963] RAX: 0004 RBX: 8809a08b2780 RCX:
0001
[ 5869.160224] RDX:  RSI: 8809a04f8370 RDI:
8809a08b2780
[ 5869.160484] RBP: 8809abf47ba8 R08: 8809a982c400 R09:
8809a99ef6e8
[ 5869.160550] R10: 000819d8 R11:  R12:
8809a04f8370
[ 5869.160550] R13: 8809a08b2780 R14: 8809aad5fc00 R15:

[ 5869.160550] FS:  () GS:8809e3c0()
knlGS:
[ 5869.160550] CS:  0010 DS:  ES:  CR0: 8005003b
[ 5869.160550] CR2: 7f60f37ff5c0 CR3: 0009a5f63000 CR4:
06f0
[ 5869.160550] Stack:
[ 5869.160550]  8809a5da1000 8809aad5fc00 8809a99ef408
8809a99ef400
[ 5869.160550]  8809a04f8370 8809a08b2780 8809aad5fc00

[ 5869.160550]  8809abf47c08 a00a0dc7 8809a982c544
8809ab3f5400
[ 5869.160550] Call Trace:
[ 5869.160550]  [] ceph_fill_trace+0x2a7/0x770 [ceph]
[ 5869.160550]  [] handle_reply+0x3d5/0xc70 [ceph]
[ 5869.160550]  [] dispatch+0xe7/0xa90 [ceph]
[

[ceph-users] mds segfault on cephfs snapshot creation

2016-04-19 Thread Brady Deetz
As soon as I create a snapshot on the root of my test cephfs deployment
with a single file within the root, my mds server kernel panics. I
understand that snapshots are not recommended. Is it beneficial to
developers for me to leave my cluster in its present state and provide
whatever debugging information they'd like? I'm not really looking for a
solution to a mission critical issue as much as providing an opportunity
for developers to pull stack traces, logs, etc from a system affected by
some sort of bug in cephfs/mds. This happens every time I create a
directory inside my .snap directory.

Let me know if I should blow my cluster away?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >