Re: [ceph-users] CephFS MDS optimal setup on Google Cloud

2019-01-20 Thread Mahmoud Ismail
On Mon, 7 Jan 2019 at 21:04, Patrick Donnelly  wrote:

> Hello Mahmoud,
>
> On Fri, Dec 21, 2018 at 7:44 AM Mahmoud Ismail
>  wrote:
> > I'm doing benchmarks for metadata operations on CephFS, HDFS, and HopsFS
> on Google Cloud. In my current setup, i'm using 32 vCPU machines with 29 GB
> memory, and i have 1 MDS, 1 MON and 3 OSDs. The MDS and the MON nodes are
> co-located on one vm, while each of the OSDs is on a separate vm with 1 SSD
> disk attached. I'm using the default configuration for MDS, and OSDs.
> >
> > I'm running 300 clients on 10 machines (16 vCPU), each client creates a
> CephFileSystem using the CephFS hadoop plugin, and then writes empty files
> for 30 seconds followed by reading the empty files for another 30 seconds.
> The aggregated throughput is around 2000 file create opertions/sec and
> 1 file read operations/sec. However, the MDS is not fully utilizing the
> 32 cores on the machine, is there any configuration that i should consider
> to fully utilize the machine?.
>
> The MDS is not yet very parallel; it can only utilize about 2.5 cores
> in the best circumstances. Make sure you allocate plenty of RAM for
> the MDS. 16GB or 32GB would be a good choice. See (and disregard the
> warning on that page):
> http://docs.ceph.com/docs/mimic/cephfs/cache-size-limits/
>
> You may also try using multiple active metadata servers to increase
> throughput. See: http://docs.ceph.com/docs/mimic/cephfs/multimds/


How to often does the dynamic subtree paritioning kicks in? Can we control
this interval?


>
> Also, i noticed that running more than 20-30 clients (on different
> threads) per machine degrade the aggregated throughput for read, is there a
> limitation on CephFileSystem and libceph on the number of clients created
> per machine?
>
> No. Can't give you any hints without more information about the test
> setup. We also have not tested with the Hadoop plugin in years. There
> may be limitations we're not presently aware of.


On each machine, i’m running a simple java code that creates 30
CephFileSystems using the hadoop file system interface (hadoop plugin) and
then on each thread, i’m doing write and then read operations on empty
files in a loop.


> > Another issue,  Are the MDS operations single threaded as pointed here "
> https://www.slideshare.net/XiaoxiChen3/cephfs-jewel-mds-performance-benchmark
> "?
>
> Yes, this is still the case.
>
> > Regarding the MDS global lock, is it it a single lock per MDS or is it a
> global distributed lock for all MDSs?
>
> per-MDS
>
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problem with OSDs

2019-01-20 Thread Brian Topping
Hi all, looks like I might have pooched something. Between the two nodes I 
have, I moved all the PGs to one machine, reformatted the other machine, 
rebuilt that machine, and moved the PGs back. In both cases, I did this by 
taking the OSDs on the machine being moved from “out” and waiting for health to 
be restored, then took them down. 

This worked great up to the point I had the mon/manager/rgw where they started, 
all the OSDs/PGs on the other machine that had been rebuilt. The next step was 
to rebuild the master machine, copy /etc/ceph and /var/lib/ceph with cpio, then 
re-add new OSDs on the master machine as it were.

This didn’t work so well. The master has come up just fine, but it’s not 
connecting to the OSDs. Of the four OSDs, only two came up, and the other two 
did not (IDs 1 and 3). For it's part, the OSD machine is reporting lines like 
the following in it’s logs:

> [2019-01-20 16:22:10,106][systemd][WARNING] failed activating OSD, retries 
> left: 2
> [2019-01-20 16:22:15,111][ceph_volume.process][INFO  ] Running command: 
> /usr/sbin/ceph-volume lvm trigger 1-e3bfc69e-a145-4e19-aac2-5f888e1ed2ce
> [2019-01-20 16:22:15,271][ceph_volume.process][INFO  ] stderr -->  
> RuntimeError: could not find osd.1 with fsid 
> e3bfc69e-a145-4e19-aac2-5f888e1ed2ce


I see this for the volumes:

> [root@gw02 ceph]# ceph-volume lvm list 
> 
> == osd.1 ===
> 
>   [block]
> /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
> 
>   type  block
>   osd id1
>   cluster fsid  1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>   cluster name  ceph
>   osd fsid  4672bb90-8cea-4580-85f2-1e692811a05a
>   encrypted 0
>   cephx lockbox secret  
>   block uuid3M5fen-JgsL-t4vz-bh3m-k3pf-hjBV-4R7Cff
>   block device  
> /dev/ceph-c7640f3e-0bf5-4d75-8dd4-00b6434c84d9/osd-block-4672bb90-8cea-4580-85f2-1e692811a05a
>   vdo   0
>   crush device classNone
>   devices   /dev/sda3
> 
> == osd.3 ===
> 
>   [block]
> /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
> 
>   type  block
>   osd id3
>   cluster fsid  1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>   cluster name  ceph
>   osd fsid  084cf33d-8a38-4c82-884a-7c88e3161479
>   encrypted 0
>   cephx lockbox secret  
>   block uuidPSU2ba-6PbF-qhm7-RMER-lCkR-j58b-G9B6A7
>   block device  
> /dev/ceph-f5f453df-1d41-4883-b0f8-d662c6ba8bea/osd-block-084cf33d-8a38-4c82-884a-7c88e3161479
>   vdo   0
>   crush device classNone
>   devices   /dev/sdb3
> 
> == osd.5 ===
> 
>   [block]
> /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
> 
>   type  block
>   osd id5
>   cluster fsid  1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>   cluster name  ceph
>   osd fsid  e854930d-1617-4fe7-b3cd-98ef284643fd
>   encrypted 0
>   cephx lockbox secret  
>   block uuidF5YIfz-quO4-gbmW-rxyP-qXxe-iN7a-Po1mL9
>   block device  
> /dev/ceph-033e2bbe-5005-45d9-9ecd-4b541fe010bd/osd-block-e854930d-1617-4fe7-b3cd-98ef284643fd
>   vdo   0
>   crush device classNone
>   devices   /dev/sdc3
> 
> == osd.7 ===
> 
>   [block]
> /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
> 
>   type  block
>   osd id7
>   cluster fsid  1cf94ce9-1323-4c43-865f-68f4ae9e6af3
>   cluster name  ceph
>   osd fsid  5c0d0404-390e-4801-94a9-da52c104206f
>   encrypted 0
>   cephx lockbox secret  
>   block uuidwgfOqi-iCu0-WIGb-uZPb-0R3n-ClQ3-0IewMe
>   block device  
> /dev/ceph-1f3d4406-af86-4813-8d06-a001c57408fa/osd-block-5c0d0404-390e-4801-94a9-da52c104206f
>   vdo   0
>   crush device classNone
>   devices   /dev/sdd3

What I am wondering is if device mapper has lost something with a kernel or 
library change:

> [root@gw02 ceph]# ls -l /dev/dm*
> brw-rw. 1 root disk 253, 0 Jan 20 16:19 /dev/dm-0
> brw-rw. 1 ceph ceph 253, 1 Jan 20 16:19 /dev/dm-1
> brw-rw. 1 ceph ceph 253, 2 Jan 20 16:19 /dev/dm-2
> brw-rw. 1 ceph ceph 253, 3 Jan 20 16:19 /dev/dm-3
> brw-rw. 1 ceph ceph 253, 4 Jan 20 16:19 /dev/dm-4
> [root@gw02 ~]# dmsetup ls
> ceph--1f3d4406

Re: [ceph-users] MDS performance issue

2019-01-20 Thread Albert Yue
Hi Yan Zheng,

1. mds cache limit is set to 64GB
2. we get the size of meta data pool by running `ceph df` and saw meta data
pool just used 200MB space.

Thanks,


On Mon, Jan 21, 2019 at 11:35 AM Yan, Zheng  wrote:

> On Mon, Jan 21, 2019 at 11:16 AM Albert Yue 
> wrote:
> >
> > Dear Ceph Users,
> >
> > We have set up a cephFS cluster with 6 osd machines, each with 16 8TB
> harddisk. Ceph version is luminous 12.2.5. We created one data pool with
> these hard disks and created another meta data pool with 3 ssd. We created
> a MDS with 65GB cache size.
> >
> > But our users are keep complaining that cephFS is too slow. What we
> observed is cephFS is fast when we switch to a new MDS instance, once the
> cache fills up (which will happen very fast), client became very slow when
> performing some basic filesystem operation such as `ls`.
> >
>
> what's your mds cache config ?
>
> > What we know is our user are putting lots of small files into the
> cephFS, now there are around 560 Million files. We didn't see high CPU wait
> on MDS instance and meta data pool just used around 200MB space.
>
> It's unlikely.  For output of 'ceph osd df', you should take both both
> DATA and OMAP into account.
>
> >
> > My question is, what is the relationship between the metadata pool and
> MDS? Is this performance issue caused by the hardware behind meta data
> pool? Why the meta data pool only used 200MB space, and we saw 3k iops on
> each of these three ssds, why can't MDS cache all these 200MB into memory?
> >
> > Thanks very much!
> >
> >
> > Best Regards,
> >
> > Albert
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS performance issue

2019-01-20 Thread Yan, Zheng
On Mon, Jan 21, 2019 at 11:16 AM Albert Yue  wrote:
>
> Dear Ceph Users,
>
> We have set up a cephFS cluster with 6 osd machines, each with 16 8TB 
> harddisk. Ceph version is luminous 12.2.5. We created one data pool with 
> these hard disks and created another meta data pool with 3 ssd. We created a 
> MDS with 65GB cache size.
>
> But our users are keep complaining that cephFS is too slow. What we observed 
> is cephFS is fast when we switch to a new MDS instance, once the cache fills 
> up (which will happen very fast), client became very slow when performing 
> some basic filesystem operation such as `ls`.
>

what's your mds cache config ?

> What we know is our user are putting lots of small files into the cephFS, now 
> there are around 560 Million files. We didn't see high CPU wait on MDS 
> instance and meta data pool just used around 200MB space.

It's unlikely.  For output of 'ceph osd df', you should take both both
DATA and OMAP into account.

>
> My question is, what is the relationship between the metadata pool and MDS? 
> Is this performance issue caused by the hardware behind meta data pool? Why 
> the meta data pool only used 200MB space, and we saw 3k iops on each of these 
> three ssds, why can't MDS cache all these 200MB into memory?
>
> Thanks very much!
>
>
> Best Regards,
>
> Albert
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS performance issue

2019-01-20 Thread Albert Yue
Dear Ceph Users,

We have set up a cephFS cluster with 6 osd machines, each with 16 8TB
harddisk. Ceph version is luminous 12.2.5. We created one data pool with
these hard disks and created another meta data pool with 3 ssd. We created
a MDS with 65GB cache size.

But our users are keep complaining that cephFS is too slow. What we
observed is cephFS is fast when we switch to a new MDS instance, once the
cache fills up (which will happen very fast), client became very slow when
performing some basic filesystem operation such as `ls`.

What we know is our user are putting lots of small files into the cephFS,
now there are around 560 Million files. We didn't see high CPU wait on MDS
instance and meta data pool just used around 200MB space.

My question is, what is the relationship between the metadata pool and MDS?
Is this performance issue caused by the hardware behind meta data pool? Why
the meta data pool only used 200MB space, and we saw 3k iops on each of
these three ssds, why can't MDS cache all these 200MB into memory?

Thanks very much!


Best Regards,

Albert
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph MDS laggy

2019-01-20 Thread Yan, Zheng
It's http://tracker.ceph.com/issues/37977. Thanks for your help.

Regards
Yan, Zheng

On Sun, Jan 20, 2019 at 12:40 AM Adam Tygart  wrote:
>
> It worked for about a week, and then seems to have locked up again.
>
> Here is the back trace from the threads on the mds:
> http://people.cs.ksu.edu/~mozes/ceph-12.2.10-laggy-mds.gdb.txt
>
> --
> Adam
>
> On Sun, Jan 13, 2019 at 7:41 PM Yan, Zheng  wrote:
> >
> > On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart  wrote:
> > >
> > > Restarting the nodes causes the hanging again. This means that this is
> > > workload dependent and not a transient state.
> > >
> > > I believe I've tracked down what is happening. One user was running
> > > 1500-2000 jobs in a single directory with 92000+ files in it. I am
> > > wondering if the cluster was getting ready to fragment the directory
> > > something freaked out, perhaps not able to get all the caps back from
> > > the nodes (if that is even required).
> > >
> > > I've stopped that user's jobs for the time being, and will probably
> > > address it with them Monday. If it is the issue, can I tell the mds to
> > > pre-fragment the directory before I re-enable their jobs?
> > >
> >
> > The log shows mds is in busy loop, but doesn't show where is it. If it
> > happens again, please use gdb to attach ceph-mds, then type 'set
> > logging on' and 'thread apply all bt' inside gdb. and send the output
> > to us
> >
> > Yan, Zheng
> > > --
> > > Adam
> > >
> > > On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart  wrote:
> > > >
> > > > On a hunch, I shutdown the compute nodes for our HPC cluster, and 10
> > > > minutes after that restarted the mds daemon. It replayed the journal,
> > > > evicted the dead compute nodes and is working again.
> > > >
> > > > This leads me to believe there was a broken transaction of some kind
> > > > coming from the compute nodes (also all running CentOS 7.6 and using
> > > > the kernel cephfs mount). I hope there is enough logging from before
> > > > to try to track this issue down.
> > > >
> > > > We are back up and running for the moment.
> > > > --
> > > > Adam
> > > >
> > > >
> > > >
> > > > On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart  wrote:
> > > > >
> > > > > Hello all,
> > > > >
> > > > > I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 
> > > > > 7.6.
> > > > >
> > > > > We're using cephfs and rbd.
> > > > >
> > > > > Last night, one of our two active/active mds servers went laggy and
> > > > > upon restart once it goes active it immediately goes laggy again.
> > > > >
> > > > > I've got a log available here (debug_mds 20, debug_objecter 20):
> > > > > https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz
> > > > >
> > > > > It looks like I might not have the right log levels. Thoughts on 
> > > > > debugging this?
> > > > >
> > > > > --
> > > > > Adam
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Process stuck in D+ on cephfs mount

2019-01-20 Thread Yan, Zheng
check /proc//stack to find where it is stuck

On Mon, Jan 21, 2019 at 5:51 AM Marc Roos  wrote:
>
>
> I have a process stuck in D+ writing to cephfs kernel mount. Anything
> can be done about this? (without rebooting)
>
>
> CentOS Linux release 7.5.1804 (Core)
> Linux 3.10.0-514.21.2.el7.x86_64
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph in OSPF environment

2019-01-20 Thread Volodymyr Litovka

Hi,

to be more precise, netstat table looks as in the following snippet:

tcp    0  0 10.10.200.5:6815 10.10.25.4:43788    ESTABLISHED 
51981/ceph-osd
tcp    0  0 10.10.15.2:41020 10.10.200.8:6813    ESTABLISHED 
51981/ceph-osd
tcp    0  0 10.10.15.2:48724 10.10.200.9:6809    ESTABLISHED 
51981/ceph-osd
tcp    0  0 10.10.200.5:6813 10.10.25.4:34692    ESTABLISHED 
51981/ceph-osd
tcp    0  0 10.10.25.2:36736 10.10.200.6:6814    ESTABLISHED 
51981/ceph-osd
tcp    0  0 10.10.15.2:55250 10.10.200.8:6815    ESTABLISHED 
51981/ceph-osd
tcp    0  0 10.10.200.5:6815 10.10.25.6:50778    ESTABLISHED 
51981/ceph-osd
tcp    0  0 10.10.15.2:57880 10.10.200.7:6803    ESTABLISHED 
51981/ceph-osd


from where you can see, that source IP for all connections is always in 
10.10.[12]5.x networks, which are interface addresses.


Network configuration of node is:

int vlo
  ip address 10.10.200.5/32
int enp97s0f1
  ip address 10.10.15.2/24
int enp19s0f0
  ip address 10.10.25.2/24

where interface vlo is OVS bridge:

Bridge vlo
    Port "vnet0"
    Interface "vnet0"
    Port vlo
    Interface vlo
    type: internal

and routing is similar on all nodes and looks as there

10.10.200.x proto bird metric 64
    nexthop via 10.10.15.x dev enp97s0f1 weight 1
    nexthop via 10.10.25.x dev enp19s0f0 weight 1

where x is address of another node with similar configuration.

As per your question, 'ceph daemon osd.0 config show' gives the following:

    "cluster_addr": "10.10.200.5:0/0",
    "cluster_network": "",
    "cluster_network_interface": "vlo",
    "public_addr": "10.10.200.5:0/0",
    "public_bind_addr": "-",
    "public_network": "",
    "public_network_interface": "vlo"

Tomorrow we'll check source file as per your suggestion, but may be 
you'll find something wrong in this config.


Thank you.

On 1/20/19 10:54 PM, Max Krasilnikov wrote:

День добрий!

  Fri, Jan 18, 2019 at 11:02:51PM +, robbat2 wrote:


On Fri, Jan 18, 2019 at 12:21:07PM +, Max Krasilnikov wrote:

Dear colleagues,

we build L3 topology for use with CEPH, which is based on OSPF routing
between Loopbacks, in order to get reliable and ECMPed topology, like this:

...

CEPH configured in the way

You have a minor misconfiguration, but I've had trouble with the address
picking logic before, on a L3 routed ECMP BGP topography on IPv6 (using
the Cumulus magic link-local IPv6 BGP)


[global]
public_network = 10.10.200.0/24

Keep this, but see below.


[osd.0]
public bind addr = 10.10.200.5

public_bind_addr is only used by mons.


cluster bind addr = 10.10.200.5

There is no such option as 'cluster_bind_addr'; it's just 'cluster_addr'

Set the following in the OSD block:
| public_network = # keep empty; empty != unset
| cluster_network = # keep empty; empty != unset
| cluster_addr = 10.10.200.5
| public_addr = 10.10.200.5

Alternatively, see the code src/common/pick_address.cc to see about
using cluster_network_interface and public_network_interface.

Unfortunatelly, all osds continue to bind to interface addresses instead of vlo
bridge address even after setting cluster_addr, public_addr,
cluster_network_interface and public_network_interface :(
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Volodymyr Litovka
  "Vision without Execution is Hallucination." -- Thomas Edison

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Process stuck in D+ on cephfs mount

2019-01-20 Thread Marc Roos


I have a process stuck in D+ writing to cephfs kernel mount. Anything 
can be done about this? (without rebooting)


CentOS Linux release 7.5.1804 (Core)
Linux 3.10.0-514.21.2.el7.x86_64 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph in OSPF environment

2019-01-20 Thread Robin H. Johnson
On Sun, Jan 20, 2019 at 09:05:10PM +, Max Krasilnikov wrote:
> > Just checking, since it isn't mentioned here: Did you explicitly add
> > public_network+cluster_network as empty variables?
> > 
> > Trace the code in the sourcefile I mentioned, specific to your Ceph
> > version, as it has changed slightly over the years.
> 
> My config is looks like that for one host:
> 
> [osd]
> # keep empty; empty != unset
> public network =
> cluster network =
> public_network_interface = vlo
> cluster_network_interface = vlo
> cluster_addr = 10.10.200.5
> public_addr = 10.10.200.5
If you tell the daemon to dump the config, does it still show these set
as you have in the config?

'ceph daemon osd.0 config show'

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph in OSPF environment

2019-01-20 Thread Max Krasilnikov
Hello!

 Sun, Jan 20, 2019 at 09:00:22PM +, robbat2 wrote: 

> > > > we build L3 topology for use with CEPH, which is based on OSPF routing 
> > > > between Loopbacks, in order to get reliable and ECMPed topology, like 
> > > > this:
> > > ...
> > > > CEPH configured in the way
> > > You have a minor misconfiguration, but I've had trouble with the address
> > > picking logic before, on a L3 routed ECMP BGP topography on IPv6 (using
> > > the Cumulus magic link-local IPv6 BGP)
> > > 
> > > > 
> > > > [global]
> > > > public_network = 10.10.200.0/24
> > > Keep this, but see below.
> > > 
> > > > [osd.0]
> > > > public bind addr = 10.10.200.5
> > > public_bind_addr is only used by mons.
> > > 
> > > > cluster bind addr = 10.10.200.5
> > > There is no such option as 'cluster_bind_addr'; it's just 'cluster_addr'
> > > 
> > > Set the following in the OSD block:
> > > | public_network = # keep empty; empty != unset
> > > | cluster_network = # keep empty; empty != unset
> > > | cluster_addr = 10.10.200.5
> > > | public_addr = 10.10.200.5
> > > 
> > > Alternatively, see the code src/common/pick_address.cc to see about
> > > using cluster_network_interface and public_network_interface.
> > 
> > Unfortunatelly, all osds continue to bind to interface addresses instead of 
> > vlo
> > bridge address even after setting cluster_addr, public_addr,
> > cluster_network_interface and public_network_interface :(
> Just checking, since it isn't mentioned here: Did you explicitly add
> public_network+cluster_network as empty variables?
> 
> Trace the code in the sourcefile I mentioned, specific to your Ceph
> version, as it has changed slightly over the years.

My config is looks like that for one host:

[osd]
# keep empty; empty != unset
public network =
cluster network =
public_network_interface = vlo
cluster_network_interface = vlo
cluster_addr = 10.10.200.5
public_addr = 10.10.200.5

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph in OSPF environment

2019-01-20 Thread Robin H. Johnson
On Sun, Jan 20, 2019 at 08:54:57PM +, Max Krasilnikov wrote:
> День добрий! 
> 
>  Fri, Jan 18, 2019 at 11:02:51PM +, robbat2 wrote: 
> 
> > On Fri, Jan 18, 2019 at 12:21:07PM +, Max Krasilnikov wrote:
> > > Dear colleagues,
> > > 
> > > we build L3 topology for use with CEPH, which is based on OSPF routing 
> > > between Loopbacks, in order to get reliable and ECMPed topology, like 
> > > this:
> > ...
> > > CEPH configured in the way
> > You have a minor misconfiguration, but I've had trouble with the address
> > picking logic before, on a L3 routed ECMP BGP topography on IPv6 (using
> > the Cumulus magic link-local IPv6 BGP)
> > 
> > > 
> > > [global]
> > > public_network = 10.10.200.0/24
> > Keep this, but see below.
> > 
> > > [osd.0]
> > > public bind addr = 10.10.200.5
> > public_bind_addr is only used by mons.
> > 
> > > cluster bind addr = 10.10.200.5
> > There is no such option as 'cluster_bind_addr'; it's just 'cluster_addr'
> > 
> > Set the following in the OSD block:
> > | public_network = # keep empty; empty != unset
> > | cluster_network = # keep empty; empty != unset
> > | cluster_addr = 10.10.200.5
> > | public_addr = 10.10.200.5
> > 
> > Alternatively, see the code src/common/pick_address.cc to see about
> > using cluster_network_interface and public_network_interface.
> 
> Unfortunatelly, all osds continue to bind to interface addresses instead of 
> vlo
> bridge address even after setting cluster_addr, public_addr,
> cluster_network_interface and public_network_interface :(
Just checking, since it isn't mentioned here: Did you explicitly add
public_network+cluster_network as empty variables?

Trace the code in the sourcefile I mentioned, specific to your Ceph
version, as it has changed slightly over the years.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph in OSPF environment

2019-01-20 Thread Max Krasilnikov
День добрий! 

 Fri, Jan 18, 2019 at 11:02:51PM +, robbat2 wrote: 

> On Fri, Jan 18, 2019 at 12:21:07PM +, Max Krasilnikov wrote:
> > Dear colleagues,
> > 
> > we build L3 topology for use with CEPH, which is based on OSPF routing 
> > between Loopbacks, in order to get reliable and ECMPed topology, like this:
> ...
> > CEPH configured in the way
> You have a minor misconfiguration, but I've had trouble with the address
> picking logic before, on a L3 routed ECMP BGP topography on IPv6 (using
> the Cumulus magic link-local IPv6 BGP)
> 
> > 
> > [global]
> > public_network = 10.10.200.0/24
> Keep this, but see below.
> 
> > [osd.0]
> > public bind addr = 10.10.200.5
> public_bind_addr is only used by mons.
> 
> > cluster bind addr = 10.10.200.5
> There is no such option as 'cluster_bind_addr'; it's just 'cluster_addr'
> 
> Set the following in the OSD block:
> | public_network = # keep empty; empty != unset
> | cluster_network = # keep empty; empty != unset
> | cluster_addr = 10.10.200.5
> | public_addr = 10.10.200.5
> 
> Alternatively, see the code src/common/pick_address.cc to see about
> using cluster_network_interface and public_network_interface.

Unfortunatelly, all osds continue to bind to interface addresses instead of vlo
bridge address even after setting cluster_addr, public_addr,
cluster_network_interface and public_network_interface :(
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Boot volume on OSD device

2019-01-20 Thread Hector Martin
On 20/01/2019 05.50, Brian Topping wrote:
> My main constraint is I had four disks on a single machine to start with
> and any one of the disks should be able to fail without affecting the
> ability for the machine to boot, the bad disk replaced without requiring
> obscure admin skills, and the final recovery to the promised land of
> “HEALTH_OK”. A single machine Ceph deployment is not much better than
> just using local storage, except the ability to later scale out. That’s
> the use case I’m addressing here.

I assume parititioning the drive and using mdadm to add it to one or
more RAID arrays and then dealing with the Ceph side doesn't qualify as
"obscure admin skills", right? :-)

(I also use single-host Ceph deployments; I like its properties over
traditional RAID or things like ZFS).

> https://theithollow.com/2012/03/21/understanding-raid-penalty/ provided
> a good background that I did not previously have on the RAID write
> penalty. I combined this with what I learned
> in 
> https://serverfault.com/questions/685289/software-vs-hardware-raid-performance-and-cache-usage/685328#685328.
> By the end of these two articles, I felt like I knew all the tradeoffs,
> but the final decision really came down to the penalty table in the
> first article and a “RAID penalty” of 2 for RAID 10, which was the same
> as the penalty for RAID 1, but with 50% better storage efficiency.

FWIW, I disagree with that article on RAID write penalty. It's an
oversimplification and the math doesn't really add up. I don't like the
way they define the concept of "write penalty" relative to the sum of
disk performance. It should be relative to a single disk.

Here's my take on it. First of all, you need to consider three different
performance metrics for writes:

- Sequential writes (seq)
- Random writes < stripe size (small)
- Random writes >> stripe size or aligned (large)

* stripe size is the size across all disks for RAID5/6, but a single
disk for RAID0

And here is the performance, where n is the number of disks, relative to
a single disk of the same type:

seq small   large
RAID 0  n   n   1
RAID 1  1   1   1
RAID 5  n-1 0.5 1
RAID 6  n-2 0.5 1
RAID 10 n/2 n/2 1

RAID0 gives a throughput improvement proportional to the number of
disks, and the same small IOPS improvement *on average* (assuming your
I/Os hit all the disks equally, not like repeatedly hammering one stripe
chunk). There is also some loss of performance because whenever I/O hits
multiple disks the *slowest* disk becomes the bottleneck, so if the
worst case latency is 10ms for a single disk, your average latency is
5ms, but the average latency for the slowest of two disks is 6.6ms, for
three disks 7.5ms, etc. approaching 10ms as you add disks.

RAID1 is just like using a single disk, really. All the disks do the
same thing in parallel. That's it.

RAID5 has the same sequential improvement as RAID0, except with one
fewer disk, because parity takes one disk. However, small writes become
read-modify-write operations (it has to read the old data and parity to
update the parity), so you get half the IOPS. If your write is
stripe-aligned this penalty goes away, and misaligned writes larger than
several stripes amortize the penalty (it only hits the beginning and
end), so the performance approaches 1 as your write size increases, and
exceeds it as the sequential effect starts to dominate.

RAID6 is like RAID5 but with two parity disks. You still need a
(parallel) read and a (parallel) write for every small write.

RAID10 is just a RAID0 of RAID1s, so you ignore half the disks (the
mirrors) and the rest behave like RAID0.

The large/aligned I/O performance is identical to a single disk across
all RAID levels, because when your I/Os are larger than one stripe, then
*all* disks across the RAID have to handle the I/O (in parallel).

This is all assuming no controller or CPU bottlenecking. Realistically,
with software RAID and a non-terrible HBA, this is a pretty reasonable
assumption. There will be some extra overhead, but not much. Also, some
of the impact of RAID5/6 will be reduced by caching (hardware cache with
hardware RAID, or software stripe cache with md-raid).

This is all still a simplification in some ways, but I think it's closer
to reality than that article.

(somewhat offtopic for this list, but after seeing that article I felt I
had to try my own attempt at doing the math here).

Personally, I've had two set-ups like yours and this is what I did:

- On a production cluster with several OSDs with 4 disks (and no boot
drive), I used a 4-disk RAID1 for /boot and a 2-disk RAID1 with 2 spares
for /. This provides possibly a bit more fail-safe reliability in that
the RAID will auto-recover to the spares when something goes wrong
(instead of having to wait for a human to fix things). You could have a
4-disk RAID1, but there is some minor penalty (not detailed in my
explanation above) for replicating all writes 

Re: [ceph-users] Ceph MDS laggy

2019-01-20 Thread Paul Emmerich
I've heard of the same(?) problem on another cluster; they upgraded
from 12.2.7 to 12.2.10 and suddenly got problems with their CephFS
(and only with the CephFS).
However, they downgraded the MDS to 12.2.8 before I could take a look
at it, so not sure what caused the issue. 12.2.8 works fine with the
same workload that also involves a relatively large number of files.


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Sun, Jan 20, 2019 at 3:26 AM Adam Tygart  wrote:
>
> Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of December. This didn't 
> happen before then.
>
> --
> Adam
>
> On Sat, Jan 19, 2019, 20:17 Paul Emmerich >
>> Did this only start to happen after upgrading to 12.2.10?
>>
>> Paul
>>
>> --
>> Paul Emmerich
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>>
>> On Sat, Jan 19, 2019 at 5:40 PM Adam Tygart  wrote:
>> >
>> > It worked for about a week, and then seems to have locked up again.
>> >
>> > Here is the back trace from the threads on the mds:
>> > http://people.cs.ksu.edu/~mozes/ceph-12.2.10-laggy-mds.gdb.txt
>> >
>> > --
>> > Adam
>> >
>> > On Sun, Jan 13, 2019 at 7:41 PM Yan, Zheng  wrote:
>> > >
>> > > On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart  wrote:
>> > > >
>> > > > Restarting the nodes causes the hanging again. This means that this is
>> > > > workload dependent and not a transient state.
>> > > >
>> > > > I believe I've tracked down what is happening. One user was running
>> > > > 1500-2000 jobs in a single directory with 92000+ files in it. I am
>> > > > wondering if the cluster was getting ready to fragment the directory
>> > > > something freaked out, perhaps not able to get all the caps back from
>> > > > the nodes (if that is even required).
>> > > >
>> > > > I've stopped that user's jobs for the time being, and will probably
>> > > > address it with them Monday. If it is the issue, can I tell the mds to
>> > > > pre-fragment the directory before I re-enable their jobs?
>> > > >
>> > >
>> > > The log shows mds is in busy loop, but doesn't show where is it. If it
>> > > happens again, please use gdb to attach ceph-mds, then type 'set
>> > > logging on' and 'thread apply all bt' inside gdb. and send the output
>> > > to us
>> > >
>> > > Yan, Zheng
>> > > > --
>> > > > Adam
>> > > >
>> > > > On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart  wrote:
>> > > > >
>> > > > > On a hunch, I shutdown the compute nodes for our HPC cluster, and 10
>> > > > > minutes after that restarted the mds daemon. It replayed the journal,
>> > > > > evicted the dead compute nodes and is working again.
>> > > > >
>> > > > > This leads me to believe there was a broken transaction of some kind
>> > > > > coming from the compute nodes (also all running CentOS 7.6 and using
>> > > > > the kernel cephfs mount). I hope there is enough logging from before
>> > > > > to try to track this issue down.
>> > > > >
>> > > > > We are back up and running for the moment.
>> > > > > --
>> > > > > Adam
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart  wrote:
>> > > > > >
>> > > > > > Hello all,
>> > > > > >
>> > > > > > I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 
>> > > > > > 7.6.
>> > > > > >
>> > > > > > We're using cephfs and rbd.
>> > > > > >
>> > > > > > Last night, one of our two active/active mds servers went laggy and
>> > > > > > upon restart once it goes active it immediately goes laggy again.
>> > > > > >
>> > > > > > I've got a log available here (debug_mds 20, debug_objecter 20):
>> > > > > > https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz
>> > > > > >
>> > > > > > It looks like I might not have the right log levels. Thoughts on 
>> > > > > > debugging this?
>> > > > > >
>> > > > > > --
>> > > > > > Adam
>> > > > > > ___
>> > > > > > ceph-users mailing list
>> > > > > > ceph-users@lists.ceph.com
>> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > > ___
>> > > > > ceph-users mailing list
>> > > > > ceph-users@lists.ceph.com
>> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > ___
>> > > > ceph-users mailing list
>> > > > ceph-users@lists.ceph.com
>> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Salvage CEPHFS after lost PG

2019-01-20 Thread Marc Roos
 
If you have a backfillfull, no pg's will be able to migrate. 
Better is to just add harddrives, because at least one of your osd's is 
to full.

I know you can set the backfillfull ratio's with commands like these
ceph tell osd.* injectargs '--mon_osd_full_ratio=0.97'
ceph tell osd.* injectargs '--mon_osd_backfillfull_ratio=0.95'

ceph tell osd.* injectargs '--mon_osd_full_ratio=0.95'
ceph tell osd.* injectargs '--mon_osd_backfillfull_ratio=0.90'

Or maybe decrease the weight of the full osd, check the osds with 'ceph 
osd status' and make sure your nodes have even distribution of the 
storage.











-Original Message-
From: Rik [mailto:r...@kawaja.net] 
Sent: zondag 20 januari 2019 8:47
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Salvage CEPHFS after lost PG

Hi all,




I'm looking for some suggestions on how to do something inappropriate. 




In a nutshell, I've lost the WAL/DB for three bluestore OSDs on a small 
cluster and, as a result of those three OSDs going offline, I've lost a 
placement group (7.a7). How I achieved this feat is an embarrassing 
mistake, which I don't think has bearing on my question.




The OSDs were created a few months ago with ceph-deploy:

/usr/local/bin/ceph-deploy --overwrite-conf osd create --bluestore 
--data /dev/vdc1 --block-db /dev/vdf1 ceph-a




With the 3 OSDs out, I'm sitting at OSD_BACKFILLFULL.




First, the PG 7.a7 belongs to the data pool, rather than the metadata 
pool and if I run "cephfs-data-scan pg_files / 7.a7" then I get a list 
of 4149 files/objects but then it hangs. I don't understand why this 
would hang if it's only the data pool which is impacted (since pg_files 
only operates on the metadata pool?).




The ceph-log shows:

cluster [WRN] slow request 30.894832 seconds old, received at 2019-01-20 
18:00:12.555398: client_request(client.25017730:21

8006 lookup #0x10001c8ce15/01 2019-01-20 18:00:12.550421 
caller_uid=0, caller_gid=0{}) currently failed to rdlock, waiting




Is the hang perhaps related to the OSD_BACKFILLFULL? If so, I could add 
some completely new OSDs to fix that problem. I have held off doing that 
for now as that will trigger a whole lot of data movement which might be 
unnecessary.




Or is the hang indeed related to the missing PG?




Second, if I try to copy files out of the CEPHFS filesystem, I get a few 
hundred files and then it too hangs. None of the files I’m attempting 
to copy are listed in the pg_files output (although since the pg_files 
hangs, perhaps it hadn't got to those files yet). Again, should I not be 
able to access files which are not associated with the a missing data 
pool PG?




Lastly, I want to know if there is some way to recreate the WAL/DB while 
leaving the OSD data intact and/or fool one of the OSDs into thinking 
everything is OK, allowing it to serve up the data it has in the missing 
PG.




From reading the mailing list and documentation, I know that this is not 
a "safe" operation:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021713.html

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/024268.html




However, my current status indicates an unusable CEPHFS and limited 
access to the data. I'd like to get as much data off it as possible and 
then I expect to have to recreate it. With a combination of the backups 
I have and what I can salvage from the cluster, I should hopefully have 
most of what I need.




I know what I *should* have done, but now I'm at this point, I know I'm 
asking for something which would never be required on a properly-run 
cluster.




If it really is not possible to get the (possibly corrupt) PG back 
again, can I get the cluster back so the remainder of the files are 
accessible?




Currently running mimic 13.2.4 on all nodes.




Status:

$ ceph health detail - 
https://gist.github.com/kawaja/f59d231179b3186748eca19aae26bcd4

$ ceph fs get main - 
https://gist.github.com/kawaja/a7ab0b285d53dee6a950a4310be4fa5a




Any advice on where I could go from here would be greatly appreciated.




thanks,

rik.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com