Re: [ceph-users] cephfs compression?

2018-06-28 Thread Richard Bade
Oh, also because the compression is at the osd level you don't see it
in ceph df. You just see that your RAW is not increasing as much as
you'd expect. E.g.
$ sudo ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED
785T  300T 485T 61.73
POOLS:
NAMEID USED %USED MAX AVAIL OBJECTS
cephfs-metadata 11 185M 068692G   178
cephfs-data 12 408T 75.26  134T 132641159

You can see that we've used 408TB in the pool but only 485TB RAW -
Rather than ~600TB RAW that I'd expect for my k4, m2 pool settings.
On Fri, 29 Jun 2018 at 17:08, Richard Bade  wrote:
>
> I'm using compression on a cephfs-data pool in luminous. I didn't do
> anything special
>
> $ sudo ceph osd pool get cephfs-data all | grep ^compression
> compression_mode: aggressive
> compression_algorithm: zlib
>
> You can check how much compression you're getting on the osd's
> $ for osd in `seq 0 11`; do echo osd.$osd; sudo ceph daemon osd.$osd
> perf dump | grep 'bluestore_compressed'; done
> osd.0
> "bluestore_compressed": 686487948225,
> "bluestore_compressed_allocated": 788659830784,
> "bluestore_compressed_original": 1660064620544,
> 
> osd.11
> "bluestore_compressed": 700999601387,
> "bluestore_compressed_allocated": 808854355968,
> "bluestore_compressed_original": 1752045551616,
>
> I can't say for mimic, but definitely for luminous v12.2.5 compression
> is working well with mostly default options.
>
> -Rich
>
> > For RGW, compression works very well. We use rgw to store crash dumps, in
> > most cases, the compression ratio is about 2.0 ~ 4.0.
>
> > I tried to enable compression for cephfs data pool:
>
> > # ceph osd pool get cephfs_data all | grep ^compression
> > compression_mode: force
> > compression_algorithm: lz4
> > compression_required_ratio: 0.95
> > compression_max_blob_size: 4194304
> > compression_min_blob_size: 4096
>
> > (we built ceph packages and enabled lz4.)
>
> > It doesn't seem to work. I copied a 8.7GB folder to cephfs, ceph df says it
> > used 8.7GB:
>
> > root at ceph-admin:~# ceph df
> > GLOBAL:
> > SIZE   AVAIL  RAW USED %RAW USED
> > 16 TiB 16 TiB  111 GiB  0.69
> > POOLS:
> > NAMEID USED%USED MAX AVAIL OBJECTS
> > cephfs_data 1  8.7 GiB  0.17   5.0 TiB  360545
> > cephfs_metadata 2  221 MiB 0   5.0 TiB   77707
>
> > I know this folder can be compressed to ~4.0GB under zfs lz4 compression.
>
> > Am I missing anything? how to make cephfs compression work? is there any
> trick?
>
> > By the way, I am evaluating ceph mimic v13.2.0.
>
> > Thanks in advance,
> > --Youzhong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs compression?

2018-06-28 Thread Richard Bade
I'm using compression on a cephfs-data pool in luminous. I didn't do
anything special

$ sudo ceph osd pool get cephfs-data all | grep ^compression
compression_mode: aggressive
compression_algorithm: zlib

You can check how much compression you're getting on the osd's
$ for osd in `seq 0 11`; do echo osd.$osd; sudo ceph daemon osd.$osd
perf dump | grep 'bluestore_compressed'; done
osd.0
"bluestore_compressed": 686487948225,
"bluestore_compressed_allocated": 788659830784,
"bluestore_compressed_original": 1660064620544,

osd.11
"bluestore_compressed": 700999601387,
"bluestore_compressed_allocated": 808854355968,
"bluestore_compressed_original": 1752045551616,

I can't say for mimic, but definitely for luminous v12.2.5 compression
is working well with mostly default options.

-Rich

> For RGW, compression works very well. We use rgw to store crash dumps, in
> most cases, the compression ratio is about 2.0 ~ 4.0.

> I tried to enable compression for cephfs data pool:

> # ceph osd pool get cephfs_data all | grep ^compression
> compression_mode: force
> compression_algorithm: lz4
> compression_required_ratio: 0.95
> compression_max_blob_size: 4194304
> compression_min_blob_size: 4096

> (we built ceph packages and enabled lz4.)

> It doesn't seem to work. I copied a 8.7GB folder to cephfs, ceph df says it
> used 8.7GB:

> root at ceph-admin:~# ceph df
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED
> 16 TiB 16 TiB  111 GiB  0.69
> POOLS:
> NAMEID USED%USED MAX AVAIL OBJECTS
> cephfs_data 1  8.7 GiB  0.17   5.0 TiB  360545
> cephfs_metadata 2  221 MiB 0   5.0 TiB   77707

> I know this folder can be compressed to ~4.0GB under zfs lz4 compression.

> Am I missing anything? how to make cephfs compression work? is there any
trick?

> By the way, I am evaluating ceph mimic v13.2.0.

> Thanks in advance,
> --Youzhong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] In a High Avaiability setup, MON, OSD daemon take up the floating IP

2018-06-28 Thread Дробышевский , Владимир
Rahul,

  if you are using the whole drives for OSDs then ceph-deploy is a good
option in most cases.

2018-06-28 18:12 GMT+05:00 Rahul S :

> Hi Vlad,
>
> Have not thoroughly tested my setup but so far things look good. Only
> problem is that I have to manually activate the osd's using the ceph-deploy
> command. Manually mounting the osd partition doesnt work.
>
> Thanks for replying.
>
> Regards,
> Rahul S
>
> On 27 June 2018 at 14:15, Дробышевский, Владимир  wrote:
>
>> Hello, Rahul!
>>
>>   Do you have your problem during initial cluster creation or on any
>> reboot\leadership transfer? If the first then try to remove floating IP
>> while creating mons and temporarily transfer the leadership from the server
>> your going to create OSD on.
>>
>>   We are using the same configuration without any issues (though have a
>> little bit more servers) but ceph cluster had been created before
>> OpenNebula setup.
>>
>>   We have a number of physical\virtual interfaces on top of IPoIB _and_
>> ethernet network (with bonding).
>>
>>   So there are 3 interfaces for the internal communications:
>>
>>   ib0.8003 - 10.103.0.0/16 - ceph public network and opennebula raft
>> virtual ip
>>   ib0.8004 - 10.104.0.0/16 - ceph cluster network
>>   br0 (on top of ethernet bonding interface) - 10.101.0.0/16 - physical
>> "management" network
>>
>>   also we have a number of other virtual interfaces for per-tenant
>> intra-VM networks (vxlan on top of IP) and so on.
>>
>>
>>
>> in /etc/hosts we have only "fixed" IPs from 10.103.0.0/16 networks like:
>>
>> 10.103.0.1  e001n01.dc1..xxe001n01
>>
>>
>>
>>   /etc/one/oned.conf:
>>
>> # Executed when a server transits from follower->leader
>>  RAFT_LEADER_HOOK = [
>>  COMMAND = "raft/vip.sh",
>>  ARGUMENTS = "leader ib0.8003 10.103.255.254/16"
>>  ]
>>
>> # Executed when a server transits from leader->follower
>>  RAFT_FOLLOWER_HOOK = [
>>  COMMAND = "raft/vip.sh",
>>  ARGUMENTS = "follower ib0.8003 10.103.255.254/16"
>>  ]
>>
>>
>>
>>   /etc/ceph/ceph.conf:
>>
>> [global]
>> public_network = 10.103.0.0/16
>> cluster_network = 10.104.0.0/16
>>
>> mon_initial_members = e001n01, e001n02, e001n03
>> mon_host = 10.103.0.1,10.103.0.2,10.103.0.3
>>
>>
>>
>>   Cluster and mons created with ceph-deploy, each OSD has been added via
>> modified ceph-disk.py (as we have only 3 drive slots per server we had to
>> co-locate system partition with OSD partition on our SSDs) on
>> per-host\drive manner:
>>
>> admin@:~$ sudo ./ceph-disk-mod.py -v prepare --dmcrypt
>> --dmcrypt-key-dir /etc/ceph/dmcrypt-keys --bluestore --cluster ceph
>> --fs-type xfs -- /dev/sda
>>
>>
>>   And the current state on the leader:
>>
>> oneadmin@e001n02:~/remotes/tm$ onezone show 0
>> ZONE 0 INFORMATION
>> ID: 0
>> NAME  : OpenNebula
>>
>>
>> ZONE SERVERS
>> ID NAMEENDPOINT
>>  0 e001n01 http://10.103.0.1:2633/RPC2
>>  1 e001n02 http://10.103.0.2:2633/RPC2
>>  2 e001n03 http://10.103.0.3:2633/RPC2
>>
>> HA & FEDERATION SYNC STATUS
>> ID NAMESTATE  TERM   INDEX  COMMIT VOTE
>> FED_INDEX
>>  0 e001n01 follower   1571   68250418   68250417   1 -1
>>  1 e001n02 leader 1571   68250418   68250418   1 -1
>>  2 e001n03 follower   1571   68250418   68250417   -1-1
>> ...
>>
>>
>> admin@e001n02:~$ ip addr show ib0.8003
>> 9: ib0.8003@ib0:  mtu 65520 qdisc mq
>> state UP group default qlen 256
>> link/infiniband 
>> a0:00:03:00:fe:80:00:00:00:00:00:00:00:1e:67:03:00:47:c1:1b
>> brd 00:ff:ff:ff:ff:12:40:1b:80:03:00:00:00:00:00:00:ff:ff:ff:ff
>> inet 10.103.0.2/16 brd 10.103.255.255 scope global ib0.8003
>>valid_lft forever preferred_lft forever
>> inet 10.103.255.254/16 scope global secondary ib0.8003
>>valid_lft forever preferred_lft forever
>> inet6 fe80::21e:6703:47:c11b/64 scope link
>>valid_lft forever preferred_lft forever
>>
>> admin@e001n02:~$ sudo netstat -anp | grep mon
>> tcp0  0 10.103.0.2:6789 0.0.0.0:*
>>  LISTEN  168752/ceph-mon
>> tcp0  0 10.103.0.2:6789 10.103.0.2:44270
>> ESTABLISHED 168752/ceph-mon
>> ...
>>
>> admin@e001n02:~$ sudo netstat -anp | grep osd
>> tcp0  0 10.104.0.2:6800 0.0.0.0:*
>>  LISTEN  6736/ceph-osd
>> tcp0  0 10.104.0.2:6801 0.0.0.0:*
>>  LISTEN  6736/ceph-osd
>> tcp0  0 10.103.0.2:6801 0.0.0.0:*
>>  LISTEN  6736/ceph-osd
>> tcp0  0 10.103.0.2:6802 0.0.0.0:*
>>  LISTEN  6736/ceph-osd
>> tcp0  0 10.104.0.2:6801 10.104.0.6:42868
>> ESTABLISHED 6736/ceph-osd
>> tcp0  0 10.104.0.2:5178810.104.0.1:6800
>>  ESTABLISHED 6736/ceph-osd
>> ...
>>
>> admin@e001n02:~$ sudo ceph -s
>>   cluster:
>> id: 
>> health: HEALTH_OK
>>
>> oneadmin@e001n02:~/remotes/tm$ onedatastore show 0
>> DATASTORE 0 

[ceph-users] Ceph FS (kernel driver) - Unable to set extended file attributed

2018-06-28 Thread Yu Haiyang
Hi,

I want to play around with my ceph.file.layout attributes such as stripe_unit 
and object_size to see how it affected my Ceph FS performance.
However I’ve been unable to set any attribute with below error.

$ setfattr -n ceph.file.layout.stripe_unit -v 41943040 file1
setfattr: file1: Invalid argument

Using strace I can see it failed at something related to missing locale 
language pack.
Any suggestion how to resolve this?

$ strace setfattr -n ceph.file.layout.stripe_unit -v 41943040 file
open("/usr/share/locale/en_HK/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No 
such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such 
file or directory)
open("/usr/share/locale-langpack/en_HK/LC_MESSAGES/libc.mo", O_RDONLY) = -1 
ENOENT (No such file or directory)
open("/usr/share/locale-langpack/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT 
(No such file or directory)
write(2, "setfattr: file: Invalid argument"..., 33setfattr: file: Invalid 
argument
) = 33
exit_group(1)   = ?

Many thanks,
Haiyang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] HDD-only performance, how far can it be sped up ?

2018-06-28 Thread Horace
You need 1 core per SATA disk, otherwise your load average will be skyrocketed 
when your system is at full load and render the cluster unstable, i.e. ceph-mon 
unreachable, slow requests, etc.  

Regards,
Horace Ng

- Original Message -
From: "Brian :" 
To: "Wladimir Mutel" , "ceph-users" 
Sent: Wednesday, June 20, 2018 4:17:29 PM
Subject: Re: [ceph-users] HDD-only performance, how far can it be sped up ?

Hi Wladimir,

A combination of slow enough clock speed , erasure code, single node
and SATA spinners is probably going to lead to not a really great
evaluation. Some of the experts will chime in here with answers to
your specific questions I"m sure but this test really isn't ever going
to give great results.

Brian

On Wed, Jun 20, 2018 at 8:28 AM, Wladimir Mutel  wrote:
> Dear all,
>
> I set up a minimal 1-node Ceph cluster to evaluate its performance. We
> tried to save as much as possible on the hardware, so now the box has Asus
> P10S-M WS motherboard, Xeon E3-1235L v5 CPU, 64 GB DDR4 ECC RAM and 8x3TB
> HDDs (WD30EFRX) connected to on-board SATA ports. Also we are trying to save
> on storage redundancy, so for most of our RBD images we use erasure-coded
> data-pool (default profile, jerasure 2+1) instead of 3x replication. I
> started with Luminous/Xenial 12.2.5 setup which initialized my OSDs as
> Bluestore during deploy, then updated it to Mimic/Bionic 13.2.0. Base OS is
> Ubuntu 18.04 with kernel updated to 4.17.2 from Ubuntu mainline PPA.
>
> With this setup, I created a number of RBD images to test iSCSI, rbd-nbd
> and QEMU+librbd performance (running QEMU VMs on the same box). And that
> worked moderately well as far as data volume transferred within one session
> was limited. The fastest transfers I had with 'rbd import' which pulled an
> ISO image file at up to 25 MBytes/sec from the remote CIFS share over
> Gigabit Ethernet and stored it into EC data-pool. Windows 2008 R2 & 2016
> setup, update installation, Win 2008 upgrade to 2012 and to 2016 within QEMU
> VM also went through tolerably well. I found that cache=writeback gives the
> best performance with librbd, unlike cache=unsafe which gave the best
> performance with VMs on plain local SATA drives. Also I have a subjective
> feeling (not confirmed by exact measurements) that providing a huge
> libRBD cache (like, cache size = 1GB, max dirty = 7/8GB, max dirty age = 60)
> improved Windows VM performance on bursty writes (like, during Windows
> update installations) as well as on reboots (due to cached reads).
>
> Now, what discouraged me, was my next attempt to clone an NTFS partition
> of ~2TB from a physical drive (via USB3-SATA3 convertor) to a partition on
> an RBD image. I tried to map RBD image with rbd-nbd either locally or
> remotely over Gigabit Ethernet, and the fastest speed I got with ntfsclone
> was about 8 MBytes/sec. Which means that it could spend up to 3 days copying
> these ~2TB of NTFS data. I thought about running
> ntfsclone /dev/sdX1 -o - | rbd import ... - , but ntfsclone needs to rewrite
> a part of existing RBD image starting from certain offset, so I decided this
> was not a solution in my situation. Now I am thinking about taking out one
> of OSDs and using it as a 'bcache' for this operation, but I am not sure how
> good is bcache performance with cache on rotating HDD. I know that keeping
> OSD logs and RocksDB on the same HDD creates a seeky workload which hurts
> overall transfer performance.
>
> Also I am thinking about a number of next-close possibilities, and I
> would like to hear your opinions on the benefits and drawbacks of each of
> them.
>
> 1. Would iSCSI access to that RBD image improve my performance (compared
> to rbd-nbd) ? I did not check that yet, but I noticed that Windows
> transferred about 2.5 MBytes/sec while formatting NTFS volume on this RBD
> attached to it by iSCSI. So, for seeky/sparse workloads like NTFS formatting
> the performance was not great.
>
> 2. Would it help to run ntfsclone in Linux VM, with RBD image accessed
> through QEMU+librbd ? (also going to measure that myself)
>
> 3. Is there any performance benefits in using Ceph cache-tier pools with
> my setup ? I hear now use of this technique is advised against, no?
>
> 4. We have an unused older box (Supermicro X8SIL-F mobo, Xeon X3430 CPU,
> 32 GB of DDR3 ECC RAM, 6 onboard SATA ports, used from 2010 to 2017, in
> perfectly working condition) which can be stuffed with up to 6 SATA HDDs and
> added to this Ceph cluster, so far with only Gigabit network interconnect.
> Like, move 4 OSDs out of first box into it, to have 2 boxes with 4 HDDs
> each. Is this going to improve Ceph performance with the setup described
> above ?
>
> 5. I hear that RAID controllers like Adaptec 5805, LSI 2108 provide
> better performance with SATA HDDs exported as JBODs than onboard SATA AHCI
> controllers due to more aggressive caching and reordering requests. Is this
> true ?
>
> 6. On the 

Re: [ceph-users] VMWARE and RBD

2018-06-28 Thread Horace
Seems there's no plan for that and the vmware kernel documentation will only 
share to partners. You would better off to use iscsi. By the way, i found that 
the performance is much better for SCST than ceph-iscsi. I don't think 
ceph-iscsi is production-ready? 

Regards, 
Horace Ng 


From: "Steven Vacaroaia"  
To: "ceph-users"  
Sent: Tuesday, June 19, 2018 12:08:40 AM 
Subject: [ceph-users] VMWARE and RBD 

Hi, 
I read somewhere that VMWare is planning to support RBD directly 

Anyone here know more about this ..maybe a tentative / date / version ? 

Thanks 
Steven 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs compression?

2018-06-28 Thread Youzhong Yang
For RGW, compression works very well. We use rgw to store crash dumps, in
most cases, the compression ratio is about 2.0 ~ 4.0.

I tried to enable compression for cephfs data pool:

# ceph osd pool get cephfs_data all | grep ^compression
compression_mode: force
compression_algorithm: lz4
compression_required_ratio: 0.95
compression_max_blob_size: 4194304
compression_min_blob_size: 4096

(we built ceph packages and enabled lz4.)

It doesn't seem to work. I copied a 8.7GB folder to cephfs, ceph df says it
used 8.7GB:

root@ceph-admin:~# ceph df
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
16 TiB 16 TiB  111 GiB  0.69
POOLS:
NAMEID USED%USED MAX AVAIL OBJECTS
cephfs_data 1  8.7 GiB  0.17   5.0 TiB  360545
cephfs_metadata 2  221 MiB 0   5.0 TiB   77707

I know this folder can be compressed to ~4.0GB under zfs lz4 compression.

Am I missing anything? how to make cephfs compression work? is there any
trick?

By the way, I am evaluating ceph mimic v13.2.0.

Thanks in advance,
--Youzhong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph behavior on (lots of) small objects (RGW, RADOS + erasure coding)?

2018-06-28 Thread Gregory Farnum
On Wed, Jun 27, 2018 at 2:32 AM Nicolas Dandrimont <
ol...@softwareheritage.org> wrote:

> Hi,
>
> I would like to use ceph to store a lot of small objects. Our current usage
> pattern is 4.5 billion unique objects, ranging from 0 to 100MB, with a
> median
> size of 3-4kB. Overall, that's around 350 TB of raw data to store, which
> isn't
> much, but that's across a *lot* of tiny files.
>
> We expect a growth pattern of around at third per year, and the object size
> distribution to sensibly stay the same (it's been stable for the past three
> years, and we don't see that changing).
>
> Our object access pattern is a very simple key -> value store, where the
> key
> happens to be the sha1 of the content we're storing. Any metadata are
> stored
> externally and we really only need a dumb object storage.
>
> Our redundancy requirement is to be able to withstand the loss of 2 OSDs.
>
> After looking at our options for storage in Ceph, I dismissed (perhaps
> hastily)
> RGW for its metadata overhead, and went straight to plain RADOS. I've
> setup an
> erasure coded storage pool, with default settings, with k=5 and m=2
> (expecting
> a 40% increase in storage use over plain contents).
>
> After storing objects in the pool, I see a storage usage of 700% instead of
> 140%. My understanding of the erasure code profile docs[1] is that objects
> that
> are below the stripe width (k * stripe_unit, which in my case is 20KB)
> can't be
> chunked for erasure coding, which makes RADOS fall back to plain object
> copying, with k+m copies.
>
> [1]
> http://docs.ceph.com/docs/master/rados/operations/erasure-code-profile/
>
> Is my understanding correct? Does anyone have experience with this kind of
> storage workload in Ceph?


That’s close but not *quite* right. It’s not that Ceph will explicitly
“fall back” to replication. In most (though perhaps not all) erasure codes,
what you’ll see is full sized parity blocks, a full store of the data (in
the default reed-Solomon that will just be full-sized chunks up to however
many are needed to store it fully in a single copy), and the remaining data
chunks (out of the k) will have no data. *But* Ceph will keep the “object
info” metadata in each shard, so all the OSDs in a PG will still witness
all the writes.



> If my understanding is correct, I'll end up adding size tiering on my
> object
> storage layer, shuffling objects in two pools with different settings
> according
> to their size. That's not too bad, but I'd like to make sure I'm not
> completely
> misunderstanding something.
>

That’s probably a reasonable response, especially if you are already
maintaining an index for other purposes!
-Greg



> Thanks!
> --
> Nicolas Dandrimont
> Backend Engineer, Software Heritage
>
> BOFH excuse #170:
> popper unable to process jumbo kernel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph snapshots

2018-06-28 Thread Gregory Farnum
You may find my talk at OpenStack Boston’s Ceph day last year to be useful:
https://www.youtube.com/watch?v=rY0OWtllkn8
-Greg
On Wed, Jun 27, 2018 at 9:06 AM Marc Schöchlin  wrote:

> Hello list,
>
> i currently hold 3 snapshots per rbd image for my virtual systems.
>
> What i miss in the current documentation:
>
>   * details about the implementation of snapshots
>   o implementation details
>   o which scenarios create high overhead per snapshot
>   o what causes the really short performance degration on snapshot
> creation/deletion
>   o why do i not see a significant rbd performance degration if
> there a numerous snapshots
>   o 
>   * details and recommendations about the overhead of snapshots
>   o what performance penalty do i have to expect for a write/read iop
>   o what are the edgecases of the implemnetation
>   o how many snapshots per image (i.e virtual machine) might be a
> good idea
>   o ...
>
> Regards
> Marc
>
>
> Am 27.06.2018 um 15:37 schrieb Brian ::
> > Hi John
> >
> > Have you looked at ceph documentation?
> >
> > RBD: http://docs.ceph.com/docs/luminous/rbd/rbd-snapshot/
> >
> > The ceph project documentation is really good for most areas. Have a
> > look at what you can find then come back with more specific questions!
> >
> > Thanks
> > Brian
> >
> >
> >
> >
> > On Wed, Jun 27, 2018 at 2:24 PM, John Molefe 
> wrote:
> >> Hi everyone
> >>
> >> I would like some advice and insight into how ceph snapshots work and
> how it
> >> can be setup.
> >>
> >> Responses will be much appreciated.
> >>
> >> Thanks
> >> John
> >>
> >> Vrywaringsklousule / Disclaimer:
> >> http://www.nwu.ac.za/it/gov-man/disclaimer.html
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph FS Random Write 4KB block size only 2MB/s?!

2018-06-28 Thread Yan, Zheng
On Fri, Jun 29, 2018 at 10:01 AM Yu Haiyang  wrote:
>
> Ubuntu 16.04.3 LTS
>

4.4 kernel?   AIO on cephfs is not supported by 4.4 kernel, AIO
actually is synchronized IO.  4.5 kernel is the first version that
support AIO on cephfs
> On Jun 28, 2018, at 9:00 PM, Yan, Zheng  wrote:
>
> kernel version?
>
> On Thu, Jun 28, 2018 at 5:38 PM Yu Haiyang  wrote:
>>
>> Here you go. Below are the fio job options and the results.
>>
>> blocksize=4K
>> size=500MB
>> directory=[ceph_fs_mount_directory]
>> ioengine=libaio
>> iodepth=64
>> direct=1
>> runtime=60
>> time_based
>> group_reporting
>>
>> numjobs Ceph FS Erasure Coding (k=2, m=1) Ceph FS 3 Replica
>> 1 job 577KB/s 765KB/s
>> 2 job 1.27MB/s 793KB/s
>> 4 job 2.33MB/s 1.36MB/s
>> 8 job 4.14MB/s 2.36MB/s
>> 16 job 6.87MB/s 4.40MB/s
>> 32 job 11.07MB/s 8.17MB/s
>> 64 job 13.75MB/s 15.84MB/s
>> 128 job 10.46MB/s 26.82MB/s
>>
>> On Jun 28, 2018, at 5:01 PM, Yan, Zheng  wrote:
>>
>> On Thu, Jun 28, 2018 at 10:30 AM Yu Haiyang  wrote:
>>
>>
>> Hi Yan,
>>
>> Thanks for your suggestion.
>> No, I didn’t run fio on ceph-fuse. I mounted my Ceph FS in kernel mode.
>>
>>
>> command option of fio ?
>>
>> Regards,
>> Haiyang
>>
>> On Jun 27, 2018, at 9:45 PM, Yan, Zheng  wrote:
>>
>> On Wed, Jun 27, 2018 at 8:04 PM Yu Haiyang  wrote:
>>
>>
>> Hi All,
>>
>> Using fio with job number ranging from 1 to 128, the random write speed for 
>> 4KB block size has been consistently around 1MB/s to 2MB/s.
>> Random read of the same block size can reach 60MB/s with 32 jobs.
>>
>>
>> run fio on ceph-fuse? If I remember right, fio does 1 bytes write.
>> overhead of passing the 1 byte to ceph-fuse is too high.
>>
>>
>> Our ceph cluster consists of 4 OSDs all running on SSD connected through a 
>> switch with 9.06 Gbits/sec bandwidth.
>> Any suggestion please?
>>
>> Warmest Regards,
>> Haiyang
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous BlueStore OSD - Still a way to pinpoint an object?

2018-06-28 Thread Gregory Farnum
The ceph-objectstore-tool also has an (experimental?) mode to mount the OSD
store as a FUSE Filesystems regardless of the backend. But I have to assume
what you’re really after here is repairing individual objects, and the way
that works is enough different in BlueStore I really wouldn’t worry about
it. The advantage of looking at the raw FileStore filesystem was you could
fix issues the FS had caused, but obviously BlueStore doesn’t experience
those.
-Greg
On Thu, Jun 28, 2018 at 1:54 PM Igor Fedotov  wrote:

> You can access offline OSD using ceph-objectstore-tool which allows to
> enumerate and access specific objects.
> Not sure this makes sense for any purposes other than low-level debugging
> though..
>
>
> Thanks,
>
> Igor
>
>
>
> On 6/28/2018 5:42 AM, Yu Haiyang wrote:
>
> Hi All,
>
> Previously I read this article about how to locate an object on the OSD
> disk.
> Apparently it was on a FileStore-back disk partition.
>
> Now I have upgraded my Ceph to Luminous and hosted my OSDs on BlueStore
> partition, the OSD directory structure has completely changed.
> The data is mapped to a block device as below and that’s as far as I can
> trace.
>
> *lrwxrwxrwx 1 ceph ceph   93 Jun 24 17:03 block ->
> /dev/ceph-0ec01ce9-d397-43e7-ad62-93cd1c62f75a/osd-block-f590b656-e40c-42f7-8cf9-ca846632d046*
>
> Hence is there still a way to pinpoint an object on a BlueStore disk
> partition?
>
> Best,
> Haiyang
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous Bluestore performance, bcache

2018-06-28 Thread Richard Bade
Hi Andrei,
These are good questions. We have another cluster with filestore and
bcache but for this particular one I was interested in testing out
bluestore. So I have used bluestore both with and without bcache.
For my synthetic load on the vm's I'm using this fio command:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1
--name=test --filename=test --bs=4k --iodepth=64 --size=4G
--readwrite=randwrite --rate_iops=50

Currently on bluestore with my synthetic load I'm getting 7% hit ratio
(cat /sys/block/bcache*/bcache/stats_total/cache_hit_ratio)
On our filestore cluster with ~700 vm's of varied workload we're
geting about 30-35% hit ratio.
In the hourly hit ratio I have as high as 50% on some osd's in our
filestore cluster. Only 25% on my synthetic load on bluestore so far,
but I hadn't actually been checking this stat until now.

I hope that helps.
Regards,
Richard

> Hi Richard,
> It is an interesting test for me too as I am planning to migrate to
> Bluestore storage and was considering repurposing the ssd disks
> that we currently use for journals.
> I was wondering if you are using the Filestore or the bluestone
> for the osds?
> Also, when you perform your testing, how good is the hit ratio
> that you have on the bcache?
> Are you using a lot of random data for your benchmarks? How
> large is your test file for each vm?
> We have been playing around with a few caching scenarios a
> few years back (enchanceio and a few more which I can't
> remember now) and we have seen a very poor hit ratio on the
> caching system. Was wondering if you see a different picture?
> Cheers
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-28 Thread Brad Hubbard
On Fri, Jun 29, 2018 at 2:38 AM, Andrei Mikhailovsky  wrote:
> Hi Brad,
>
> This has helped to repair the issue. Many thanks for your help on this!!!

No problem.

>
> I had so many objects with broken omap checksum, that I spent at least a few 
> hours identifying those and using the commands you've listed to repair. They 
> were all related to one pool called .rgw.buckets.index . All other pools look 
> okay so far.

So originally you said you were having trouble with "one inconsistent
and stubborn PG" When did that become "so many objects"?

>
> I am wondering what could have got horribly wrong with the above pool?

Is that pool 18? I notice it seems to be size 2, what is min_size on that pool?

As to working out what went wrong. What event(s) coincided with or
preceded the problem? What history can you provide? What data can you
provide from the time leading up to when the issue was first seen?

>
> Cheers
>
> Andrei
> - Original Message -
>> From: "Brad Hubbard" 
>> To: "Andrei Mikhailovsky" 
>> Cc: "ceph-users" 
>> Sent: Thursday, 28 June, 2018 01:08:34
>> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG
>
>> Try the following. You can do this with all osds up and running.
>>
>> # rados -p [name_of_pool_18] setomapval .dir.default.80018061.2
>> temporary-key anything
>> # ceph pg deep-scrub 18.2
>>
>> Once you are sure the scrub has completed and the pg is no longer
>> inconsistent you can remove the temporary key.
>>
>> # rados -p [name_of_pool_18] rmomapkey .dir.default.80018061.2 temporary-key
>>
>>
>> On Wed, Jun 27, 2018 at 9:42 PM, Andrei Mikhailovsky  
>> wrote:
>>> Here is one more thing:
>>>
>>> rados list-inconsistent-obj 18.2
>>> {
>>>"inconsistents" : [
>>>   {
>>>  "object" : {
>>> "locator" : "",
>>> "version" : 632942,
>>> "nspace" : "",
>>> "name" : ".dir.default.80018061.2",
>>> "snap" : "head"
>>>  },
>>>  "union_shard_errors" : [
>>> "omap_digest_mismatch_info"
>>>  ],
>>>  "shards" : [
>>> {
>>>"osd" : 21,
>>>"primary" : true,
>>>"data_digest" : "0x",
>>>"omap_digest" : "0x25e8a1da",
>>>"errors" : [
>>>   "omap_digest_mismatch_info"
>>>],
>>>"size" : 0
>>> },
>>> {
>>>"data_digest" : "0x",
>>>"primary" : false,
>>>"osd" : 28,
>>>"errors" : [
>>>   "omap_digest_mismatch_info"
>>>],
>>>"omap_digest" : "0x25e8a1da",
>>>"size" : 0
>>> }
>>>  ],
>>>  "errors" : [],
>>>  "selected_object_info" : {
>>> "mtime" : "2018-06-19 16:31:44.759717",
>>> "alloc_hint_flags" : 0,
>>> "size" : 0,
>>> "last_reqid" : "client.410876514.0:1",
>>> "local_mtime" : "2018-06-19 16:31:44.760139",
>>> "data_digest" : "0x",
>>> "truncate_seq" : 0,
>>> "legacy_snaps" : [],
>>> "expected_write_size" : 0,
>>> "watchers" : {},
>>> "flags" : [
>>>"dirty",
>>>"data_digest",
>>>"omap_digest"
>>> ],
>>> "oid" : {
>>>"pool" : 18,
>>>"hash" : 1156456354,
>>>"key" : "",
>>>"oid" : ".dir.default.80018061.2",
>>>"namespace" : "",
>>>"snapid" : -2,
>>>"max" : 0
>>> },
>>> "truncate_size" : 0,
>>> "version" : "120985'632942",
>>> "expected_object_size" : 0,
>>> "omap_digest" : "0x",
>>> "lost" : 0,
>>> "manifest" : {
>>>"redirect_target" : {
>>>   "namespace" : "",
>>>   "snapid" : 0,
>>>   "max" : 0,
>>>   "pool" : -9223372036854775808,
>>>   "hash" : 0,
>>>   "oid" : "",
>>>   "key" : ""
>>>},
>>>"type" : 0
>>> },
>>> "prior_version" : "0'0",
>>> "user_version" : 632942
>>>  }
>>>   }
>>>],
>>>"epoch" : 121151
>>> }
>>>
>>> Cheers
>>>
>>> - Original Message -
 From: "Andrei Mikhailovsky" 
 To: "Brad Hubbard" 
 Cc: "ceph-users" 
 Sent: Wednesday, 27 June, 2018 09:10:07
 Subject: Re: [ceph-users] fixing unrepairable inconsistent PG
>>>
 Hi Brad,

 Thanks, that helped to get the query info on the inconsistent PG 18.2:

 {
"state": "active+clean+inconsistent",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 121293,
"up": [
21,

Re: [ceph-users] Ceph Luminous RocksDB vs WalDB?

2018-06-28 Thread Igor Fedotov
The idea is to avoid separate WAL partition - it doesn't make sense for 
single NVMe device - just compicates things.


And if you don't specify WAL explicitly it's co-exist with DB.

Hence I vote for the second option :)


On 6/29/2018 12:07 AM, Kai Wagner wrote:


I'm also not 100% sure but I think that the first one is the right way 
to go. The second command only specifies the db partition but no 
dedicated WAL partition. The first one should do the trick.



On 28.06.2018 22:58, Igor Fedotov wrote:


I think the second variant is what you need. But I'm not the guru in 
ceph-deploy so there might be some nuances there...


Anyway the general idea is to have just a single NVME partition (for 
both WAL and DB) per OSD.


Thanks,

Igor


On 6/27/2018 11:28 PM, Pardhiv Karri wrote:

Thank you Igor for the response.

So do I need to use this,

ceph-deploy osd create --debug --bluestore --data /dev/sdb 
--block-wal /dev/nvme0n1p1 --block-db /dev/nvme0n1p2 cephdatahost1


or

ceph-deploy osd create --debug --bluestore --data /dev/sdb 
--block-db /dev/nvme0n1p2 cephdatahost1


where /dev/sdb is ssd disk for osd
/dev/nvmen0n1p1 is 10G partition
/dev/nvme0n1p2 is 25G partition


Thanks,
Pardhiv K

On Wed, Jun 27, 2018 at 9:08 AM Igor Fedotov > wrote:


Hi Pardhiv,

there is no WalDB in Ceph.

It's WAL (Write Ahead Log) that is a way to ensure write safety
in RocksDB. In other words - that's just a RocksDB subsystem
which can use separate volume though.

In general For BlueStore/BlueFS one can either allocate separate
volumes for WAL and DB or have them on the same volume. The
latter is the common option.

The separated layout makes sense when you have tiny but
super-fast device (for WAL) and less effective (but still fast)
larger drive for DB. Not to mention the third one for user data

E.g. HDD (user data) + SDD (DB) + NVME  (WAL) is such a layout.


So for you case IMO it's optimal to have merged WAL+DB at NVME
and data at SSD. Hence no need for separate WAL volume.


Regards,

Igor


On 6/26/2018 10:22 PM, Pardhiv Karri wrote:

Hi,

I am playing with Ceph Luminous and getting confused
information around usage of WalDB vs RocksDB.

I have 2TB NVMe drive which I want to use for Wal/Rocks DB and
have 5 2TB SSD's for OSD.
I am planning to create 5 30GB partitions for RocksDB on NVMe
drive, do I need to create partitions of WalDB also on NVMe
drive or does RocksDB does same work as WalDB plus having
metadata on it?

So my question is do I really need to use WalDB along with
RocksDB or having RocksDB only is fine?

Thanks,
Pardhiv K





___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
*Pardhiv Karri*
"Rise and Rise again untilLAMBSbecome LIONS"






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Luminous RocksDB vs WalDB?

2018-06-28 Thread Kai Wagner
On 28.06.2018 23:25, Eric Jackson wrote:
> Recently, I learned that this is not necessary when both are on the same 
> device.  The wal for the Bluestore OSD will use the db device when set to 0.
That's good to know. Thanks for the input on this Eric.

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Luminous RocksDB vs WalDB?

2018-06-28 Thread Eric Jackson
I'm going to hope that Igor is correct since I have a PR for DeepSea to change 
this exact behavior.

With respect to ceph-deploy, if you specify --block-wal, your OSD will have a 
block.wal symlink.  Likewise, --block-db will give you a block.db symlink.

If you have both on the command line, you will get both.  That does work, but 
this also means twice the partitions to manage on a shared device.  We have 
been doing Bluestore this way in DeepSea since we started supporting 
Bluestore.

Recently, I learned that this is not necessary when both are on the same 
device.  The wal for the Bluestore OSD will use the db device when set to 0.  
I will soon verify that also means when the block.wal is absent.  

Eric

On Thursday, June 28, 2018 5:07:25 PM EDT Kai Wagner wrote:
> I'm also not 100% sure but I think that the first one is the right way
> to go. The second command only specifies the db partition but no
> dedicated WAL partition. The first one should do the trick.
> 
> On 28.06.2018 22:58, Igor Fedotov wrote:
> > I think the second variant is what you need. But I'm not the guru in
> > ceph-deploy so there might be some nuances there...
> > 
> > Anyway the general idea is to have just a single NVME partition (for
> > both WAL and DB) per OSD.
> > 
> > Thanks,
> > 
> > Igor
> > 
> > On 6/27/2018 11:28 PM, Pardhiv Karri wrote:
> >> Thank you Igor for the response.
> >> 
> >> So do I need to use this,
> >> 
> >> ceph-deploy osd create --debug --bluestore --data /dev/sdb
> >> --block-wal /dev/nvme0n1p1 --block-db /dev/nvme0n1p2 cephdatahost1
> >> 
> >> or 
> >> 
> >> ceph-deploy osd create --debug --bluestore --data /dev/sdb --block-db
> >> /dev/nvme0n1p2 cephdatahost1
> >> 
> >> where /dev/sdb is ssd disk for osd
> >> /dev/nvmen0n1p1 is 10G partition
> >> /dev/nvme0n1p2 is 25G partition
> >> 
> >> 
> >> Thanks,
> >> Pardhiv K
> >> 
> >> On Wed, Jun 27, 2018 at 9:08 AM Igor Fedotov  >> 
> >> > wrote:
> >> Hi Pardhiv,
> >> 
> >> there is no WalDB in Ceph.
> >> 
> >> It's WAL (Write Ahead Log) that is a way to ensure write safety
> >> in RocksDB. In other words - that's just a RocksDB subsystem
> >> which can use separate volume though.
> >> 
> >> In general For BlueStore/BlueFS one can either allocate separate
> >> volumes for WAL and DB or have them on the same volume. The
> >> latter is the common option.
> >> 
> >> The separated layout makes sense when you have tiny but
> >> super-fast device (for WAL) and less effective (but still fast)
> >> larger drive for DB. Not to mention the third one for user data
> >> 
> >> E.g. HDD (user data) + SDD (DB) + NVME  (WAL) is such a layout.
> >> 
> >> 
> >> So for you case IMO it's optimal to have merged WAL+DB at NVME
> >> and data at SSD. Hence no need for separate WAL volume.
> >> 
> >> 
> >> Regards,
> >> 
> >> Igor
> >> 
> >> On 6/26/2018 10:22 PM, Pardhiv Karri wrote:
> >>> Hi,
> >>> 
> >>> I am playing with Ceph Luminous and getting confused information
> >>> around usage of WalDB vs RocksDB.
> >>> 
> >>> I have 2TB NVMe drive which I want to use for Wal/Rocks DB and
> >>> have 5 2TB SSD's for OSD. 
> >>> I am planning to create 5 30GB partitions for RocksDB on NVMe
> >>> drive, do I need to create partitions of WalDB also on NVMe
> >>> drive or does RocksDB does same work as WalDB plus having
> >>> metadata on it? 
> >>> 
> >>> So my question is do I really need to use WalDB along with
> >>> RocksDB or having RocksDB only is fine?
> >>> 
> >>> Thanks,
> >>> Pardhiv K
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com 
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> 
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com 
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



signature.asc
Description: This is a digitally signed message part.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Luminous RocksDB vs WalDB?

2018-06-28 Thread Kai Wagner
I'm also not 100% sure but I think that the first one is the right way
to go. The second command only specifies the db partition but no
dedicated WAL partition. The first one should do the trick.


On 28.06.2018 22:58, Igor Fedotov wrote:
>
> I think the second variant is what you need. But I'm not the guru in
> ceph-deploy so there might be some nuances there...
>
> Anyway the general idea is to have just a single NVME partition (for
> both WAL and DB) per OSD.
>
> Thanks,
>
> Igor
>
>
> On 6/27/2018 11:28 PM, Pardhiv Karri wrote:
>> Thank you Igor for the response.
>>
>> So do I need to use this,
>>
>> ceph-deploy osd create --debug --bluestore --data /dev/sdb
>> --block-wal /dev/nvme0n1p1 --block-db /dev/nvme0n1p2 cephdatahost1
>>
>> or 
>>
>> ceph-deploy osd create --debug --bluestore --data /dev/sdb --block-db
>> /dev/nvme0n1p2 cephdatahost1
>>
>> where /dev/sdb is ssd disk for osd
>> /dev/nvmen0n1p1 is 10G partition
>> /dev/nvme0n1p2 is 25G partition
>>
>>
>> Thanks,
>> Pardhiv K
>>
>> On Wed, Jun 27, 2018 at 9:08 AM Igor Fedotov > > wrote:
>>
>> Hi Pardhiv,
>>
>> there is no WalDB in Ceph.
>>
>> It's WAL (Write Ahead Log) that is a way to ensure write safety
>> in RocksDB. In other words - that's just a RocksDB subsystem
>> which can use separate volume though.
>>
>> In general For BlueStore/BlueFS one can either allocate separate
>> volumes for WAL and DB or have them on the same volume. The
>> latter is the common option.
>>
>> The separated layout makes sense when you have tiny but
>> super-fast device (for WAL) and less effective (but still fast)
>> larger drive for DB. Not to mention the third one for user data
>>
>> E.g. HDD (user data) + SDD (DB) + NVME  (WAL) is such a layout.
>>
>>
>> So for you case IMO it's optimal to have merged WAL+DB at NVME
>> and data at SSD. Hence no need for separate WAL volume.
>>
>>
>> Regards,
>>
>> Igor
>>
>>
>> On 6/26/2018 10:22 PM, Pardhiv Karri wrote:
>>> Hi,
>>>
>>> I am playing with Ceph Luminous and getting confused information
>>> around usage of WalDB vs RocksDB.
>>>
>>> I have 2TB NVMe drive which I want to use for Wal/Rocks DB and
>>> have 5 2TB SSD's for OSD. 
>>> I am planning to create 5 30GB partitions for RocksDB on NVMe
>>> drive, do I need to create partitions of WalDB also on NVMe
>>> drive or does RocksDB does same work as WalDB plus having
>>> metadata on it? 
>>>
>>> So my question is do I really need to use WalDB along with
>>> RocksDB or having RocksDB only is fine?
>>>
>>> Thanks,
>>> Pardhiv K
>>>
>>>
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> -- 
>> *Pardhiv Karri*
>> "Rise and Rise again untilLAMBSbecome LIONS" 
>>
>>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Luminous RocksDB vs WalDB?

2018-06-28 Thread Igor Fedotov
I think the second variant is what you need. But I'm not the guru in 
ceph-deploy so there might be some nuances there...


Anyway the general idea is to have just a single NVME partition (for 
both WAL and DB) per OSD.


Thanks,

Igor


On 6/27/2018 11:28 PM, Pardhiv Karri wrote:

Thank you Igor for the response.

So do I need to use this,

ceph-deploy osd create --debug --bluestore --data /dev/sdb --block-wal 
/dev/nvme0n1p1 --block-db /dev/nvme0n1p2 cephdatahost1


or

ceph-deploy osd create --debug --bluestore --data /dev/sdb --block-db 
/dev/nvme0n1p2 cephdatahost1


where /dev/sdb is ssd disk for osd
/dev/nvmen0n1p1 is 10G partition
/dev/nvme0n1p2 is 25G partition


Thanks,
Pardhiv K

On Wed, Jun 27, 2018 at 9:08 AM Igor Fedotov > wrote:


Hi Pardhiv,

there is no WalDB in Ceph.

It's WAL (Write Ahead Log) that is a way to ensure write safety in
RocksDB. In other words - that's just a RocksDB subsystem which
can use separate volume though.

In general For BlueStore/BlueFS one can either allocate separate
volumes for WAL and DB or have them on the same volume. The latter
is the common option.

The separated layout makes sense when you have tiny but super-fast
device (for WAL) and less effective (but still fast) larger drive
for DB. Not to mention the third one for user data

E.g. HDD (user data) + SDD (DB) + NVME  (WAL) is such a layout.


So for you case IMO it's optimal to have merged WAL+DB at NVME and
data at SSD. Hence no need for separate WAL volume.


Regards,

Igor


On 6/26/2018 10:22 PM, Pardhiv Karri wrote:

Hi,

I am playing with Ceph Luminous and getting confused information
around usage of WalDB vs RocksDB.

I have 2TB NVMe drive which I want to use for Wal/Rocks DB and
have 5 2TB SSD's for OSD.
I am planning to create 5 30GB partitions for RocksDB on NVMe
drive, do I need to create partitions of WalDB also on NVMe drive
or does RocksDB does same work as WalDB plus having metadata on it?

So my question is do I really need to use WalDB along with
RocksDB or having RocksDB only is fine?

Thanks,
Pardhiv K





___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
*Pardhiv Karri*
"Rise and Rise again untilLAMBSbecome LIONS"




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous BlueStore OSD - Still a way to pinpoint an object?

2018-06-28 Thread Igor Fedotov
You can access offline OSD using ceph-objectstore-tool which allows to 
enumerate and access specific objects.


Not sure this makes sense for any purposes other than low-level 
debugging though..



Thanks,

Igor



On 6/28/2018 5:42 AM, Yu Haiyang wrote:

Hi All,

Previously I read this article about how to locate an object on the 
OSD disk.

Apparently it was on a FileStore-back disk partition.

Now I have upgraded my Ceph to Luminous and hosted my OSDs on 
BlueStore partition, the OSD directory structure has completely changed.
The data is mapped to a block device as below and that’s as far as I 
can trace.


/lrwxrwxrwx 1 ceph ceph   93 Jun 24 17:03 block -> 
/dev/ceph-0ec01ce9-d397-43e7-ad62-93cd1c62f75a/osd-block-f590b656-e40c-42f7-8cf9-ca846632d046/


Hence is there still a way to pinpoint an object on a BlueStore 
disk partition?


Best,
Haiyang


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Many inconsistent PGs in EC pool, is this normal?

2018-06-28 Thread Paul Emmerich
Are you running tight on RAM?
You might be running into http://tracker.ceph.com/issues/22464


Paul

2018-06-28 17:17 GMT+02:00 Bryan Banister :

> Hi all,
>
>
>
> We started running a EC pool based object store, set up with a 4+2
> configuration, and we seem to be getting an almost constant report of
> inconsistent PGs during scrub operations.  For example:
>
> root@rook-tools:/# ceph pg ls inconsistent
>
> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
> LOG  DISK_LOG STATE STATE_STAMP
> VERSION   REPORTED  UP  UP_PRIMARY
> ACTING  ACTING_PRIMARY LAST_SCRUB
> SCRUB_STAMPLAST_DEEP_SCRUB DEEP_SCRUB_STAMP
>
> 19.26   104  00 0   0 436207616
> 1537 1537 active+clean+inconsistent 2018-06-28 15:04:54.054227
> 1811'3137 2079:5075 [206,116,68,31,193,156]206
> [206,116,68,31,193,156]206  1811'3137 2018-06-27
> 21:00:17.011611   1811'3137 2018-06-27 21:00:17.011611
>
> 19.234   98  00 0   0 406847488
> 1581 1581 active+clean+inconsistent 2018-06-28 15:05:18.077003
> 1811'2981 2080:4822  [28,131,229,180,84,68] 28
> [28,131,229,180,84,68] 28  1811'2981 2018-06-28
> 14:09:54.092401   1811'2981 2018-06-28 14:09:54.092401
>
> 19.2a8  116  00 0   0 486539264
> 1561 1561 active+clean+inconsistent 2018-06-28 15:04:54.073762
> 1811'3161 2079:4825 [177,68,222,13,131,107]177
> [177,68,222,13,131,107]177  1811'3161 2018-06-28
> 07:51:21.109587   1811'3161 2018-06-28 07:51:21.109587
>
> 19.406  126  00 0   0 520233399
> 1557 1557 active+clean+inconsistent 2018-06-28 15:04:57.142651
> 1811'3057 2080:4944  [230,199,128,68,92,11]230
> [230,199,128,68,92,11]230  1811'3057 2018-06-27
> 18:36:18.497899   1811'3057 2018-06-27 18:36:18.497899
>
> 19.46b  109  00 0   0 449840274
> 1558 1558 active+clean+inconsistent 2018-06-28 15:04:54.227970
> 1811'3058 2079:4986  [18,68,130,94,181,225] 18
> [18,68,130,94,181,225] 18  1811'3058 2018-06-27
> 14:32:17.800961   1811'3058 2018-06-27 14:32:17.800961
>
> [snip]
>
>
>
> We sometimes see that running a deep scrub on the PG resolves the issue
> but not all the time.
>
>
>
> We have been running the PG repair operation on them (e.g. ceph pg repair
> ), which clears the issue.  Is this the correct way to resolve this
> issue?
>
>
>
> Is this a normal behavior for a Ceph cluster?
>
>
>
> If so, why doesn’t it attempt to repair itself automatically?
>
>
>
> Thanks for the help understanding Ceph, we are very new to it still!
>
> -Bryan
>
>
>
>
>
> --
>
> Note: This email is for the confidential use of the named addressee(s)
> only and may contain proprietary, confidential, or privileged information
> and/or personal data. If you are not the intended recipient, you are hereby
> notified that any review, dissemination, or copying of this email is
> strictly prohibited, and requested to notify the sender immediately and
> destroy this email and any attachments. Email transmission cannot be
> guaranteed to be secure or error-free. The Company, therefore, does not
> make any guarantees as to the completeness or accuracy of this email or any
> attachments. This email is for informational purposes only and does not
> constitute a recommendation, offer, request, or solicitation of any kind to
> buy, sell, subscribe, redeem, or perform any type of transaction of a
> financial product. Personal data, as defined by applicable data privacy
> laws, contained in this email may be processed by the Company, and any of
> its affiliated or related companies, for potential ongoing compliance
> and/or business-related purposes. You may have rights regarding your
> personal data; for information on exercising these rights or the Company’s
> treatment of personal data, please email datareque...@jumptrading.com.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw multi file upload failure

2018-06-28 Thread Melzer Pinto
Hello,
Recently we've observed on one of our ceph clusters that uploading of a large 
number of small files(~2000x2k) fails. The http return code shows 200 but the 
file upload fails. Here is an e.g of the log

2018-06-27 07:34:40.624103 7f0dc67cc700  1 == starting new request 
req=0x7f0dc67c68a0 =
2018-06-27 07:34:40.645039 7f0dc3fc7700  1 == starting new request 
req=0x7f0dc3fc18a0 =
2018-06-27 07:34:40.682108 7f0dc3fc7700  0 WARNING: couldn't find acl header 
for object, generating default
2018-06-27 07:34:40.962674 7f0dcbfd7700  0 ERROR: client_io->complete_request() 
returned -5
2018-06-27 07:34:40.962689 7f0dcbfd7700  1 == req done req=0x7f0dcbfd18a0 
op status=0 http_status=200 ==
2018-06-27 07:34:40.962738 7f0dcbfd7700  1 civetweb: 0x7f0df4004160: 10.x.x.x. 
- - [27/Jun/2018:07:34:34 +] "POST - HTTP/1.1" 200 0 - 
aws-sdk-java/1.6.4 Linux/3.17.6-200.fc20.x86_64 
Java_HotSpot(TM)_64-Bit_Server_VM/25.73-b02

I tried tuning the performance using the below parameters but the number of 
file upload failures still exists so I'm suspecting this is not a concurrency 
issue.
rgw num rados handles = 8
rgw thread pool size = 512
rgw frontends = civetweb port=7480 num_threads=512

I also tried increasing the logging level for rgw and civetweb to 20/5 but I 
dont see anything that can point to the issue.

2018-06-28 18:00:24.575460 7f3d7dfc3700 20 get_obj_state: s->obj_tag was set 
empty
2018-06-28 18:00:24.575491 7f3d7dfc3700 20 get_obj_state: rctx=0x7f3d7dfbcff0 
obj=files:_multipart_-.error.2~Rh1AqHvzgCPc0NGWMl-FHE0Y-HvCcmk.1 
state=0x7f3e04024d88 s->prefetch_data=0
2018-06-28 18:00:24.575496 7f3d7dfc3700 20 prepare_atomic_modification: state 
is not atomic. state=0x7f3e04024d88
2018-06-28 18:00:24.57 7f3d7dfc3700 20 reading from 
default.rgw.data.root:.bucket.meta.files:-.6432.11
2018-06-28 18:00:24.575567 7f3d7dfc3700 20 get_system_obj_state: 
rctx=0x7f3d7dfbb5d0 
obj=default.rgw.data.root:.bucket.meta.files:-.6432.11 
state=0x7f3e04001228 s->prefetch_data
2018-06-28 18:00:24.575577 7f3d7dfc3700 10 cache get: 
name=default.rgw.data.root+.bucket.meta.files:-.6432.11 : hit 
(requested=22, cached=23)
2018-06-28 18:00:24.575586 7f3d7dfc3700 20 get_system_obj_state: s->obj_tag was 
set empty
2018-06-28 18:00:24.575592 7f3d7dfc3700 10 cache get: 
name=default.rgw.data.root+.bucket.meta.files:-.6432.11 : hit 
(requested=17, cached=23)
2018-06-28 18:00:24.575614 7f3d7dfc3700 20  bucket index object: 
.dir.-.6432.11
2018-06-28 18:00:24.606933 7f3d67796700  2 req 9567:5.505460:s3:POST 
-.error:init_multipart:completing
2018-06-28 18:00:24.607025 7f3d67796700  0 ERROR: client_io->complete_request() 
returned -5
2018-06-28 18:00:24.607036 7f3d67796700  2 req 9567:5.505572:s3:POST 
-.error:init_multipart:op status=0
2018-06-28 18:00:24.607040 7f3d67796700  2 req 9567:5.505578:s3:POST 
-.error:init_multipart:http status=200
2018-06-28 18:00:24.607046 7f3d67796700  1 == req done req=0x7f3d677908a0 
op status=0 http_status=200 ==

The cluster is a 12 node Ceph jewel(10.2.10-1~bpo80+1) one. Operating system is 
Debian 8.9
ceph.conf
[global]
fsid = 314d4121-46b1-4433-9bae-fdd2803fc24b
mon_initial_members = ceph-1,ceph-2,ceph-3
mon_host = 10.x.x.x, 10.x.x.x, 10.x.x.x
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
public network = 10.x.x.x
osd_journal_size = 10240
osd_mount_options_xfs = rw,noexec,noatime,nodiratime,inode64
osd_pool_default_size = 3
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 900
osd_pool_default_pgp_num = 900
log to syslog = true
err to syslog = true
clog to syslog = true
rgw dns name = xxx.com
rgw num rados handles = 8
rgw thread pool size = 512
rgw frontends = civetweb port=7480 num_threads=512
debug rgw = 20/5
debug civetweb = 20/5

[mon]
mon cluster log to syslog = true


Any idea what the issue could be here?

Thanks
Mel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] fixing unrepairable inconsistent PG

2018-06-28 Thread Andrei Mikhailovsky
Hi Brad,

This has helped to repair the issue. Many thanks for your help on this!!!

I had so many objects with broken omap checksum, that I spent at least a few 
hours identifying those and using the commands you've listed to repair. They 
were all related to one pool called .rgw.buckets.index . All other pools look 
okay so far.

I am wondering what could have got horribly wrong with the above pool? 

Cheers

Andrei
- Original Message -
> From: "Brad Hubbard" 
> To: "Andrei Mikhailovsky" 
> Cc: "ceph-users" 
> Sent: Thursday, 28 June, 2018 01:08:34
> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG

> Try the following. You can do this with all osds up and running.
> 
> # rados -p [name_of_pool_18] setomapval .dir.default.80018061.2
> temporary-key anything
> # ceph pg deep-scrub 18.2
> 
> Once you are sure the scrub has completed and the pg is no longer
> inconsistent you can remove the temporary key.
> 
> # rados -p [name_of_pool_18] rmomapkey .dir.default.80018061.2 temporary-key
> 
> 
> On Wed, Jun 27, 2018 at 9:42 PM, Andrei Mikhailovsky  
> wrote:
>> Here is one more thing:
>>
>> rados list-inconsistent-obj 18.2
>> {
>>"inconsistents" : [
>>   {
>>  "object" : {
>> "locator" : "",
>> "version" : 632942,
>> "nspace" : "",
>> "name" : ".dir.default.80018061.2",
>> "snap" : "head"
>>  },
>>  "union_shard_errors" : [
>> "omap_digest_mismatch_info"
>>  ],
>>  "shards" : [
>> {
>>"osd" : 21,
>>"primary" : true,
>>"data_digest" : "0x",
>>"omap_digest" : "0x25e8a1da",
>>"errors" : [
>>   "omap_digest_mismatch_info"
>>],
>>"size" : 0
>> },
>> {
>>"data_digest" : "0x",
>>"primary" : false,
>>"osd" : 28,
>>"errors" : [
>>   "omap_digest_mismatch_info"
>>],
>>"omap_digest" : "0x25e8a1da",
>>"size" : 0
>> }
>>  ],
>>  "errors" : [],
>>  "selected_object_info" : {
>> "mtime" : "2018-06-19 16:31:44.759717",
>> "alloc_hint_flags" : 0,
>> "size" : 0,
>> "last_reqid" : "client.410876514.0:1",
>> "local_mtime" : "2018-06-19 16:31:44.760139",
>> "data_digest" : "0x",
>> "truncate_seq" : 0,
>> "legacy_snaps" : [],
>> "expected_write_size" : 0,
>> "watchers" : {},
>> "flags" : [
>>"dirty",
>>"data_digest",
>>"omap_digest"
>> ],
>> "oid" : {
>>"pool" : 18,
>>"hash" : 1156456354,
>>"key" : "",
>>"oid" : ".dir.default.80018061.2",
>>"namespace" : "",
>>"snapid" : -2,
>>"max" : 0
>> },
>> "truncate_size" : 0,
>> "version" : "120985'632942",
>> "expected_object_size" : 0,
>> "omap_digest" : "0x",
>> "lost" : 0,
>> "manifest" : {
>>"redirect_target" : {
>>   "namespace" : "",
>>   "snapid" : 0,
>>   "max" : 0,
>>   "pool" : -9223372036854775808,
>>   "hash" : 0,
>>   "oid" : "",
>>   "key" : ""
>>},
>>"type" : 0
>> },
>> "prior_version" : "0'0",
>> "user_version" : 632942
>>  }
>>   }
>>],
>>"epoch" : 121151
>> }
>>
>> Cheers
>>
>> - Original Message -
>>> From: "Andrei Mikhailovsky" 
>>> To: "Brad Hubbard" 
>>> Cc: "ceph-users" 
>>> Sent: Wednesday, 27 June, 2018 09:10:07
>>> Subject: Re: [ceph-users] fixing unrepairable inconsistent PG
>>
>>> Hi Brad,
>>>
>>> Thanks, that helped to get the query info on the inconsistent PG 18.2:
>>>
>>> {
>>>"state": "active+clean+inconsistent",
>>>"snap_trimq": "[]",
>>>"snap_trimq_len": 0,
>>>"epoch": 121293,
>>>"up": [
>>>21,
>>>28
>>>],
>>>"acting": [
>>>21,
>>>28
>>>],
>>>"actingbackfill": [
>>>"21",
>>>"28"
>>>],
>>>"info": {
>>>"pgid": "18.2",
>>>"last_update": "121290'698339",
>>>"last_complete": "121290'698339",
>>>"log_tail": "121272'696825",
>>>"last_user_version": 698319,
>>>"last_backfill": "MAX",
>>>"last_backfill_bitwise": 0,
>>>"purged_snaps": [],
>>>"history": {
>>>"epoch_created": 24431,
>>>"epoch_pool_created": 24431,
>>>"last_epoch_started": 121152,
>>>

[ceph-users] Ceph Tech Talk Jun 2018

2018-06-28 Thread Leonardo Vaz
Hey Cephers!

The Ceph Tech Talk of June starts in about 50 minutes and on this
edition George Mihaiuescu will talk about Ceph used on Cancer Research
at OIRC.

Please check the URL below for the meeting details:

  https://ceph.com/ceph-tech-talks/

Kindest regards,

Leo

-- 
Leonardo Vaz
Ceph Community Manager
Open Source and Standards Team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RDMA support in Ceph

2018-06-28 Thread Kamble, Nitin A


On 6/28/18, 12:11 AM, "kefu chai"  wrote:
> What is the state of the RDMA code in the Ceph Luminous and later 
releases?

in Ceph, the RDMA support has been constantly worked on. xio messenger
support was added 4 years ago, but i don't think it's maintained
anymore. and async messenger was IB protocol support. i think that's
what you wanted to try out. recently, we added the iWARP support to
the async messenger, see https://github.com/ceph/ceph/pull/20297. that
change also brought better connection management by using rdma-cm to
Ceph. and i believe to get RDMA support we should have
https://github.com/ceph/ceph/pull/14681, which is still pending on
review.
>
> When will it be production ready?
the RDMA support in Ceph is completely driven by our community. and we
don't have the hardware (NIC) for testing RoCEv2/iWARP, not to mention
IB. so i can hardly tell from the maintainer's perspective.
> [1]: https://community.mellanox.com/docs/DOC-2721
-- 
Regards
Kefu Chai

Hi Kefu,
  Thanks for the detailed explanation. Looks like, we will have to wait for few 
releases to get a supported and production ready RDMA support in ceph.

Thanks,
Nitin


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems setting up iSCSI

2018-06-28 Thread Jason Dillaman
Do you have the ansible backtrace from the "ceph-iscsi-gw : igw_lun |
configure luns (create/map rbds and add to lio)]" step? Have you tried
using the stock v4.16 kernel (no need to use the one on shaman)?

On Thu, Jun 28, 2018 at 11:29 AM Bernhard Dick  wrote:

> Hi Jason,
>
> Am 28.06.2018 um 14:33 schrieb Jason Dillaman:
> > You should have "/var/log/ansible-module-igw_config.log" on the target
> > machine that hopefully includes more information about why the RBD image
> > is missing from the TCM backend.
> I have the logfile, however there everything seems fine. The last lines
> are:
> 2018-06-28 17:25:36,876 igw_lun.py DEBUG: Check the rbd image size
> matches the request
> 2018-06-28 17:25:36,904 igw_lun.py DEBUG: rbd image rbd.iscsi2342
> size matches the configuration file request
> 2018-06-28 17:25:36,904 igw_lun.py DEBUG: Begin processing LIO mapping
> 2018-06-28 17:25:36,905 igw_lun.py INFO : (LUN.add_dev_to_lio)
> Adding image 'rbd.iscsi2342' to LIO
> 2018-06-28 17:25:36,905 igw_lun.py DEBUG: control="max_data_area_mb=8"
>
>Regards
>  Bernhard
>
> > In the past, I've seen issues w/ image
> > features and image size mismatch causing the process to abort
> >
> > On Thu, Jun 28, 2018 at 5:57 AM Bernhard Dick  > > wrote:
> >
> > Hi,
> >
> > I'm trying to setup iSCSI using ceph-ansible (stable-3.1) branch with
> > one iscsi gateway. The Host is a CentOS 7 host and I use the packages
> > from the Centos Storage SIG for luminous. Additionally I have
> installed:
> > ceph-iscsi-cli Version 2.7 Release 13.gb9e48a7.el7
> > ceph-iscsi-config Version 2.6 Release 15.ge016c6f.el7
> > from the iscsi project at shaman.ceph.com .
> > The python-rtslib version on the host is 2.1.fb67 Release 10.g7713d1e
> > also from shaman. The running kernel version is
> > 4.15.0-ceph-g1c778f43da52.
> > The ansible process stops at TASK [ceph-iscsi-gw : igw_lun |
> configure
> > luns (create/map rbds and add to lio)] complaining about file not
> found
> > while caused by the self._enable function in tcm.py. That is due to
> the
> > directory /sys/kernel/config/target/core/user_0/ is completely
> missing
> > at this step.
> > Do you have any ideas how to fix or debug this further?
> >
> > Regards
> >   Bernhard
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > --
> > Jason
>
>
> --
> Dipl.-Inf. Bernhard Dick
> Auf dem Anger 24
> DE-46485 Wesel
> www.BernhardDick.de
>
> jabber: bernh...@jabber.bdick.de
>
> Tel : +49.2812068620
> Mobil : +49.1747607927
> FAX : +49.2812068621
> USt-IdNr.: DE274728845
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems setting up iSCSI

2018-06-28 Thread Bernhard Dick

Hi Jason,

Am 28.06.2018 um 14:33 schrieb Jason Dillaman:
You should have "/var/log/ansible-module-igw_config.log" on the target 
machine that hopefully includes more information about why the RBD image 
is missing from the TCM backend.

I have the logfile, however there everything seems fine. The last lines are:
2018-06-28 17:25:36,876 igw_lun.py DEBUG: Check the rbd image size 
matches the request
2018-06-28 17:25:36,904 igw_lun.py DEBUG: rbd image rbd.iscsi2342 
size matches the configuration file request

2018-06-28 17:25:36,904 igw_lun.py DEBUG: Begin processing LIO mapping
2018-06-28 17:25:36,905 igw_lun.py INFO : (LUN.add_dev_to_lio) 
Adding image 'rbd.iscsi2342' to LIO

2018-06-28 17:25:36,905 igw_lun.py DEBUG: control="max_data_area_mb=8"

  Regards
Bernhard

In the past, I've seen issues w/ image 
features and image size mismatch causing the process to abort


On Thu, Jun 28, 2018 at 5:57 AM Bernhard Dick > wrote:


Hi,

I'm trying to setup iSCSI using ceph-ansible (stable-3.1) branch with
one iscsi gateway. The Host is a CentOS 7 host and I use the packages
from the Centos Storage SIG for luminous. Additionally I have installed:
ceph-iscsi-cli Version 2.7 Release 13.gb9e48a7.el7
ceph-iscsi-config Version 2.6 Release 15.ge016c6f.el7
from the iscsi project at shaman.ceph.com .
The python-rtslib version on the host is 2.1.fb67 Release 10.g7713d1e
also from shaman. The running kernel version is
4.15.0-ceph-g1c778f43da52.
The ansible process stops at TASK [ceph-iscsi-gw : igw_lun | configure
luns (create/map rbds and add to lio)] complaining about file not found
while caused by the self._enable function in tcm.py. That is due to the
directory /sys/kernel/config/target/core/user_0/ is completely missing
at this step.
Do you have any ideas how to fix or debug this further?

    Regards
      Bernhard
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Jason



--
Dipl.-Inf. Bernhard Dick
Auf dem Anger 24
DE-46485 Wesel
www.BernhardDick.de

jabber: bernh...@jabber.bdick.de

Tel : +49.2812068620
Mobil : +49.1747607927
FAX : +49.2812068621
USt-IdNr.: DE274728845
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Many inconsistent PGs in EC pool, is this normal?

2018-06-28 Thread Bryan Banister
Hi all,

We started running a EC pool based object store, set up with a 4+2 
configuration, and we seem to be getting an almost constant report of 
inconsistent PGs during scrub operations.  For example:
root@rook-tools:/# ceph pg ls inconsistent
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES LOG  
DISK_LOG STATE STATE_STAMPVERSION   
REPORTED  UP  UP_PRIMARY ACTING  
ACTING_PRIMARY LAST_SCRUB SCRUB_STAMPLAST_DEEP_SCRUB 
DEEP_SCRUB_STAMP
19.26   104  00 0   0 436207616 1537
 1537 active+clean+inconsistent 2018-06-28 15:04:54.054227 1811'3137 2079:5075 
[206,116,68,31,193,156]206 [206,116,68,31,193,156]206  
1811'3137 2018-06-27 21:00:17.011611   1811'3137 2018-06-27 21:00:17.011611
19.234   98  00 0   0 406847488 1581
 1581 active+clean+inconsistent 2018-06-28 15:05:18.077003 1811'2981 2080:4822  
[28,131,229,180,84,68] 28  [28,131,229,180,84,68] 28  
1811'2981 2018-06-28 14:09:54.092401   1811'2981 2018-06-28 14:09:54.092401
19.2a8  116  00 0   0 486539264 1561
 1561 active+clean+inconsistent 2018-06-28 15:04:54.073762 1811'3161 2079:4825 
[177,68,222,13,131,107]177 [177,68,222,13,131,107]177  
1811'3161 2018-06-28 07:51:21.109587   1811'3161 2018-06-28 07:51:21.109587
19.406  126  00 0   0 520233399 1557
 1557 active+clean+inconsistent 2018-06-28 15:04:57.142651 1811'3057 2080:4944  
[230,199,128,68,92,11]230  [230,199,128,68,92,11]230  
1811'3057 2018-06-27 18:36:18.497899   1811'3057 2018-06-27 18:36:18.497899
19.46b  109  00 0   0 449840274 1558
 1558 active+clean+inconsistent 2018-06-28 15:04:54.227970 1811'3058 2079:4986  
[18,68,130,94,181,225] 18  [18,68,130,94,181,225] 18  
1811'3058 2018-06-27 14:32:17.800961   1811'3058 2018-06-27 14:32:17.800961
[snip]

We sometimes see that running a deep scrub on the PG resolves the issue but not 
all the time.

We have been running the PG repair operation on them (e.g. ceph pg repair ), which clears the issue.  Is this the correct way to resolve this issue?

Is this a normal behavior for a Ceph cluster?

If so, why doesn't it attempt to repair itself automatically?

Thanks for the help understanding Ceph, we are very new to it still!
-Bryan





Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential, or privileged information and/or 
personal data. If you are not the intended recipient, you are hereby notified 
that any review, dissemination, or copying of this email is strictly 
prohibited, and requested to notify the sender immediately and destroy this 
email and any attachments. Email transmission cannot be guaranteed to be secure 
or error-free. The Company, therefore, does not make any guarantees as to the 
completeness or accuracy of this email or any attachments. This email is for 
informational purposes only and does not constitute a recommendation, offer, 
request, or solicitation of any kind to buy, sell, subscribe, redeem, or 
perform any type of transaction of a financial product. Personal data, as 
defined by applicable data privacy laws, contained in this email may be 
processed by the Company, and any of its affiliated or related companies, for 
potential ongoing compliance and/or business-related purposes. You may have 
rights regarding your personal data; for information on exercising these rights 
or the Company's treatment of personal data, please email 
datareque...@jumptrading.com.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] In a High Avaiability setup, MON, OSD daemon take up the floating IP

2018-06-28 Thread Rahul S
Hi Vlad,

Have not thoroughly tested my setup but so far things look good. Only
problem is that I have to manually activate the osd's using the ceph-deploy
command. Manually mounting the osd partition doesnt work.

Thanks for replying.

Regards,
Rahul S

On 27 June 2018 at 14:15, Дробышевский, Владимир  wrote:

> Hello, Rahul!
>
>   Do you have your problem during initial cluster creation or on any
> reboot\leadership transfer? If the first then try to remove floating IP
> while creating mons and temporarily transfer the leadership from the server
> your going to create OSD on.
>
>   We are using the same configuration without any issues (though have a
> little bit more servers) but ceph cluster had been created before
> OpenNebula setup.
>
>   We have a number of physical\virtual interfaces on top of IPoIB _and_
> ethernet network (with bonding).
>
>   So there are 3 interfaces for the internal communications:
>
>   ib0.8003 - 10.103.0.0/16 - ceph public network and opennebula raft
> virtual ip
>   ib0.8004 - 10.104.0.0/16 - ceph cluster network
>   br0 (on top of ethernet bonding interface) - 10.101.0.0/16 - physical
> "management" network
>
>   also we have a number of other virtual interfaces for per-tenant
> intra-VM networks (vxlan on top of IP) and so on.
>
>
>
> in /etc/hosts we have only "fixed" IPs from 10.103.0.0/16 networks like:
>
> 10.103.0.1  e001n01.dc1..xxe001n01
>
>
>
>   /etc/one/oned.conf:
>
> # Executed when a server transits from follower->leader
>  RAFT_LEADER_HOOK = [
>  COMMAND = "raft/vip.sh",
>  ARGUMENTS = "leader ib0.8003 10.103.255.254/16"
>  ]
>
> # Executed when a server transits from leader->follower
>  RAFT_FOLLOWER_HOOK = [
>  COMMAND = "raft/vip.sh",
>  ARGUMENTS = "follower ib0.8003 10.103.255.254/16"
>  ]
>
>
>
>   /etc/ceph/ceph.conf:
>
> [global]
> public_network = 10.103.0.0/16
> cluster_network = 10.104.0.0/16
>
> mon_initial_members = e001n01, e001n02, e001n03
> mon_host = 10.103.0.1,10.103.0.2,10.103.0.3
>
>
>
>   Cluster and mons created with ceph-deploy, each OSD has been added via
> modified ceph-disk.py (as we have only 3 drive slots per server we had to
> co-locate system partition with OSD partition on our SSDs) on
> per-host\drive manner:
>
> admin@:~$ sudo ./ceph-disk-mod.py -v prepare --dmcrypt
> --dmcrypt-key-dir /etc/ceph/dmcrypt-keys --bluestore --cluster ceph
> --fs-type xfs -- /dev/sda
>
>
>   And the current state on the leader:
>
> oneadmin@e001n02:~/remotes/tm$ onezone show 0
> ZONE 0 INFORMATION
> ID: 0
> NAME  : OpenNebula
>
>
> ZONE SERVERS
> ID NAMEENDPOINT
>  0 e001n01 http://10.103.0.1:2633/RPC2
>  1 e001n02 http://10.103.0.2:2633/RPC2
>  2 e001n03 http://10.103.0.3:2633/RPC2
>
> HA & FEDERATION SYNC STATUS
> ID NAMESTATE  TERM   INDEX  COMMIT VOTE
> FED_INDEX
>  0 e001n01 follower   1571   68250418   68250417   1 -1
>  1 e001n02 leader 1571   68250418   68250418   1 -1
>  2 e001n03 follower   1571   68250418   68250417   -1-1
> ...
>
>
> admin@e001n02:~$ ip addr show ib0.8003
> 9: ib0.8003@ib0:  mtu 65520 qdisc mq
> state UP group default qlen 256
> link/infiniband 
> a0:00:03:00:fe:80:00:00:00:00:00:00:00:1e:67:03:00:47:c1:1b
> brd 00:ff:ff:ff:ff:12:40:1b:80:03:00:00:00:00:00:00:ff:ff:ff:ff
> inet 10.103.0.2/16 brd 10.103.255.255 scope global ib0.8003
>valid_lft forever preferred_lft forever
> inet 10.103.255.254/16 scope global secondary ib0.8003
>valid_lft forever preferred_lft forever
> inet6 fe80::21e:6703:47:c11b/64 scope link
>valid_lft forever preferred_lft forever
>
> admin@e001n02:~$ sudo netstat -anp | grep mon
> tcp0  0 10.103.0.2:6789 0.0.0.0:*
>  LISTEN  168752/ceph-mon
> tcp0  0 10.103.0.2:6789 10.103.0.2:44270
> ESTABLISHED 168752/ceph-mon
> ...
>
> admin@e001n02:~$ sudo netstat -anp | grep osd
> tcp0  0 10.104.0.2:6800 0.0.0.0:*
>  LISTEN  6736/ceph-osd
> tcp0  0 10.104.0.2:6801 0.0.0.0:*
>  LISTEN  6736/ceph-osd
> tcp0  0 10.103.0.2:6801 0.0.0.0:*
>  LISTEN  6736/ceph-osd
> tcp0  0 10.103.0.2:6802 0.0.0.0:*
>  LISTEN  6736/ceph-osd
> tcp0  0 10.104.0.2:6801 10.104.0.6:42868
> ESTABLISHED 6736/ceph-osd
> tcp0  0 10.104.0.2:5178810.104.0.1:6800
>  ESTABLISHED 6736/ceph-osd
> ...
>
> admin@e001n02:~$ sudo ceph -s
>   cluster:
> id: 
> health: HEALTH_OK
>
> oneadmin@e001n02:~/remotes/tm$ onedatastore show 0
> DATASTORE 0 INFORMATION
> ID : 0
> NAME   : system
> USER   : oneadmin
> GROUP  : oneadmin
> CLUSTERS   : 0
> TYPE   : SYSTEM
> DS_MAD : -
> TM_MAD : ceph_shared
> BASE PATH  : /var/lib/one//datastores/0
> DISK_TYPE  : RBD
> STATE  : 

Re: [ceph-users] Problems setting up iSCSI

2018-06-28 Thread Jason Dillaman
You should have "/var/log/ansible-module-igw_config.log" on the target
machine that hopefully includes more information about why the RBD image is
missing from the TCM backend. In the past, I've seen issues w/ image
features and image size mismatch causing the process to abort.

On Thu, Jun 28, 2018 at 5:57 AM Bernhard Dick  wrote:

> Hi,
>
> I'm trying to setup iSCSI using ceph-ansible (stable-3.1) branch with
> one iscsi gateway. The Host is a CentOS 7 host and I use the packages
> from the Centos Storage SIG for luminous. Additionally I have installed:
> ceph-iscsi-cli Version 2.7 Release 13.gb9e48a7.el7
> ceph-iscsi-config Version 2.6 Release 15.ge016c6f.el7
> from the iscsi project at shaman.ceph.com.
> The python-rtslib version on the host is 2.1.fb67 Release 10.g7713d1e
> also from shaman. The running kernel version is 4.15.0-ceph-g1c778f43da52.
> The ansible process stops at TASK [ceph-iscsi-gw : igw_lun | configure
> luns (create/map rbds and add to lio)] complaining about file not found
> while caused by the self._enable function in tcm.py. That is due to the
> directory /sys/kernel/config/target/core/user_0/ is completely missing
> at this step.
> Do you have any ideas how to fix or debug this further?
>
>Regards
>  Bernhard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problems setting up iSCSI

2018-06-28 Thread Bernhard Dick

Hi,

I'm trying to setup iSCSI using ceph-ansible (stable-3.1) branch with 
one iscsi gateway. The Host is a CentOS 7 host and I use the packages 
from the Centos Storage SIG for luminous. Additionally I have installed:

ceph-iscsi-cli Version 2.7 Release 13.gb9e48a7.el7
ceph-iscsi-config Version 2.6 Release 15.ge016c6f.el7
from the iscsi project at shaman.ceph.com.
The python-rtslib version on the host is 2.1.fb67 Release 10.g7713d1e 
also from shaman. The running kernel version is 4.15.0-ceph-g1c778f43da52.
The ansible process stops at TASK [ceph-iscsi-gw : igw_lun | configure 
luns (create/map rbds and add to lio)] complaining about file not found 
while caused by the self._enable function in tcm.py. That is due to the 
directory /sys/kernel/config/target/core/user_0/ is completely missing 
at this step.

Do you have any ideas how to fix or debug this further?

  Regards
Bernhard
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph FS Random Write 4KB block size only 2MB/s?!

2018-06-28 Thread Yu Haiyang
Here you go. Below are the fio job options and the results.

blocksize=4K
size=500MB
directory=[ceph_fs_mount_directory]
ioengine=libaio
iodepth=64
direct=1
runtime=60
time_based
group_reporting

numjobs Ceph FS Erasure Coding (k=2, m=1)   Ceph FS 3 Replica
1 job   577KB/s 765KB/s
2 job   1.27MB/s793KB/s
4 job   2.33MB/s1.36MB/s
8 job   4.14MB/s2.36MB/s
16 job  6.87MB/s4.40MB/s
32 job  11.07MB/s   8.17MB/s
64 job  13.75MB/s   15.84MB/s
128 job 10.46MB/s   26.82MB/s

On Jun 28, 2018, at 5:01 PM, Yan, Zheng 
mailto:uker...@gmail.com>> wrote:

On Thu, Jun 28, 2018 at 10:30 AM Yu Haiyang 
mailto:haiya...@moqi.ai>> wrote:

Hi Yan,

Thanks for your suggestion.
No, I didn’t run fio on ceph-fuse. I mounted my Ceph FS in kernel mode.


command option of fio ?

Regards,
Haiyang

On Jun 27, 2018, at 9:45 PM, Yan, Zheng 
mailto:uker...@gmail.com>> wrote:

On Wed, Jun 27, 2018 at 8:04 PM Yu Haiyang 
mailto:haiya...@moqi.ai>> wrote:

Hi All,

Using fio with job number ranging from 1 to 128, the random write speed for 4KB 
block size has been consistently around 1MB/s to 2MB/s.
Random read of the same block size can reach 60MB/s with 32 jobs.

run fio on ceph-fuse? If I remember right, fio does 1 bytes write.
overhead of passing the 1 byte to ceph-fuse is too high.


Our ceph cluster consists of 4 OSDs all running on SSD connected through a 
switch with 9.06 Gbits/sec bandwidth.
Any suggestion please?

Warmest Regards,
Haiyang
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph FS Random Write 4KB block size only 2MB/s?!

2018-06-28 Thread Yan, Zheng
On Thu, Jun 28, 2018 at 10:30 AM Yu Haiyang  wrote:
>
> Hi Yan,
>
> Thanks for your suggestion.
> No, I didn’t run fio on ceph-fuse. I mounted my Ceph FS in kernel mode.
>

command option of fio ?

> Regards,
> Haiyang
>
> > On Jun 27, 2018, at 9:45 PM, Yan, Zheng  wrote:
> >
> > On Wed, Jun 27, 2018 at 8:04 PM Yu Haiyang  wrote:
> >>
> >> Hi All,
> >>
> >> Using fio with job number ranging from 1 to 128, the random write speed 
> >> for 4KB block size has been consistently around 1MB/s to 2MB/s.
> >> Random read of the same block size can reach 60MB/s with 32 jobs.
> >
> > run fio on ceph-fuse? If I remember right, fio does 1 bytes write.
> > overhead of passing the 1 byte to ceph-fuse is too high.
> >
> >>
> >> Our ceph cluster consists of 4 OSDs all running on SSD connected through a 
> >> switch with 9.06 Gbits/sec bandwidth.
> >> Any suggestion please?
> >>
> >> Warmest Regards,
> >> Haiyang
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FreeBSD Initiator with Ceph iscsi

2018-06-28 Thread Frank (lists)

Jason Dillaman wrote:
Conceptually, I would assume it should just work if configured 
correctly w/ multipath (to properly configure the ALUA settings on the 
LUNs). I don't run FreeBSD, but any particular issue you are seeing?


When logged in to both targets,  the following message floods the log

WARNING: 192.168.5.109 (iqn.2018-06.lan.x.iscsi-gw:ceph-igw): underflow 
mismatch: target indicates 0, we calculated 512

(da1:iscsi6:0:0:0): READ(10). CDB: 28 00 0c 7f ff ff 00 00 01 00
(da1:iscsi6:0:0:0): CAM status: SCSI Status Error
(da1:iscsi6:0:0:0): SCSI status: Check Condition
(da1:iscsi6:0:0:0): SCSI sense: NOT READY asc:4,b (Logical unit not 
accessible, target port in standby state)

(da1:iscsi6:0:0:0): Error 6, Unretryable error
(da1:iscsi6:0:0:0): Invalidating pack

For both sessions the message are the same (besides numbering of devices)

When trying to read from either of the devices (da1 and da2 in my case), 
FreeBSD gives the error 'Device not configured'. When using gmultipath, 
manually created, because FreeBSD is not able to write a label to either 
of the devices, the created multipath is not functional because it 
markes both devices as FAIL






On Tue, Jun 26, 2018 at 6:06 PM Frank de Bot (lists) 
mailto:li...@searchy.net>> wrote:


Hi,

In my test setup I have a ceph iscsi gateway (configured as in
http://docs.ceph.com/docs/luminous/rbd/iscsi-overview/ )

I would like to use thie with a FreeBSD (11.1) initiator, but I
fail to
make a working setup in FreeBSD. Is it known if the FreeBSD initiator
(with gmultipath) can work with this gateway setup?


Regards,

Frank
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Jason


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous Bluestore performance, bcache

2018-06-28 Thread Andrei Mikhailovsky
Hi Richard,

It is an interesting test for me too as I am planning to migrate to Bluestore 
storage and was considering repurposing the ssd disks that we currently use for 
journals.

I was wondering if you are using the Filestore or the bluestone for the osds?

Also, when you perform your testing, how good is the hit ratio that you have on 
the bcache?

Are you using a lot of random data for your benchmarks? How large is your test 
file for each vm?

We have been playing around with a few caching scenarios a few years back 
(enchanceio and a few more which I can't remember now) and we have seen a very 
poor hit ratio on the caching system. Was wondering if you see a different 
picture?

Cheers

- Original Message -
> From: "Richard Bade" 
> To: "ceph-users" 
> Sent: Thursday, 28 June, 2018 05:42:34
> Subject: [ceph-users] Luminous Bluestore performance, bcache

> Hi Everyone,
> There's been a few threads go past around this but I haven't seen any
> that pointed me in the right direction.
> We've recently set up a new luminous (12.2.5) cluster with 5 hosts
> each with 12 4TB Seagate Constellation ES spinning disks for osd's. We
> also have 2x 400GB Intel DC P3700's per node. We're using this for rbd
> storage for VM's running under Proxmox VE.
> I firstly set these up with DB partition (approx 60GB per osd) on nvme
> and data directly onto the spinning disk using ceph-deploy create.
> This worked great and was very simple.
> However performance wasn't great. I fired up 20vm's each running fio
> trying to attain 50 iops. Ceph was only just able to keep up with the
> 1000iops this generated and vm's started to have trouble hitting their
> 50iops target.
> So I rebuilt all the osd's halving the DB space (~30GB per osd) and
> adding a 200GB BCache partition shared between 6 osd's. Again this
> worked great with ceph-deploy create and was very simple.
> I have had a vast improvement with my synthetic test. I can now run
> 100 50iops test vm's generating a constant 5000iops load and each one
> can keep up without any trouble.
> 
> The question I have is if the poor performance out of the box is
> expected? Or is there some kind of tweaking I should be doing to make
> this usable for rbd images? Are others able to work ok with this kind
> of config at a small scale like my 60osd's? Or is it only workable at
> a larger scale?
> 
> Regards,
> Rich
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RDMA support in Ceph

2018-06-28 Thread kefu chai
On Wed, Jun 27, 2018 at 1:21 AM Kamble, Nitin A
 wrote:
>
> I tried enabling the RDMA support in Ceph Luminous release following this [1] 
> guide.
>
> I used the released Luminous bits, and not the Mellanox branches mentioned in 
> the guide.
>
>
>
> I could see some RDMA traffic in the perf counters, but the ceph daemons were 
> still
>
> complaining that they are not able to talk with each other.
>
>
>
> AFAIK the RDMA support in Ceph is experimental.
>
>
>
> I would like to know…
>
> What is the state of the RDMA code in the Ceph Luminous and later releases?

in Ceph, the RDMA support has been constantly worked on. xio messenger
support was added 4 years ago, but i don't think it's maintained
anymore. and async messenger was IB protocol support. i think that's
what you wanted to try out. recently, we added the iWARP support to
the async messenger, see https://github.com/ceph/ceph/pull/20297. that
change also brought better connection management by using rdma-cm to
Ceph. and i believe to get RDMA support we should have
https://github.com/ceph/ceph/pull/14681, which is still pending on
review.

>
> When will it be production ready?

the RDMA support in Ceph is completely driven by our community. and we
don't have the hardware (NIC) for testing RoCEv2/iWARP, not to mention
IB. so i can hardly tell from the maintainer's perspective.

>
>
>
> Thanks,
>
> Nitin
>
>
>
>
>
> [1]: https://community.mellanox.com/docs/DOC-2721
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Regards
Kefu Chai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com