Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Brad Hubbard
On Thu, Jul 19, 2018 at 12:47 PM, Troy Ablan  wrote:
>
>
> On 07/18/2018 06:37 PM, Brad Hubbard wrote:
>> On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan  wrote:
>>>
>>>
>>> On 07/17/2018 11:14 PM, Brad Hubbard wrote:

 On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan  wrote:
>
> I was on 12.2.5 for a couple weeks and started randomly seeing
> corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
> loose.  I panicked and moved to Mimic, and when that didn't solve the
> problem, only then did I start to root around in mailing lists archives.
>
> It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is
> out, but I'm unsure how to proceed now that the damaged cluster is
> running under Mimic.  Is there anything I can do to get the cluster back
> online and objects readable?

 That depends on what the specific problem is. Can you provide some
 data that fills in the blanks around "randomly seeing corruption"?

>>> Thanks for the reply, Brad.  I have a feeling that almost all of this stems
>>> from the time the cluster spent running 12.2.6.  When booting VMs that use
>>> rbd as a backing store, they typically get I/O errors during boot and cannot
>>> read critical parts of the image.  I also get similar errors if I try to rbd
>>> export most of the images. Also, CephFS is not started as ceph -s indicates
>>> damage.
>>>
>>> Many of the OSDs have been crashing and restarting as I've tried to rbd
>>> export good versions of images (from older snapshots).  Here's one
>>> particular crash:
>>>
>>> 2018-07-18 15:52:15.809 7fcbaab77700 -1
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/h
>>> uge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/os/bluestore/BlueStore.h:
>>> In function 'void
>>> BlueStore::SharedBlobSet::remove_last(BlueStore::SharedBlob*)' thread
>>> 7fcbaab7
>>> 7700 time 2018-07-18 15:52:15.750916
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13
>>> .2.0/src/os/bluestore/BlueStore.h: 455: FAILED assert(sb->nref == 0)
>>>
>>>  ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic
>>> (stable)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0xff) [0x7fcbc197a53f]
>>>  2: (()+0x286727) [0x7fcbc197a727]
>>>  3: (BlueStore::SharedBlob::put()+0x1da) [0x5641f39181ca]
>>>  4: (std::_Rb_tree,
>>> boost::intrusive_ptr,
>>> std::_Identity >,
>>> std::less >,
>>> std::allocator >
 ::_M_erase(std::_Rb_tree_node>> lueStore::SharedBlob> >*)+0x2d) [0x5641f3977cfd]
>>>  5: (std::_Rb_tree,
>>> boost::intrusive_ptr,
>>> std::_Identity >,
>>> std::less >,
>>> std::allocator >
 ::_M_erase(std::_Rb_tree_node>> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>>  6: (std::_Rb_tree,
>>> boost::intrusive_ptr,
>>> std::_Identity >,
>>> std::less >,
>>> std::allocator >
 ::_M_erase(std::_Rb_tree_node>> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>>  7: (std::_Rb_tree,
>>> boost::intrusive_ptr,
>>> std::_Identity >,
>>> std::less >,
>>> std::allocator >
 ::_M_erase(std::_Rb_tree_node>> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>>  8: (BlueStore::TransContext::~TransContext()+0xf7) [0x5641f3979297]
>>>  9: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x610)
>>> [0x5641f391c9b0]
>>>  10: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x9a)
>>> [0x5641f392a38a]
>>>  11: (BlueStore::_kv_finalize_thread()+0x41e) [0x5641f392b3be]
>>>  12: (BlueStore::KVFinalizeThread::entry()+0xd) [0x5641f397d85d]
>>>  13: (()+0x7e25) [0x7fcbbe4d2e25]
>>>  14: (clone()+0x6d) [0x7fcbbd5c3bad]
>>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
>>> interpret this.
>>>
>>>
>>> Here's the output of ceph -s that might fill in some configuration
>>> questions.  Since osds are continually restarting if I try to put load on
>>> it, the cluster seems to be churning a bit.  That's why I set nodown for
>>> now.
>>>
>>>   cluster:
>>> id: b2873c9a-5539-4c76-ac4a-a6c9829bfed2
>>> health: HEALTH_ERR
>>> 1 filesystem is degraded
>>> 1 filesystem is offline
>>> 1 mds daemon damaged
>>> nodown,noscrub,nodeep-scrub flag(s) set
>>> 9 scrub errors
>>> Reduced data availability: 61 pgs inactive, 56 pgs peering, 4
>>> pgs stale
>>> Possible data damage: 3 pgs inconsistent
>>> 16 slow requests are blocked > 32 sec
>>> 26 stuck requests are blocked > 4096 sec
>>>
>>>   services:
>>> mon: 5 daemons, quorum a,b,c,d,e
>>> mgr: a(active), standbys: b, d, e, c
>>> mds: lcs-0/1/1 up , 2 up:standby, 1 damaged
>>> osd: 34 osds: 34 up, 34 in
>>>  flags nodown,noscrub,nodeep-scrub
>>>
>>>   data:
>>> pools:   15 pools, 640 pgs
>>> 

Re: [ceph-users] Crush Rules with multiple Device Classes

2018-07-18 Thread Konstantin Shalygin

Now my first question is:
1) Is there a way to specify "take default class (ssd or nvme)"?
Then we could just do this for the migration period, and at some point remove 
"ssd".

If multi-device-class in a crush rule is not supported yet, the only workaround 
which comes to my mind right now is to issue:
   $ ceph osd crush set-device-class nvme 
for all our old SSD-backed osds, and modify the crush rule to refer to class 
"nvme" straightaway.



My advice is to set class to 'nvme' to your current osd's with class 
'ssd' and change crush rule to this class.


You still have to do it, better sooner than later.Either use the ssd 
class for your future drives, in case when you switch all your ssd to 
nvme and forgot about ssd disks.




Here my third question:
3) Are the tunables used for NVME devices the same as for SSD devices?
I do not find any NVME tunables here:
http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/
Only SSD, HDD and Hybrid are shown.


Ceph is doesn't care about nvme/ssd. Ceph is only care is_rotational 
drive or not.



"bluefs_db_rotational": "0",
    "bluefs_slow_rotational": "1",
    "bluefs_wal_rotational": "0",
    "bluestore_bdev_rotational": "1",
    "journal_rotational": "0",
    "rotational": "1"




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating EC pool to device-class crush rules

2018-07-18 Thread Konstantin Shalygin

So mostly I want to confirm that is is safe to change the crush rule for
the EC pool.


Changing crush rules for replicated or ec pool is safe.

One thing is, when I was migrated from multiroot to device-classes I was 
recreate ec pools and clone images with qemu-img for ec_overwrites 
feature, so I don't have experience with changing erasure-profiles.


Your old profile doesn't have root value, so this is 'default', new rule 
will be have 'default' too. So I don't see any disadvantages with this 
migration.




These are both ec42 but I'm not sure why the old rule has "max size 20"
(perhaps because it was generated a long time ago under hammer?).


Likely.





k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Crush Rules with multiple Device Classes

2018-07-18 Thread Oliver Freyermuth
Dear Cephalopodians,

we use an SSD-only pool to store the metadata of our CephFS. 
In the future, we will add a few NVMEs, and in the long-term view, replace the 
existing SSDs by NVMEs, too. 

Thinking this through, I came up with three questions which I do not find 
answered in the docs (yet). 

Currently, we use the following crush-rule:

rule cephfs_metadata {
id 1
type replicated
min_size 1
max_size 10
step take default class ssd
step choose firstn 0 type osd
step emit
}

As you can see, this uses "class ssd". 

Now my first question is: 
1) Is there a way to specify "take default class (ssd or nvme)"? 
   Then we could just do this for the migration period, and at some point 
remove "ssd". 

If multi-device-class in a crush rule is not supported yet, the only workaround 
which comes to my mind right now is to issue:
  $ ceph osd crush set-device-class nvme 
for all our old SSD-backed osds, and modify the crush rule to refer to class 
"nvme" straightaway. 

This leads to my second question:
2) Since the OSD IDs do not change, Ceph should not move any data around by 
changing both the device classes of the OSDs and the device class in the crush 
rule - correct? 

After this operation, adding NVMEs to our cluster should let them automatically 
join this crush rule, and once all SSDs are replaced with NVMEs, 
the workaround is automatically gone. 

As long as the SSDs are still there, some tunables might not fit well anymore 
out of the box, i.e. the "sleep" values for scrub and repair, though. 

Here my third question:
3) Are the tunables used for NVME devices the same as for SSD devices?
   I do not find any NVME tunables here:
   http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/
   Only SSD, HDD and Hybrid are shown. 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RAID question for Ceph

2018-07-18 Thread Troy Ablan



On 07/18/2018 07:44 PM, Satish Patel wrote:
> If i have 8 OSD drives in server on P410i RAID controller (HP), If i
> want to make this server has OSD node in that case show should i
> configure RAID?
> 
> 1. Put all drives in RAID-0?
> 2. Put individual HDD in RAID-0 and create 8 individual RAID-0 so OS
> can see 8 separate HDD drives
> 
> What most people doing in production for Ceph (BleuStore)?

In my experience, using a RAID card is not ideal for storage systems
like Ceph.  Redundancy comes from replicating data across multiple
hosts, so there's no need for this functionality in a disk controller.
Even worse, the P410i doesn't appear to support a pass-thru (JBOD/HBA)
mode, so your only sane option for using this card is to create RAID-0s.
Whenever you need to replace a bad drive, you will need to go through
the extra step of creating a RAID-0 on the new drive.

In a production environment, I would recommend an HBA that exposes all
of the drives directly to the OS. It makes management and monitoring a
lot easier.

-Troy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Troy Ablan



On 07/18/2018 06:37 PM, Brad Hubbard wrote:
> On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan  wrote:
>>
>>
>> On 07/17/2018 11:14 PM, Brad Hubbard wrote:
>>>
>>> On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan  wrote:

 I was on 12.2.5 for a couple weeks and started randomly seeing
 corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
 loose.  I panicked and moved to Mimic, and when that didn't solve the
 problem, only then did I start to root around in mailing lists archives.

 It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is
 out, but I'm unsure how to proceed now that the damaged cluster is
 running under Mimic.  Is there anything I can do to get the cluster back
 online and objects readable?
>>>
>>> That depends on what the specific problem is. Can you provide some
>>> data that fills in the blanks around "randomly seeing corruption"?
>>>
>> Thanks for the reply, Brad.  I have a feeling that almost all of this stems
>> from the time the cluster spent running 12.2.6.  When booting VMs that use
>> rbd as a backing store, they typically get I/O errors during boot and cannot
>> read critical parts of the image.  I also get similar errors if I try to rbd
>> export most of the images. Also, CephFS is not started as ceph -s indicates
>> damage.
>>
>> Many of the OSDs have been crashing and restarting as I've tried to rbd
>> export good versions of images (from older snapshots).  Here's one
>> particular crash:
>>
>> 2018-07-18 15:52:15.809 7fcbaab77700 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/h
>> uge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/os/bluestore/BlueStore.h:
>> In function 'void
>> BlueStore::SharedBlobSet::remove_last(BlueStore::SharedBlob*)' thread
>> 7fcbaab7
>> 7700 time 2018-07-18 15:52:15.750916
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13
>> .2.0/src/os/bluestore/BlueStore.h: 455: FAILED assert(sb->nref == 0)
>>
>>  ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic
>> (stable)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0xff) [0x7fcbc197a53f]
>>  2: (()+0x286727) [0x7fcbc197a727]
>>  3: (BlueStore::SharedBlob::put()+0x1da) [0x5641f39181ca]
>>  4: (std::_Rb_tree,
>> boost::intrusive_ptr,
>> std::_Identity >,
>> std::less >,
>> std::allocator >
>>> ::_M_erase(std::_Rb_tree_node> lueStore::SharedBlob> >*)+0x2d) [0x5641f3977cfd]
>>  5: (std::_Rb_tree,
>> boost::intrusive_ptr,
>> std::_Identity >,
>> std::less >,
>> std::allocator >
>>> ::_M_erase(std::_Rb_tree_node> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>  6: (std::_Rb_tree,
>> boost::intrusive_ptr,
>> std::_Identity >,
>> std::less >,
>> std::allocator >
>>> ::_M_erase(std::_Rb_tree_node> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>  7: (std::_Rb_tree,
>> boost::intrusive_ptr,
>> std::_Identity >,
>> std::less >,
>> std::allocator >
>>> ::_M_erase(std::_Rb_tree_node> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>  8: (BlueStore::TransContext::~TransContext()+0xf7) [0x5641f3979297]
>>  9: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x610)
>> [0x5641f391c9b0]
>>  10: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x9a)
>> [0x5641f392a38a]
>>  11: (BlueStore::_kv_finalize_thread()+0x41e) [0x5641f392b3be]
>>  12: (BlueStore::KVFinalizeThread::entry()+0xd) [0x5641f397d85d]
>>  13: (()+0x7e25) [0x7fcbbe4d2e25]
>>  14: (clone()+0x6d) [0x7fcbbd5c3bad]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
>> interpret this.
>>
>>
>> Here's the output of ceph -s that might fill in some configuration
>> questions.  Since osds are continually restarting if I try to put load on
>> it, the cluster seems to be churning a bit.  That's why I set nodown for
>> now.
>>
>>   cluster:
>> id: b2873c9a-5539-4c76-ac4a-a6c9829bfed2
>> health: HEALTH_ERR
>> 1 filesystem is degraded
>> 1 filesystem is offline
>> 1 mds daemon damaged
>> nodown,noscrub,nodeep-scrub flag(s) set
>> 9 scrub errors
>> Reduced data availability: 61 pgs inactive, 56 pgs peering, 4
>> pgs stale
>> Possible data damage: 3 pgs inconsistent
>> 16 slow requests are blocked > 32 sec
>> 26 stuck requests are blocked > 4096 sec
>>
>>   services:
>> mon: 5 daemons, quorum a,b,c,d,e
>> mgr: a(active), standbys: b, d, e, c
>> mds: lcs-0/1/1 up , 2 up:standby, 1 damaged
>> osd: 34 osds: 34 up, 34 in
>>  flags nodown,noscrub,nodeep-scrub
>>
>>   data:
>> pools:   15 pools, 640 pgs
>> objects: 9.73 M objects, 13 TiB
>> usage:   24 TiB used, 55 TiB / 79 TiB avail
>> pgs: 23.438% pgs not active
>>  487 active+clean
>>  73  

[ceph-users] RAID question for Ceph

2018-07-18 Thread Satish Patel
If i have 8 OSD drives in server on P410i RAID controller (HP), If i
want to make this server has OSD node in that case show should i
configure RAID?

1. Put all drives in RAID-0?
2. Put individual HDD in RAID-0 and create 8 individual RAID-0 so OS
can see 8 separate HDD drives

What most people doing in production for Ceph (BleuStore)?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph rdma + IB network error

2018-07-18 Thread Will Zhao
Hi all:



By following the instructions:

(https://community.mellanox.com/docs/DOC-2721)

(https://community.mellanox.com/docs/DOC-2693)

(http://hwchiu.com/2017-05-03-ceph-with-rdma.html)



I'm trying to configure CEPH with RDMA feature on environments as follows:



CentOS Linux release 7.2.1511 (Core)

MLNX_OFED_LINUX-4.4-1.0.0.0:

Mellanox Technologies MT27500 Family [ConnectX-3]



rping works between all nodes and add these lines to ceph.conf to enable
RDMA:



public_network = 10.10.121.0/24

cluster_network = 10.10.121.0/24

ms_type = async+rdma

ms_async_rdma_device_name = mlx4_0

ms_async_rdma_port_num = 2



IB network is using 10.10.121.0/24 addresses and "ibdev2netdev" command
shows port 2 is up.

Error occurs when running "ceph-deploy --overwrite-conf mon
create-initial", ceph-deploy log details:



[2018-07-12 17:53:48,943][ceph_deploy.conf][DEBUG ] found configuration
file at: /home/user1/.cephdeploy.conf

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ] Invoked (1.5.37):
/usr/bin/ceph-deploy --overwrite-conf mon create-initial

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ] ceph-deploy options:

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]
username  : None

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]
verbose   : False

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]
overwrite_conf: True

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]
subcommand: create-initial

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]  quiet
  : False

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
cd_conf   : 

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
cluster   : ceph

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
func  : 

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
ceph_conf : None

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
default_release   : False

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
keyrings  : None

[2018-07-12 17:53:48,947][ceph_deploy.mon][DEBUG ] Deploying mon, cluster
ceph hosts node1

[2018-07-12 17:53:48,947][ceph_deploy.mon][DEBUG ] detecting platform for
host node1 ...

[2018-07-12 17:53:49,005][node1][DEBUG ] connection detected need for sudo

[2018-07-12 17:53:49,039][node1][DEBUG ] connected to host: node1

[2018-07-12 17:53:49,040][node1][DEBUG ] detect platform information from
remote host

[2018-07-12 17:53:49,073][node1][DEBUG ] detect machine type

[2018-07-12 17:53:49,078][node1][DEBUG ] find the location of an executable

[2018-07-12 17:53:49,079][ceph_deploy.mon][INFO  ] distro info: CentOS
Linux 7.2.1511 Core

[2018-07-12 17:53:49,079][node1][DEBUG ] determining if provided host has
same hostname in remote

[2018-07-12 17:53:49,079][node1][DEBUG ] get remote short hostname

[2018-07-12 17:53:49,080][node1][DEBUG ] deploying mon to node1

[2018-07-12 17:53:49,080][node1][DEBUG ] get remote short hostname

[2018-07-12 17:53:49,081][node1][DEBUG ] remote hostname: node1

[2018-07-12 17:53:49,083][node1][DEBUG ] write cluster configuration to
/etc/ceph/{cluster}.conf

[2018-07-12 17:53:49,084][node1][DEBUG ] create the mon path if it does not
exist

[2018-07-12 17:53:49,085][node1][DEBUG ] checking for done path:
/var/lib/ceph/mon/ceph-node1/done

[2018-07-12 17:53:49,085][node1][DEBUG ] create a done file to avoid
re-doing the mon deployment

[2018-07-12 17:53:49,086][node1][DEBUG ] create the init path if it does
not exist

[2018-07-12 17:53:49,089][node1][INFO  ] Running command: sudo systemctl
enable ceph.target

[2018-07-12 17:53:49,365][node1][INFO  ] Running command: sudo systemctl
enable ceph-mon@node1

[2018-07-12 17:53:49,588][node1][INFO  ] Running command: sudo systemctl
start ceph-mon@node1

[2018-07-12 17:53:51,762][node1][INFO  ] Running command: sudo ceph
--cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.node1.asok mon_status

[2018-07-12 17:53:51,979][node1][DEBUG ]


[2018-07-12 17:53:51,979][node1][DEBUG ] status for monitor: mon.node1

[2018-07-12 17:53:51,980][node1][DEBUG ] {

[2018-07-12 17:53:51,980][node1][DEBUG ]   "election_epoch": 3,

[2018-07-12 17:53:51,980][node1][DEBUG ]   "extra_probe_peers": [],

[2018-07-12 17:53:51,980][node1][DEBUG ]   "feature_map": {

[2018-07-12 17:53:51,981][node1][DEBUG ] "mon": {

[2018-07-12 17:53:51,981][node1][DEBUG ]   "group": {

[2018-07-12 17:53:51,981][node1][DEBUG ] "features":
"0x1ffddff8eea4fffb",

[2018-07-12 17:53:51,981][node1][DEBUG ] "num": 1,

[2018-07-12 17:53:51,981][node1][DEBUG ] "release": "luminous"

[2018-07-12 17:53:51,981][node1][DEBUG ]   }

[2018-07-12 17:53:51,981][node1][DEBUG ] }

[2018-07-12 17:53:51,982][node1][DEBUG ]   },

[2018-07-12 17:53:51,982][node1][DEBUG ]   "features": {

[2018-07-12 

Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Brad Hubbard
On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan  wrote:
>
>
> On 07/17/2018 11:14 PM, Brad Hubbard wrote:
>>
>> On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan  wrote:
>>>
>>> I was on 12.2.5 for a couple weeks and started randomly seeing
>>> corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
>>> loose.  I panicked and moved to Mimic, and when that didn't solve the
>>> problem, only then did I start to root around in mailing lists archives.
>>>
>>> It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is
>>> out, but I'm unsure how to proceed now that the damaged cluster is
>>> running under Mimic.  Is there anything I can do to get the cluster back
>>> online and objects readable?
>>
>> That depends on what the specific problem is. Can you provide some
>> data that fills in the blanks around "randomly seeing corruption"?
>>
> Thanks for the reply, Brad.  I have a feeling that almost all of this stems
> from the time the cluster spent running 12.2.6.  When booting VMs that use
> rbd as a backing store, they typically get I/O errors during boot and cannot
> read critical parts of the image.  I also get similar errors if I try to rbd
> export most of the images. Also, CephFS is not started as ceph -s indicates
> damage.
>
> Many of the OSDs have been crashing and restarting as I've tried to rbd
> export good versions of images (from older snapshots).  Here's one
> particular crash:
>
> 2018-07-18 15:52:15.809 7fcbaab77700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/h
> uge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/os/bluestore/BlueStore.h:
> In function 'void
> BlueStore::SharedBlobSet::remove_last(BlueStore::SharedBlob*)' thread
> 7fcbaab7
> 7700 time 2018-07-18 15:52:15.750916
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13
> .2.0/src/os/bluestore/BlueStore.h: 455: FAILED assert(sb->nref == 0)
>
>  ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0xff) [0x7fcbc197a53f]
>  2: (()+0x286727) [0x7fcbc197a727]
>  3: (BlueStore::SharedBlob::put()+0x1da) [0x5641f39181ca]
>  4: (std::_Rb_tree,
> boost::intrusive_ptr,
> std::_Identity >,
> std::less >,
> std::allocator >
>>::_M_erase(std::_Rb_tree_node lueStore::SharedBlob> >*)+0x2d) [0x5641f3977cfd]
>  5: (std::_Rb_tree,
> boost::intrusive_ptr,
> std::_Identity >,
> std::less >,
> std::allocator >
>>::_M_erase(std::_Rb_tree_node lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>  6: (std::_Rb_tree,
> boost::intrusive_ptr,
> std::_Identity >,
> std::less >,
> std::allocator >
>>::_M_erase(std::_Rb_tree_node lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>  7: (std::_Rb_tree,
> boost::intrusive_ptr,
> std::_Identity >,
> std::less >,
> std::allocator >
>>::_M_erase(std::_Rb_tree_node lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>  8: (BlueStore::TransContext::~TransContext()+0xf7) [0x5641f3979297]
>  9: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x610)
> [0x5641f391c9b0]
>  10: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x9a)
> [0x5641f392a38a]
>  11: (BlueStore::_kv_finalize_thread()+0x41e) [0x5641f392b3be]
>  12: (BlueStore::KVFinalizeThread::entry()+0xd) [0x5641f397d85d]
>  13: (()+0x7e25) [0x7fcbbe4d2e25]
>  14: (clone()+0x6d) [0x7fcbbd5c3bad]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
>
> Here's the output of ceph -s that might fill in some configuration
> questions.  Since osds are continually restarting if I try to put load on
> it, the cluster seems to be churning a bit.  That's why I set nodown for
> now.
>
>   cluster:
> id: b2873c9a-5539-4c76-ac4a-a6c9829bfed2
> health: HEALTH_ERR
> 1 filesystem is degraded
> 1 filesystem is offline
> 1 mds daemon damaged
> nodown,noscrub,nodeep-scrub flag(s) set
> 9 scrub errors
> Reduced data availability: 61 pgs inactive, 56 pgs peering, 4
> pgs stale
> Possible data damage: 3 pgs inconsistent
> 16 slow requests are blocked > 32 sec
> 26 stuck requests are blocked > 4096 sec
>
>   services:
> mon: 5 daemons, quorum a,b,c,d,e
> mgr: a(active), standbys: b, d, e, c
> mds: lcs-0/1/1 up , 2 up:standby, 1 damaged
> osd: 34 osds: 34 up, 34 in
>  flags nodown,noscrub,nodeep-scrub
>
>   data:
> pools:   15 pools, 640 pgs
> objects: 9.73 M objects, 13 TiB
> usage:   24 TiB used, 55 TiB / 79 TiB avail
> pgs: 23.438% pgs not active
>  487 active+clean
>  73  peering
>  70  activating
>  5   stale+peering
>  3   active+clean+inconsistent
>  2   stale+activating
>
>   io:
> 

Re: [ceph-users] [Ceph-maintainers] v12.2.7 Luminous released

2018-07-18 Thread Linh Vu
Awesome, thank you Sage! With that explanation, it's actually a lot easier and 
less impacting than I thought. :)


Cheers,

Linh


From: Sage Weil 
Sent: Thursday, 19 July 2018 9:35:33 AM
To: Linh Vu
Cc: Stefan Kooman; ceph-de...@vger.kernel.org; ceph-us...@ceph.com; 
ceph-maintain...@ceph.com; ceph-annou...@ceph.com
Subject: Re: [Ceph-maintainers] [ceph-users] v12.2.7 Luminous released

On Wed, 18 Jul 2018, Linh Vu wrote:
> Thanks for all your hard work in putting out the fixes so quickly! :)
>
> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS,
> not RGW. In the release notes, it says RGW is a risk especially the
> garbage collection, and the recommendation is to either pause IO or
> disable RGW garbage collection.
>
> In our case with CephFS, not RGW, is it a lot less risky to perform the
> upgrade to 12.2.7 without the need to pause IO?

It is hard to quantify.  I think we only saw the problem with RGW, but
CephFS also sends deletes to non-existent objects when deleting or
truncating sparse files.  Those are probably not too common in most
environments...

> What does pause IO do? Do current sessions just get queued up and IO
> resume normally with no problem after unpausing?

Exactly.  As long as the application doesn't have some timeout coded where
it gives up when a read or write is taking to long, everything will just
pause.

> If we have to pause IO, is it better to do something like: pause IO,
> restart OSDs on one node, unpause IO - repeated for all the nodes
> involved in the EC pool?

Yes, that sounds like a great way to proceed!

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-maintainers] v12.2.7 Luminous released

2018-07-18 Thread Sage Weil
On Wed, 18 Jul 2018, Linh Vu wrote:
> Thanks for all your hard work in putting out the fixes so quickly! :)
> 
> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, 
> not RGW. In the release notes, it says RGW is a risk especially the 
> garbage collection, and the recommendation is to either pause IO or 
> disable RGW garbage collection.
> 
> In our case with CephFS, not RGW, is it a lot less risky to perform the 
> upgrade to 12.2.7 without the need to pause IO?

It is hard to quantify.  I think we only saw the problem with RGW, but 
CephFS also sends deletes to non-existent objects when deleting or 
truncating sparse files.  Those are probably not too common in most 
environments...

> What does pause IO do? Do current sessions just get queued up and IO 
> resume normally with no problem after unpausing?

Exactly.  As long as the application doesn't have some timeout coded where 
it gives up when a read or write is taking to long, everything will just 
pause.

> If we have to pause IO, is it better to do something like: pause IO, 
> restart OSDs on one node, unpause IO - repeated for all the nodes 
> involved in the EC pool?

Yes, that sounds like a great way to proceed!

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: MDS memory usage is very high

2018-07-18 Thread Daniel Carrasco
Thanks again,

I was trying to use fuse client instead Ubuntu 16.04 kernel module to see
if maybe is a client side problem, but CPU usage on fuse client is very
high (a 100% and even more in a two cores machine), so I'd to rever to
kernel client that uses much less CPU.

Is a web server, so maybe the problem is that. PHP and Nginx should open a
lot of files and maybe that uses a lot of RAM.

For now I've rebooted the machine because is the only way to free the
memory, but I cannot restart the machine every few hours...

Greetings!!

2018-07-19 1:00 GMT+02:00 Gregory Farnum :

> Wow, yep, apparently the MDS has another 9GB of allocated RAM outside of
> the cache! Hopefully one of the current FS users or devs has some idea. All
> I can suggest is looking to see if there are a bunch of stuck requests or
> something that are taking up memory which isn’t properly counted.
>
> On Wed, Jul 18, 2018 at 3:48 PM Daniel Carrasco 
> wrote:
>
>> Hello, thanks for your response.
>>
>> This is what I get:
>>
>> # ceph tell mds.kavehome-mgto-pro-fs01  heap stats
>> 2018-07-19 00:43:46.142560 7f5a7a7fc700  0 client.1318388 ms_handle_reset
>> on 10.22.0.168:6800/1129848128
>> 2018-07-19 00:43:46.181133 7f5a7b7fe700  0 client.1318391 ms_handle_reset
>> on 10.22.0.168:6800/1129848128
>> mds.kavehome-mgto-pro-fs01 tcmalloc heap stats:
>> 
>> MALLOC: 9982980144 ( 9520.5 MiB) Bytes in use by application
>> MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
>> MALLOC: +172148208 (  164.2 MiB) Bytes in central cache freelist
>> MALLOC: + 19031168 (   18.1 MiB) Bytes in transfer cache freelist
>> MALLOC: + 23987552 (   22.9 MiB) Bytes in thread cache freelists
>> MALLOC: + 20869280 (   19.9 MiB) Bytes in malloc metadata
>> MALLOC:   
>> MALLOC: =  10219016352 ( 9745.6 MiB) Actual memory used (physical + swap)
>> MALLOC: +   3913687040 ( 3732.4 MiB) Bytes released to OS (aka unmapped)
>> MALLOC:   
>> MALLOC: =  14132703392 (13478.0 MiB) Virtual address space used
>> MALLOC:
>> MALLOC:  63875  Spans in use
>> MALLOC: 16  Thread heaps in use
>> MALLOC:   8192  Tcmalloc page size
>> 
>> Call ReleaseFreeMemory() to release freelist memory to the OS (via
>> madvise()).
>> Bytes released to the OS take up virtual address space but no physical
>> memory.
>>
>>
>> I've tried the release command but it keeps using the same memory.
>>
>> greetings!
>>
>>
>> 2018-07-19 0:25 GMT+02:00 Gregory Farnum :
>>
>>> The MDS think it's using 486MB of cache right now, and while that's
>>> not a complete accounting (I believe you should generally multiply by
>>> 1.5 the configured cache limit to get a realistic memory consumption
>>> model) it's obviously a long way from 12.5GB. You might try going in
>>> with the "ceph daemon" command and looking at the heap stats (I forget
>>> the exact command, but it will tell you if you run "help" against it)
>>> and seeing what those say — you may have one of the slightly-broken
>>> base systems and find that running the "heap release" (or similar
>>> wording) command will free up a lot of RAM back to the OS!
>>> -Greg
>>>
>>> On Wed, Jul 18, 2018 at 1:53 PM, Daniel Carrasco 
>>> wrote:
>>> > Hello,
>>> >
>>> > I've created a 3 nodes cluster with MON, MGR, OSD and MDS on all (2 MDS
>>> > actives), and I've noticed that MDS is using a lot of memory (just now
>>> is
>>> > using 12.5GB of RAM):
>>> > # ceph daemon mds.kavehome-mgto-pro-fs01 dump_mempools | jq -c
>>> '.mds_co';
>>> > ceph daemon mds.kavehome-mgto-pro-fs01 perf dump | jq '.mds_mem.rss'
>>> > {"items":9272259,"bytes":510032260}
>>> > 12466648
>>> >
>>> > I've configured the limit:
>>> > mds_cache_memory_limit = 536870912
>>> >
>>> > But looks like is ignored, because is about 512Mb and is using a lot
>>> more.
>>> >
>>> > Is there any way to limit the memory usage of MDS, because is giving a
>>> lot
>>> > of troubles because start to swap.
>>> > Maybe I've to limit the cached inodes?
>>> >
>>> > The other active MDS is using a lot less memory (2.5Gb). but also is
>>> using
>>> > more than 512Mb. The standby MDS is not using memory it all.
>>> >
>>> > I'm using the version:
>>> > ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
>>> luminous
>>> > (stable).
>>> >
>>> > Thanks!!
>>> > --
>>> > _
>>> >
>>> >   Daniel Carrasco Marín
>>> >   Ingeniería para la Innovación i2TIC, S.L.
>>> >   Tlf:  +34 911 12 32 84 Ext: 223
>>> >   www.i2tic.com
>>> > _
>>> >
>>> >
>>> >
>>> > --
>>> > _
>>> >
>>> >   Daniel Carrasco Marín
>>> >   Ingeniería para la Innovación i2TIC, S.L.
>>> >   Tlf:  +34 911 12 32 84 Ext: 223
>>> >   www.i2tic.com
>>> > _
>>> 

Re: [ceph-users] Fwd: MDS memory usage is very high

2018-07-18 Thread Gregory Farnum
Wow, yep, apparently the MDS has another 9GB of allocated RAM outside of
the cache! Hopefully one of the current FS users or devs has some idea. All
I can suggest is looking to see if there are a bunch of stuck requests or
something that are taking up memory which isn’t properly counted.
On Wed, Jul 18, 2018 at 3:48 PM Daniel Carrasco 
wrote:

> Hello, thanks for your response.
>
> This is what I get:
>
> # ceph tell mds.kavehome-mgto-pro-fs01  heap stats
> 2018-07-19 00:43:46.142560 7f5a7a7fc700  0 client.1318388 ms_handle_reset
> on 10.22.0.168:6800/1129848128
> 2018-07-19 00:43:46.181133 7f5a7b7fe700  0 client.1318391 ms_handle_reset
> on 10.22.0.168:6800/1129848128
> mds.kavehome-mgto-pro-fs01 tcmalloc heap
> stats:
> MALLOC: 9982980144 ( 9520.5 MiB) Bytes in use by application
> MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
> MALLOC: +172148208 (  164.2 MiB) Bytes in central cache freelist
> MALLOC: + 19031168 (   18.1 MiB) Bytes in transfer cache freelist
> MALLOC: + 23987552 (   22.9 MiB) Bytes in thread cache freelists
> MALLOC: + 20869280 (   19.9 MiB) Bytes in malloc metadata
> MALLOC:   
> MALLOC: =  10219016352 ( 9745.6 MiB) Actual memory used (physical + swap)
> MALLOC: +   3913687040 ( 3732.4 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   
> MALLOC: =  14132703392 (13478.0 MiB) Virtual address space used
> MALLOC:
> MALLOC:  63875  Spans in use
> MALLOC: 16  Thread heaps in use
> MALLOC:   8192  Tcmalloc page size
> 
> Call ReleaseFreeMemory() to release freelist memory to the OS (via
> madvise()).
> Bytes released to the OS take up virtual address space but no physical
> memory.
>
>
> I've tried the release command but it keeps using the same memory.
>
> greetings!
>
>
> 2018-07-19 0:25 GMT+02:00 Gregory Farnum :
>
>> The MDS think it's using 486MB of cache right now, and while that's
>> not a complete accounting (I believe you should generally multiply by
>> 1.5 the configured cache limit to get a realistic memory consumption
>> model) it's obviously a long way from 12.5GB. You might try going in
>> with the "ceph daemon" command and looking at the heap stats (I forget
>> the exact command, but it will tell you if you run "help" against it)
>> and seeing what those say — you may have one of the slightly-broken
>> base systems and find that running the "heap release" (or similar
>> wording) command will free up a lot of RAM back to the OS!
>> -Greg
>>
>> On Wed, Jul 18, 2018 at 1:53 PM, Daniel Carrasco 
>> wrote:
>> > Hello,
>> >
>> > I've created a 3 nodes cluster with MON, MGR, OSD and MDS on all (2 MDS
>> > actives), and I've noticed that MDS is using a lot of memory (just now
>> is
>> > using 12.5GB of RAM):
>> > # ceph daemon mds.kavehome-mgto-pro-fs01 dump_mempools | jq -c
>> '.mds_co';
>> > ceph daemon mds.kavehome-mgto-pro-fs01 perf dump | jq '.mds_mem.rss'
>> > {"items":9272259,"bytes":510032260}
>> > 12466648
>> >
>> > I've configured the limit:
>> > mds_cache_memory_limit = 536870912
>> >
>> > But looks like is ignored, because is about 512Mb and is using a lot
>> more.
>> >
>> > Is there any way to limit the memory usage of MDS, because is giving a
>> lot
>> > of troubles because start to swap.
>> > Maybe I've to limit the cached inodes?
>> >
>> > The other active MDS is using a lot less memory (2.5Gb). but also is
>> using
>> > more than 512Mb. The standby MDS is not using memory it all.
>> >
>> > I'm using the version:
>> > ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
>> > (stable).
>> >
>> > Thanks!!
>> > --
>> > _
>> >
>> >   Daniel Carrasco Marín
>> >   Ingeniería para la Innovación i2TIC, S.L.
>> >   Tlf:  +34 911 12 32 84 Ext: 223
>> >   www.i2tic.com
>> > _
>> >
>> >
>> >
>> > --
>> > _
>> >
>> >   Daniel Carrasco Marín
>> >   Ingeniería para la Innovación i2TIC, S.L.
>> >   Tlf:  +34 911 12 32 84 Ext: 223
>> >   www.i2tic.com
>> > _
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>
>
>
> --
> _
>
>   Daniel Carrasco Marín
>   Ingeniería para la Innovación i2TIC, S.L.
>   Tlf:  +34 911 12 32 84 Ext: 223
>   www.i2tic.com
> _
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: MDS memory usage is very high

2018-07-18 Thread Daniel Carrasco
Hello, thanks for your response.

This is what I get:

# ceph tell mds.kavehome-mgto-pro-fs01  heap stats
2018-07-19 00:43:46.142560 7f5a7a7fc700  0 client.1318388 ms_handle_reset
on 10.22.0.168:6800/1129848128
2018-07-19 00:43:46.181133 7f5a7b7fe700  0 client.1318391 ms_handle_reset
on 10.22.0.168:6800/1129848128
mds.kavehome-mgto-pro-fs01 tcmalloc heap
stats:
MALLOC: 9982980144 ( 9520.5 MiB) Bytes in use by application
MALLOC: +0 (0.0 MiB) Bytes in page heap freelist
MALLOC: +172148208 (  164.2 MiB) Bytes in central cache freelist
MALLOC: + 19031168 (   18.1 MiB) Bytes in transfer cache freelist
MALLOC: + 23987552 (   22.9 MiB) Bytes in thread cache freelists
MALLOC: + 20869280 (   19.9 MiB) Bytes in malloc metadata
MALLOC:   
MALLOC: =  10219016352 ( 9745.6 MiB) Actual memory used (physical + swap)
MALLOC: +   3913687040 ( 3732.4 MiB) Bytes released to OS (aka unmapped)
MALLOC:   
MALLOC: =  14132703392 (13478.0 MiB) Virtual address space used
MALLOC:
MALLOC:  63875  Spans in use
MALLOC: 16  Thread heaps in use
MALLOC:   8192  Tcmalloc page size

Call ReleaseFreeMemory() to release freelist memory to the OS (via
madvise()).
Bytes released to the OS take up virtual address space but no physical
memory.


I've tried the release command but it keeps using the same memory.

greetings!


2018-07-19 0:25 GMT+02:00 Gregory Farnum :

> The MDS think it's using 486MB of cache right now, and while that's
> not a complete accounting (I believe you should generally multiply by
> 1.5 the configured cache limit to get a realistic memory consumption
> model) it's obviously a long way from 12.5GB. You might try going in
> with the "ceph daemon" command and looking at the heap stats (I forget
> the exact command, but it will tell you if you run "help" against it)
> and seeing what those say — you may have one of the slightly-broken
> base systems and find that running the "heap release" (or similar
> wording) command will free up a lot of RAM back to the OS!
> -Greg
>
> On Wed, Jul 18, 2018 at 1:53 PM, Daniel Carrasco 
> wrote:
> > Hello,
> >
> > I've created a 3 nodes cluster with MON, MGR, OSD and MDS on all (2 MDS
> > actives), and I've noticed that MDS is using a lot of memory (just now is
> > using 12.5GB of RAM):
> > # ceph daemon mds.kavehome-mgto-pro-fs01 dump_mempools | jq -c '.mds_co';
> > ceph daemon mds.kavehome-mgto-pro-fs01 perf dump | jq '.mds_mem.rss'
> > {"items":9272259,"bytes":510032260}
> > 12466648
> >
> > I've configured the limit:
> > mds_cache_memory_limit = 536870912
> >
> > But looks like is ignored, because is about 512Mb and is using a lot
> more.
> >
> > Is there any way to limit the memory usage of MDS, because is giving a
> lot
> > of troubles because start to swap.
> > Maybe I've to limit the cached inodes?
> >
> > The other active MDS is using a lot less memory (2.5Gb). but also is
> using
> > more than 512Mb. The standby MDS is not using memory it all.
> >
> > I'm using the version:
> > ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
> > (stable).
> >
> > Thanks!!
> > --
> > _
> >
> >   Daniel Carrasco Marín
> >   Ingeniería para la Innovación i2TIC, S.L.
> >   Tlf:  +34 911 12 32 84 Ext: 223
> >   www.i2tic.com
> > _
> >
> >
> >
> > --
> > _
> >
> >   Daniel Carrasco Marín
> >   Ingeniería para la Innovación i2TIC, S.L.
> >   Tlf:  +34 911 12 32 84 Ext: 223
> >   www.i2tic.com
> > _
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 
_

  Daniel Carrasco Marín
  Ingeniería para la Innovación i2TIC, S.L.
  Tlf:  +34 911 12 32 84 Ext: 223
  www.i2tic.com
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: MDS memory usage is very high

2018-07-18 Thread Gregory Farnum
The MDS think it's using 486MB of cache right now, and while that's
not a complete accounting (I believe you should generally multiply by
1.5 the configured cache limit to get a realistic memory consumption
model) it's obviously a long way from 12.5GB. You might try going in
with the "ceph daemon" command and looking at the heap stats (I forget
the exact command, but it will tell you if you run "help" against it)
and seeing what those say — you may have one of the slightly-broken
base systems and find that running the "heap release" (or similar
wording) command will free up a lot of RAM back to the OS!
-Greg

On Wed, Jul 18, 2018 at 1:53 PM, Daniel Carrasco  wrote:
> Hello,
>
> I've created a 3 nodes cluster with MON, MGR, OSD and MDS on all (2 MDS
> actives), and I've noticed that MDS is using a lot of memory (just now is
> using 12.5GB of RAM):
> # ceph daemon mds.kavehome-mgto-pro-fs01 dump_mempools | jq -c '.mds_co';
> ceph daemon mds.kavehome-mgto-pro-fs01 perf dump | jq '.mds_mem.rss'
> {"items":9272259,"bytes":510032260}
> 12466648
>
> I've configured the limit:
> mds_cache_memory_limit = 536870912
>
> But looks like is ignored, because is about 512Mb and is using a lot more.
>
> Is there any way to limit the memory usage of MDS, because is giving a lot
> of troubles because start to swap.
> Maybe I've to limit the cached inodes?
>
> The other active MDS is using a lot less memory (2.5Gb). but also is using
> more than 512Mb. The standby MDS is not using memory it all.
>
> I'm using the version:
> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
> (stable).
>
> Thanks!!
> --
> _
>
>   Daniel Carrasco Marín
>   Ingeniería para la Innovación i2TIC, S.L.
>   Tlf:  +34 911 12 32 84 Ext: 223
>   www.i2tic.com
> _
>
>
>
> --
> _
>
>   Daniel Carrasco Marín
>   Ingeniería para la Innovación i2TIC, S.L.
>   Tlf:  +34 911 12 32 84 Ext: 223
>   www.i2tic.com
> _
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: MDS memory usage is very high

2018-07-18 Thread Daniel Carrasco
Hello,

I've created a 3 nodes cluster with MON, MGR, OSD and MDS on all (2 MDS
actives), and I've noticed that MDS is using a lot of memory (just now is
using 12.5GB of RAM):
# ceph daemon mds.kavehome-mgto-pro-fs01 dump_mempools | jq -c '.mds_co';
ceph daemon mds.kavehome-mgto-pro-fs01 perf dump | jq '.mds_mem.rss'
{"items":9272259,"bytes":510032260}
12466648

I've configured the limit:
mds_cache_memory_limit = 536870912

But looks like is ignored, because is about 512Mb and is using a lot more.

Is there any way to limit the memory usage of MDS, because is giving a lot
of troubles because start to swap.
Maybe I've to limit the cached inodes?

The other active MDS is using a lot less memory (2.5Gb). but also is using
more than 512Mb. The standby MDS is not using memory it all.

I'm using the version:
ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
(stable).

Thanks!!
-- 
_

  Daniel Carrasco Marín
  Ingeniería para la Innovación i2TIC, S.L.
  Tlf:  +34 911 12 32 84 Ext: 223
  www.i2tic.com
_



-- 
_

  Daniel Carrasco Marín
  Ingeniería para la Innovación i2TIC, S.L.
  Tlf:  +34 911 12 32 84 Ext: 223
  www.i2tic.com
_
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Migrating EC pool to device-class crush rules

2018-07-18 Thread Graham Allan
Like many, we have a typical double root crush map, for hdd vs ssd-based 
pools. We've been running lumious for some time, so in preparation for a 
migration to new storage hardware, I wanted to migrate our pools to use 
the new device-class based rules; this way I shouldn't need to 
perpetuate the double hdd/ssd crush map for new hardware...


I understand how to migrate our replicated pools, by creating new 
replicated crush rules, and migrating them one at a time, but I'm 
confused on how to do this for erasure pools.


I can create a new class-aware EC profile something like:


ceph osd erasure-code-profile set ecprofile42_hdd k=4 m=2 
crush-device-class=hdd crush-failure-domain=host


then a new crush rule from this:


ceph osd crush rule create-erasure ec42_hdd ecprofile42_hdd


So mostly I want to confirm that is is safe to change the crush rule for 
the EC pool. It seems to make sense, but then, as I understand it, you 
can't change the erasure code profile for a pool after creation; but 
this seems to implicitly do so...


old rule:

rule .rgw.buckets.ec42 {
id 17
type erasure
min_size 3
max_size 20
step set_chooseleaf_tries 5
step take platter
step chooseleaf indep 0 type host
step emit
}


old ec profile:

# ceph osd erasure-code-profile get ecprofile42
crush-failure-domain=host
directory=/usr/lib/x86_64-linux-gnu/ceph/erasure-code
k=4
m=2
plugin=jerasure
technique=reed_sol_van


new rule:

rule ec42_hdd {
id 7
type erasure
min_size 3
max_size 6
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type host
step emit
}


new ec profile:

# ceph osd erasure-code-profile get ecprofile42_hdd
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8


These are both ec42 but I'm not sure why the old rule has "max size 20" 
(perhaps because it was generated a long time ago under hammer?).


Thanks for any feedback,

Graham
--
Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need advice on Ceph design

2018-07-18 Thread Satish Patel
Thanks Sebastien,

Let me answer all of your question which i missed out, Let me tell you
this is first cluster so i have no idea what would be best or worst
here, also you said we don't need SSD Journal for BlueStore but i
heard people saying  WAL/RockDB required SSD, can you explain?

If i have SATA 500GB 7.5k HDD in that case running journal WAL/RockDB
on same OSD disk will slowdown right?




On Wed, Jul 18, 2018 at 2:42 PM, Sébastien VIGNERON
 wrote:
> Hello,
>
> What is your expected workload? VMs, primary storage, backup, objects 
> storage, ...?

All VMs only ( we are running openstack and all i need HA solution
live migration etc)

> How many disks do you plan to put in each OSD node?

6 Disk per OSD node ( I have Samsung 850 EVO Pro 500GB  & SATA 500GB 7.5k)

> How many CPU cores? How many RAM per nodes?

2.9GHz  (32 core in /proc/cpuinfo)

> Ceph access protocol(s): CephFS, RBD or objects?

RBD only

> How do you plan to give access to the storage to you client? NFS, SMB, 
> CephFS, ...?

Openstack Nova / Cinder

> Replicated pools or EC pools? If EC, k and m factors?

I didn't thought of it, This is first cluster so don't know what would be best.

> What OS (for ceph nodes and clients)?

CentOS7.5  (Linux)

>
> Recommandations:
>  - For your information, Bluestore is not like Filestore, no need to have 
> journal SSD. It's recommended for Bluestore to use the same disk for both 
> WAL/RocksDB and datas.
>  - For production, it's recommended to have dedicated MON/MGR nodes.
>  - You may also need dedicated MDS nodes, depending the CEPH access 
> protocol(s) you choose.
>  - If you need commercial support afterward, you should see with a Redhat 
> representative.
>
> Samsung 850 pro is consumer grade, not great.
>
>
>> Le 18 juil. 2018 à 19:16, Satish Patel  a écrit :
>>
>> I have decided to setup 5 node Ceph storage and following is my
>> inventory, just tell me is it good to start first cluster for average
>> load.
>>
>> 0. Ceph Bluestore
>> 1. Journal SSD (Intel DC 3700)
>> 2. OSD disk Samsung 850 Pro 500GB
>> 3. OSD disk SATA 500GB (7.5k RPMS)
>> 4. 2x10G NIC (separate public/cluster with JumboFrame)
>>
>> Do you thin this combination is good for average load?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Community Manager

2018-07-18 Thread Sage Weil
Hi everyone,

Leo Vaz has moved on from his community manager role.  I'd like to take 
this opportunity to thank him for his efforts over the past year, and to 
wish him the best in his future ventures.  We've accomplished a lot during 
his tenure (including our first Cephalocon!) and Leo's efforts have helped 
make it all possible.  Thank you, Leo!

Until we identify someone else to fill the role, please direct any 
community related matters to either me or Stormy Peters 
.  Likewise, if you are interested in the role, or 
have someone in mind who would be a good fit for the Ceph community, 
please let us know!

Thanks-
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need advice on Ceph design

2018-07-18 Thread Sébastien VIGNERON
Hello,

What is your expected workload? VMs, primary storage, backup, objects storage, 
...? 
How many disks do you plan to put in each OSD node?
How many CPU cores? How many RAM per nodes?
Ceph access protocol(s): CephFS, RBD or objects?
How do you plan to give access to the storage to you client? NFS, SMB, CephFS, 
...?
Replicated pools or EC pools? If EC, k and m factors?
What OS (for ceph nodes and clients)?

Recommandations:
 - For your information, Bluestore is not like Filestore, no need to have 
journal SSD. It's recommended for Bluestore to use the same disk for both 
WAL/RocksDB and datas.
 - For production, it's recommended to have dedicated MON/MGR nodes.
 - You may also need dedicated MDS nodes, depending the CEPH access protocol(s) 
you choose.
 - If you need commercial support afterward, you should see with a Redhat 
representative.

Samsung 850 pro is consumer grade, not great.


> Le 18 juil. 2018 à 19:16, Satish Patel  a écrit :
> 
> I have decided to setup 5 node Ceph storage and following is my
> inventory, just tell me is it good to start first cluster for average
> load.
> 
> 0. Ceph Bluestore
> 1. Journal SSD (Intel DC 3700)
> 2. OSD disk Samsung 850 Pro 500GB
> 3. OSD disk SATA 500GB (7.5k RPMS)
> 4. 2x10G NIC (separate public/cluster with JumboFrame)
> 
> Do you thin this combination is good for average load?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd vs librbd performance with qemu

2018-07-18 Thread Alexandre DERUMIER
Hi,

qemu use only 1 thread for disk, generally the performance limitation come from 
cpu.

(you can have 1 thread for each disk using iothread).

I'm not sure how it's work with krbd, but with librbd  and qemu rbd driver, 
it's only use 1core by disk.

So, you need to have fast cpu frequency, disable rbd cache, disable client 
debug, or others option which can lower cpu usage client side.



- Mail original -
De: "Nikola Ciprich" 
À: "ceph-users" 
Cc: "nik" 
Envoyé: Mercredi 18 Juillet 2018 16:54:58
Objet: [ceph-users] krbd vs librbd performance with qemu

Hi, 

historically I've found many discussions about this topic in 
last few years, but it seems to me to be still a bit unresolved 
so I'd like to open the question again.. 

In all flash deployments, under 12.2.5 luminous and qemu 12.2.0 
using lbirbd, I'm getting much worse results regarding IOPS then 
with KRBD and direct block device access.. 

I'm testing on the same 100GB RBD volume, notable ceph settings: 

client rbd cache disabled 
osd_enable_op_tracker = False 
osd_op_num_shards = 64 
osd_op_num_threads_per_shard = 1 

osds are running bluestore, 2 replicas (it's just for testing) 

when I run FIO using librbd directly, I'm getting ~160k reads/s 
and ~60k writes/s which is not that bad. 

however when I run fio on block device under VM (qemu using librbd), 
I'm getting only 60/40K op/s which is a huge loss.. 

when I use VM with block access to krbd mapped device, numbers 
are much better, I'm getting something like 115/40K op/s which 
is not ideal, but still much better.. tried many optimisations 
and configuration variants (multiple queues, threads vs native aio 
etc), but krbd still performs much much better.. 

My question is whether this is expected, or should both access methods 
give more similar results? If possible, I'd like to stick to librbd 
(especially because krbd still lacks layering support, but there are 
more reasons) 

interesting is, that when I compare fio direct ceph access, librbd performs 
better then KRBD, but this doesn't concern me that much.. 

another question, during the tests, I noticed that enabling exclusive lock 
feature degrades write iops a lot as well, is this expected? (the performance 
falls to someting like 50%) 

I'm doing the tests on small 2 node cluster, VMS are running directly on ceph 
nodes, 
all is centos 7 with 4.14 kernel. (I know it's not recommended to run VMs 
directly 
on ceph nodes, but for small deployments it's necessary for us) 

if I could provide more details, I'll be happy to do so 

BR 

nik 


-- 
- 
Ing. Nikola CIPRICH 
LinuxBox.cz, s.r.o. 
28.rijna 168, 709 00 Ostrava 

tel.: +420 591 166 214 
fax: +420 596 621 273 
mobil: +420 777 093 799 
www.linuxbox.cz 

mobil servis: +420 737 238 656 
email servis: ser...@linuxbox.cz 
- 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd vs librbd performance with qemu

2018-07-18 Thread Jason Dillaman
On Wed, Jul 18, 2018 at 1:08 PM Nikola Ciprich 
wrote:

> > Care to share your "bench-rbd" script (on pastebin or similar)?
> sure, no problem.. it's so short I hope nobody will get offended if I
> paste it right
> here :)
>
> #!/bin/bash
>
> #export LD_PRELOAD="/usr/lib64/libtcmalloc.so.4"
> numjobs=8
> pool=nvme
> vol=xxx
> time=30
>
> opts="--randrepeat=1 --ioengine=rbd --direct=1 --numjobs=${numjobs}
> --gtod_reduce=1 --name=test --pool=${pool} --rbdname=${vol} --invalidate=0
> --bs=4k --iodepth=64 --time_based --runtime=$time --group_reporting"
>

So that "--numjobs" parameter is what I was referring to when I said
multiple jobs will cause a huge performance it. This causes fio to open the
same image X images, so with (nearly) each write operation, the
exclusive-lock is being moved from client-to-client. Instead of multiple
jobs against the same image, you should use multiple images.


> sopts="--randrepeat=1 --ioengine=rbd --direct=1 --numjobs=1
> --gtod_reduce=1 --name=test --pool=${pool} --rbdname=${vol} --invalidate=0
> --bs=256k --iodepth=64 --time_based --runtime=$time --group_reporting"
>
> #fio $sopts --readwrite=read --output=rbd-fio-seqread.log
> echo
>
> #fio $sopts --readwrite=write --output=rbd-fio-seqwrite.log
> echo
>
> fio $opts --readwrite=randread --output=rbd-fio-randread.log
> echo
>
> fio $opts --readwrite=randwrite --output=rbd-fio-randwrite.log
> echo
>
>
> hope it's of some use..
>
> n.
>
>
> --
> -
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28. rijna 168, 709 00 Ostrava
>
> tel.:   +420 591 166 214
> fax:+420 596 621 273
> mobil:  +420 777 093 799
>
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Exact scope of OSD heartbeating?

2018-07-18 Thread Anthony D'Atri
Thanks, Dan.  I thought so but wanted to verify.  I'll see if I can work up a 
doc PR to clarify this.

>> The documentation here:
>> 
>> http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
>> 
>> says
>> 
>> "Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 
>> seconds"
>> 
>> and
>> 
>> " If a neighboring Ceph OSD Daemon doesn’t show a heartbeat within a 20 
>> second grace period, the Ceph OSD Daemon may consider the neighboring Ceph 
>> OSD Daemon down and report it back to a Ceph Monitor,"
>> 
>> I've always thought that each OSD heartbeats with *every* other OSD, which 
>> of course means that total heartbeat traffic grows ~ quadratically.  However 
>> in extending test we've observed that the number of other OSDs that a 
>> subject heartbeat (heartbeated?) was < N, which has us wondering if perhaps 
>> only OSDs with which a given OSD shares are contacted -- or some other 
>> subset.
>> 
> 
> OSDs heartbeat with their peers, the set of osds with whom they share
> at least one PG.
> You can see the heartbeat peers (HB_PEERS) in ceph pg dump -- after
> the header "OSD_STAT USED  AVAIL TOTAL HB_PEERS..."
> 
> This is one of the nice features of the placement group concept --
> heartbeats and peering in general stays constant with the number of
> PGs per OSD, rather than scaling up with the total number of OSDs in a
> cluster.
> 
> Cheers, Dan
> 
> 
>> I plan to submit a doc fix for mon_osd_min_down_reporters and wanted to 
>> resolve this FUD first.
>> 
>> -- aad
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Need advice on Ceph design

2018-07-18 Thread Satish Patel
I have decided to setup 5 node Ceph storage and following is my
inventory, just tell me is it good to start first cluster for average
load.

0. Ceph Bluestore
1. Journal SSD (Intel DC 3700)
2. OSD disk Samsung 850 Pro 500GB
3. OSD disk SATA 500GB (7.5k RPMS)
4. 2x10G NIC (separate public/cluster with JumboFrame)

Do you thin this combination is good for average load?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd vs librbd performance with qemu

2018-07-18 Thread Nikola Ciprich
> Care to share your "bench-rbd" script (on pastebin or similar)?
sure, no problem.. it's so short I hope nobody will get offended if I paste it 
right
here :)

#!/bin/bash

#export LD_PRELOAD="/usr/lib64/libtcmalloc.so.4"
numjobs=8
pool=nvme
vol=xxx
time=30

opts="--randrepeat=1 --ioengine=rbd --direct=1 --numjobs=${numjobs} 
--gtod_reduce=1 --name=test --pool=${pool} --rbdname=${vol} --invalidate=0 
--bs=4k --iodepth=64 --time_based --runtime=$time --group_reporting"

sopts="--randrepeat=1 --ioengine=rbd --direct=1 --numjobs=1 --gtod_reduce=1 
--name=test --pool=${pool} --rbdname=${vol} --invalidate=0 --bs=256k 
--iodepth=64 --time_based --runtime=$time --group_reporting"

#fio $sopts --readwrite=read --output=rbd-fio-seqread.log
echo

#fio $sopts --readwrite=write --output=rbd-fio-seqwrite.log
echo

fio $opts --readwrite=randread --output=rbd-fio-randread.log
echo

fio $opts --readwrite=randwrite --output=rbd-fio-randwrite.log
echo


hope it's of some use..

n.


-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgp1e1Nvjmsdz.pgp
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd vs librbd performance with qemu

2018-07-18 Thread Jason Dillaman
On Wed, Jul 18, 2018 at 12:58 PM Nikola Ciprich 
wrote:

> > What's the output from "rbd info nvme/centos7"?
> that was it! the parent had some of unsupported features
> enabled, therefore the child could not be mapped..
>
> so the error message is a bit confusing, but now after disabling
> the features on the parent it works for me, thanks!
>
> > Odd. The exclusive-lock code is only executed once (in general) upon the
> > first write IO (or immediately upon mapping the image if the "exclusive"
> > option is passed to the kernel). Therefore, it should have zero impact on
> > IO performance.
>
> hmm, then I might have found a bug..
>
> [root@v4a bench1]# sh bench-rbd
> Jobs: 8 (f=8): [r(8)][100.0%][r=671MiB/s,w=0KiB/s][r=172k,w=0 IOPS][eta
> 00m:00s]
> Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=230MiB/s][r=0,w=58.8k IOPS][eta
> 00m:00s]
>
> [root@v4a bench1]# rbd feature enable nvme/xxx exclusive-lock
> [root@v4a bench1]# sh bench-rbd
> Jobs: 8 (f=8): [r(8)][100.0%][r=651MiB/s,w=0KiB/s][r=167k,w=0 IOPS][eta
> 00m:00s]
> Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=45.9MiB/s][r=0,w=11.7k IOPS][eta
> 00m:00s]
>
> (as you can see, the performance impact is even worse..)
>
> I guess I should create a bug report for this one?
>

Care to share your "bench-rbd" script (on pastebin or similar)?


>
> nik
>
>
>
> >
> >
> > >
> > > BR
> > >
> > > nik
> > >
> > >
> > > --
> > > -
> > > Ing. Nikola CIPRICH
> > > LinuxBox.cz, s.r.o.
> > > 28. rijna 168, 709 00 Ostrava
> > >
> > > tel.:   +420 591 166 214
> > > fax:+420 596 621 273
> > > mobil:  +420 777 093 799
> > >
> > > www.linuxbox.cz
> > >
> > > mobil servis: +420 737 238 656
> > > email servis: ser...@linuxbox.cz
> > > -
> > >
> >
> >
> > --
> > Jason
>
> --
> -
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28. rijna 168, 709 00 Ostrava
>
> tel.:   +420 591 166 214
> fax:+420 596 621 273
> mobil:  +420 777 093 799
>
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd vs librbd performance with qemu

2018-07-18 Thread Nikola Ciprich
> What's the output from "rbd info nvme/centos7"?
that was it! the parent had some of unsupported features
enabled, therefore the child could not be mapped..

so the error message is a bit confusing, but now after disabling
the features on the parent it works for me, thanks!

> Odd. The exclusive-lock code is only executed once (in general) upon the
> first write IO (or immediately upon mapping the image if the "exclusive"
> option is passed to the kernel). Therefore, it should have zero impact on
> IO performance.

hmm, then I might have found a bug..

[root@v4a bench1]# sh bench-rbd 
Jobs: 8 (f=8): [r(8)][100.0%][r=671MiB/s,w=0KiB/s][r=172k,w=0 IOPS][eta 00m:00s]
Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=230MiB/s][r=0,w=58.8k IOPS][eta 
00m:00s]

[root@v4a bench1]# rbd feature enable nvme/xxx exclusive-lock
[root@v4a bench1]# sh bench-rbd
Jobs: 8 (f=8): [r(8)][100.0%][r=651MiB/s,w=0KiB/s][r=167k,w=0 IOPS][eta 00m:00s]
Jobs: 8 (f=8): [w(8)][100.0%][r=0KiB/s,w=45.9MiB/s][r=0,w=11.7k IOPS][eta 
00m:00s]

(as you can see, the performance impact is even worse..)

I guess I should create a bug report for this one?

nik



> 
> 
> >
> > BR
> >
> > nik
> >
> >
> > --
> > -
> > Ing. Nikola CIPRICH
> > LinuxBox.cz, s.r.o.
> > 28. rijna 168, 709 00 Ostrava
> >
> > tel.:   +420 591 166 214
> > fax:+420 596 621 273
> > mobil:  +420 777 093 799
> >
> > www.linuxbox.cz
> >
> > mobil servis: +420 737 238 656
> > email servis: ser...@linuxbox.cz
> > -
> >
> 
> 
> -- 
> Jason

-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgp1jfmQEJQAu.pgp
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Troy Ablan



On 07/17/2018 11:14 PM, Brad Hubbard wrote:

On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan  wrote:

I was on 12.2.5 for a couple weeks and started randomly seeing
corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
loose.  I panicked and moved to Mimic, and when that didn't solve the
problem, only then did I start to root around in mailing lists archives.

It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is
out, but I'm unsure how to proceed now that the damaged cluster is
running under Mimic.  Is there anything I can do to get the cluster back
online and objects readable?

That depends on what the specific problem is. Can you provide some
data that fills in the blanks around "randomly seeing corruption"?

Thanks for the reply, Brad.  I have a feeling that almost all of this 
stems from the time the cluster spent running 12.2.6.  When booting VMs 
that use rbd as a backing store, they typically get I/O errors during 
boot and cannot read critical parts of the image.  I also get similar 
errors if I try to rbd export most of the images. Also, CephFS is not 
started as ceph -s indicates damage.


Many of the OSDs have been crashing and restarting as I've tried to rbd 
export good versions of images (from older snapshots).  Here's one 
particular crash:


2018-07-18 15:52:15.809 7fcbaab77700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/h
uge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/os/bluestore/BlueStore.h: 
In function 'void 
BlueStore::SharedBlobSet::remove_last(BlueStore::SharedBlob*)' thread 
7fcbaab7

7700 time 2018-07-18 15:52:15.750916
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13
.2.0/src/os/bluestore/BlueStore.h: 455: FAILED assert(sb->nref == 0)

 ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0xff) [0x7fcbc197a53f]

 2: (()+0x286727) [0x7fcbc197a727]
 3: (BlueStore::SharedBlob::put()+0x1da) [0x5641f39181ca]
 4: (std::_Rb_tree, 
boost::intrusive_ptr, 
std::_Identity >,
std::less >, 
std::allocator > 
>::_M_erase(std::_Rb_tree_node
lueStore::SharedBlob> >*)+0x2d) [0x5641f3977cfd]
 5: (std::_Rb_tree, 
boost::intrusive_ptr, 
std::_Identity >,
std::less >, 
std::allocator > 
>::_M_erase(std::_Rb_tree_node
lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
 6: (std::_Rb_tree, 
boost::intrusive_ptr, 
std::_Identity >,
std::less >, 
std::allocator > 
>::_M_erase(std::_Rb_tree_node
lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
 7: (std::_Rb_tree, 
boost::intrusive_ptr, 
std::_Identity >,
std::less >, 
std::allocator > 
>::_M_erase(std::_Rb_tree_node
lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
 8: (BlueStore::TransContext::~TransContext()+0xf7) [0x5641f3979297]
 9: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x610) 
[0x5641f391c9b0]
 10: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x9a) 
[0x5641f392a38a]

 11: (BlueStore::_kv_finalize_thread()+0x41e) [0x5641f392b3be]
 12: (BlueStore::KVFinalizeThread::entry()+0xd) [0x5641f397d85d]
 13: (()+0x7e25) [0x7fcbbe4d2e25]
 14: (clone()+0x6d) [0x7fcbbd5c3bad]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.



Here's the output of ceph -s that might fill in some configuration 
questions.  Since osds are continually restarting if I try to put load 
on it, the cluster seems to be churning a bit.  That's why I set nodown 
for now.


  cluster:
    id: b2873c9a-5539-4c76-ac4a-a6c9829bfed2
    health: HEALTH_ERR
    1 filesystem is degraded
    1 filesystem is offline
    1 mds daemon damaged
    nodown,noscrub,nodeep-scrub flag(s) set
    9 scrub errors
    Reduced data availability: 61 pgs inactive, 56 pgs peering, 
4 pgs stale

    Possible data damage: 3 pgs inconsistent
    16 slow requests are blocked > 32 sec
    26 stuck requests are blocked > 4096 sec

  services:
    mon: 5 daemons, quorum a,b,c,d,e
    mgr: a(active), standbys: b, d, e, c
    mds: lcs-0/1/1 up , 2 up:standby, 1 damaged
    osd: 34 osds: 34 up, 34 in
 flags nodown,noscrub,nodeep-scrub

  data:
    pools:   15 pools, 640 pgs
    objects: 9.73 M objects, 13 TiB
    usage:   24 TiB used, 55 TiB / 79 TiB avail
    pgs: 23.438% pgs not active
 487 active+clean
 73  peering
 70  activating
 5   stale+peering
 3   active+clean+inconsistent
 2   stale+activating

  io:
    client:   1.3 KiB/s wr, 0 op/s rd, 0 op/s wr


If there's any other information I can provide that can help point to 
the problem, I'd be glad to share.


Thanks

-Troy
___
ceph-users mailing list

Re: [ceph-users] krbd vs librbd performance with qemu

2018-07-18 Thread Jason Dillaman
On Wed, Jul 18, 2018 at 12:36 PM Nikola Ciprich 
wrote:

> ;6QHi Janon,
>
> > Just to clarify: modern / rebased krbd block drivers definitely support
> > layering. The only missing features right now are object-map/fast-diff,
> > deep-flatten, and journaling (for RBD mirroring).
>
> I thought it as well, but at least mapping clone does not work for me even
> under 4.17.6:
>
>
> [root@v4a ~]# rbd map nvme/xxx
> rbd: sysfs write failed
> RBD image feature set mismatch. You can disable features unsupported by
> the kernel with "rbd feature disable nvme/xxx".
> In some cases useful info is found in syslog - try "dmesg | tail".
> rbd: map failed: (6) No such device or address
>
> (note incorrect hint on how this is supposed to be fixed, with feature
> disable command without any feature)
>
> dmesg output:
>
> [  +3.919281] rbd: image xxx: WARNING: kernel layering is EXPERIMENTAL!
> [  +0.001266] rbd: id 36dde238e1f29: image uses unsupported features: 0x38
>
>
> [root@v4a ~]# rbd info nvme/xxx
> rbd image 'xxx':
> size 20480 MB in 5120 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.6a71313887ee0
> format: 2
> features: layering
> flags:
> create_timestamp: Wed Jun 20 13:46:38 2018
> parent: nvme/centos7@template
> overlap: 20480 MB
>
> is trying 4.18-rc5 worth giving a try?
>

What's the output from "rbd info nvme/centos7"?


> > If you are running multiple fio jobs against the same image (or have the
> > krbd device mapped to multiple hosts w/ active IO), then I would expect a
> > huge performance hit since the lock needs to be transitioned between
> > clients.
>
> nope, only one running fio instance, no users on the other node..
>

Odd. The exclusive-lock code is only executed once (in general) upon the
first write IO (or immediately upon mapping the image if the "exclusive"
option is passed to the kernel). Therefore, it should have zero impact on
IO performance.


>
> BR
>
> nik
>
>
> --
> -
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28. rijna 168, 709 00 Ostrava
>
> tel.:   +420 591 166 214
> fax:+420 596 621 273
> mobil:  +420 777 093 799
>
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd vs librbd performance with qemu

2018-07-18 Thread Nikola Ciprich
;6QHi Janon,

> Just to clarify: modern / rebased krbd block drivers definitely support
> layering. The only missing features right now are object-map/fast-diff,
> deep-flatten, and journaling (for RBD mirroring).

I thought it as well, but at least mapping clone does not work for me even
under 4.17.6:


[root@v4a ~]# rbd map nvme/xxx
rbd: sysfs write failed
RBD image feature set mismatch. You can disable features unsupported by the 
kernel with "rbd feature disable nvme/xxx".
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (6) No such device or address

(note incorrect hint on how this is supposed to be fixed, with feature disable 
command without any feature)

dmesg output:

[  +3.919281] rbd: image xxx: WARNING: kernel layering is EXPERIMENTAL!
[  +0.001266] rbd: id 36dde238e1f29: image uses unsupported features: 0x38


[root@v4a ~]# rbd info nvme/xxx
rbd image 'xxx':
size 20480 MB in 5120 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.6a71313887ee0
format: 2
features: layering
flags: 
create_timestamp: Wed Jun 20 13:46:38 2018
parent: nvme/centos7@template
overlap: 20480 MB

is trying 4.18-rc5 worth giving a try?

> If you are running multiple fio jobs against the same image (or have the
> krbd device mapped to multiple hosts w/ active IO), then I would expect a
> huge performance hit since the lock needs to be transitioned between
> clients.

nope, only one running fio instance, no users on the other node..

BR

nik


-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-


pgp_UJKdcWlKC.pgp
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] is upgrade from 12.2.5 to 12.2.7 an emergency for EC users

2018-07-18 Thread Brady Deetz
I'm trying to determine if I need to perform an emergency update on my 2PB
CephFS environment running on EC.

What triggers the corruption bug? Is it only at the time of an OSD restart
before data is quiesced?

When do you know if corruption has occurred? deep-scrub?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd vs librbd performance with qemu

2018-07-18 Thread Jason Dillaman
On Wed, Jul 18, 2018 at 10:55 AM Nikola Ciprich 
wrote:

> Hi,
>
> historically I've found many discussions about this topic in
> last few years, but it seems  to me to be still a bit unresolved
> so I'd like to open the question again..
>
> In all flash deployments, under 12.2.5 luminous and qemu 12.2.0
> using lbirbd, I'm getting much worse results regarding IOPS then
> with KRBD and direct block device access..
>
> I'm testing on the same 100GB RBD volume, notable ceph settings:
>
> client rbd cache disabled
> osd_enable_op_tracker = False
> osd_op_num_shards = 64
> osd_op_num_threads_per_shard = 1
>
> osds are running bluestore, 2 replicas (it's just for testing)
>
> when I run FIO using librbd directly, I'm getting ~160k reads/s
> and ~60k writes/s which is not that bad.
>
> however when I run fio on block device under VM (qemu using librbd),
> I'm getting only 60/40K op/s which is a huge loss..
>
> when I use VM with block access to krbd mapped device, numbers
> are much better, I'm getting something like 115/40K op/s which
> is not ideal, but still much better.. tried many optimisations
> and configuration variants (multiple queues, threads vs native aio
> etc), but krbd still performs much much better..
>
> My question is whether this is expected, or should both access methods
> give more similar results? If possible, I'd like  to stick to librbd
> (especially because krbd still lacks layering support, but there are
> more reasons)
>

Just to clarify: modern / rebased krbd block drivers definitely support
layering. The only missing features right now are object-map/fast-diff,
deep-flatten, and journaling (for RBD mirroring).


> interesting is, that when I compare fio direct ceph access, librbd performs
> better then KRBD, but  this doesn't concern me that much..
>
> another question, during the tests, I noticed that enabling exclusive lock
> feature degrades write iops a lot as well, is this expected? (the
> performance
> falls to someting like 50%)
>

If you are running multiple fio jobs against the same image (or have the
krbd device mapped to multiple hosts w/ active IO), then I would expect a
huge performance hit since the lock needs to be transitioned between
clients.


>
> I'm doing the tests on small 2 node cluster, VMS are running directly on
> ceph nodes,
> all is centos 7 with 4.14 kernel. (I know it's not recommended to run VMs
> directly
> on ceph nodes, but for small deployments it's necessary for us)
>
> if I could provide more details, I'll be happy to do so
>
> BR
>
> nik
>
>
> --
> -
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
>
> tel.:   +420 591 166 214
> fax:+420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
>
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10.2.6 upgrade

2018-07-18 Thread Glen Baars
Hello Sage,

Thanks for the response.

I new fairly new to ceph. Is there any commands that would help confirm the 
issue?

Kind regards,
Glen Baars

T  1300 733 328
NZ +64 9280 3561
MOB +61 447 991 234


This e-mail may contain confidential and/or privileged information.If you are 
not the intended recipient (or have received this e-mail in error) please 
notify the sender immediately and destroy this e-mail. Any unauthorized 
copying, disclosure or distribution of the material in this e-mail is strictly 
forbidden.

-Original Message-
From: Sage Weil 
Sent: Wednesday, 18 July 2018 10:38 PM
To: Glen Baars 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] 10.2.6 upgrade

On Wed, 18 Jul 2018, Glen Baars wrote:
> Hello Ceph Users,
>
> We installed 12.2.6 on a single node in the cluster ( new node added,
> 80TB moved ) Disabled scrub/deepscrub once the issues with 12.2.6 were 
> discovered.
>
>
> Today we upgrade the one affected node to 12.2.7 today, set osd skip data 
> digest = true and re enabled the scrubs. It's a 500TB all bluestore cluster.
>
>
> We are now seeing inconsistent PGs and scrub errors now the scrubbing has 
> resumed.

It is likely the inconsistencies were tehre from teh period running 12.2.6, not 
due ot 12.2.7.  I would suggest continuing the upgrade.  The scrub errors will 
either go away on their own or need to wait until 12.2.8 for scrub to learn how 
to repair them for you.

Can you share the scrub error you got to confirm it is the digest issue in
12.2.6 that is to blame?

sage

> What is the best way forward?
>
>
>   1.  Upgrade all nodes to 12.2.7?
>   2.  Remove the 12.2.7 node and rebuild?
> Kind regards,
> Glen Baars
> BackOnline Manager
> This e-mail is intended solely for the benefit of the addressee(s) and any 
> other named recipient. It is confidential and may contain legally privileged 
> or confidential information. If you are not the recipient, any use, 
> distribution, disclosure or copying of this e-mail is prohibited. The 
> confidentiality and legal privilege attached to this communication is not 
> waived or lost by reason of the mistaken transmission or delivery to you. If 
> you have received this e-mail in error, please notify us immediately.
>
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] krbd vs librbd performance with qemu

2018-07-18 Thread Nikola Ciprich
Hi,

historically I've found many discussions about this topic in
last few years, but it seems  to me to be still a bit unresolved
so I'd like to open the question again..

In all flash deployments, under 12.2.5 luminous and qemu 12.2.0
using lbirbd, I'm getting much worse results regarding IOPS then
with KRBD and direct block device access..

I'm testing on the same 100GB RBD volume, notable ceph settings:

client rbd cache disabled
osd_enable_op_tracker = False
osd_op_num_shards = 64
osd_op_num_threads_per_shard = 1

osds are running bluestore, 2 replicas (it's just for testing)

when I run FIO using librbd directly, I'm getting ~160k reads/s
and ~60k writes/s which is not that bad.

however when I run fio on block device under VM (qemu using librbd),
I'm getting only 60/40K op/s which is a huge loss.. 

when I use VM with block access to krbd mapped device, numbers
are much better, I'm getting something like 115/40K op/s which
is not ideal, but still much better.. tried many optimisations
and configuration variants (multiple queues, threads vs native aio
etc), but krbd still performs much much better..

My question is whether this is expected, or should both access methods
give more similar results? If possible, I'd like  to stick to librbd
(especially because krbd still lacks layering support, but there are
more reasons)

interesting is, that when I compare fio direct ceph access, librbd performs
better then KRBD, but  this doesn't concern me that much..

another question, during the tests, I noticed that enabling exclusive lock
feature degrades write iops a lot as well, is this expected? (the performance
falls to someting like 50%)

I'm doing the tests on small 2 node cluster, VMS are running directly on ceph 
nodes,
all is centos 7 with 4.14 kernel. (I know it's not recommended to run VMs 
directly
on ceph nodes, but for small deployments it's necessary for us)

if I could provide more details, I'll be happy to do so

BR

nik


-- 
-
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:+420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: ser...@linuxbox.cz
-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Nicolas Huillard
Hi all,

This is just to report that I just upgraded smoothly from 12.2.6 to
12.2.7 (bluestore only, bitten by the "damaged mds" consequence of the
bad checksum on mds journal 0x200).
This was a really bad problem for CephFS. Hopefully, that cluster was
not in production yet (that's why I didn't ask myself too many
questions before upgrading to 12.2.6).

Many many many thanks to all who provided help, and to the brave
developers who might not have had fun those days too ;-)

Le mardi 17 juillet 2018 à 18:28 +0200, Abhishek Lekshmanan a écrit :
> This is the seventh bugfix release of Luminous v12.2.x long term
> stable release series. This release contains several fixes for
> regressions in the v12.2.6 and v12.2.5 releases.  We recommend that
> all users upgrade. 
> 
> *NOTE* The v12.2.6 release has serious known regressions, while
> 12.2.6
> wasn't formally announced in the mailing lists or blog, the packages
> were built and available on download.ceph.com since last week. If you
> installed this release, please see the upgrade procedure below.
> 
> *NOTE* The v12.2.5 release has a potential data corruption issue with
> erasure coded pools. If you ran v12.2.5 with erasure coding, please
> see
> below.
> 
> The full blog post alongwith the complete changelog is published at
> the
> official ceph blog at https://ceph.com/releases/12-2-7-luminous-relea
> sed/
> 
> Upgrading from v12.2.6
> --
> 
> v12.2.6 included an incomplete backport of an optimization for
> BlueStore OSDs that avoids maintaining both the per-object checksum
> and the internal BlueStore checksum.  Due to the accidental omission
> of a critical follow-on patch, v12.2.6 corrupts (fails to update) the
> stored per-object checksum value for some objects.  This can result
> in
> an EIO error when trying to read those objects.
> 
> #. If your cluster uses FileStore only, no special action is
> required.
>    This problem only affects clusters with BlueStore.
> 
> #. If your cluster has only BlueStore OSDs (no FileStore), then you
>    should enable the following OSD option::
> 
>  osd skip data digest = true
> 
>    This will avoid setting and start ignoring the full-object digests
>    whenever the primary for a PG is BlueStore.
> 
> #. If you have a mix of BlueStore and FileStore OSDs, then you should
>    enable the following OSD option::
> 
>  osd distrust data digest = true
> 
>    This will avoid setting and start ignoring the full-object digests
>    in all cases.  This weakens the data integrity checks for
>    FileStore (although those checks were always only opportunistic).
> 
> If your cluster includes BlueStore OSDs and was affected, deep scrubs
> will generate errors about mismatched CRCs for affected objects.
> Currently the repair operation does not know how to correct them
> (since all replicas do not match the expected checksum it does not
> know how to proceed).  These warnings are harmless in the sense that
> IO is not affected and the replicas are all still in sync.  The
> number
> of affected objects is likely to drop (possibly to zero) on their own
> over time as those objects are modified.  We expect to include a
> scrub
> improvement in v12.2.8 to clean up any remaining objects.
> 
> Additionally, see the notes below, which apply to both v12.2.5 and
> v12.2.6.
> 
> Upgrading from v12.2.5 or v12.2.6
> -
> 
> If you used v12.2.5 or v12.2.6 in combination with erasure coded
> pools, there is a small risk of corruption under certain workloads.
> Specifically, when:
> 
> * An erasure coded pool is in use
> * The pool is busy with successful writes
> * The pool is also busy with updates that result in an error result
> to
>   the librados user.  RGW garbage collection is the most common
>   example of this (it sends delete operations on objects that don't
>   always exist.)
> * Some OSDs are reasonably busy.  One known example of such load is
>   FileStore splitting, although in principle any load on the cluster
>   could also trigger the behavior.
> * One or more OSDs restarts.
> 
> This combination can trigger an OSD crash and possibly leave PGs in a
> state
> where they fail to peer.
> 
> Notably, upgrading a cluster involves OSD restarts and as such may
> increase the risk of encountering this bug.  For this reason, for
> clusters with erasure coded pools, we recommend the following upgrade
> procedure to minimize risk:
> 
> 1. Install the v12.2.7 packages.
> 2. Temporarily quiesce IO to cluster::
> 
>  ceph osd pause
> 
> 3. Restart all OSDs and wait for all PGs to become active.
> 4. Resume IO::
> 
>  ceph osd unpause
> 
> This will cause an availability outage for the duration of the OSD
> restarts.  If this in unacceptable, an *more risky* alternative is to
> disable RGW garbage collection (the primary known cause of these
> rados
> operations) for the duration of the upgrade::
> 
> 1. Set ``rgw_enable_gc_threads = false`` in ceph.conf
> 2. Restart all 

Re: [ceph-users] 10.2.6 upgrade

2018-07-18 Thread Sage Weil
On Wed, 18 Jul 2018, Glen Baars wrote:
> Hello Ceph Users,
> 
> We installed 12.2.6 on a single node in the cluster ( new node added, 80TB 
> moved )
> Disabled scrub/deepscrub once the issues with 12.2.6 were discovered.
> 
> 
> Today we upgrade the one affected node to 12.2.7 today, set osd skip data 
> digest = true and re enabled the scrubs. It's a 500TB all bluestore cluster.
> 
> 
> We are now seeing inconsistent PGs and scrub errors now the scrubbing has 
> resumed.

It is likely the inconsistencies were tehre from teh period running 
12.2.6, not due ot 12.2.7.  I would suggest continuing the upgrade.  The 
scrub errors will either go away on their own or need to wait until 12.2.8 
for scrub to learn how to repair them for you.

Can you share the scrub error you got to confirm it is the digest issue in 
12.2.6 that is to blame?

sage
 
> What is the best way forward?
> 
> 
>   1.  Upgrade all nodes to 12.2.7?
>   2.  Remove the 12.2.7 node and rebuild?
> Kind regards,
> Glen Baars
> BackOnline Manager
> This e-mail is intended solely for the benefit of the addressee(s) and any 
> other named recipient. It is confidential and may contain legally privileged 
> or confidential information. If you are not the recipient, any use, 
> distribution, disclosure or copying of this e-mail is prohibited. The 
> confidentiality and legal privilege attached to this communication is not 
> waived or lost by reason of the mistaken transmission or delivery to you. If 
> you have received this e-mail in error, please notify us immediately.
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 10.2.6 upgrade

2018-07-18 Thread Glen Baars
Hello Ceph Users,

We installed 12.2.6 on a single node in the cluster ( new node added, 80TB 
moved )
Disabled scrub/deepscrub once the issues with 12.2.6 were discovered.


Today we upgrade the one affected node to 12.2.7 today, set osd skip data 
digest = true and re enabled the scrubs. It's a 500TB all bluestore cluster.


We are now seeing inconsistent PGs and scrub errors now the scrubbing has 
resumed.

What is the best way forward?


  1.  Upgrade all nodes to 12.2.7?
  2.  Remove the 12.2.7 node and rebuild?
Kind regards,
Glen Baars
BackOnline Manager
This e-mail is intended solely for the benefit of the addressee(s) and any 
other named recipient. It is confidential and may contain legally privileged or 
confidential information. If you are not the recipient, any use, distribution, 
disclosure or copying of this e-mail is prohibited. The confidentiality and 
legal privilege attached to this communication is not waived or lost by reason 
of the mistaken transmission or delivery to you. If you have received this 
e-mail in error, please notify us immediately.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Oliver Freyermuth
Am 18.07.2018 um 16:20 schrieb Sage Weil:
> On Wed, 18 Jul 2018, Oliver Freyermuth wrote:
>> Am 18.07.2018 um 14:20 schrieb Sage Weil:
>>> On Wed, 18 Jul 2018, Linh Vu wrote:
 Thanks for all your hard work in putting out the fixes so quickly! :)

 We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, 
 not RGW. In the release notes, it says RGW is a risk especially the 
 garbage collection, and the recommendation is to either pause IO or 
 disable RGW garbage collection.


 In our case with CephFS, not RGW, is it a lot less risky to perform the 
 upgrade to 12.2.7 without the need to pause IO?


 What does pause IO do? Do current sessions just get queued up and IO 
 resume normally with no problem after unpausing?


 If we have to pause IO, is it better to do something like: pause IO, 
 restart OSDs on one node, unpause IO - repeated for all the nodes 
 involved in the EC pool?
>>
>> Hi!
>>
>> sorry for asking again, but... 
>>
>>>
>>> CephFS can generate a problem rados workload too when files are deleted or 
>>> truncated.  If that isn't happening in your workload then you're probably 
>>> fine.  If deletes are mixed in, then you might consider pausing IO for the 
>>> upgrade.
>>>
>>> FWIW, if you have been running 12.2.5 for a while and haven't encountered 
>>> the OSD FileStore crashes with
>>>
>>> src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must 
>>> exist")
>>>
>>> but have had OSDs go up/down then you are probably okay.
>>
>> => Does this issue only affect filestore, or also bluestore? 
>> In your "IMPORTANT" warning mail, you wrote:
>> "It seems to affect filestore and busy clusters with this specific 
>> workload."
>> concerning this issue. 
>> However, the release notes do not mention explicitly that only Filestore is 
>> affected. 
>>
>> Both Linh Vu and me are using Bluestore (exclusively). 
>> Are we potentially affected unless we pause I/O during the upgrade? 
> 
> The bug should apply to both FileStore and BlueStore, but we have only 
> seen crashes with FileStore.  I'm not entirely sure why that is.  One 
> theory is that the filestore apply timing is different and that makes the 
> bug more likely to happen.  Another is that filestore splitting is a 
> "good" source of that latency that tends to trigger the bug easily.
> 
> If it were me I would err on the safe side. :)

That's certainly the choice of a sage ;-). 

We'll do that, too - we informed our users just now I/O will be blocked for 
thirty minutes or so to give us some leeway for the upgrade... 
They will certainly survive the pause with the nice weather outside :-). 

Cheers and many thanks,
Oliver

> 
> sage
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Sage Weil
On Wed, 18 Jul 2018, Oliver Freyermuth wrote:
> Am 18.07.2018 um 14:20 schrieb Sage Weil:
> > On Wed, 18 Jul 2018, Linh Vu wrote:
> >> Thanks for all your hard work in putting out the fixes so quickly! :)
> >>
> >> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, 
> >> not RGW. In the release notes, it says RGW is a risk especially the 
> >> garbage collection, and the recommendation is to either pause IO or 
> >> disable RGW garbage collection.
> >>
> >>
> >> In our case with CephFS, not RGW, is it a lot less risky to perform the 
> >> upgrade to 12.2.7 without the need to pause IO?
> >>
> >>
> >> What does pause IO do? Do current sessions just get queued up and IO 
> >> resume normally with no problem after unpausing?
> >>
> >>
> >> If we have to pause IO, is it better to do something like: pause IO, 
> >> restart OSDs on one node, unpause IO - repeated for all the nodes 
> >> involved in the EC pool?
> 
> Hi!
> 
> sorry for asking again, but... 
> 
> > 
> > CephFS can generate a problem rados workload too when files are deleted or 
> > truncated.  If that isn't happening in your workload then you're probably 
> > fine.  If deletes are mixed in, then you might consider pausing IO for the 
> > upgrade.
> > 
> > FWIW, if you have been running 12.2.5 for a while and haven't encountered 
> > the OSD FileStore crashes with
> > 
> > src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must 
> > exist")
> > 
> > but have had OSDs go up/down then you are probably okay.
> 
> => Does this issue only affect filestore, or also bluestore? 
> In your "IMPORTANT" warning mail, you wrote:
> "It seems to affect filestore and busy clusters with this specific 
> workload."
> concerning this issue. 
> However, the release notes do not mention explicitly that only Filestore is 
> affected. 
> 
> Both Linh Vu and me are using Bluestore (exclusively). 
> Are we potentially affected unless we pause I/O during the upgrade? 

The bug should apply to both FileStore and BlueStore, but we have only 
seen crashes with FileStore.  I'm not entirely sure why that is.  One 
theory is that the filestore apply timing is different and that makes the 
bug more likely to happen.  Another is that filestore splitting is a 
"good" source of that latency that tends to trigger the bug easily.

If it were me I would err on the safe side. :)

sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Oliver Freyermuth
Am 18.07.2018 um 14:20 schrieb Sage Weil:
> On Wed, 18 Jul 2018, Linh Vu wrote:
>> Thanks for all your hard work in putting out the fixes so quickly! :)
>>
>> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, 
>> not RGW. In the release notes, it says RGW is a risk especially the 
>> garbage collection, and the recommendation is to either pause IO or 
>> disable RGW garbage collection.
>>
>>
>> In our case with CephFS, not RGW, is it a lot less risky to perform the 
>> upgrade to 12.2.7 without the need to pause IO?
>>
>>
>> What does pause IO do? Do current sessions just get queued up and IO 
>> resume normally with no problem after unpausing?
>>
>>
>> If we have to pause IO, is it better to do something like: pause IO, 
>> restart OSDs on one node, unpause IO - repeated for all the nodes 
>> involved in the EC pool?

Hi!

sorry for asking again, but... 

> 
> CephFS can generate a problem rados workload too when files are deleted or 
> truncated.  If that isn't happening in your workload then you're probably 
> fine.  If deletes are mixed in, then you might consider pausing IO for the 
> upgrade.
> 
> FWIW, if you have been running 12.2.5 for a while and haven't encountered 
> the OSD FileStore crashes with
> 
> src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must 
> exist")
> 
> but have had OSDs go up/down then you are probably okay.

=> Does this issue only affect filestore, or also bluestore? 
In your "IMPORTANT" warning mail, you wrote:
"It seems to affect filestore and busy clusters with this specific 
workload."
concerning this issue. 
However, the release notes do not mention explicitly that only Filestore is 
affected. 

Both Linh Vu and me are using Bluestore (exclusively). 
Are we potentially affected unless we pause I/O during the upgrade? 

All the best,
Oliver

> 
> Thanks!
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multisite and link speed

2018-07-18 Thread Casey Bodley
On Tue, Jul 17, 2018 at 10:16 AM, Robert Stanford
 wrote:
>
>  I have ceph clusters in a zone configured as active/passive, or
> primary/backup.  If the network link between the two clusters is slower than
> the speed of data coming in to the active cluster, what will eventually
> happen?  Will data pool on the active cluster until memory runs out?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

The primary zone does not queue up changes in memory to push to other
zones. Instead, the sync process on the backup zone reads updates from
the primary zone. Sync will make as much progress as the link allows,
but if the primary cluster is constantly ingesting data at a higher
rate, the backup cluster will fall behind.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Sage Weil
On Wed, 18 Jul 2018, Linh Vu wrote:
> Thanks for all your hard work in putting out the fixes so quickly! :)
> 
> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, 
> not RGW. In the release notes, it says RGW is a risk especially the 
> garbage collection, and the recommendation is to either pause IO or 
> disable RGW garbage collection.
> 
> 
> In our case with CephFS, not RGW, is it a lot less risky to perform the 
> upgrade to 12.2.7 without the need to pause IO?
> 
> 
> What does pause IO do? Do current sessions just get queued up and IO 
> resume normally with no problem after unpausing?
> 
> 
> If we have to pause IO, is it better to do something like: pause IO, 
> restart OSDs on one node, unpause IO - repeated for all the nodes 
> involved in the EC pool?

CephFS can generate a problem rados workload too when files are deleted or 
truncated.  If that isn't happening in your workload then you're probably 
fine.  If deletes are mixed in, then you might consider pausing IO for the 
upgrade.

FWIW, if you have been running 12.2.5 for a while and haven't encountered 
the OSD FileStore crashes with

src/os/filestore/FileStore.cc: 5524: FAILED assert(0 == "ERROR: source must 
exist")

but have had OSDs go up/down then you are probably okay.

Thanks!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Oliver Freyermuth
Also many thanks from my side! 

Am 18.07.2018 um 03:04 schrieb Linh Vu:
> Thanks for all your hard work in putting out the fixes so quickly! :)
> 
> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, not 
> RGW. In the release notes, it says RGW is a risk especially the garbage 
> collection, and the recommendation is to either pause IO or disable RGW 
> garbage collection. 
> 
> 
> In our case with CephFS, not RGW, is it a lot less risky to perform the 
> upgrade to 12.2.7 without the need to pause IO? 
> 
> 
> What does pause IO do? Do current sessions just get queued up and IO resume 
> normally with no problem after unpausing? 

That's my understanding, pause blocks any reads and writes. If the processes 
accessing CephFS do not have any wallclock-related timeout handlers, they 
should be fine IMHO. 
I'm unsure how NFS Ganesha 
But indeed I have the very same question - we also have a pure CephFS cluster, 
without RGW, EC-pool-backed, on 12.2.5. Should we pause IO during upgrade? 

I wonder whether it is risky / unrisky to upgrade without pausing I/O? 
The update notes in the blog do not state whether a pure CephFS setup is 
affected. 

Cheers,
Oliver

> 
> 
> If we have to pause IO, is it better to do something like: pause IO, restart 
> OSDs on one node, unpause IO - repeated for all the nodes involved in the EC 
> pool? 
> 
> 
> Regards,
> 
> Linh
> 
> --
> *From:* ceph-users  on behalf of Sage Weil 
> 
> *Sent:* Wednesday, 18 July 2018 4:42:41 AM
> *To:* Stefan Kooman
> *Cc:* ceph-annou...@ceph.com; ceph-de...@vger.kernel.org; 
> ceph-maintain...@ceph.com; ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] v12.2.7 Luminous released
>  
> On Tue, 17 Jul 2018, Stefan Kooman wrote:
>> Quoting Abhishek Lekshmanan (abhis...@suse.com):
>>
>> > *NOTE* The v12.2.5 release has a potential data corruption issue with
>> > erasure coded pools. If you ran v12.2.5 with erasure coding, please see
> ^^^
>> > below.
>>
>> < snip >
>>
>> > Upgrading from v12.2.5 or v12.2.6
>> > -
>> >
>> > If you used v12.2.5 or v12.2.6 in combination with erasure coded
> ^
>> > pools, there is a small risk of corruption under certain workloads.
>> > Specifically, when:
>>
>> < snip >
>>
>> One section mentions Luminous clusters _with_ EC pools specifically, the 
>> other
>> section mentions Luminous clusters running 12.2.5.
> 
> I think they both do?
> 
>> I might be misreading this, but to make things clear for current Ceph
>> Luminous 12.2.5 users. Is the following statement correct?
>>
>> If you do _NOT_ use EC in your 12.2.5 cluster (only replicated pools), there 
>> is
>> no need to quiesce IO (ceph osd pause).
> 
> Correct.
> 
>> http://docs.ceph.com/docs/master/releases/luminous/#upgrading-from-other-versions
>> If your cluster did not run v12.2.5 or v12.2.6 then none of the above
>> issues apply to you and you should upgrade normally.
>>
>> ^^ Above section would indicate all 12.2.5 luminous clusters.
> 
> The intent here is to clarify that any cluster running 12.2.4 or
> older can upgrade without reading carefully. If the cluster
> does/did run 12.2.5 or .6, then read carefully because it may (or may not)
> be affected.
> 
> Does that help? Any suggested revisions to the wording in the release
> notes that make it clearer are welcome!
> 
> Thanks-
> sage
> 
> 
>>
>> Please clarify,
>>
>> Thanks,
>>
>> Stefan
>>
>> --
>> | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
>> | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing 

Re: [ceph-users] Read/write statistics per RBD image

2018-07-18 Thread Jason Dillaman
Yup, on the host running librbd, you just need to enable the "admin socket"
in your ceph.conf and then use "ceph --admin-daemon
/path/to/image/admin/socket.asok perf dump" (i.e. not "ceph perf dump").

See the example in this tip window [1] for how to configure for a "libvirt"
CephX user.

[1] http://docs.ceph.com/docs/mimic/rbd/libvirt/#configuring-ceph

On Wed, Jul 18, 2018 at 4:02 AM Mateusz Skala (UST, POL) <
mateusz.sk...@ust-global.com> wrote:

> Thanks  for response.
>
> In ‘ceph perf dump’ there is no statistics for read/write operations on
> specific RBD image, only for osd and total client operations. I need to get
> statistics on one specific RBD image, to get top used images. It is
> possible?
>
> Regards
>
> Mateusz
>
>
>
> *From:* Jason Dillaman [mailto:jdill...@redhat.com]
> *Sent:* Tuesday, July 17, 2018 3:29 PM
> *To:* Mateusz Skala (UST, POL) 
> *Cc:* ceph-users 
> *Subject:* Re: [ceph-users] Read/write statistics per RBD image
>
>
>
> Yes, you just need to enable the "admin socket" in your ceph.conf and then
> use "ceph --admin-daemon /path/to/image/admin/socket.asok perf dump".
>
>
>
> On Tue, Jul 17, 2018 at 8:53 AM Mateusz Skala (UST, POL) <
> mateusz.sk...@ust-global.com> wrote:
>
> Hi,
>
> It is possible to get statistics of issued reads/writes to specific RBD
> image? Best will be statistics like in /proc/diskstats in linux.
>
> Regards
>
> Mateusz
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
>
> Jason
>


-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Jewel PG stuck inconsistent with 3 0-size objects

2018-07-18 Thread Matthew Vernon
Hi,

On 17/07/18 01:29, Brad Hubbard wrote:
> Your issue is different since not only do the omap digests of all
> replicas not match the omap digest from the auth object info but they
> are all different to each other.
> 
> What is min_size of pool 67 and what can you tell us about the events
> leading up to this?

min_size is 2 ; pool 67 is default.rgw.buckets.index.
This is a moderately-large (3060 OSD) cluster that's been running for a
while; we upgraded to 10.2.9 (from 10.2.6, also from Ubuntu) about a
week ago.

>> rados -p default.rgw.buckets.index setomapval
>> .dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key anything
>> [deep-scrub]
>> rados -p default.rgw.buckets.index rmomapkey
>> .dir.861ae926-7ff0-48c5-86d6-a6ba8d0a7a14.7130858.6 temporary-key

We did this, and it does appear to have resolved the issue (the pg is
now happy).

Regards,

Matthew


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Balancer: change from crush-compat to upmap

2018-07-18 Thread Caspar Smit
Hi Xavier,

Not yet, i got a little anxious in changing anything major in the cluster
after reading about the 12.2.5 regressions since i'm also using bluestore
and erasure coding.

So after this cluster is upgraded to 12.2.7 i'm proceeding forward with
this.

Kind regards,
Caspar

2018-07-16 8:34 GMT+02:00 Xavier Trilla :

> Hi Caspar,
>
>
>
> Did you find any information regarding the migration from crush-compat to
> unmap? I’m facing the same situation.
>
>
>
> Thanks!
>
>
>
>
>
> *De:* ceph-users  * En nombre de *Caspar
> Smit
> *Enviado el:* lunes, 25 de junio de 2018 12:25
> *Para:* ceph-users 
> *Asunto:* [ceph-users] Balancer: change from crush-compat to upmap
>
>
>
> Hi All,
>
>
>
> I've been using the balancer module in crush-compat mode for quite a while
> now and want to switch to upmap mode since all my clients are now luminous
> (v12.2.5)
>
>
>
> i've reweighted the compat weight-set back to as close as the original
> crush weights using 'ceph osd crush reweight-compat'
>
>
>
> Before i switch to upmap i presume i need to remove the compat weight set
> with:
>
>
>
> ceph osd crush weight-set rm-compat
>
>
>
> Will this have any significant impact (rebalancing lots of pgs) or does
> this have very little effect since i already reweighted everything back
> close to crush default weights?
>
>
>
> Thanks in advance and kind regards,
>
> Caspar
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Exact scope of OSD heartbeating?

2018-07-18 Thread Dan van der Ster
On Wed, Jul 18, 2018 at 3:20 AM Anthony D'Atri  wrote:
>
> The documentation here:
>
> http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
>
> says
>
> "Each Ceph OSD Daemon checks the heartbeat of other Ceph OSD Daemons every 6 
> seconds"
>
> and
>
> " If a neighboring Ceph OSD Daemon doesn’t show a heartbeat within a 20 
> second grace period, the Ceph OSD Daemon may consider the neighboring Ceph 
> OSD Daemon down and report it back to a Ceph Monitor,"
>
> I've always thought that each OSD heartbeats with *every* other OSD, which of 
> course means that total heartbeat traffic grows ~ quadratically.  However in 
> extending test we've observed that the number of other OSDs that a subject 
> heartbeat (heartbeated?) was < N, which has us wondering if perhaps only OSDs 
> with which a given OSD shares are contacted -- or some other subset.
>

OSDs heartbeat with their peers, the set of osds with whom they share
at least one PG.
You can see the heartbeat peers (HB_PEERS) in ceph pg dump -- after
the header "OSD_STAT USED  AVAIL TOTAL HB_PEERS..."

This is one of the nice features of the placement group concept --
heartbeats and peering in general stays constant with the number of
PGs per OSD, rather than scaling up with the total number of OSDs in a
cluster.

Cheers, Dan


> I plan to submit a doc fix for mon_osd_min_down_reporters and wanted to 
> resolve this FUD first.
>
> -- aad
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Read/write statistics per RBD image

2018-07-18 Thread Mateusz Skala (UST, POL)
Thanks  for response.

In ‘ceph perf dump’ there is no statistics for read/write operations on 
specific RBD image, only for osd and total client operations. I need to get 
statistics on one specific RBD image, to get top used images. It is possible?

Regards

Mateusz



From: Jason Dillaman [mailto:jdill...@redhat.com]
Sent: Tuesday, July 17, 2018 3:29 PM
To: Mateusz Skala (UST, POL) 
Cc: ceph-users 
Subject: Re: [ceph-users] Read/write statistics per RBD image



Yes, you just need to enable the "admin socket" in your ceph.conf and then use 
"ceph --admin-daemon /path/to/image/admin/socket.asok perf dump".



On Tue, Jul 17, 2018 at 8:53 AM Mateusz Skala (UST, POL) 
mailto:mateusz.sk...@ust-global.com>> wrote:

   Hi,

   It is possible to get statistics of issued reads/writes to specific RBD 
image? Best will be statistics like in /proc/diskstats in linux.

   Regards

   Mateusz

   ___
   ceph-users mailing list
   ceph-users@lists.ceph.com
   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






   --

   Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.2.7 Luminous released

2018-07-18 Thread Caspar Smit
2018-07-18 3:04 GMT+02:00 Linh Vu :

> Thanks for all your hard work in putting out the fixes so quickly! :)
>
> We have a cluster on 12.2.5 with Bluestore and EC pool but for CephFS, not
> RGW. In the release notes, it says RGW is a risk especially the garbage
> collection, and the recommendation is to either pause IO or disable RGW
> garbage collection.
>
>
> In our case with CephFS, not RGW, is it a lot less risky to perform the
> upgrade to 12.2.7 without the need to pause IO?
>
>
> I have the same question but now for a 12.2.5 EC cluster doing only RBD.
Am i still affected or is this only on RGW workloads?

Furthermore, after the upgrade of the packages to 12.2.7 it is still needed
to upgrade/restart the mons/mgrs first i presume?

Kind regards,
Caspar

What does pause IO do? Do current sessions just get queued up and IO resume
> normally with no problem after unpausing?
>
>
> If we have to pause IO, is it better to do something like: pause IO,
> restart OSDs on one node, unpause IO - repeated for all the nodes involved
> in the EC pool?
>
>
> Regards,
>
> Linh
> --
> *From:* ceph-users  on behalf of Sage
> Weil 
> *Sent:* Wednesday, 18 July 2018 4:42:41 AM
> *To:* Stefan Kooman
> *Cc:* ceph-annou...@ceph.com; ceph-de...@vger.kernel.org;
> ceph-maintain...@ceph.com; ceph-us...@ceph.com
> *Subject:* Re: [ceph-users] v12.2.7 Luminous released
>
> On Tue, 17 Jul 2018, Stefan Kooman wrote:
> > Quoting Abhishek Lekshmanan (abhis...@suse.com):
> >
> > > *NOTE* The v12.2.5 release has a potential data corruption issue with
> > > erasure coded pools. If you ran v12.2.5 with erasure coding, please see
> ^^^
> > > below.
> >
> > < snip >
> >
> > > Upgrading from v12.2.5 or v12.2.6
> > > -
> > >
> > > If you used v12.2.5 or v12.2.6 in combination with erasure coded
> ^
> > > pools, there is a small risk of corruption under certain workloads.
> > > Specifically, when:
> >
> > < snip >
> >
> > One section mentions Luminous clusters _with_ EC pools specifically, the
> other
> > section mentions Luminous clusters running 12.2.5.
>
> I think they both do?
>
> > I might be misreading this, but to make things clear for current Ceph
> > Luminous 12.2.5 users. Is the following statement correct?
> >
> > If you do _NOT_ use EC in your 12.2.5 cluster (only replicated pools),
> there is
> > no need to quiesce IO (ceph osd pause).
>
> Correct.
>
> > http://docs.ceph.com/docs/master/releases/luminous/#
> upgrading-from-other-versions
> > If your cluster did not run v12.2.5 or v12.2.6 then none of the above
> > issues apply to you and you should upgrade normally.
> >
> > ^^ Above section would indicate all 12.2.5 luminous clusters.
>
> The intent here is to clarify that any cluster running 12.2.4 or
> older can upgrade without reading carefully. If the cluster
> does/did run 12.2.5 or .6, then read carefully because it may (or may not)
> be affected.
>
> Does that help? Any suggested revisions to the wording in the release
> notes that make it clearer are welcome!
>
> Thanks-
> sage
>
>
> >
> > Please clarify,
> >
> > Thanks,
> >
> > Stefan
> >
> > --
> > | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351
> > | GPG: 0xD14839C6 +31 318 648 688 / i...@bit.nl
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] config ceph with rdma error

2018-07-18 Thread Will Zhao
Hi all:



By following the instructions:

(https://community.mellanox.com/docs/DOC-2721)

(https://community.mellanox.com/docs/DOC-2693)

(http://hwchiu.com/2017-05-03-ceph-with-rdma.html)



I'm trying to configure CEPH with RDMA feature on environments as follows:



CentOS Linux release 7.2.1511 (Core)

MLNX_OFED_LINUX-4.4-1.0.0.0:

Mellanox Technologies MT27500 Family [ConnectX-3]



rping works between all nodes and add these lines to ceph.conf to enable
RDMA:



public_network = 10.10.121.0/24

cluster_network = 10.10.121.0/24

ms_type = async+rdma

ms_async_rdma_device_name = mlx4_0

ms_async_rdma_port_num = 2



IB network is using 10.10.121.0/24 addresses and "ibdev2netdev" command
shows port 2 is up.

Error occurs when running "ceph-deploy --overwrite-conf mon
create-initial", ceph-deploy log details:



[2018-07-12 17:53:48,943][ceph_deploy.conf][DEBUG ] found configuration
file at: /home/user1/.cephdeploy.conf

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ] Invoked (1.5.37):
/usr/bin/ceph-deploy --overwrite-conf mon create-initial

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ] ceph-deploy options:

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]
username  : None

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]
verbose   : False

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]
overwrite_conf: True

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]
subcommand: create-initial

[2018-07-12 17:53:48,944][ceph_deploy.cli][INFO  ]  quiet
  : False

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
cd_conf   : 

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
cluster   : ceph

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
func  : 

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
ceph_conf : None

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
default_release   : False

[2018-07-12 17:53:48,945][ceph_deploy.cli][INFO  ]
keyrings  : None

[2018-07-12 17:53:48,947][ceph_deploy.mon][DEBUG ] Deploying mon, cluster
ceph hosts node1

[2018-07-12 17:53:48,947][ceph_deploy.mon][DEBUG ] detecting platform for
host node1 ...

[2018-07-12 17:53:49,005][node1][DEBUG ] connection detected need for sudo

[2018-07-12 17:53:49,039][node1][DEBUG ] connected to host: node1

[2018-07-12 17:53:49,040][node1][DEBUG ] detect platform information from
remote host

[2018-07-12 17:53:49,073][node1][DEBUG ] detect machine type

[2018-07-12 17:53:49,078][node1][DEBUG ] find the location of an executable

[2018-07-12 17:53:49,079][ceph_deploy.mon][INFO  ] distro info: CentOS
Linux 7.2.1511 Core

[2018-07-12 17:53:49,079][node1][DEBUG ] determining if provided host has
same hostname in remote

[2018-07-12 17:53:49,079][node1][DEBUG ] get remote short hostname

[2018-07-12 17:53:49,080][node1][DEBUG ] deploying mon to node1

[2018-07-12 17:53:49,080][node1][DEBUG ] get remote short hostname

[2018-07-12 17:53:49,081][node1][DEBUG ] remote hostname: node1

[2018-07-12 17:53:49,083][node1][DEBUG ] write cluster configuration to
/etc/ceph/{cluster}.conf

[2018-07-12 17:53:49,084][node1][DEBUG ] create the mon path if it does not
exist

[2018-07-12 17:53:49,085][node1][DEBUG ] checking for done path:
/var/lib/ceph/mon/ceph-node1/done

[2018-07-12 17:53:49,085][node1][DEBUG ] create a done file to avoid
re-doing the mon deployment

[2018-07-12 17:53:49,086][node1][DEBUG ] create the init path if it does
not exist

[2018-07-12 17:53:49,089][node1][INFO  ] Running command: sudo systemctl
enable ceph.target

[2018-07-12 17:53:49,365][node1][INFO  ] Running command: sudo systemctl
enable ceph-mon@node1

[2018-07-12 17:53:49,588][node1][INFO  ] Running command: sudo systemctl
start ceph-mon@node1

[2018-07-12 17:53:51,762][node1][INFO  ] Running command: sudo ceph
--cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.node1.asok mon_status

[2018-07-12 17:53:51,979][node1][DEBUG ]


[2018-07-12 17:53:51,979][node1][DEBUG ] status for monitor: mon.node1

[2018-07-12 17:53:51,980][node1][DEBUG ] {

[2018-07-12 17:53:51,980][node1][DEBUG ]   "election_epoch": 3,

[2018-07-12 17:53:51,980][node1][DEBUG ]   "extra_probe_peers": [],

[2018-07-12 17:53:51,980][node1][DEBUG ]   "feature_map": {

[2018-07-12 17:53:51,981][node1][DEBUG ] "mon": {

[2018-07-12 17:53:51,981][node1][DEBUG ]   "group": {

[2018-07-12 17:53:51,981][node1][DEBUG ] "features":
"0x1ffddff8eea4fffb",

[2018-07-12 17:53:51,981][node1][DEBUG ] "num": 1,

[2018-07-12 17:53:51,981][node1][DEBUG ] "release": "luminous"

[2018-07-12 17:53:51,981][node1][DEBUG ]   }

[2018-07-12 17:53:51,981][node1][DEBUG ] }

[2018-07-12 17:53:51,982][node1][DEBUG ]   },

[2018-07-12 17:53:51,982][node1][DEBUG ]   "features": {

[2018-07-12 

Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Brad Hubbard
On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan  wrote:
> I was on 12.2.5 for a couple weeks and started randomly seeing
> corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
> loose.  I panicked and moved to Mimic, and when that didn't solve the
> problem, only then did I start to root around in mailing lists archives.
>
> It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is
> out, but I'm unsure how to proceed now that the damaged cluster is
> running under Mimic.  Is there anything I can do to get the cluster back
> online and objects readable?

That depends on what the specific problem is. Can you provide some
data that fills in the blanks around "randomly seeing corruption"?

>
> Everything is BlueStore and most of it is EC.
>
> Thanks.
>
> -Troy
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com