Re: [ceph-users] PG Balancer Upmap mode not working

2019-12-09 Thread Lars Täuber
Hi Anthony!

Mon, 9 Dec 2019 17:11:12 -0800
Anthony D'Atri  ==> ceph-users  
:
> > How is that possible? I dont know how much more proof I need to present 
> > that there's a bug.  
> 
> FWIW, your pastes are hard to read with all the ? in them.  Pasting 
> non-7-bit-ASCII?

I don't see much "?" in his posts. Maybe a display issue?

> > |I increased PGs and see no difference.  
> 
> From what pgp_num to what new value?  Numbers that are not a power of 2 can 
> contribute to the sort of problem you describe.  Do you have host CRUSH fault 
> domain?
> 

Does the fault domain play a role with this situation? I can't see the reason. 
This would only be important if the OSDs weren't evenly distributed across the 
hosts.
Philippe can you posts your 'ceph osd tree'?

> > Raising PGs to 100 is an old statement anyway, anything 60+ should be fine. 
> >   
> 
> Fine in what regard?  To be sure, Wido’s advice means a *ratio* of at least 
> 100.  ratio = (pgp_num * replication) / #osds
> 
> The target used to be 200, a commit around 12.2.1 retconned that to 100.  
> Best I can tell the rationale is memory usage at the expense of performance.
> 
> Is your original except complete? Ie., do you only have 24 OSDs?  Across how 
> many nodes?
> 
> The old guidance for tiny clusters:
> 
> • Less than 5 OSDs set pg_num to 128
> 
> • Between 5 and 10 OSDs set pg_num to 512
> 
> • Between 10 and 50 OSDs set pg_num to 1024

This is what I thought too. But in this posts
https://lists.ceph.io/hyperkitty/list/ceph-us...@ceph.io/message/TR6CJQKSMOHNGOMQO4JBDMGEL2RMWE36/
[Why are the mailing lists ceph.io and ceph.com not merged? It's hard to find 
the link to messages this way.]
Konstantin suggested to reduce to pg_num=512. The cluster had 35 OSDs.
It is still merging very slowly the PGs.

In the meantime I added 5 more OSDs and thinking about rising the pg_num back 
to 1024.
I wonder how less PGs can balance better than 512.

I'm in a similar situation like Philippe with my cluster.
ceph osd df class hdd:
[…]
MIN/MAX VAR: 0.73/1.21  STDDEV: 6.27

Attached is a picture of the dashboard with tiny bars of the data distribution. 
The nearly empty OSDs are SSDs used for its own pool.
I think there might be a bug in the balancing algorithm.

Thanks,
Lars

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Impact of a small DB size with Bluestore

2019-12-01 Thread Lars Täuber
Hi,

Tue, 26 Nov 2019 13:57:51 +
Simon Ironside  ==> ceph-users@lists.ceph.com :
> Mattia Belluco said back in May:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-May/035086.html
> 
> "when RocksDB needs to compact a layer it rewrites it
> *before* deleting the old data; if you'd like to be sure you db does not
> spill over to the spindle you should allocate twice the size of the
> biggest layer to allow for compaction."
> 
> I didn't spot anyone disagreeing so I used 64GiB DB/WAL partitions on 
> the SSDs in my most recent clusters to allow for this and to be certain 
> that I definitely had room for the WAL on top and wouldn't get caught 
> out by people saying GB (x1000^3 bytes) when they mean GiB (x1024^3 
> bytes). I left the rest of the SSD empty to make the most of wear 
> leveling, garbage collection etc.
> 
> Simon


this is something I liked to get a comment from a developer too.
So what about the doubled size for block_db?

Thanks
Lars

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVMe disk - size

2019-11-17 Thread Lars Täuber
Hi Kristof,

may I add another choice?
I configured my SSDs this way.

Every host for OSDs has two fast and durable SSDs.
Both SSDs are in one RAID1 which then is split up into LVs.

I took 58GB for DB & WAL (and space for a special action by the DB (was it 
compaction?)) for each OSD.
Then there where some hundreds of GB left on this RAID1 which I took to form a 
faster SSD-OSD.
This is put into its own class of OSDs.

So I have (slower) pools put onto OSDs of class "hdd" and (faster) pools put 
onto OSDs of class "ssd".
The faster pools are used for metadata of CephFS.

Good luck,
Lars


Mon, 18 Nov 2019 07:46:23 +0100
Kristof Coucke  ==> vita...@yourcmc.ru :
> Hi all,
> 
> Thanks for the feedback.
> Though, just to be sure:
> 
> 1. There is no 30GB limit if I understand correctly for the RocksDB size.
> If metadata crosses that barrier, the L4 part will spillover to the primary
> device? Or will it just move the RocksDB completely? Or will it just stop
> and indicate it's full?
> 2. Since the WAL will also be written to that device, I assume a few
> additional GB's is still usefull...
> 
> With my setup (13x 14TB + 2 NVMe of 1.6TB / host, 10 hosts) I have multiple
> possible scenario's:
> - Assigning 35GB of space of the NVMe disk (30GB for DB, 5 spare) would
> result in only 455GB being used (13 x 35GB). This is a pity, since I have
> 3.2TB of NVMe disk space...
> 
> Options line-up:
> 
> *Option a*: Not using the NVMe for block.db storage, but as RGW metadata
> pool.
> Advantages:
> - Impact of 1 defect NVMe is limited.
> - Fast storage for the metadata pool.
> Disadvantage:
> - RocksDB for each OSD is on the primary disk, resulting in slower
> performance of each OSD.
> 
> *Option b: *Hardware mirror of the NVMe drive
> Advantages:
> - Impact of 1 defect NVMe is limited
> - Fast KV lookup for each OSD
> Disadvantage:
> - I/O to NVMe is serialized for all OSDs on 1 host. Though the NVMe are
> fast, I imagine that there still is an impact.
> - 1 TB of NVMe is not used / host
> 
> *Option c: *Split the NVMe's accross the OSD
> Advantages:
> - Fast RockDB access - up to L3 (assuming spillover does it job)
> Disadvantage:
> - 1 defect NVMe impacts max 7 OSDs (1 NVMe assigned to 7 or 6 OSD daemons
> per host)
> - 2.7TB of NVMe space not used per host
> 
> *Option d: *1 NVMe disk for OSDs, 1 for RGW metadata pool
> Advantages:
> - Fast RockDB access - up to L3
> - Fast RGW metadata pool (though limited to 5,3TB (raw pool size will be
> 16TB, divided by 3 due to replication) I assume this already gives some
> possibilities
> Disadvantages:
> - 1 defect NVMe might impact a complete host (all OSDs might be using it
> for the RockDB storage)
> - 1 TB of NVMe is not used
> 
> Though menu to choose from, each with it possibilities... The initial idea
> was too assign 200GB per OSD of the NVMe space per OSD, but this would
> result in a lot of unused space. I don't know if there is anything on the
> roadmap to adapt the RocksDB sizing to make better use of the available
> NVMe disk space.
> With all the information, I would assume that the best option would be *option
> A*. Since we will be using erasure coding for the RGW data pool (k=6, m=3),
> the impact of a defect NVMe would be too significant. The other alternative
> would be option b, but then again we would be dealing with HW raid which is
> against all Ceph design rules.
> 
> Any other options or (dis)advantages I missed? Or any other opinions to
> choose another option?
> 
> Regards,
> 
> Kristof
> 
> Op vr 15 nov. 2019 om 18:22 schreef :
> 
> > Use 30 GB for all OSDs. Other values are pointless, because
> > https://yourcmc.ru/wiki/Ceph_performance#About_block.db_sizing
> >
> > You can use the rest of free NVMe space for bcache - it's much better
> > than just allocating it for block.db.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-ansible / block-db block-wal

2019-10-30 Thread Lars Täuber
I don't use ansible anymore. But this was my config for the host onode1:

./host_vars/onode2.yml:

lvm_volumes:
  - data: /dev/sdb
db: '1'
db_vg: host-2-db
  - data: /dev/sdc
db: '2'
db_vg: host-2-db
  - data: /dev/sde
db: '3'
db_vg: host-2-db
  - data: /dev/sdf
db: '4'
db_vg: host-2-db
…

one config file per host. The LVs were created by hand on a PV over RAID1 over 
two SSDs.
The hosts had empty slots for hdds to be bought later. So I had to "partition" 
the PV by hand, because ansible uses the whole RAID1 only for the present HDDs.

It is said that only certain sizes of DB & WAL partitions are sensible.
I now use 58GiB LVs.
The remaining space in the RAID1 is used for a faster OSD.


Lars


Wed, 30 Oct 2019 10:02:23 +
CUZA Frédéric  ==> "ceph-users@lists.ceph.com" 
 :
> Hi Everyone,
> 
> Does anyone know how to indicate block-db and block-wal to device on ansible ?
> In ceph-deploy it is quite easy :
> ceph-deploy osd create osd_host08 --data /dev/sdl --block-db /dev/sdm12 
> --block-wal /dev/sdn12 -bluestore
> 
> On my data nodes I have 12 HDDs and 2 SSDs I use those SSDs for block-db and 
> block-wal.
> How to indicate for each osd which partition to use ?
> 
> And finally, how do you handle the deployment if you have multiple data nodes 
> setup ?
> SSDs on sdm and sdn on one host and SSDs on sda and sdb on another ?
> 
> Thank you for your help.
> 
> Regards,


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS client hanging and cache issues

2019-10-30 Thread Lars Täuber
Hi.

Sounds like you use kernel clients with kernels from canonical/ubuntu.
Two kernels have a bug:
4.15.0-66
and
5.0.0-32

Updated kernels are said to have fixes.
Older kernels also work: 
4.15.0-65
and
5.0.0-31


Lars

Wed, 30 Oct 2019 09:42:16 +
Bob Farrell  ==> ceph-users  :
> Hi. We are experiencing a CephFS client issue on one of our servers.
> 
> ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus
> (stable)
> 
> Trying to access, `umount`, or `umount -f` a mounted CephFS volumes causes
> my shell to hang indefinitely.
> 
> After a reboot I can remount the volumes cleanly but they drop out after <
> 1 hour of use.
> 
> I see this log entry multiple times when I reboot the server:
> ```
> cache_from_obj: Wrong slab cache. inode_cache but object is from
> ceph_inode_info
> ```
> The machine then reboots after approx. 30 minutes.
> 
> All other Ceph/CephFS clients and servers seem perfectly happy. CephFS
> cluster is HEALTH_OK.
> 
> Any help appreciated. If I can provide any further details please let me
> know.
> 
> Thanks in advance,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster network down

2019-10-01 Thread Lars Täuber
Mon, 30 Sep 2019 15:21:18 +0200
Janne Johansson  ==> Lars Täuber  :
> >
> > I don't remember where I read it, but it was told that the cluster is
> > migrating its complete traffic over to the public network when the cluster
> > networks goes down. So this seems not to be the case?
> >  
> 
> Be careful with generalizations like "when a network acts up, it will be
> completely down and noticeably unreachable for all parts", since networks
> can break in thousands of not-very-obvious ways which are not 0%-vs-100%
> but somewhere in between.
> 

Ok. I ask my question in a new way.
What does ceph do, when I switch off all switches of the cluster network?
Does ceph handle this silently without interruption? Does the heartbeat systems 
use the public network as a failover automatically?

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cluster network down

2019-09-30 Thread Lars Täuber
Mon, 30 Sep 2019 14:49:48 +0200
Burkhard Linke  ==> 
ceph-users@lists.ceph.com :
> Hi,
> 
> On 9/30/19 2:46 PM, Lars Täuber wrote:
> > Hi!
> >
> > What happens when the cluster network goes down completely?
> > Is the cluster silently using the public network without interruption, or 
> > does the admin has to act?  
> 
> The cluster network is used for OSD heartbeats and backfilling/recovery 
> traffic. If the heartbeats do not work anymore, the OSDs will start to 
> report the other OSDs as down, resulting in a completely confused cluster...
> 
> 
> I would avoid an extra cluster network unless it is absolutely necessary.
> 
> 
> Regards,
> 
> Burkhard

I don't remember where I read it, but it was told that the cluster is migrating 
its complete traffic over to the public network when the cluster networks goes 
down. So this seems not to be the case?

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cluster network down

2019-09-30 Thread Lars Täuber
Hi!

What happens when the cluster network goes down completely?
Is the cluster silently using the public network without interruption, or does 
the admin has to act?

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore.cc: 11208: ceph_abort_msg("unexpected error")

2019-08-23 Thread Lars Täuber
Hi Paul,

a result of fgrep is attached.
Can you do something with it?

I can't read it. Maybe this is the relevant part:
" bluestore(/var/lib/ceph/osd/first-16) _txc_add_transaction error (39) 
Directory not empty not handled on operation 21 (op 1, counting from 0)"

Later I tried it again and the osd is working again.

It feels like I hit a bug!?

Huge thanks for your help.

Cheers,
Lars

Fri, 23 Aug 2019 13:36:00 +0200
Paul Emmerich  ==> Lars Täuber  :
> Filter the log for "7f266bdc9700" which is the id of the crashed
> thread, it should contain more information on the transaction that
> caused the crash.
> 
> 
> Paul
> 


7f266bdc9700.log.gz
Description: application/gzip
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] BlueStore.cc: 11208: ceph_abort_msg("unexpected error")

2019-08-23 Thread Lars Täuber
Hi there!

In our testcluster is an osd that won't start anymore.

Here is a short part of the log:

-1> 2019-08-23 08:56:13.316 7f266bdc9700 -1 
/tmp/release/Debian/WORKDIR/ceph-14.2.2/src/os/bluestore/BlueStore.cc: In 
function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)' thread 7f266bdc9700 time 2019-08-23 08:56:13.318938
/tmp/release/Debian/WORKDIR/ceph-14.2.2/src/os/bluestore/BlueStore.cc: 11208: 
ceph_abort_msg("unexpected error")

 ceph version 14.2.2 (4f8fa0a0024755aae7d95567c63f11d6862d55be) nautilus 
(stable)
 1: (ceph::__ceph_abort(char const*, int, char const*, 
std::__cxx11::basic_string, std::allocator > 
const&)+0xdf) [0x564406ac153a]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)+0x2830) [0x5644070e48d0]
 3: 
(BlueStore::queue_transactions(boost::intrusive_ptr&,
 std::vector 
>&, boost::intrusive_ptr, ThreadPool::TPHandle*)+0x42a) 
[0x5644070ec33a]
 4: 
(ObjectStore::queue_transaction(boost::intrusive_ptr&,
 ObjectStore::Transaction&&, boost::intrusive_ptr, 
ThreadPool::TPHandle*)+0x7f) [0x564406cd620f]
 5: (PG::_delete_some(ObjectStore::Transaction*)+0x945) [0x564406d32d85]
 6: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x71) 
[0x564406d337d1]
 7: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x109) [0x564406d81ec9]
 8: (boost::statechart::state_machine, 
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
 const&)+0x6b) [0x564406d4e7cb]
 9: (PG::do_peering_event(std::shared_ptr, 
PG::RecoveryCtx*)+0x2af) [0x564406d3f39f]
 10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr, 
ThreadPool::TPHandle&)+0x1b4) [0x564406c7e644]
 11: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int, 
ThreadPool::TPHandle&)+0xc4) [0x564406c7e8c4]
 12: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x7d7) [0x564406c72667]
 13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5b4) 
[0x56440724f7d4]
 14: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x5644072521d0]
 15: (()+0x7fa3) [0x7f26862f6fa3]
 16: (clone()+0x3f) [0x7f2685ea64cf]


The log is so huge that I don't know which part may be of interest. The cite is 
the part I think is most useful.
Is there anybody able to read and explain this?


Thanks in advance,
Lars


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph status: pg backfill_toofull, but all OSDs have enough space

2019-08-22 Thread Lars Täuber
Hi there!

We also experience this behaviour of our cluster while it is moving pgs.

# ceph health detail
HEALTH_ERR 1 MDSs report slow metadata IOs; Reduced data availability: 2 pgs 
inactive; Degraded data redundancy (low space): 1 pg backfill_toofull
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdsmds1(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked 
for 359 secs
PG_AVAILABILITY Reduced data availability: 2 pgs inactive
pg 21.231 is stuck inactive for 878.224182, current state remapped, last 
acting [20,2147483647,13,2147483647,15,10]
pg 21.240 is stuck inactive for 878.123932, current state remapped, last 
acting [26,17,21,20,2147483647,2147483647]
PG_DEGRADED_FULL Degraded data redundancy (low space): 1 pg backfill_toofull
pg 21.376 is active+remapped+backfill_wait+backfill_toofull, acting 
[6,11,29,2,10,15]
# ceph pg map 21.376
osdmap e68016 pg 21.376 (21.376) -> up [6,5,23,21,10,11] acting 
[6,11,29,2,10,15]

# ceph osd dump | fgrep ratio
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

This happens while the cluster is rebalancing the pgs after I manually mark a 
single osd out.
see here:
 Subject: [ceph-users] pg 21.1f9 is stuck inactive for 53316.902820,  current 
state remapped
 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-August/036634.html


Mostly the cluster heals itself at least into state HEALTH_WARN:


# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 2 pgs 
inactive
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdsmds1(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked 
for 1155 secs
PG_AVAILABILITY Reduced data availability: 2 pgs inactive
pg 21.231 is stuck inactive for 1677.312219, current state remapped, last 
acting [20,2147483647,13,2147483647,15,10]
pg 21.240 is stuck inactive for 1677.211969, current state remapped, last 
acting [26,17,21,20,2147483647,2147483647]



Cheers,
Lars


Wed, 21 Aug 2019 17:28:05 -0500
Reed Dier  ==> Vladimir Brik 
 :
> Just chiming in to say that I too had some issues with backfill_toofull PGs, 
> despite no OSD's being in a backfill_full state, albeit, there were some 
> nearfull OSDs.
> 
> I was able to get through it by reweighting down the OSD that was the target 
> reported by ceph pg dump | grep 'backfill_toofull'.
> 
> This was on 14.2.2.
> 
> Reed
> 
> > On Aug 21, 2019, at 2:50 PM, Vladimir Brik  
> > wrote:
> > 
> > Hello
> > 
> > After increasing number of PGs in a pool, ceph status is reporting 
> > "Degraded data redundancy (low space): 1 pg backfill_toofull", but I don't 
> > understand why, because all OSDs seem to have enough space.
> > 
> > ceph health detail says:
> > pg 40.155 is active+remapped+backfill_toofull, acting [20,57,79,85]
> > 
> > $ ceph pg map 40.155
> > osdmap e3952 pg 40.155 (40.155) -> up [20,57,66,85] acting [20,57,79,85]
> > 
> > So I guess Ceph wants to move 40.155 from 66 to 79 (or other way around?). 
> > According to "osd df", OSD 66's utilization is 71.90%, OSD 79's utilization 
> > is 58.45%. The OSD with least free space in the cluster is 81.23% full, and 
> > it's not any of the ones above.
> > 
> > OSD backfillfull_ratio is 90% (is there a better way to determine this?):
> > $ ceph osd dump | grep ratio
> > full_ratio 0.95
> > backfillfull_ratio 0.9
> > nearfull_ratio 0.7
> > 
> > Does anybody know why a PG could be in the backfill_toofull state if no OSD 
> > is in the backfillfull state?
> > 
> > 
> > Vlad
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> 


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg 21.1f9 is stuck inactive for 53316.902820, current state remapped

2019-08-22 Thread Lars Täuber
All osd are up.

I manually mark one out of 30 "out" not "down".
The primary osd of the stuck pgs are neither marked as out nor as down.

Thanks
Lars

Thu, 22 Aug 2019 15:01:12 +0700
wahyu.muqs...@gmail.com ==> wahyu.muqs...@gmail.com, Lars Täuber 
 :
> I think you use too few osd. when you use erasure code, the probability of 
> primary pg being on the down osd will increase
> On 22 Aug 2019 14.51 +0700, Lars Täuber , wrote:
> > There are 30 osds.
> >
> > Thu, 22 Aug 2019 14:38:10 +0700
> > wahyu.muqs...@gmail.com ==> ceph-users@lists.ceph.com, Lars Täuber 
> >  :  
> > > how many osd do you use ?  
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg 21.1f9 is stuck inactive for 53316.902820, current state remapped

2019-08-22 Thread Lars Täuber
There are 30 osds.

Thu, 22 Aug 2019 14:38:10 +0700
wahyu.muqs...@gmail.com ==> ceph-users@lists.ceph.com, Lars Täuber 
 :
> how many osd do you use ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg 21.1f9 is stuck inactive for 53316.902820, current state remapped

2019-08-22 Thread Lars Täuber
Hi all,

we are using ceph in version 14.2.2 from 
https://mirror.croit.io/debian-nautilus/ on debian buster and experiencing 
problems with cephfs.

The mounted file system produces hanging processes due to pg stuck inactive. 
This often happens after I marked single osds out manually.
A typical result is this:

HEALTH_WARN 1 MDSs report slow metadata IOs; 1 MDSs behind on trimming; Reduced 
data availability: 4 pgs inactive
MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs
mdsmds1(mds.0): 4 slow metadata IOs are blocked > 30 secs, oldest blocked 
for 51206 secs
MDS_TRIM 1 MDSs behind on trimming
mdsmds1(mds.0): Behind on trimming (4298/128) max_segments: 128, 
num_segments: 4298
PG_AVAILABILITY Reduced data availability: 4 pgs inactive
pg 21.1f9 is stuck inactive for 52858.655306, current state remapped, last 
acting [8,2147483647,2147483647,26,27,11]
pg 21.22f is stuck inactive for 52858.636207, current state remapped, last 
acting [27,26,4,2147483647,15,2147483647]
pg 21.2b5 is stuck inactive for 52865.857165, current state remapped, last 
acting [6,2147483647,21,27,11,2147483647]
pg 21.3ed is stuck inactive for 52865.852710, current state remapped, last 
acting [26,18,14,20,2147483647,2147483647]

The placement groups are from an erasure coded pool.

# ceph osd erasure-code-profile get CLAYje4_2_5
crush-device-class=
crush-failure-domain=host
crush-root=default
d=5
k=4
m=2
plugin=clay

It helps restarting the primary osd from the stuck pgs to get them alive again.
This problem keeps us from using this cluster as a productive system.

I'm still a beginner with ceph and this cluster is still in testing phase.

What I'm doing wrong?
Is this problem a symptom of using the clay erasure code?

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SOLVED - MDSs report damaged metadata

2019-08-20 Thread Lars Täuber
Hi all!

I solved this situation with restarting the active mds. So the next mds took 
over and the error was gone.

This is somehow a strange situation. Similar to the situation when restarting 
primary osds when having scrub errors of pgs.

Maybe this should be researched a bit deeper.

Thanks all for this great storage solution!

Cheers,
Lars


Tue, 20 Aug 2019 07:30:11 +0200
Lars Täuber  ==> ceph-users@lists.ceph.com :
> Hi there!
> 
> Does anyone else have an idea what I could do to get rid of this error?
> 
> BTW: it is the third time that the pg 20.0 is gone inconsistent.
> This is a pg from the metadata pool (cephfs).
> May this be related anyhow?
> 
> # ceph health detail
> HEALTH_ERR 1 MDSs report damaged metadata; 1 scrub errors; Possible data 
> damage: 1 pg inconsistent
> MDS_DAMAGE 1 MDSs report damaged metadata
> mdsmds3(mds.0): Metadata damage detected
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 20.0 is active+clean+inconsistent, acting [9,27,15]
> 
> 
> Best regards,
> Lars
> 
> 
> Mon, 19 Aug 2019 13:51:59 +0200
> Lars Täuber  ==> Paul Emmerich  :
> > Hi Paul,
> > 
> > thanks for the hint.
> > 
> > I did a recursive scrub from "/". The log says there where some inodes with 
> > bad backtraces repaired. But the error remains.
> > May this have something to do with a deleted file? Or a file within a 
> > snapshot?
> > 
> > The path told by
> > 
> > # ceph tell mds.mds3 damage ls
> > 2019-08-19 13:43:04.608 7f563f7f6700  0 client.894552 ms_handle_reset on 
> > v2:192.168.16.23:6800/176704036
> > 2019-08-19 13:43:04.624 7f56407f8700  0 client.894558 ms_handle_reset on 
> > v2:192.168.16.23:6800/176704036
> > [
> > {
> > "damage_type": "backtrace",
> > "id": 3760765989,
> > "ino": 1099518115802,
> > "path": "~mds0/stray7/15161f7/dovecot.index.backup"
> > }
> > ]
> > 
> > starts a bit strange to me.
> > 
> > Are the snapshots also repaired with a recursive repair operation?
> > 
> > Thanks
> > Lars
> > 
> > 
> > Mon, 19 Aug 2019 13:30:53 +0200
> > Paul Emmerich  ==> Lars Täuber  :  
> > > Hi,
> > > 
> > > that error just says that the path is wrong. I unfortunately don't
> > > know the correct way to instruct it to scrub a stray path off the top
> > > of my head; you can always run a recursive scrub on / to go over
> > > everything, though
> > > 
> > > 
> > > Paul
> > > 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata

2019-08-19 Thread Lars Täuber
Hi there!

Does anyone else have an idea what I could do to get rid of this error?

BTW: it is the third time that the pg 20.0 is gone inconsistent.
This is a pg from the metadata pool (cephfs).
May this be related anyhow?

# ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata; 1 scrub errors; Possible data 
damage: 1 pg inconsistent
MDS_DAMAGE 1 MDSs report damaged metadata
mdsmds3(mds.0): Metadata damage detected
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 20.0 is active+clean+inconsistent, acting [9,27,15]


Best regards,
Lars


Mon, 19 Aug 2019 13:51:59 +0200
Lars Täuber  ==> Paul Emmerich  :
> Hi Paul,
> 
> thanks for the hint.
> 
> I did a recursive scrub from "/". The log says there where some inodes with 
> bad backtraces repaired. But the error remains.
> May this have something to do with a deleted file? Or a file within a 
> snapshot?
> 
> The path told by
> 
> # ceph tell mds.mds3 damage ls
> 2019-08-19 13:43:04.608 7f563f7f6700  0 client.894552 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> 2019-08-19 13:43:04.624 7f56407f8700  0 client.894558 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> [
> {
> "damage_type": "backtrace",
> "id": 3760765989,
> "ino": 1099518115802,
> "path": "~mds0/stray7/15161f7/dovecot.index.backup"
> }
> ]
> 
> starts a bit strange to me.
> 
> Are the snapshots also repaired with a recursive repair operation?
> 
> Thanks
> Lars
> 
> 
> Mon, 19 Aug 2019 13:30:53 +0200
> Paul Emmerich  ==> Lars Täuber  :
> > Hi,
> > 
> > that error just says that the path is wrong. I unfortunately don't
> > know the correct way to instruct it to scrub a stray path off the top
> > of my head; you can always run a recursive scrub on / to go over
> > everything, though
> > 
> > 
> > Paul
> >   
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata - "return_code": -116

2019-08-19 Thread Lars Täuber
Hi Paul,

thanks for the hint.

I did a recursive scrub from "/". The log says there where some inodes with bad 
backtraces repaired. But the error remains.
May this have something to do with a deleted file? Or a file within a snapshot?

The path told by

# ceph tell mds.mds3 damage ls
2019-08-19 13:43:04.608 7f563f7f6700  0 client.894552 ms_handle_reset on 
v2:192.168.16.23:6800/176704036
2019-08-19 13:43:04.624 7f56407f8700  0 client.894558 ms_handle_reset on 
v2:192.168.16.23:6800/176704036
[
{
"damage_type": "backtrace",
"id": 3760765989,
"ino": 1099518115802,
"path": "~mds0/stray7/15161f7/dovecot.index.backup"
}
]

starts a bit strange to me.

Are the snapshots also repaired with a recursive repair operation?

Thanks
Lars


Mon, 19 Aug 2019 13:30:53 +0200
Paul Emmerich  ==> Lars Täuber  :
> Hi,
> 
> that error just says that the path is wrong. I unfortunately don't
> know the correct way to instruct it to scrub a stray path off the top
> of my head; you can always run a recursive scrub on / to go over
> everything, though
> 
> 
> Paul
> 


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata - "return_code": -116

2019-08-19 Thread Lars Täuber
Hi all!

Where can I look up what the error number means?
Or did I something wrong in my command line?

Thanks in advance,
Lars

Fri, 16 Aug 2019 13:31:38 +0200
Lars Täuber  ==> Paul Emmerich  :
> Hi Paul,
> 
> thank you for your help. But I get the following error:
> 
> # ceph tell mds.mds3 scrub start 
> "~mds0/stray7/15161f7/dovecot.index.backup" repair
> 2019-08-16 13:29:40.208 7f7e927fc700  0 client.881878 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> 2019-08-16 13:29:40.240 7f7e937fe700  0 client.867786 ms_handle_reset on 
> v2:192.168.16.23:6800/176704036
> {
> "return_code": -116
> }
> 
> 
> 
> Lars
> 
> 
> Fri, 16 Aug 2019 13:17:08 +0200
> Paul Emmerich  ==> Lars Täuber  :
> > Hi,
> > 
> > damage_type backtrace is rather harmless and can indeed be repaired
> > with the repair command, but it's called scrub_path.
> > Also you need to pass the name and not the rank of the MDS as id, it should 
> > be
> > 
> > # (on the server where the MDS is actually running)
> > ceph daemon mds.mds3 scrub_path ...
> > 
> > But you should also be able to use ceph tell since nautilus which is a
> > little bit easier because it can be run from any node:
> > 
> > ceph tell mds.mds3 scrub start 'PATH' repair
> > 
> > 
> > Paul
> >   
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDSs report damaged metadata

2019-08-16 Thread Lars Täuber
Hi Paul,

thank you for your help. But I get the following error:

# ceph tell mds.mds3 scrub start 
"~mds0/stray7/15161f7/dovecot.index.backup" repair
2019-08-16 13:29:40.208 7f7e927fc700  0 client.881878 ms_handle_reset on 
v2:192.168.16.23:6800/176704036
2019-08-16 13:29:40.240 7f7e937fe700  0 client.867786 ms_handle_reset on 
v2:192.168.16.23:6800/176704036
{
"return_code": -116
}



Lars


Fri, 16 Aug 2019 13:17:08 +0200
Paul Emmerich  ==> Lars Täuber  :
> Hi,
> 
> damage_type backtrace is rather harmless and can indeed be repaired
> with the repair command, but it's called scrub_path.
> Also you need to pass the name and not the rank of the MDS as id, it should be
> 
> # (on the server where the MDS is actually running)
> ceph daemon mds.mds3 scrub_path ...
> 
> But you should also be able to use ceph tell since nautilus which is a
> little bit easier because it can be run from any node:
> 
> ceph tell mds.mds3 scrub start 'PATH' repair
> 
> 
> Paul
> 


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDSs report damaged metadata

2019-08-16 Thread Lars Täuber
Hi all!

The mds of our ceph cluster produces a health_err state.
It is a nautilus 14.2.2 on debian buster installed from the repo made by 
croit.io with osds on bluestore.

The symptom:
# ceph -s
  cluster:
health: HEALTH_ERR
1 MDSs report damaged metadata

  services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 2d)
mgr: mon3(active, since 2d), standbys: mon2, mon1
mds: cephfs_1:1 {0=mds3=up:active} 2 up:standby
osd: 30 osds: 30 up (since 17h), 29 in (since 19h)
 
  data:
pools:   3 pools, 1153 pgs
objects: 435.21k objects, 806 GiB
usage:   4.7 TiB used, 162 TiB / 167 TiB avail
pgs: 1153 active+clean


# ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata
MDS_DAMAGE 1 MDSs report damaged metadata
mdsmds3(mds.0): Metadata damage detected

#ceph tell mds.0 damage ls
2019-08-16 07:20:09.415 7f1254ff9700  0 client.840758 ms_handle_reset on 
v2:192.168.16.23:6800/176704036
2019-08-16 07:20:09.431 7f1255ffb700  0 client.840764 ms_handle_reset on 
v2:192.168.16.23:6800/176704036
[
{
"damage_type": "backtrace",
"id": 3760765989,
"ino": 1099518115802,
"path": "~mds0/stray7/15161f7/dovecot.index.backup"
}
]



I tried this without much luck:
# ceph daemon mds.0 "~mds0/stray7/15161f7/dovecot.index.backup" recursive 
repair
admin_socket: exception getting command descriptions: [Errno 2] No such file or 
directory


Is there a way out of this error?

Thanks and best regards,
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] writable snapshots in cephfs? GDPR/DSGVO

2019-07-11 Thread Lars Täuber
Thu, 11 Jul 2019 10:24:16 +0200
"Marc Roos"  ==> ceph-users 
, lmb  :
> What about creating snaps on a 'lower level' in the directory structure 
> so you do not need to remove files from a snapshot as a work around?

Thanks for the idea. This might be a solution for our use case.

Regards,
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] writable snapshots in cephfs? GDPR/DSGVO

2019-07-11 Thread Lars Täuber
Thu, 11 Jul 2019 10:21:16 +0200
Lars Marowsky-Bree  ==> ceph-users@lists.ceph.com :
> On 2019-07-10T09:59:08, Lars Täuber   wrote:
> 
> > Hi everbody!
> > 
> > Is it possible to make snapshots in cephfs writable?
> > We need to remove files because of this General Data Protection Regulation 
> > also from snapshots.  
> 
> Removing data from existing WORM storage is tricky, snapshots being a
> specific form thereof.

We liked it to be a non-WORM storage. It is not meant to be used as an archive.

Thanks,
Lars

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] writable snapshots in cephfs? GDPR/DSGVO

2019-07-10 Thread Lars Täuber
Hi everbody!

Is it possible to make snapshots in cephfs writable?
We need to remove files because of this General Data Protection Regulation also 
from snapshots.

Thanks and best regards,
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What does the differences in osd benchmarks mean?

2019-06-28 Thread Lars Täuber
Hi Nathan,

yes the osd hosts are dual-socket machines. But does this make such difference?

osd.0: bench: wrote 1 GiB in blocks of 4 MiB in 15.0133 sec at  68 MiB/sec 17 
IOPS
osd.1: bench: wrote 1 GiB in blocks of 4 MiB in 6.98357 sec at 147 MiB/sec 36 
IOPS

Doubling the IOPS?

Thanks,
Lars

Thu, 27 Jun 2019 11:16:31 -0400
Nathan Fish  ==> Ceph Users  :
> Are these dual-socket machines? Perhaps NUMA is involved?
> 
> On Thu., Jun. 27, 2019, 4:56 a.m. Lars Täuber,  wrote:
> 
> > Hi!
> >
> > In our cluster I ran some benchmarks.
> > The results are always similar but strange to me.
> > I don't know what the results mean.
> > The cluster consists of 7 (nearly) identical hosts for osds. Two of them
> > have one an additional hdd.
> > The hdds are from identical type. The ssds for the journal and wal are of
> > identical type. The configuration is identical (ssd-db-lv-size) for each
> > osd.
> > The hosts are connected the same way to the same switches.
> > This nautilus cluster was set up with ceph-ansible 4.0 on debian buster.
> >
> > This are the results of
> > # ceph --format plain tell osd.* bench
> >
> > osd.0: bench: wrote 1 GiB in blocks of 4 MiB in 15.0133 sec at 68 MiB/sec
> > 17 IOPS
> > osd.1: bench: wrote 1 GiB in blocks of 4 MiB in 6.98357 sec at 147 MiB/sec
> > 36 IOPS
> > osd.2: bench: wrote 1 GiB in blocks of 4 MiB in 6.80336 sec at 151 MiB/sec
> > 37 IOPS
> > osd.3: bench: wrote 1 GiB in blocks of 4 MiB in 12.0813 sec at 85 MiB/sec
> > 21 IOPS
> > osd.4: bench: wrote 1 GiB in blocks of 4 MiB in 8.51311 sec at 120 MiB/sec
> > 30 IOPS
> > osd.5: bench: wrote 1 GiB in blocks of 4 MiB in 6.61376 sec at 155 MiB/sec
> > 38 IOPS
> > osd.6: bench: wrote 1 GiB in blocks of 4 MiB in 14.7478 sec at 69 MiB/sec
> > 17 IOPS
> > osd.7: bench: wrote 1 GiB in blocks of 4 MiB in 12.9266 sec at 79 MiB/sec
> > 19 IOPS
> > osd.8: bench: wrote 1 GiB in blocks of 4 MiB in 15.2513 sec at 67 MiB/sec
> > 16 IOPS
> > osd.9: bench: wrote 1 GiB in blocks of 4 MiB in 9.26225 sec at 111 MiB/sec
> > 27 IOPS
> > osd.10: bench: wrote 1 GiB in blocks of 4 MiB in 13.6641 sec at 75 MiB/sec
> > 18 IOPS
> > osd.11: bench: wrote 1 GiB in blocks of 4 MiB in 13.8943 sec at 74 MiB/sec
> > 18 IOPS
> > osd.12: bench: wrote 1 GiB in blocks of 4 MiB in 13.235 sec at 77 MiB/sec
> > 19 IOPS
> > osd.13: bench: wrote 1 GiB in blocks of 4 MiB in 10.4559 sec at 98 MiB/sec
> > 24 IOPS
> > osd.14: bench: wrote 1 GiB in blocks of 4 MiB in 12.469 sec at 82 MiB/sec
> > 20 IOPS
> > osd.15: bench: wrote 1 GiB in blocks of 4 MiB in 17.434 sec at 59 MiB/sec
> > 14 IOPS
> > osd.16: bench: wrote 1 GiB in blocks of 4 MiB in 11.7184 sec at 87 MiB/sec
> > 21 IOPS
> > osd.17: bench: wrote 1 GiB in blocks of 4 MiB in 12.8702 sec at 80 MiB/sec
> > 19 IOPS
> > osd.18: bench: wrote 1 GiB in blocks of 4 MiB in 20.1894 sec at 51 MiB/sec
> > 12 IOPS
> > osd.19: bench: wrote 1 GiB in blocks of 4 MiB in 9.60049 sec at 107
> > MiB/sec 26 IOPS
> > osd.20: bench: wrote 1 GiB in blocks of 4 MiB in 15.0613 sec at 68 MiB/sec
> > 16 IOPS
> > osd.21: bench: wrote 1 GiB in blocks of 4 MiB in 17.6074 sec at 58 MiB/sec
> > 14 IOPS
> > osd.22: bench: wrote 1 GiB in blocks of 4 MiB in 16.39 sec at 62 MiB/sec
> > 15 IOPS
> > osd.23: bench: wrote 1 GiB in blocks of 4 MiB in 15.2747 sec at 67 MiB/sec
> > 16 IOPS
> > osd.24: bench: wrote 1 GiB in blocks of 4 MiB in 10.2462 sec at 100
> > MiB/sec 24 IOPS
> > osd.25: bench: wrote 1 GiB in blocks of 4 MiB in 13.5297 sec at 76 MiB/sec
> > 18 IOPS
> > osd.26: bench: wrote 1 GiB in blocks of 4 MiB in 7.46824 sec at 137
> > MiB/sec 34 IOPS
> > osd.27: bench: wrote 1 GiB in blocks of 4 MiB in 11.2216 sec at 91 MiB/sec
> > 22 IOPS
> > osd.28: bench: wrote 1 GiB in blocks of 4 MiB in 16.6205 sec at 62 MiB/sec
> > 15 IOPS
> > osd.29: bench: wrote 1 GiB in blocks of 4 MiB in 10.1477 sec at 101
> > MiB/sec 25 IOPS
> >
> >
> > The different runs differ by ±1 IOPS.
> > Why are the osds 1,2,4,5,9,19,26 faster than the others?
> >
> > Restarting an osd did change the result.
> >
> > Could someone give me hint where to look further to find the reason?
> >
> > Thanks
> > Lars
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >  


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What does the differences in osd benchmarks mean?

2019-06-27 Thread Lars Täuber
Hi!

In our cluster I ran some benchmarks.
The results are always similar but strange to me.
I don't know what the results mean.
The cluster consists of 7 (nearly) identical hosts for osds. Two of them have 
one an additional hdd.
The hdds are from identical type. The ssds for the journal and wal are of 
identical type. The configuration is identical (ssd-db-lv-size) for each osd.
The hosts are connected the same way to the same switches.
This nautilus cluster was set up with ceph-ansible 4.0 on debian buster.

This are the results of
# ceph --format plain tell osd.* bench

osd.0: bench: wrote 1 GiB in blocks of 4 MiB in 15.0133 sec at 68 MiB/sec 17 
IOPS
osd.1: bench: wrote 1 GiB in blocks of 4 MiB in 6.98357 sec at 147 MiB/sec 36 
IOPS
osd.2: bench: wrote 1 GiB in blocks of 4 MiB in 6.80336 sec at 151 MiB/sec 37 
IOPS
osd.3: bench: wrote 1 GiB in blocks of 4 MiB in 12.0813 sec at 85 MiB/sec 21 
IOPS
osd.4: bench: wrote 1 GiB in blocks of 4 MiB in 8.51311 sec at 120 MiB/sec 30 
IOPS
osd.5: bench: wrote 1 GiB in blocks of 4 MiB in 6.61376 sec at 155 MiB/sec 38 
IOPS
osd.6: bench: wrote 1 GiB in blocks of 4 MiB in 14.7478 sec at 69 MiB/sec 17 
IOPS
osd.7: bench: wrote 1 GiB in blocks of 4 MiB in 12.9266 sec at 79 MiB/sec 19 
IOPS
osd.8: bench: wrote 1 GiB in blocks of 4 MiB in 15.2513 sec at 67 MiB/sec 16 
IOPS
osd.9: bench: wrote 1 GiB in blocks of 4 MiB in 9.26225 sec at 111 MiB/sec 27 
IOPS
osd.10: bench: wrote 1 GiB in blocks of 4 MiB in 13.6641 sec at 75 MiB/sec 18 
IOPS
osd.11: bench: wrote 1 GiB in blocks of 4 MiB in 13.8943 sec at 74 MiB/sec 18 
IOPS
osd.12: bench: wrote 1 GiB in blocks of 4 MiB in 13.235 sec at 77 MiB/sec 19 
IOPS
osd.13: bench: wrote 1 GiB in blocks of 4 MiB in 10.4559 sec at 98 MiB/sec 24 
IOPS
osd.14: bench: wrote 1 GiB in blocks of 4 MiB in 12.469 sec at 82 MiB/sec 20 
IOPS
osd.15: bench: wrote 1 GiB in blocks of 4 MiB in 17.434 sec at 59 MiB/sec 14 
IOPS
osd.16: bench: wrote 1 GiB in blocks of 4 MiB in 11.7184 sec at 87 MiB/sec 21 
IOPS
osd.17: bench: wrote 1 GiB in blocks of 4 MiB in 12.8702 sec at 80 MiB/sec 19 
IOPS
osd.18: bench: wrote 1 GiB in blocks of 4 MiB in 20.1894 sec at 51 MiB/sec 12 
IOPS
osd.19: bench: wrote 1 GiB in blocks of 4 MiB in 9.60049 sec at 107 MiB/sec 26 
IOPS
osd.20: bench: wrote 1 GiB in blocks of 4 MiB in 15.0613 sec at 68 MiB/sec 16 
IOPS
osd.21: bench: wrote 1 GiB in blocks of 4 MiB in 17.6074 sec at 58 MiB/sec 14 
IOPS
osd.22: bench: wrote 1 GiB in blocks of 4 MiB in 16.39 sec at 62 MiB/sec 15 IOPS
osd.23: bench: wrote 1 GiB in blocks of 4 MiB in 15.2747 sec at 67 MiB/sec 16 
IOPS
osd.24: bench: wrote 1 GiB in blocks of 4 MiB in 10.2462 sec at 100 MiB/sec 24 
IOPS
osd.25: bench: wrote 1 GiB in blocks of 4 MiB in 13.5297 sec at 76 MiB/sec 18 
IOPS
osd.26: bench: wrote 1 GiB in blocks of 4 MiB in 7.46824 sec at 137 MiB/sec 34 
IOPS
osd.27: bench: wrote 1 GiB in blocks of 4 MiB in 11.2216 sec at 91 MiB/sec 22 
IOPS
osd.28: bench: wrote 1 GiB in blocks of 4 MiB in 16.6205 sec at 62 MiB/sec 15 
IOPS
osd.29: bench: wrote 1 GiB in blocks of 4 MiB in 10.1477 sec at 101 MiB/sec 25 
IOPS


The different runs differ by ±1 IOPS.
Why are the osds 1,2,4,5,9,19,26 faster than the others?

Restarting an osd did change the result.

Could someone give me hint where to look further to find the reason?

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reduced data availability: 2 pgs inactive

2019-06-19 Thread Lars Täuber
Hi Paul,

thanks for the hint.
Restarting the primary osds of the inactive pgs resolved the problem:

Before restarting them they said:
2019-06-19 15:55:36.190 7fcd55c4e700 -1 osd.5 33858 get_health_metrics 
reporting 15 slow ops, oldest is osd_op(client.220116.0:967410 21.2e4s0 
21.d4e19ae4 (undecoded) ondisk+write+known_if_redirected e31569)
and
2019-06-19 15:53:31.214 7f9b946d1700 -1 osd.13 33849 get_health_metrics 
reporting 14560 slow ops, oldest is osd_op(mds.0.44294:99584053 23.5 
23.cad28605 (undecoded) ondisk+write+known_if_redirected+full_force e31562)

Is this something to worry about?

Regards,
Lars

Wed, 19 Jun 2019 15:04:06 +0200
Paul Emmerich  ==> Lars Täuber  :
> That shouldn't trigger the PG limit (yet), but increasing "mon max pg per
> osd" from the default of 200 is a good idea anyways since you are running
> with more than 200 PGs per OSD.
> 
> I'd try to restart all OSDs that are in the UP set for that PG:
> 
> 13,
> 21,
> 23
> 7,
> 29,
> 9,
> 28,
> 11,
> 8
> 
> 
> Maybe that solves it (technically it shouldn't), if that doesn't work
> you'll have to dig in deeper into the log files to see where exactly and
> why it is stuck activating.
> 
> Paul
> 


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reduced data availability: 2 pgs inactive

2019-06-19 Thread Lars Täuber
Hi Paul,

thanks for your reply.

Wed, 19 Jun 2019 13:19:55 +0200
Paul Emmerich  ==> Lars Täuber  :
> Wild guess: you hit the PG hard limit, how many PGs per OSD do you have?
> If this is the case: increase "osd max pg per osd hard ratio"
> 
> Check "ceph pg  query" to see why it isn't activating.
> 
> Can you share the output of "ceph osd df tree" and "ceph pg  query"
> of the affected PGs?

The pg queries are attached. I can't read them - to much information.


Here is the osd df tree:
# osd df tree
ID  CLASS WEIGHTREWEIGHT SIZERAW USE DATAOMAPMETA AVAIL   
%USE VAR  PGS STATUS TYPE NAME   
 -1   167.15057- 167 TiB 4.7 TiB 1.2 TiB 952 MiB   57 GiB 162 TiB 
2.79 1.00   -root PRZ
-1772.43192-  72 TiB 2.0 TiB 535 GiB 393 MiB   25 GiB  70 TiB 
2.78 1.00   -rack 1-eins 
 -922.28674-  22 TiB 640 GiB 170 GiB  82 MiB  9.0 GiB  22 TiB 
2.80 1.01   -host onode1 
  2   hdd   5.57169  1.0 5.6 TiB 162 GiB  45 GiB  11 MiB  2.3 GiB 5.4 TiB 
2.84 1.02 224 up osd.2   
  9   hdd   5.57169  1.0 5.6 TiB 156 GiB  39 GiB  19 MiB  2.1 GiB 5.4 TiB 
2.74 0.98 201 up osd.9   
 14   hdd   5.57169  1.0 5.6 TiB 162 GiB  44 GiB  24 MiB  2.1 GiB 5.4 TiB 
2.84 1.02 230 up osd.14  
 21   hdd   5.57169  1.0 5.6 TiB 160 GiB  42 GiB  27 MiB  2.5 GiB 5.4 TiB 
2.80 1.00 219 up osd.21  
-1322.28674-  22 TiB 640 GiB 170 GiB 123 MiB  8.9 GiB  22 TiB 
2.80 1.00   -host onode4 
  4   hdd   5.57169  1.0 5.6 TiB 156 GiB  39 GiB  38 MiB  2.2 GiB 5.4 TiB 
2.73 0.98 205 up osd.4   
 11   hdd   5.57169  1.0 5.6 TiB 164 GiB  47 GiB  24 MiB  2.0 GiB 5.4 TiB 
2.87 1.03 241 up osd.11  
 18   hdd   5.57169  1.0 5.6 TiB 159 GiB  42 GiB  31 MiB  2.5 GiB 5.4 TiB 
2.79 1.00 221 up osd.18  
 22   hdd   5.57169  1.0 5.6 TiB 160 GiB  43 GiB  29 MiB  2.1 GiB 5.4 TiB 
2.81 1.01 225 up osd.22  
 -527.85843-  28 TiB 782 GiB 195 GiB 188 MiB  6.9 GiB  27 TiB 
2.74 0.98   -host onode7 
  5   hdd   5.57169  1.0 5.6 TiB 158 GiB  41 GiB  26 MiB  1.2 GiB 5.4 TiB 
2.77 0.99 213 up osd.5   
 12   hdd   5.57169  1.0 5.6 TiB 159 GiB  42 GiB  31 MiB  993 MiB 5.4 TiB 
2.79 1.00 222 up osd.12  
 20   hdd   5.57169  1.0 5.6 TiB 157 GiB  40 GiB  47 MiB  1.2 GiB 5.4 TiB 
2.76 0.99 212 up osd.20  
 27   hdd   5.57169  1.0 5.6 TiB 151 GiB  33 GiB  28 MiB  1.9 GiB 5.4 TiB 
2.64 0.95 179 up osd.27  
 29   hdd   5.57169  1.0 5.6 TiB 156 GiB  39 GiB  56 MiB  1.7 GiB 5.4 TiB 
2.74 0.98 203 up osd.29  
-1844.57349-  45 TiB 1.3 TiB 341 GiB 248 MiB   14 GiB  43 TiB 
2.81 1.01   -rack 2-zwei 
 -722.28674-  22 TiB 641 GiB 171 GiB 132 MiB  6.7 GiB  22 TiB 
2.81 1.01   -host onode2 
  1   hdd   5.57169  1.0 5.6 TiB 155 GiB  38 GiB  35 MiB  1.2 GiB 5.4 TiB 
2.72 0.97 203 up osd.1   
  8   hdd   5.57169  1.0 5.6 TiB 163 GiB  46 GiB  36 MiB  2.4 GiB 5.4 TiB 
2.86 1.02 243 up osd.8   
 16   hdd   5.57169  1.0 5.6 TiB 161 GiB  43 GiB  24 MiB 1000 MiB 5.4 TiB 
2.82 1.01 221 up osd.16  
 23   hdd   5.57169  1.0 5.6 TiB 162 GiB  45 GiB  37 MiB  2.1 GiB 5.4 TiB 
2.84 1.02 228 up osd.23  
 -322.28674-  22 TiB 640 GiB 170 GiB 116 MiB  7.6 GiB  22 TiB 
2.80 1.00   -host onode5 
  3   hdd   5.57169  1.0 5.6 TiB 154 GiB  36 GiB  14 MiB 1010 MiB 5.4 TiB 
2.70 0.97 186 up osd.3   
  7   hdd   5.57169  1.0 5.6 TiB 161 GiB  44 GiB  22 MiB  2.2 GiB 5.4 TiB 
2.82 1.01 221 up osd.7   
 15   hdd   5.57169  1.0 5.6 TiB 165 GiB  48 GiB  26 MiB  2.3 GiB 5.4 TiB 
2.89 1.04 249 up osd.15  
 24   hdd   5.57169  1.0 5.6 TiB 160 GiB  42 GiB  54 MiB  2.1 GiB 5.4 TiB 
2.80 1.00 223 up osd.24  
-1950.14517-  50 TiB 1.4 TiB 376 GiB 311 MiB   18 GiB  49 TiB 
2.79 1.00   -rack 3-drei 
-1522.28674-  22 TiB 649 GiB 179 GiB 112 MiB  8.2 GiB  22 TiB 
2.84 1.02   -host onode3 
  0   hdd   5.57169  1.0 5.6 TiB 162 GiB  45 GiB  28 MiB  996 MiB 5.4 TiB 
2.84 1.02 229 up osd.0   
 10   hdd   5.57169  1.0 5.6 TiB 159 GiB  42 GiB  21 MiB  2.2 GiB 5.4 TiB 
2.79 1.00 213 up osd.10  
 17   hdd   5.57169  1.0 5.6 TiB 165 GiB  47 GiB  19 MiB  2.5 GiB 5.4 TiB 
2.88 1.03 238 up osd.17  
 25   hdd   5.57169  1.0 5.6 TiB 163 GiB  46 GiB  44 MiB  2.5 GiB 5.4 TiB 
2.86 1.03 242 up osd.25  
-1127.85843-  28 TiB 784 GiB 197 GiB 199 MiB  9.4 

[ceph-users] Reduced data availability: 2 pgs inactive

2019-06-19 Thread Lars Täuber
Hi there!

Recently I made our cluster rack aware
by adding racks to the crush map.
The failure domain was and still is "host".

rule cephfs2_data {
id 7
type erasure
min_size 3
max_size 6
step set_chooseleaf_tries 5
step set_choose_tries 100
step take PRZ
step chooseleaf indep 0 type host
step emit


Then I sorted the hosts into the new
rack buckets of the crush map as they
are in reality, by:
  # osd crush move onodeX rack=XYZ
for all hosts.

The cluster started to reorder the data.

In the end the cluster has now:
HEALTH_WARN 1 filesystem is degraded; Reduced data availability: 2 pgs 
inactive; Degraded data redundancy: 678/2371785 objects degraded (0.029%), 2 
pgs degraded, 2 pgs undersized
FS_DEGRADED 1 filesystem is degraded
fs cephfs_1 is degraded
PG_AVAILABILITY Reduced data availability: 2 pgs inactive
pg 21.2e4 is stuck inactive for 142792.952697, current state 
activating+undersized+degraded+remapped+forced_backfill, last acting 
[5,2147483647,25,28,11,2]
pg 23.5 is stuck inactive for 142791.437243, current state 
activating+undersized+degraded+remapped+forced_backfill, last acting [13,21]
PG_DEGRADED Degraded data redundancy: 678/2371785 objects degraded (0.029%), 2 
pgs degraded, 2 pgs undersized
pg 21.2e4 is stuck undersized for 142779.321192, current state 
activating+undersized+degraded+remapped+forced_backfill, last acting 
[5,2147483647,25,28,11,2]
pg 23.5 is stuck undersized for 142789.747915, current state 
activating+undersized+degraded+remapped+forced_backfill, last acting [13,21]

The cluster hosts a cephfs which is
not mountable anymore.

I tried a few things (as you can see:
forced_backfill), but failed.

The cephfs_data pool is EC 4+2.
Both inactive pgs seem to have enough
copies to recalculate the contents for
all osds.

Is there a chance to get both pgs
clean again?

How can I force the pgs to recalculate
all necessary copies?


Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent number of pools

2019-05-28 Thread Lars Täuber
Yes, thanks. This helped.

Regards,
Lars

Tue, 28 May 2019 11:50:01 -0700
Gregory Farnum  ==> Lars Täuber  :
> You’re the second report I’ve seen if this, and while it’s confusing, you
> should be Abel to resolve it by restarting your active manager daemon.
> 
> On Sun, May 26, 2019 at 11:52 PM Lars Täuber  wrote:
> 
> > Fri, 24 May 2019 21:41:33 +0200
> > Michel Raabe  ==> Lars Täuber ,
> > ceph-users@lists.ceph.com :  
> > >
> > > You can also try
> > >
> > > $ rados lspools
> > > $ ceph osd pool ls
> > >
> > > and verify that with the pgs
> > >
> > > $ ceph pg ls --format=json-pretty | jq -r '.pg_stats[].pgid' | cut -d.
> > > -f1 | uniq
> > >  
> >
> > Yes, now I know but I still get this:
> > $ sudo ceph -s
> > […]
> >   data:
> > pools:   5 pools, 1153 pgs
> > […]
> >
> >
> > and with all other means I get:
> > $ sudo ceph osd lspools | wc -l
> > 3
> >
> > Which is what I expect, because all other pools are removed.
> > But since this has no bad side effects I can live with it.
> >
> > Cheers,
> > Lars
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >  


-- 
Informationstechnologie
Berlin-Brandenburgische Akademie der Wissenschaften
Jägerstraße 22-23  10117 Berlin
Tel.: +49 30 20370-352   http://www.bbaw.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent number of pools

2019-05-27 Thread Lars Täuber
Fri, 24 May 2019 21:41:33 +0200
Michel Raabe  ==> Lars Täuber , 
ceph-users@lists.ceph.com :
> 
> You can also try
> 
> $ rados lspools
> $ ceph osd pool ls
> 
> and verify that with the pgs
> 
> $ ceph pg ls --format=json-pretty | jq -r '.pg_stats[].pgid' | cut -d. 
> -f1 | uniq
> 

Yes, now I know but I still get this:
$ sudo ceph -s
[…]
  data:
pools:   5 pools, 1153 pgs
[…]


and with all other means I get:
$ sudo ceph osd lspools | wc -l
3

Which is what I expect, because all other pools are removed.
But since this has no bad side effects I can live with it.

Cheers,
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] inconsistent number of pools

2019-05-20 Thread Lars Täuber
Mon, 20 May 2019 10:52:14 +
Eugen Block  ==> ceph-users@lists.ceph.com :
> Hi, have you tried 'ceph health detail'?
> 

No I hadn't. Thanks for the hint.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] inconsistent number of pools

2019-05-20 Thread Lars Täuber
Hi everybody,

with the status report I get a HEALTH_WARN I don't know how to get rid of.
It my be connected to recently removed pools.

# ceph -s
  cluster:
id: 6cba13d1-b814-489c-9aac-9c04aaf78720
health: HEALTH_WARN
1 pools have many more objects per pg than average
 
  services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 4h)
mgr: mon1(active, since 4h), standbys: cephsible, mon2, mon3
mds: cephfs_1:1 {0=mds3=up:active} 2 up:standby
osd: 30 osds: 30 up (since 2h), 30 in (since 7w)
 
  data:
pools:   5 pools, 1029 pgs
objects: 315.51k objects, 728 GiB
usage:   4.6 TiB used, 163 TiB / 167 TiB avail
pgs: 1029 active+clean
 

!!! but:
# ceph osd lspools | wc -l
3

The status says there are 5 pools but the listing says there are only 3.
How to I get to know which pool is the reason for the health warning?

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pool migration for cephfs?

2019-05-15 Thread Lars Täuber
Hi,

is there a way to migrate a cephfs to a new data pool like it is for rbd on 
nautilus?
https://ceph.com/geen-categorie/ceph-pool-migration/

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Multi Mds Trim Log Slow

2019-05-06 Thread Lars Täuber
I restarted the mds process which was in "up:stopping" state.
Since then there are no trimmings behind any more.
All (sub)directories are accessible as normal again.

It seems there are stability issues with snapshots in a multi-mds cephfs on 
nautilus.
This has already been suspected here:
http://docs.ceph.com/docs/nautilus/cephfs/experimental-features/#snapshots


Regards,
Lars


Fri, 3 May 2019 11:45:41 +0200
Lars Täuber  ==> ceph-users@lists.ceph.com :
> Hi,
> 
> I'm still new to ceph. Here are similar problems with CephFS.
> 
> ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus 
> (stable)
> on Debian GNU/Linux buster/sid
> 
> # ceph health detail
> HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
> MDS_SLOW_REQUEST 1 MDSs report slow requests
> mdsmds3(mds.0): 13 slow requests are blocked > 30 secs
> MDS_TRIM 1 MDSs behind on trimming
> mdsmds3(mds.0): Behind on trimming (33924/125) max_segments: 125, 
> num_segments: 33924
> 
> 
> 
> The workload is "doveadm backup" of more than 500 mail folders from a local 
> ext4 to a cephfs.
> * There are ~180'000 files with a strange file size distribution:
> 
> # NumSamples = 181056; MIN_SEEN = 377; MAX_SEEN = 584835624
> # Mean = 4477785.646005; Variance = 31526763457775.421875; SD = 5614869.852256
> 377 - 262502 [ 56652]: ∎ 
> 31.29%
>  262502 - 524627 [  4891]:   2.70%
>  524627 - 786752 [  3498]: ∎∎∎  1.93%
>  786752 -1048878 [  2770]: ∎∎∎  1.53%
> 1048878 -1311003 [  2460]: ∎∎  1.36%
> 1311003 -1573128 [  2197]: ∎∎  1.21%
> 1573128 -1835253 [  2014]: ∎∎  1.11%
> 1835253 -2097378 [  1961]: ∎∎  1.08%
> 2097378 -2359503 [  2244]: ∎∎  1.24%
> 2359503 -2621628 [  1890]: ∎∎  1.04%
> 2621628 -2883754 [  1897]: ∎∎  1.05%
> 2883754 -3145879 [  2188]: ∎∎  1.21%
> 3145879 -3408004 [  2579]: ∎∎  1.42%
> 3408004 -3670129 [  3396]: ∎∎∎  1.88%
> 3670129 -3932254 [  5173]:   2.86%
> 3932254 -4194379 [ 24847]: ∎∎ 13.72%
> 4194379 -4456505 [  1512]: ∎∎  0.84%
> 4456505 -4718630 [  1394]: ∎∎  0.77%
> 4718630 -4980755 [  1412]: ∎∎  0.78%
> 4980755 -  584835624 [ 56081]: ∎ 
> 30.97%
> 
> * There are two snapshots of the main directory the mails are backed up to.
> * There are three sub directories where a simple ls doesn't return from.
> * The cephfs is mounted using the kernel driver of Ubuntu 18.04.2 LTS kernel 
> 4.15.0-48-generic.
> * Same behaviour with ceph-fuse 'FUSE library version: 2.9.7' with the 
> difference that I can't interrupt the ls.
> 
> The reduction of the number of mds working for our cephfs to 1 made no 
> difference.
> The number of segments is still rising.
> # ceph -w
>   cluster:
> id: 6cba13d1-b814-489c-9aac-9c04aaf78720
> health: HEALTH_WARN
> 1 MDSs report slow requests
> 1 MDSs behind on trimming
>  
>   services:
> mon: 3 daemons, quorum mon1,mon2,mon3 (age 3d)
> mgr: cephsible(active, since 27h), standbys: mon3, mon1
> mds: cephfs_1:2 {0=mds3=up:active,1=mds2=up:stopping} 1 up:standby
> osd: 30 osds: 30 up (since 4w), 30 in (since 5w)
>  
>   data:
> pools:   5 pools, 393 pgs
> objects: 607.74k objects, 1.5 TiB
> usage:   6.9 TiB used, 160 TiB / 167 TiB avail
> pgs: 393 active+clean
>  
> 
> 2019-05-03 11:40:17.916193 mds.mds3 [WRN] 15 slow requests, 0 included below; 
> oldest blocked for > 342610.193367 secs
> 
> It seems the stopping of one out of two mds doesn't come to an end.
> 
> How to debug this?
> 
> Thanks in advance.
> Lars
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Multi Mds Trim Log Slow

2019-05-03 Thread Lars Täuber
Hi,

I'm still new to ceph. Here are similar problems with CephFS.

ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
on Debian GNU/Linux buster/sid

# ceph health detail
HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
MDS_SLOW_REQUEST 1 MDSs report slow requests
mdsmds3(mds.0): 13 slow requests are blocked > 30 secs
MDS_TRIM 1 MDSs behind on trimming
mdsmds3(mds.0): Behind on trimming (33924/125) max_segments: 125, 
num_segments: 33924



The workload is "doveadm backup" of more than 500 mail folders from a local 
ext4 to a cephfs.
* There are ~180'000 files with a strange file size distribution:

# NumSamples = 181056; MIN_SEEN = 377; MAX_SEEN = 584835624
# Mean = 4477785.646005; Variance = 31526763457775.421875; SD = 5614869.852256
377 - 262502 [ 56652]: ∎ 
31.29%
 262502 - 524627 [  4891]:   2.70%
 524627 - 786752 [  3498]: ∎∎∎  1.93%
 786752 -1048878 [  2770]: ∎∎∎  1.53%
1048878 -1311003 [  2460]: ∎∎  1.36%
1311003 -1573128 [  2197]: ∎∎  1.21%
1573128 -1835253 [  2014]: ∎∎  1.11%
1835253 -2097378 [  1961]: ∎∎  1.08%
2097378 -2359503 [  2244]: ∎∎  1.24%
2359503 -2621628 [  1890]: ∎∎  1.04%
2621628 -2883754 [  1897]: ∎∎  1.05%
2883754 -3145879 [  2188]: ∎∎  1.21%
3145879 -3408004 [  2579]: ∎∎  1.42%
3408004 -3670129 [  3396]: ∎∎∎  1.88%
3670129 -3932254 [  5173]:   2.86%
3932254 -4194379 [ 24847]: ∎∎ 13.72%
4194379 -4456505 [  1512]: ∎∎  0.84%
4456505 -4718630 [  1394]: ∎∎  0.77%
4718630 -4980755 [  1412]: ∎∎  0.78%
4980755 -  584835624 [ 56081]: ∎ 
30.97%

* There are two snapshots of the main directory the mails are backed up to.
* There are three sub directories where a simple ls doesn't return from.
* The cephfs is mounted using the kernel driver of Ubuntu 18.04.2 LTS kernel 
4.15.0-48-generic.
* Same behaviour with ceph-fuse 'FUSE library version: 2.9.7' with the 
difference that I can't interrupt the ls.

The reduction of the number of mds working for our cephfs to 1 made no 
difference.
The number of segments is still rising.
# ceph -w
  cluster:
id: 6cba13d1-b814-489c-9aac-9c04aaf78720
health: HEALTH_WARN
1 MDSs report slow requests
1 MDSs behind on trimming
 
  services:
mon: 3 daemons, quorum mon1,mon2,mon3 (age 3d)
mgr: cephsible(active, since 27h), standbys: mon3, mon1
mds: cephfs_1:2 {0=mds3=up:active,1=mds2=up:stopping} 1 up:standby
osd: 30 osds: 30 up (since 4w), 30 in (since 5w)
 
  data:
pools:   5 pools, 393 pgs
objects: 607.74k objects, 1.5 TiB
usage:   6.9 TiB used, 160 TiB / 167 TiB avail
pgs: 393 active+clean
 

2019-05-03 11:40:17.916193 mds.mds3 [WRN] 15 slow requests, 0 included below; 
oldest blocked for > 342610.193367 secs

It seems the stopping of one out of two mds doesn't come to an end.

How to debug this?

Thanks in advance.
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Lars Täuber
Wed, 17 Apr 2019 20:01:28 +0900
Christian Balzer  ==> Ceph Users  :
> On Wed, 17 Apr 2019 11:22:08 +0200 Lars Täuber wrote:
> 
> > Wed, 17 Apr 2019 10:47:32 +0200
> > Paul Emmerich  ==> Lars Täuber  :  
> > > The standard argument that it helps preventing recovery traffic from
> > > clogging the network and impacting client traffic is missleading:
> > 
> > What do you mean by "it"? I don't know the standard argument.
> > Do you mean separating the networks or do you mean having both together in 
> > one switched network?
> >   
> He means separated networks, obviously.
> 
> > > 
> > > * write client traffic relies on the backend network for replication
> > > operations: your client (write) traffic is impacted anyways if the
> > > backend network is full
> > 
> > This I understand as an argument for separating the networks and the 
> > backend network being faster than the frontend network.
> > So in case of reconstruction there should be some bandwidth left in the 
> > backend for the traffic that is used for the client IO.
> >   
> You need to run the numbers and look at the big picture.
> As mentioned already, this is all moot in your case.
> 
> 6 HDDs at realistically 150MB/s each, if they were all doing sequential
> I/O. which they aren't. 
> But the for the sake of argument lest say that one of your nodes can read
> (or write, not both at the same time) 900MB/s.
> That's still less than half of a single 25Gb/s link.

Is this really true also with the WAL device (combined with the DB device) 
which is a (fast) SSD in our setup?
reading:2150MB/​s
writing:2120MB/​s
IOPS 4K reading/wrtiing 440k/​320k

If so, the next version of OSD host will be adjusted in HW requirements.


> And that very hypothetical data rate (it's not sequential, you will
> concurrent operations and thus seeks) is all your node can handle, if it
> all going into recovery/rebalancing your clients are starved because of
> that, not bandwidth exhaustion.

If it like this also with our SSD WAL, the next version of OSD host will be 
adjusted in HW requirements.

Thanks
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Lars Täuber
Wed, 17 Apr 2019 10:47:32 +0200
Paul Emmerich  ==> Lars Täuber  :
> The standard argument that it helps preventing recovery traffic from
> clogging the network and impacting client traffic is missleading:

What do you mean by "it"? I don't know the standard argument.
Do you mean separating the networks or do you mean having both together in one 
switched network?

> 
> * write client traffic relies on the backend network for replication
> operations: your client (write) traffic is impacted anyways if the
> backend network is full

This I understand as an argument for separating the networks and the backend 
network being faster than the frontend network.
So in case of reconstruction there should be some bandwidth left in the backend 
for the traffic that is used for the client IO.


> * you are usually not limited by network speed for recovery (except
> for 1 gbit networks), and if you are you probably want to reduce
> recovery speed anyways if you would run into that limit
> 
> Paul
> 

Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-17 Thread Lars Täuber
Wed, 17 Apr 2019 09:52:29 +0200
Stefan Kooman  ==> Lars Täuber  :
> Quoting Lars Täuber (taeu...@bbaw.de):
> > > I'd probably only use the 25G network for both networks instead of
> > > using both. Splitting the network usually doesn't help.  
> > 
> > This is something i was told to do, because a reconstruction of failed
> > OSDs/disks would have a heavy impact on the backend network.  
> 
> Opinions vary on running "public" only versus "public" / "backend".
> Having a separate "backend" network might lead to difficult to debug
> issues when the "public" network is working fine, but the "backend" is
> having issues and OSDs can't peer with each other, while the clients can
> talk to all OSDs. You will get slow requests and OSDs marking each other
> down while they are still running etc.

This I was not aware of.


> In your case with only 6 spinners max per server there is no way you
> will every fill the network capacity of a 25 Gb/s network: 6 * 250 MB/s
> (for large spinners) should be just enough to fill a 10 Gb/s link. A
> redundant 25 Gb/s link would provide 50 Gb/s of bandwith, enough for
> both OSD replication traffic and client IO.

The reason for the choice for the 25GBit network was because a remark of 
someone, that the latency in this ethernet is way below that of 10GBit. I never 
double checked this.


> 
> My 2 cents,
> 
> Gr. Stefan
> 

Cheers,
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to judge the results? - rados bench comparison

2019-04-16 Thread Lars Täuber
Thanks Paul for the judgement.

Tue, 16 Apr 2019 10:13:03 +0200
Paul Emmerich  ==> Lars Täuber  :
> Seems in line with what I'd expect for the hardware.
> 
> Your hardware seems to be way overspecced, you'd be fine with half the
> RAM, half the CPU and way cheaper disks.

Do you mean all the components of the cluster or only the OSD-nodes?
Before making the requirements i only read about mirroring clusters. I was 
afraid of the CPUs being to slow to calculate the erasure codes we planned to 
use.


> In fakt, a good SATA 4kn disk can be faster than a SAS 512e disk.

This is a really good hint, because we just started to plan the extension.

> 
> I'd probably only use the 25G network for both networks instead of
> using both. Splitting the network usually doesn't help.

This is something i was told to do, because a reconstruction of failed 
OSDs/disks would have a heavy impact on the backend network.


> 
> Paul
> 

Thanks again.

Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to judge the results? - rados bench comparison

2019-04-08 Thread Lars Täuber
Hi there,

i'm new to ceph and just got my first cluster running.
Now i'd like to know if the performance we get is expectable.

Is there a website with benchmark results somewhere where i could have a look 
to compare with our HW and our results?

This are the results:
rados bench single threaded:
# rados bench 10 write --rbd-cache=false -t 1

Object size:4194304
Bandwidth (MB/sec): 53.7186
Stddev Bandwidth:   3.86437
Max bandwidth (MB/sec): 60
Min bandwidth (MB/sec): 48
Average IOPS:   13
Stddev IOPS:0.966092
Average Latency(s): 0.0744599
Stddev Latency(s):  0.00911778

nearly maxing out one (idle) client with 28 threads
# rados bench 10 write --rbd-cache=false -t 28

Bandwidth (MB/sec): 850.451
Stddev Bandwidth:   40.6699
Max bandwidth (MB/sec): 904
Min bandwidth (MB/sec): 748
Average IOPS:   212
Stddev IOPS:10.1675
Average Latency(s): 0.131309
Stddev Latency(s):  0.0318489

four concurrent benchmarks on four clients each with 24 threads:
Bandwidth (MB/sec): 396 376 381 389
Stddev Bandwidth:   30  25  22  22
Max bandwidth (MB/sec): 440 420 416 428
Min bandwidth (MB/sec): 352 348 344 364
Average IOPS:   99  94  95  97
Stddev IOPS:7.5 6.3 5.6 5.6
Average Latency(s): 0.240.250.250.24
Stddev Latency(s):  0.120.150.150.14

summing up: write mode
~1500 MB/sec Bandwidth
~385 IOPS
~0.25s Latency

rand mode:
~3500 MB/sec
~920 IOPS
~0.154s Latency



Maybe someone could judge our numbers. I am actually very satisfied with the 
values.

The (mostly idle) cluster is build from these components:
* 10GB frontend network, bonding two connections to mon-, mds- and osd-nodes
** no bonding to clients
* 25GB backend network, bonding two connections to osd-nodes


cluster:
* 3x mon, 2x Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 64GB RAM
* 3x mds, 1x Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz, 128MB RAM
* 7x OSD-nodes, 2x Intel(R) Xeon(R) Silver 4112 CPU @ 2.60GHz, 96GB RAM
** 4x 6TB SAS HDD HGST HUS726T6TAL5204 (5x on two nodes, max. 6x per chassis 
for later growth)
** 2x 800GB SAS SSD WDC WUSTM3280ASS200 => SW-RAID1 => LVM ~116 GiB per OSD for 
DB and WAL

erasure encoded pool: (made for CephFS)
* plugin=clay k=5 m=2 d=6 crush-failure-domain=host

Thanks and best regards
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] typo in news for PG auto-scaler

2019-04-05 Thread Lars Täuber
Hi everybody!

There is a small mistake in the news about the PG autoscaler
https://ceph.com/rados/new-in-nautilus-pg-merging-and-autotuning/

The command
 $ ceph osd pool set foo target_ratio .8
should actually be
 $ ceph osd pool set foo target_size_ratio .8


Thanks for this great improvement!

Cheers,
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Support for buster with nautilus?

2019-03-18 Thread Lars Täuber
Hi there!

I just started to install a ceph cluster.
I'd like to take the nautilus release.
Because of hardware restrictions (network driver modules) I had to take the 
buster release of Debian.

Will there be buster packages of nautilus available after the release?

Thanks for this great storage!

Cheers,
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous/bluetsore osd memory requirements

2017-08-14 Thread Lars Täuber
Hi there,

can someone share her/his experiences regarding this question? Maybe 
differentiated according to the different available algorithms?

Sat, 12 Aug 2017 14:40:05 +0200
Stijn De Weirdt  ==> Gregory Farnum 
, Mark Nelson , 
"ceph-users@lists.ceph.com"  :
> also any indication how much more cpu EC uses (10%,
> 100%, ...)?


I would be interested in the hardware recommendations for the newly introduced 
ceph-mgr daemon also. The big search engines don't tell me anything about this 
yet.

Thanks in advance
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com