Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-11 Thread Brett Chancellor
I did try and run sudo ceph-bluestore-tool --out-dir /mnt/ceph
bluefs-export . but it died after writing out 93GB and filling up my root
partition.

On Thu, Jul 11, 2019 at 3:32 PM Brett Chancellor 
wrote:

> We moved the .rgw.meta data pool over to SSD to try and improve
> performance, during the backfill SSDs bgan dying in mass. Log attached to
> this case
> https://tracker.ceph.com/issues/40741
>
> Right now the SSD's wont come up with either allocator and the cluster is
> pretty much dead.
>
> What are the consequences of deleting the .rgw.meta pool? Can it be
> recreated?
>
> On Wed, Jul 10, 2019 at 3:31 PM ifedo...@suse.de  wrote:
>
>> You might want to try manual rocksdb compaction using ceph-kvstore-tool..
>>
>> Sent from my Huawei tablet
>>
>>
>>  Original Message 
>> Subject: Re: [ceph-users] 3 OSDs stopped and unable to restart
>> From: Brett Chancellor
>> To: Igor Fedotov
>> CC: Ceph Users
>>
>> Once backfilling finished, the cluster was super slow, most osd's were
>> filled with heartbeat_map errors.  When an OSD restarts it causes a cascade
>> of other osd's to follow suit and restart.. logs like..
>>   -3> 2019-07-10 18:34:50.046 7f34abf5b700 -1 osd.69 1348581
>> get_health_metrics reporting 21 slow ops, oldest is
>> osd_op(client.115295041.0:17575966 15.c37fa482 15.c37fa482 (undecoded)
>> ack+ondisk+write+known_if_redirected e1348522)
>> -2> 2019-07-10 18:34:50.967 7f34acf5d700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f3493f2b700' had timed out after 90
>> -1> 2019-07-10 18:34:50.967 7f34acf5d700  1 heartbeat_map is_healthy
>> 'OSD::osd_op_tp thread 0x7f3493f2b700' had suicide timed out after 150
>>  0> 2019-07-10 18:34:51.025 7f3493f2b700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f3493f2b700 thread_name:tp_osd_tp
>>
>>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
>> (stable)
>>  1: (()+0xf5d0) [0x7f34b57c25d0]
>>  2: (pread64()+0x33) [0x7f34b57c1f63]
>>  3: (KernelDevice::read_random(unsigned long, unsigned long, char*,
>> bool)+0x238) [0x55bfdae5a448]
>>  4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned
>> long, char*)+0xca) [0x55bfdae1271a]
>>  5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long,
>> rocksdb::Slice*, char*) const+0x20) [0x55bfdae3b440]
>>  6: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long,
>> rocksdb::Slice*, char*) const+0x960) [0x55bfdb466ba0]
>>  7: (rocksdb::BlockFetcher::ReadBlockContents()+0x3e7) [0x55bfdb420c27]
>>  8: (()+0x11146a4) [0x55bfdb40d6a4]
>>  9:
>> (rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::FilePrefetchBuffer*,
>> rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&,
>> rocksdb::BlockHandle const&, rocksdb::Slice,
>> rocksdb::BlockBasedTable::CachableEntry*, bool,
>> rocksdb::GetContext*)+0x2cc) [0x55bfdb40f63c]
>>  10: (rocksdb::DataBlockIter*
>> rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*,
>> rocksdb::ReadOptions const&, rocksdb::BlockHandle const&,
>> rocksdb::DataBlockIter*, bool, bool, bool, rocksdb::GetContext*,
>> rocksdb::Status, rocksdb::FilePrefetchBuffer*)+0x169) [0x55bfdb41cb29]
>>  11: (rocksdb::BlockBasedTableIterator> rocksdb::Slice>::InitDataBlock()+0xc8) [0x55bfdb41e588]
>>  12: (rocksdb::BlockBasedTableIterator> rocksdb::Slice>::FindKeyForward()+0x8d) [0x55bfdb41e89d]
>>  13: (()+0x10adde9) [0x55bfdb3a6de9]
>>  14: (rocksdb::MergingIterator::Next()+0x44) [0x55bfdb4357c4]
>>  15: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0x762)
>> [0x55bfdb32a092]
>>  16: (rocksdb::DBIter::Next()+0x1d6) [0x55bfdb32b6c6]
>>  17: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::next()+0x2d)
>> [0x55bfdad9fa8d]
>>  18: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t
>> const&, ghobject_t const&, int, std::vector> std::allocator >*, ghobject_t*)+0xdf6) [0x55bfdad12466]
>>  19:
>> (BlueStore::collection_list(boost::intrusive_ptr&,
>> ghobject_t const&, ghobject_t const&, int, std::vector> std::allocator >*, ghobject_t*)+0x9b) [0x55bfdad1393b]
>>  20: (PG::_delete_some(ObjectStore::Transaction*)+0x1e0) [0x55bfda984120]
>>  21: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38)
>> [0x55bfda985598]
>>  22: (boost::statechart::simple_state> PG::RecoveryState::ToDelete, boost::mpl::list> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>> mpl_::na, mpl_::na, mpl_::na>,
>> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
>> const&, void const*)+0x16a) [0x55bfda9c45ca]
>>  23:
>> (boost::statechart::state_machine> PG::RecoveryState::Initial, std::allocator,
>> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
>> const&)+0x5a) [0x55bfda9a20ca]
>>  24: (PG::do_peering_event(std::shared_ptr,
>> PG::RecoveryCtx*)+0x119) [0x55bfda991389]
>>  25: 

Re: [ceph-users] "session established", "io error", "session lost, hunting for new mon" solution/fix

2019-07-11 Thread Marc Roos
 

Anyone know why I would get these? Is it not strange to get them in a 
'standard' setup?





-Original Message-
Subject: [ceph-users] "session established", "io error", "session lost, 
hunting for new mon" solution/fix


I have on a cephfs client again (luminous cluster, centos7, only 32 
osds!). Wanted to share the 'fix'

[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 12:16:09 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 12:16:09 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 12:16:09 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 12:16:09 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 12:16:09 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 12:16:09 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon

1) I blocked client access to the monitors with
iptables -I INPUT -p tcp -s 192.168.10.43 --dport 6789 -j REJECT
Resulting in 

[Thu Jul 11 12:34:16 2019] libceph: mon1 192.168.10.112:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:18 2019] libceph: mon1 192.168.10.112:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:22 2019] libceph: mon1 192.168.10.112:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:26 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:27 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:28 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:30 2019] libceph: mon1 192.168.10.112:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:30 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:34 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:42 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:44 2019] libceph: mon0 192.168.10.111:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:45 2019] libceph: mon0 192.168.10.111:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:46 2019] libceph: mon0 192.168.10.111:6789 socket 
closed (con state CONNECTING)

2) I applied the suggested changes to the osd map message max, mentioned 

in early threads[0]
ceph tell osd.* injectargs '--osd_map_message_max=10'
ceph tell mon.* injectargs '--osd_map_message_max=10'
[@c01 ~]# ceph daemon osd.0 config show|grep message_max
"osd_map_message_max": "10",
[@c01 ~]# ceph daemon mon.a config show|grep message_max
"osd_map_message_max": "10",

[0]
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg54419.html
http://tracker.ceph.com/issues/38040

3) Allow access to a monitor with
iptables -D INPUT -p tcp -s 192.168.10.43 --dport 6789 -j REJECT

Getting 
[Thu Jul 11 12:39:26 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 12:39:26 2019] libceph: osd0 down
[Thu Jul 11 12:39:26 2019] libceph: osd0 up

Problems solved, in D state hung unmount was released. 

I am not sure if the prolonged disconnection to the monitors was the 
solution or the osd_map_message_max=10, or both. 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-11 Thread Brett Chancellor
We moved the .rgw.meta data pool over to SSD to try and improve
performance, during the backfill SSDs bgan dying in mass. Log attached to
this case
https://tracker.ceph.com/issues/40741

Right now the SSD's wont come up with either allocator and the cluster is
pretty much dead.

What are the consequences of deleting the .rgw.meta pool? Can it be
recreated?

On Wed, Jul 10, 2019 at 3:31 PM ifedo...@suse.de  wrote:

> You might want to try manual rocksdb compaction using ceph-kvstore-tool..
>
> Sent from my Huawei tablet
>
>
>  Original Message 
> Subject: Re: [ceph-users] 3 OSDs stopped and unable to restart
> From: Brett Chancellor
> To: Igor Fedotov
> CC: Ceph Users
>
> Once backfilling finished, the cluster was super slow, most osd's were
> filled with heartbeat_map errors.  When an OSD restarts it causes a cascade
> of other osd's to follow suit and restart.. logs like..
>   -3> 2019-07-10 18:34:50.046 7f34abf5b700 -1 osd.69 1348581
> get_health_metrics reporting 21 slow ops, oldest is
> osd_op(client.115295041.0:17575966 15.c37fa482 15.c37fa482 (undecoded)
> ack+ondisk+write+known_if_redirected e1348522)
> -2> 2019-07-10 18:34:50.967 7f34acf5d700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f3493f2b700' had timed out after 90
> -1> 2019-07-10 18:34:50.967 7f34acf5d700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7f3493f2b700' had suicide timed out after 150
>  0> 2019-07-10 18:34:51.025 7f3493f2b700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f3493f2b700 thread_name:tp_osd_tp
>
>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> (stable)
>  1: (()+0xf5d0) [0x7f34b57c25d0]
>  2: (pread64()+0x33) [0x7f34b57c1f63]
>  3: (KernelDevice::read_random(unsigned long, unsigned long, char*,
> bool)+0x238) [0x55bfdae5a448]
>  4: (BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned
> long, char*)+0xca) [0x55bfdae1271a]
>  5: (BlueRocksRandomAccessFile::Read(unsigned long, unsigned long,
> rocksdb::Slice*, char*) const+0x20) [0x55bfdae3b440]
>  6: (rocksdb::RandomAccessFileReader::Read(unsigned long, unsigned long,
> rocksdb::Slice*, char*) const+0x960) [0x55bfdb466ba0]
>  7: (rocksdb::BlockFetcher::ReadBlockContents()+0x3e7) [0x55bfdb420c27]
>  8: (()+0x11146a4) [0x55bfdb40d6a4]
>  9:
> (rocksdb::BlockBasedTable::MaybeLoadDataBlockToCache(rocksdb::FilePrefetchBuffer*,
> rocksdb::BlockBasedTable::Rep*, rocksdb::ReadOptions const&,
> rocksdb::BlockHandle const&, rocksdb::Slice,
> rocksdb::BlockBasedTable::CachableEntry*, bool,
> rocksdb::GetContext*)+0x2cc) [0x55bfdb40f63c]
>  10: (rocksdb::DataBlockIter*
> rocksdb::BlockBasedTable::NewDataBlockIterator(rocksdb::BlockBasedTable::Rep*,
> rocksdb::ReadOptions const&, rocksdb::BlockHandle const&,
> rocksdb::DataBlockIter*, bool, bool, bool, rocksdb::GetContext*,
> rocksdb::Status, rocksdb::FilePrefetchBuffer*)+0x169) [0x55bfdb41cb29]
>  11: (rocksdb::BlockBasedTableIterator rocksdb::Slice>::InitDataBlock()+0xc8) [0x55bfdb41e588]
>  12: (rocksdb::BlockBasedTableIterator rocksdb::Slice>::FindKeyForward()+0x8d) [0x55bfdb41e89d]
>  13: (()+0x10adde9) [0x55bfdb3a6de9]
>  14: (rocksdb::MergingIterator::Next()+0x44) [0x55bfdb4357c4]
>  15: (rocksdb::DBIter::FindNextUserEntryInternal(bool, bool)+0x762)
> [0x55bfdb32a092]
>  16: (rocksdb::DBIter::Next()+0x1d6) [0x55bfdb32b6c6]
>  17: (RocksDBStore::RocksDBWholeSpaceIteratorImpl::next()+0x2d)
> [0x55bfdad9fa8d]
>  18: (BlueStore::_collection_list(BlueStore::Collection*, ghobject_t
> const&, ghobject_t const&, int, std::vector std::allocator >*, ghobject_t*)+0xdf6) [0x55bfdad12466]
>  19:
> (BlueStore::collection_list(boost::intrusive_ptr&,
> ghobject_t const&, ghobject_t const&, int, std::vector std::allocator >*, ghobject_t*)+0x9b) [0x55bfdad1393b]
>  20: (PG::_delete_some(ObjectStore::Transaction*)+0x1e0) [0x55bfda984120]
>  21: (PG::RecoveryState::Deleting::react(PG::DeleteSome const&)+0x38)
> [0x55bfda985598]
>  22: (boost::statechart::simple_state PG::RecoveryState::ToDelete, boost::mpl::list mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
> mpl_::na, mpl_::na, mpl_::na>,
> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
> const&, void const*)+0x16a) [0x55bfda9c45ca]
>  23: (boost::statechart::state_machine PG::RecoveryState::Initial, std::allocator,
> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
> const&)+0x5a) [0x55bfda9a20ca]
>  24: (PG::do_peering_event(std::shared_ptr,
> PG::RecoveryCtx*)+0x119) [0x55bfda991389]
>  25: (OSD::dequeue_peering_evt(OSDShard*, PG*,
> std::shared_ptr, ThreadPool::TPHandle&)+0x1b4)
> [0x55bfda8cb3c4]
>  26: (OSD::dequeue_delete(OSDShard*, PG*, unsigned int,
> ThreadPool::TPHandle&)+0x234) [0x55bfda8cb804]
>  27: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x9f4) [0x55bfda8bfb44]
>  28: 

Re: [ceph-users] P1 production down - 4 OSDs down will not start 14.2.1 nautilus

2019-07-11 Thread Edward Kalk
Production has been restored, it just took about 26 minuets for linux to let me 
execute the OSD start command this time. The longest yet.
sudo systemctl start ceph-osd@X
(Yes, this has happened to us about 4 times now.)
-Ed

> On Jul 11, 2019, at 11:38 AM, Edward Kalk  wrote:
> 
> Rebooted node 4, on node 1 and 2, 2 OSDs each crashed and will not start.
> 
> 
> The logs are similar, seems to be the BUG related to 38724
> Tried to manually start the OSDs, failed. It’s been about 20 minuets with 
> prod down.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] memory usage of: radosgw-admin bucket rm

2019-07-11 Thread Harald Staub

Created https://tracker.ceph.com/issues/40700 (sry forgot to mention).

On 11.07.19 16:41, Matt Benjamin wrote:

I don't think one has been created yet.  Eric Ivancich and Mark Kogan
of my team are investigating this behavior.

Matt

On Thu, Jul 11, 2019 at 10:40 AM Paul Emmerich  wrote:


Is there already a tracker issue?

I'm seeing the same problem here. Started deletion of a bucket with a few 
hundred million objects a week ago or so and I've now noticed that it's also 
leaking memory and probably going to crash.
Going to investigate this further...

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Tue, Jul 9, 2019 at 1:26 PM Matt Benjamin  wrote:


Hi Harald,

Please file a tracker issue, yes.  (Deletes do tend to be slower,
presumably due to rocksdb compaction.)

Matt

On Tue, Jul 9, 2019 at 7:12 AM Harald Staub  wrote:


Currently removing a bucket with a lot of objects:
radosgw-admin bucket rm --bucket=$BUCKET --bypass-gc --purge-objects

This process was killed by the out-of-memory killer. Then looking at the
graphs, we see a continuous increase of memory usage for this process,
about +24 GB per day. Removal rate is about 3 M objects per day.

It is not the fastest hardware, and this index pool is still without
SSDs. The bucket is sharded, 1024 shards. We are on Nautilus 14.2.1, now
about 500 OSDs.

So with this bucket with 60 M objects, we would need about 480 GB of RAM
to come through. Or is there a workaround? Should I open a tracker issue?

The killed remove command can just be called again, but it will be
killed again before it finishes. Also, it has to run some time until it
continues to actually remove objects. This "wait time" is also
increasing. Last time, after about 16 M objects already removed, the
wait time was nearly 9 hours. Also during this time, there is a memory
ramp, but not so steep.

BTW it feels strange that the removal of objects is slower (about 3
times) than adding objects.

   Harry
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] memory usage of: radosgw-admin bucket rm [EXT]

2019-07-11 Thread Matthew Vernon

On 11/07/2019 15:40, Paul Emmerich wrote:

Is there already a tracker issue?

I'm seeing the same problem here. Started deletion of a bucket with a 
few hundred million objects a week ago or so and I've now noticed that 
it's also leaking memory and probably going to crash.

Going to investigate this further...


We had a bucket rm on a machine that OOM'd (and killed the relevant 
process), but I wasn't watching at the time to see if it was the thing 
eating all the RAM.


If someone's giving the bucket rm code some love, it'd be nice if
https://tracker.ceph.com/issues/40587 (and associated PR) got looked at 
- missing shadow objects shouldn't really cause a bucket rm to give up...


Regards,

Matthew


--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] memory usage of: radosgw-admin bucket rm

2019-07-11 Thread Matt Benjamin
I don't think one has been created yet.  Eric Ivancich and Mark Kogan
of my team are investigating this behavior.

Matt

On Thu, Jul 11, 2019 at 10:40 AM Paul Emmerich  wrote:
>
> Is there already a tracker issue?
>
> I'm seeing the same problem here. Started deletion of a bucket with a few 
> hundred million objects a week ago or so and I've now noticed that it's also 
> leaking memory and probably going to crash.
> Going to investigate this further...
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
>
> On Tue, Jul 9, 2019 at 1:26 PM Matt Benjamin  wrote:
>>
>> Hi Harald,
>>
>> Please file a tracker issue, yes.  (Deletes do tend to be slower,
>> presumably due to rocksdb compaction.)
>>
>> Matt
>>
>> On Tue, Jul 9, 2019 at 7:12 AM Harald Staub  wrote:
>> >
>> > Currently removing a bucket with a lot of objects:
>> > radosgw-admin bucket rm --bucket=$BUCKET --bypass-gc --purge-objects
>> >
>> > This process was killed by the out-of-memory killer. Then looking at the
>> > graphs, we see a continuous increase of memory usage for this process,
>> > about +24 GB per day. Removal rate is about 3 M objects per day.
>> >
>> > It is not the fastest hardware, and this index pool is still without
>> > SSDs. The bucket is sharded, 1024 shards. We are on Nautilus 14.2.1, now
>> > about 500 OSDs.
>> >
>> > So with this bucket with 60 M objects, we would need about 480 GB of RAM
>> > to come through. Or is there a workaround? Should I open a tracker issue?
>> >
>> > The killed remove command can just be called again, but it will be
>> > killed again before it finishes. Also, it has to run some time until it
>> > continues to actually remove objects. This "wait time" is also
>> > increasing. Last time, after about 16 M objects already removed, the
>> > wait time was nearly 9 hours. Also during this time, there is a memory
>> > ramp, but not so steep.
>> >
>> > BTW it feels strange that the removal of objects is slower (about 3
>> > times) than adding objects.
>> >
>> >   Harry
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>>
>>
>> --
>>
>> Matt Benjamin
>> Red Hat, Inc.
>> 315 West Huron Street, Suite 140A
>> Ann Arbor, Michigan 48103
>>
>> http://www.redhat.com/en/technologies/storage
>>
>> tel.  734-821-5101
>> fax.  734-769-8938
>> cel.  734-216-5309
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 

Matt Benjamin
Red Hat, Inc.
315 West Huron Street, Suite 140A
Ann Arbor, Michigan 48103

http://www.redhat.com/en/technologies/storage

tel.  734-821-5101
fax.  734-769-8938
cel.  734-216-5309
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 14.2.1 Nautilus OSDs crash

2019-07-11 Thread Edward Kalk
http://tracker.ceph.com/issues/38724 
^seems this bug is related. I’ve added notes to it.
Triggers seem to be a node reboot or remove or add a new OSD.

There seem to be pack port duplicates for Mimic and Luminous
Copied to RADOS - Backport #39692 : 
mimic: _txc_add_transaction error (39) Directory not empty not handled on 
operation 21 (op 1, counting from 0)New 
Copied to RADOS - Backport #39693 : 
nautilus: _txc_add_transaction error (39) Directory not empty not handled on 
operation 21 (op 1, counting from 0) New 
Copied to RADOS - Backport #39694 : 
luminous: _txc_add_transaction error (39) Directory not empty not handled on 
operation 21 (op 1, counting from 0)

This may have an impact to production when multiple OSDs fail to start 
repeatedly after hitting the BUG. Linux stops the start due to too many 
attempts. Our production VM becomes unresponsive for about 10 minuets and then 
the OSD try to start again and typically starts. Sometimes it does not and we 
go another 10 minuets. I have had this happen and the Prod VM crashes.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] memory usage of: radosgw-admin bucket rm

2019-07-11 Thread Paul Emmerich
Is there already a tracker issue?

I'm seeing the same problem here. Started deletion of a bucket with a few
hundred million objects a week ago or so and I've now noticed that it's
also leaking memory and probably going to crash.
Going to investigate this further...

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Tue, Jul 9, 2019 at 1:26 PM Matt Benjamin  wrote:

> Hi Harald,
>
> Please file a tracker issue, yes.  (Deletes do tend to be slower,
> presumably due to rocksdb compaction.)
>
> Matt
>
> On Tue, Jul 9, 2019 at 7:12 AM Harald Staub 
> wrote:
> >
> > Currently removing a bucket with a lot of objects:
> > radosgw-admin bucket rm --bucket=$BUCKET --bypass-gc --purge-objects
> >
> > This process was killed by the out-of-memory killer. Then looking at the
> > graphs, we see a continuous increase of memory usage for this process,
> > about +24 GB per day. Removal rate is about 3 M objects per day.
> >
> > It is not the fastest hardware, and this index pool is still without
> > SSDs. The bucket is sharded, 1024 shards. We are on Nautilus 14.2.1, now
> > about 500 OSDs.
> >
> > So with this bucket with 60 M objects, we would need about 480 GB of RAM
> > to come through. Or is there a workaround? Should I open a tracker issue?
> >
> > The killed remove command can just be called again, but it will be
> > killed again before it finishes. Also, it has to run some time until it
> > continues to actually remove objects. This "wait time" is also
> > increasing. Last time, after about 16 M objects already removed, the
> > wait time was nearly 9 hours. Also during this time, there is a memory
> > ramp, but not so steep.
> >
> > BTW it feels strange that the removal of objects is slower (about 3
> > times) than adding objects.
> >
> >   Harry
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
>
> --
>
> Matt Benjamin
> Red Hat, Inc.
> 315 West Huron Street, Suite 140A
> Ann Arbor, Michigan 48103
>
> http://www.redhat.com/en/technologies/storage
>
> tel.  734-821-5101
> fax.  734-769-8938
> cel.  734-216-5309
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW Beast crash 14.2.1

2019-07-11 Thread Casey Bodley



On 7/11/19 3:28 AM, EDH - Manuel Rios Fernandez wrote:


Hi Folks,

This night RGW crashed without sense using beast as fronted.

We solved turning on civetweb again.

Should be report to tracker?

Please do. It looks like this crashed during startup. Can you please 
include the rgw_frontends configuration?



Regards

Manuel

Centos 7.6

Linux ceph-rgw03 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 
UTC 2019 x86_64 x86_64 x86_64 GNU/Linux


fsid e1ee8086-7cce-43fd-a252-3d677af22428

last_changed 2019-06-17 22:35:18.946810

created 2018-04-17 01:37:27.768960

min_mon_release 14 (nautilus)

0: [v2:172.16.2.5:3300/0,v1:172.16.2.5:6789/0] mon.CEPH-MON01

1: [v2:172.16.2.11:3300/0,v1:172.16.2.11:6789/0] mon.CEPH002

2: [v2:172.16.2.12:3300/0,v1:172.16.2.12:6789/0] mon.CEPH003

3: [v2:172.16.2.10:3300/0,v1:172.16.2.10:6789/0] mon.CEPH001

   -18> 2019-07-11 09:05:01.995 7f8441aff700  4 set_mon_vals no 
callback set


   -17> 2019-07-11 09:05:01.995 7f845f6e47c0 10 monclient: _renew_subs

   -16> 2019-07-11 09:05:01.995 7f845f6e47c0 10 monclient: 
_send_mon_message to mon.CEPH003 at v2:172.16.2.12:3300/0


  -15> 2019-07-11 09:05:01.995 7f845f6e47c0  1 librados: init done

   -14> 2019-07-11 09:05:01.995 7f845f6e47c0  5 asok(0x55cd18bac000) 
register_command cr dump hook 0x55cd198247a8


   -13> 2019-07-11 09:05:01.996 7f8443302700  4 mgrc handle_mgr_map 
Got map version 774


   -12> 2019-07-11 09:05:01.996 7f8443302700  4 mgrc handle_mgr_map 
Active mgr is now [v2:172.16.2.10:6858/256331,v1:172.16.2.10:6859/256331]


   -11> 2019-07-11 09:05:01.996 7f8443302700  4 mgrc reconnect 
Starting new session with 
[v2:172.16.2.10:6858/256331,v1:172.16.2.10:6859/256331]


   -10> 2019-07-11 09:05:01.996 7f844c59d700 10 monclient: 
get_auth_request con 0x55cd19a62000 auth_method 0


    -9> 2019-07-11 09:05:01.997 7f844cd9e700 10 monclient: 
get_auth_request con 0x55cd19a62400 auth_method 0


    -8> 2019-07-11 09:05:01.997 7f844c59d700 10 monclient: 
get_auth_request con 0x55cd19a62800 auth_method 0


    -7> 2019-07-11 09:05:01.998 7f845f6e47c0  5 asok(0x55cd18bac000) 
register_command sync trace show hook 0x55cd19846c40


    -6> 2019-07-11 09:05:01.998 7f845f6e47c0  5 asok(0x55cd18bac000) 
register_command sync trace history hook 0x55cd19846c40


    -5> 2019-07-11 09:05:01.998 7f845f6e47c0  5 asok(0x55cd18bac000) 
register_command sync trace active hook 0x55cd19846c40


    -4> 2019-07-11 09:05:01.998 7f845f6e47c0  5 asok(0x55cd18bac000) 
register_command sync trace active_short hook 0x55cd19846c40


    -3> 2019-07-11 09:05:01.999 7f844d59f700 10 monclient: 
get_auth_request con 0x55cd19a62c00 auth_method 0


    -2> 2019-07-11 09:05:01.999 7f844cd9e700 10 monclient: 
get_auth_request con 0x55cd19a63000 auth_method 0


    -1> 2019-07-11 09:05:01.999 7f845f6e47c0  0 starting handler: beast

 0> 2019-07-11 09:05:02.001 7f845f6e47c0 -1 *** Caught signal 
(Aborted) **


in thread 7f845f6e47c0 thread_name:radosgw

ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) 
nautilus (stable)


1: (()+0xf5d0) [0x7f845293c5d0]

2: (gsignal()+0x37) [0x7f8451d77207]

3: (abort()+0x148) [0x7f8451d788f8]

4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f84526867d5]

5: (()+0x5e746) [0x7f8452684746]

6: (()+0x5e773) [0x7f8452684773]

7: (()+0x5e993) [0x7f8452684993]

8: (void 
boost::throw_exception(boost::system::system_error 
const&)+0x173) [0x55cd16d9f863]


9: (boost::asio::detail::do_throw_error(boost::system::error_code 
const&, char const*)+0x5b) [0x55cd16d9f91b]


10: (()+0x2837fc) [0x55cd16d8b7fc]

11: (main()+0x2873) [0x55cd16d2a8b3]

12: (__libc_start_main()+0xf5) [0x7f8451d633d5]

13: (()+0x24a877) [0x55cd16d52877]

NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 1 buffer

   0/ 1 timer

   0/ 1 filer

   0/ 1 striper

   0/ 1 objecter

   0/ 5 rados

   0/ 5 rbd

   0/ 5 rbd_mirror

   0/ 5 rbd_replay

   0/ 5 journaler

   0/ 5 objectcacher

   0/ 5 client

   0/ 0 osd

   0/ 5 optracker

   0/ 5 objclass

   1/ 3 filestore

   0/ 0 journal

   0/ 0 ms

   1/ 5 mon

   0/10 monc

   1/ 5 paxos

   0/ 5 tp

   1/ 5 auth

   1/ 5 crypto

   1/ 1 finisher

   1/ 1 reserver

   1/ 5 heartbeatmap

   1/ 5 perfcounter

   1/ 1 rgw

   1/ 5 rgw_sync

   1/10 civetweb

   1/ 5 javaclient

   1/ 5 asok

   1/ 1 throttle

   0/ 0 refs

   1/ 5 xio

   1/ 5 compressor

   1/ 5 bluestore

   1/ 5 bluefs

   1/ 3 bdev

   1/ 5 kstore

   4/ 5 rocksdb

   4/ 5 leveldb

   4/ 5 memdb

   1/ 5 kinetic

   1/ 5 fuse

   1/ 5 mgr

   1/ 5 mgrc

   1/ 5 dpdk

   1/ 5 eventtrace

  -2/-2 (syslog threshold)

  -1/-1 (stderr threshold)

  max_recent 1

  max_new 1000

  log_file /var/log/ceph/ceph-client.rgw.ceph-rgw03.log

--- end dump of recent 

Re: [ceph-users] shutdown down all monitors

2019-07-11 Thread Nathan Fish
The monitors determine quorum, so stopping all monitors will
immediately stop IO to prevent split-brain. I would not recommend
shutting down all mons at once in production, though it *should* come
back up fine. If you really need to, shut them down in a certain
order, and bring them back up in the opposite order.

On Thu, Jul 11, 2019 at 5:42 AM Marc Roos  wrote:
>
>
>
> Can I temporary shutdown all my monitors? This only affects new
> connections not? Existing will still keep running?
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Libceph clock drift or I guess kernel clock drift issue

2019-07-11 Thread Marc Roos


I noticed that the dmesg -T gives incorrect time, the messages have a 
time in the future compared to the system time. Not sure if this is 
libceph issue or a kernel issue.

[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: osd22 192.168.10.111:6811 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
established


[@ ]# uptime
 10:39:17 up 50 days, 13:31,  2 users,  load average: 3.60, 3.02, 2.57


[@~]# uname -a
Linux c01 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC 2019 
x86_64 x86_64 x86_64 GNU/Linux
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "session established", "io error", "session lost, hunting for new mon" solution/fix

2019-07-11 Thread Marc Roos


I have on a cephfs client again (luminous cluster, centos7, only 32 
osds!). Wanted to share the 'fix'

[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 12:16:09 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 12:16:09 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 12:16:09 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 12:16:09 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 12:16:09 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 12:16:09 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 12:16:09 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon

1) I blocked client access to the monitors with
iptables -I INPUT -p tcp -s 192.168.10.43 --dport 6789 -j REJECT
Resulting in 

[Thu Jul 11 12:34:16 2019] libceph: mon1 192.168.10.112:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:18 2019] libceph: mon1 192.168.10.112:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:22 2019] libceph: mon1 192.168.10.112:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:26 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:27 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:28 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:30 2019] libceph: mon1 192.168.10.112:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:30 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:34 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:42 2019] libceph: mon2 192.168.10.113:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:44 2019] libceph: mon0 192.168.10.111:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:45 2019] libceph: mon0 192.168.10.111:6789 socket 
closed (con state CONNECTING)
[Thu Jul 11 12:34:46 2019] libceph: mon0 192.168.10.111:6789 socket 
closed (con state CONNECTING)

2) I applied the suggested changes to the osd map message max, mentioned 
in early threads[0]
ceph tell osd.* injectargs '--osd_map_message_max=10'
ceph tell mon.* injectargs '--osd_map_message_max=10'
[@c01 ~]# ceph daemon osd.0 config show|grep message_max
"osd_map_message_max": "10",
[@c01 ~]# ceph daemon mon.a config show|grep message_max
"osd_map_message_max": "10",

[0]
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg54419.html
http://tracker.ceph.com/issues/38040

3) Allow access to a monitor with
iptables -D INPUT -p tcp -s 192.168.10.43 --dport 6789 -j REJECT

Getting 
[Thu Jul 11 12:39:26 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 12:39:26 2019] libceph: osd0 down
[Thu Jul 11 12:39:26 2019] libceph: osd0 up

Problems solved, in D state hung unmount was released. 

I am not sure if the prolonged disconnection to the monitors was the 
solution or the osd_map_message_max=10, or both. 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Frank Schilder
Oh dear. Every occurrence of stripe_* is wrong :)

It should be stripe_count (option --stripe-count in rbd create) everywhere in 
my text.

What choices are legal depends on the restrictions on stripe_count*stripe_unit 
(=stripe_size=stripe_width?) imposed by ceph. I believe all of this ends up 
being powers of 2.

Yes, the 6+2 is a bit surprising. I have no explanation for the observation. It 
just seems a good argument for "do not trust what you believe, gather facts". 
And to try things that seem non-obvious - just to be sure.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Lars 
Marowsky-Bree 
Sent: 11 July 2019 12:17:37
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

On 2019-07-11T09:46:47, Frank Schilder  wrote:

> Striping with stripe units other than 1 is something I also tested. I found 
> that with EC pools non-trivial striping should be avoided. Firstly, EC is 
> already a striped format and, secondly, striping on top of that with 
> stripe_unit>1 will make every write an ec_overwrite, because now shards are 
> rarely if ever written as a whole.

That's why I said that rbd's stripe_unit should match the EC pool's
stripe_width, or be a 2^n multiple of it. (Not sure what stripe_count
should be set to, probably also a small number of two.)

> The native striping in EC pools comes from k, data is striped over k disks. 
> The higher k the more throughput at the expense of cpu and network.

Increasing k also increases stripe_width though; this leads to more IO
suffering from the ec_overwrite penalty.

> In my long list, this should actually be point
>
> 6) Use stripe_unit=1 (default).

You mean stripe-count?

> To get back to your question, this is another argument for k=power-of-two. 
> Object sizes in ceph are always powers of 2 and stripe sizes contain k as a 
> factor. Hence, any prime factor other than 2 in k will imply a mismatch. How 
> badly a mismatch affects performance should be tested.

Yes, of course. Depending on the IO pattern, this means more IO will be
misaligned or have non-stripe_width portions. (Most IO patterns, if they
strive for alignment, aim for a power of two alignment, obviously.)

> Results with non-trivial striping (stripe_size>1) were so poor, I did not 
> even include them in my report.

stripe_size?

> We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool 
> is used for VMs (RBD images), where IOP/s are more important. It also offers 
> a higher redundancy level. Its an acceptable compromise for us.

Especially with RBDs, I'm surprised that k=6 works well for you. Block
device IO is most commonly aligned on power-of-two boundaries.


Regards,
Lars

--
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] writable snapshots in cephfs? GDPR/DSGVO

2019-07-11 Thread Lars Täuber
Thu, 11 Jul 2019 10:24:16 +0200
"Marc Roos"  ==> ceph-users 
, lmb  :
> What about creating snaps on a 'lower level' in the directory structure 
> so you do not need to remove files from a snapshot as a work around?

Thanks for the idea. This might be a solution for our use case.

Regards,
Lars
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] writable snapshots in cephfs? GDPR/DSGVO

2019-07-11 Thread Lars Täuber
Thu, 11 Jul 2019 10:21:16 +0200
Lars Marowsky-Bree  ==> ceph-users@lists.ceph.com :
> On 2019-07-10T09:59:08, Lars Täuber   wrote:
> 
> > Hi everbody!
> > 
> > Is it possible to make snapshots in cephfs writable?
> > We need to remove files because of this General Data Protection Regulation 
> > also from snapshots.  
> 
> Removing data from existing WORM storage is tricky, snapshots being a
> specific form thereof.

We liked it to be a non-WORM storage. It is not meant to be used as an archive.

Thanks,
Lars

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Lars Marowsky-Bree
On 2019-07-11T09:46:47, Frank Schilder  wrote:

> Striping with stripe units other than 1 is something I also tested. I found 
> that with EC pools non-trivial striping should be avoided. Firstly, EC is 
> already a striped format and, secondly, striping on top of that with 
> stripe_unit>1 will make every write an ec_overwrite, because now shards are 
> rarely if ever written as a whole.

That's why I said that rbd's stripe_unit should match the EC pool's
stripe_width, or be a 2^n multiple of it. (Not sure what stripe_count
should be set to, probably also a small number of two.)

> The native striping in EC pools comes from k, data is striped over k disks. 
> The higher k the more throughput at the expense of cpu and network.

Increasing k also increases stripe_width though; this leads to more IO
suffering from the ec_overwrite penalty.

> In my long list, this should actually be point
> 
> 6) Use stripe_unit=1 (default).

You mean stripe-count?

> To get back to your question, this is another argument for k=power-of-two. 
> Object sizes in ceph are always powers of 2 and stripe sizes contain k as a 
> factor. Hence, any prime factor other than 2 in k will imply a mismatch. How 
> badly a mismatch affects performance should be tested.

Yes, of course. Depending on the IO pattern, this means more IO will be
misaligned or have non-stripe_width portions. (Most IO patterns, if they
strive for alignment, aim for a power of two alignment, obviously.)

> Results with non-trivial striping (stripe_size>1) were so poor, I did not 
> even include them in my report.

stripe_size?

> We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool 
> is used for VMs (RBD images), where IOP/s are more important. It also offers 
> a higher redundancy level. Its an acceptable compromise for us.

Especially with RBDs, I'm surprised that k=6 works well for you. Block
device IO is most commonly aligned on power-of-two boundaries.


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] shutdown down all monitors

2019-07-11 Thread Wido den Hollander



On 7/11/19 11:42 AM, Marc Roos wrote:
> 
> 
> Can I temporary shutdown all my monitors? This only affects new 
> connections not? Existing will still keep running?
> 

You can, but it will completely shut down your whole Ceph cluster.

All I/O will pause until the MONs are back and have reached quorum.

Wido

> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Frank Schilder
Striping with stripe units other than 1 is something I also tested. I found 
that with EC pools non-trivial striping should be avoided. Firstly, EC is 
already a striped format and, secondly, striping on top of that with 
stripe_unit>1 will make every write an ec_overwrite, because now shards are 
rarely if ever written as a whole.

The native striping in EC pools comes from k, data is striped over k disks. The 
higher k the more throughput at the expense of cpu and network.

In my long list, this should actually be point

6) Use stripe_unit=1 (default).

To get back to your question, this is another argument for k=power-of-two. 
Object sizes in ceph are always powers of 2 and stripe sizes contain k as a 
factor. Hence, any prime factor other than 2 in k will imply a mismatch. How 
badly a mismatch affects performance should be tested.

Example: on our 6+2 EC pool I have stripe_width  24576, which has 3 as a 
factor. The 3 comes from k=6=3*2 and will always be there. This implies a 
misalignment and some writes will have to be split/padded in the middle. This 
does not happen too often per object, so 6+2 performance is good, but not as 
good as 8+2 performance.

Some numbers:

1) rbd object size 8MB, 4 servers writing with 1 processes each (=4 workers):
EC profile 4K random write  sequential write 8M write size
   IOP/s aggregated MB/s aggregated
 5+2802.30  1156.05
 6+2   1188.26  1873.67
 8+2   1210.27  2510.78
10+4421.80   681.22

2) rbd object size 8MB, 4 servers writing with 4 processes each (=16 workers):
EC profile 4K random write  sequential write 8M write size
   IOP/s aggregated MB/s aggregated
6+21384.43  3139.14
8+21343.34  4069.27

The EC-profiles with factor 5 are so bad that I didn't repeat the multi-process 
tests (2) with these. I had limited time and went for the discard-early 
strategy to find suitable parameters.

The 25% smaller throughput (6+2 vs 8+2) in test (2) is probably due to the fact 
that data is striped over 6 instead of 8 disks. There might be some impact of 
the factor 3 somewhere as well, but it seems negligible in the scenario I 
tested.

Results with non-trivial striping (stripe_size>1) were so poor, I did not even 
include them in my report.

We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool is 
used for VMs (RBD images), where IOP/s are more important. It also offers a 
higher redundancy level. Its an acceptable compromise for us.

Note that numbers will vary depending on hardware, OSD config, kernel 
parameters etc, etc. One needs to test what one has.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of Lars 
Marowsky-Bree 
Sent: 11 July 2019 10:14:04
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] What's the best practice for Erasure Coding

On 2019-07-09T07:27:28, Frank Schilder  wrote:

> Small addition:
>
> This result holds for rbd bench. It seems to imply good performance for 
> large-file IO on cephfs, since cephfs will split large files into many 
> objects of size object_size. Small-file IO is a different story.
>
> The formula should be N*alloc_size=object_size/k, where N is some integer. 
> alloc_size should be an integer multiple of object_size/k.

If using rbd striping, I'd also assume that making rbd's stripe_unit be
equal to, or at least a multiple of, the stripe_width of the EC pool is
sensible.

(Similar for CephFS's layout.)

Does this hold in your environment?


--
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] shutdown down all monitors

2019-07-11 Thread Marc Roos



Can I temporary shutdown all my monitors? This only affects new 
connections not? Existing will still keep running?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous cephfs maybe not to stable as expected?

2019-07-11 Thread Marc Roos
 


I decided to restart osd.0, then the load of the cephfs and on all osd 
nodes dropped. After this I still have on the first server


[@~]# cat 
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client357431
0/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS


[@~]# cat 
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client358422
4/osdc
REQUESTS 2 homeless 0
317841  osd020.d6ec44c1 20.1[0,28,5]/0  [0,28,5]/0  
e65040  10001b44a70.0x40001c102023  read
317853  osd020.5956d31b 20.1b   [0,5,10]/0  [0,5,10]/0  
e65040  10001ad8962.0x40001c40731   read
LINGER REQUESTS
BACKOFFS

And dmesg -T keeps giving me these (again with wrong timestamps)

[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon

What to do now? Restarting the monitor did not help.


-Original Message-
Subject: Re: [ceph-users] Luminous cephfs maybe not to stable as 
expected?

 

Forgot to add these

[@ ~]# cat
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client357431
0/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS

[@~]# cat
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client358422
4/osdc
REQUESTS 38 homeless 0
317841  osd020.d6ec44c1 20.1[0,28,5]/0  [0,28,5]/0  
e65040  10001b44a70.0x40001c101139  read
317853  osd020.5956d31b 20.1b   [0,5,10]/0  [0,5,10]/0  
e65040  10001ad8962.0x40001c39847   read
317835  osd320.ede889de 20.1e   [3,12,27]/3 [3,12,27]/3 
e65040  10001ad80f6.0x40001c87758   read
317838  osd320.7b730a4e 20.e[3,31,9]/3  [3,31,9]/3  
e65040  10001ad89d8.0x40001c83444   read
317844  osd320.feead84c 20.c[3,13,18]/3 [3,13,18]/3 
e65040  10001ad8733.0x40001c77267   read
317852  osd320.bd2658e  20.e[3,31,9]/3  [3,31,9]/3  
e65040  10001ad7e00.0x40001c39331   read
317830  osd420.922e6d04 20.4[4,16,27]/4 [4,16,27]/4 
e65040  10001ad80f2.0x40001c86326   read
317837  osd420.fe93d4ab 20.2b   [4,14,25]/4 [4,14,25]/4 
e65040  10001ad80fb.0x40001c78951   read
317839  osd420.d7af926b 20.2b   [4,14,25]/4 [4,14,25]/4 
e65040  10001ad80ee.0x40001c77556   read
317849  osd520.5fcb95c5 20.5[5,18,29]/5 [5,18,29]/5 
e65040  10001ad7f75.0x40001c61147   read
317857  osd520.28764e9a 20.1a   [5,7,28]/5  [5,7,28]/5  
e65040  10001ad8a10.0x40001c30369   read
317859  osd520.7bb79985 20.5[5,18,29]/5 [5,18,29]/5 
e65040  10001ad7fe8.0x40001c27942   read
317836  osd820.e7bf5cf4 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad7d79.0x40001c133699  read
317842  osd820.abbb9df4 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001d5903f.0x40001c125308  read
317850  osd820.ecd0034  20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad89b2.0x40001c68348   read
317854  osd820.cef50134 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad8728.0x40001c57431   read
317861  osd820.3e859bb4 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad8108.0x40001c50642   read
317847  osd920.fc9e9f43 20.3[9,29,17]/9 [9,29,17]/9 
e65040  10001ad8101.0x40001c88464   read
317848  osd920.d32b6ac3 20.3[9,29,17]/9 [9,29,17]/9 
e65040  10001ad8100.0x40001c85929   read
317862  osd11   

Re: [ceph-users] Luminous cephfs maybe not to stable as expected?

2019-07-11 Thread Marc Roos
 

Forgot to add these

[@ ~]# cat 
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client357431
0/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS

[@~]# cat 
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client358422
4/osdc
REQUESTS 38 homeless 0
317841  osd020.d6ec44c1 20.1[0,28,5]/0  [0,28,5]/0  
e65040  10001b44a70.0x40001c101139  read
317853  osd020.5956d31b 20.1b   [0,5,10]/0  [0,5,10]/0  
e65040  10001ad8962.0x40001c39847   read
317835  osd320.ede889de 20.1e   [3,12,27]/3 [3,12,27]/3 
e65040  10001ad80f6.0x40001c87758   read
317838  osd320.7b730a4e 20.e[3,31,9]/3  [3,31,9]/3  
e65040  10001ad89d8.0x40001c83444   read
317844  osd320.feead84c 20.c[3,13,18]/3 [3,13,18]/3 
e65040  10001ad8733.0x40001c77267   read
317852  osd320.bd2658e  20.e[3,31,9]/3  [3,31,9]/3  
e65040  10001ad7e00.0x40001c39331   read
317830  osd420.922e6d04 20.4[4,16,27]/4 [4,16,27]/4 
e65040  10001ad80f2.0x40001c86326   read
317837  osd420.fe93d4ab 20.2b   [4,14,25]/4 [4,14,25]/4 
e65040  10001ad80fb.0x40001c78951   read
317839  osd420.d7af926b 20.2b   [4,14,25]/4 [4,14,25]/4 
e65040  10001ad80ee.0x40001c77556   read
317849  osd520.5fcb95c5 20.5[5,18,29]/5 [5,18,29]/5 
e65040  10001ad7f75.0x40001c61147   read
317857  osd520.28764e9a 20.1a   [5,7,28]/5  [5,7,28]/5  
e65040  10001ad8a10.0x40001c30369   read
317859  osd520.7bb79985 20.5[5,18,29]/5 [5,18,29]/5 
e65040  10001ad7fe8.0x40001c27942   read
317836  osd820.e7bf5cf4 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad7d79.0x40001c133699  read
317842  osd820.abbb9df4 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001d5903f.0x40001c125308  read
317850  osd820.ecd0034  20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad89b2.0x40001c68348   read
317854  osd820.cef50134 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad8728.0x40001c57431   read
317861  osd820.3e859bb4 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad8108.0x40001c50642   read
317847  osd920.fc9e9f43 20.3[9,29,17]/9 [9,29,17]/9 
e65040  10001ad8101.0x40001c88464   read
317848  osd920.d32b6ac3 20.3[9,29,17]/9 [9,29,17]/9 
e65040  10001ad8100.0x40001c85929   read
317862  osd11   20.ee6cc689 20.9[11,0,12]/11[11,0,12]/11
e65040  10001ad7d64.0x40001c40266   read
317843  osd12   20.a801f0e9 20.29   [12,26,8]/12[12,26,8]/12
e65040  10001ad7f07.0x40001c86610   read
317851  osd12   20.8bb48de9 20.29   [12,26,8]/12[12,26,8]/12
e65040  10001ad7e4f.0x40001c46746   read
317860  osd12   20.47815f36 20.36   [12,0,28]/12[12,0,28]/12
e65040  10001ad8035.0x40001c35249   read
317831  osd15   20.9e3acb53 20.13   [15,0,1]/15 [15,0,1]/15 
e65040  10001ad8978.0x40001c85329   read
317840  osd15   20.2a40efdf 20.1f   [15,4,17]/15[15,4,17]/15
e65040  10001ad7ef8.0x40001c76282   read
317846  osd15   20.8143f15f 20.1f   [15,4,17]/15[15,4,17]/15
e65040  10001ad89d1.0x40001c61297   read
317864  osd15   20.c889a49c 20.1c   [15,0,31]/15[15,0,31]/15
e65040  10001ad89fb.0x40001c24385   read
317832  osd18   20.f76227a  20.3a   [18,6,15]/18[18,6,15]/18
e65040  10001ad8020.0x40001c82852   read
317833  osd18   20.d8edab31 20.31   [18,29,14]/18   [18,29,14]/18   
e65040  10001ad8952.0x40001c82852   read
317858  osd18   20.8f69d231 20.31   [18,29,14]/18   [18,29,14]/18   
e65040  10001ad8176.0x40001c32400   read
317855  osd22   20.b3342c0f 20.f[22,18,31]/22   [22,18,31]/22   
e65040  10001ad8146.0x40001c51024   read
317863  osd23   20.cde0ce7b 20.3b   [23,1,6]/23 [23,1,6]/23 
e65040  10001ad856c.0x40001c34521   read
317865  osd23   20.702d2dfe 20.3e   [23,9,22]/23[23,9,22]/23
e65040  10001ad8a5e.0x40001c30664   read
317866  osd23   20.cb4a32fe 20.3e   [23,9,22]/23[23,9,22]/23
e65040  10001ad8575.0x40001c29683   read
317867  osd23   20.9a008910 20.10   [23,12,6]/23[23,12,6]/23
e65040  10001ad7d24.0x40001c29683   read
317834  osd25   20.6efd4911

[ceph-users] Luminous cephfs maybe not to stable as expected?

2019-07-11 Thread Marc Roos

Maybe this requires some attention. I have a default centos7 (maybe not 
the most recent kernel though), ceph luminous setup eg. no different 
kernels. 

This is 2nd or 3rd time that a vm is going into a high load (151) and 
stopping its services. I have two vm's both mounting the same 2 cephfs 
'shares'. After the last incident I dismounted the shares on the 2nd 
server. (Migrating to a new environment this 2nd server is not doing 
anything). Last time I thought maybe this could be related to my work on 
the switch from the stupid allocator to the bitmap.

Anyway yesterday I thought lets mount again the 2 shares on the 2nd 
server, see what happens. And this morning the high load was back. Afaik 
the 2nd server is only doing a cron job on the cephfs mounts, creating 
snapshots.

1) I have now still increased load on the osd nodes, from cephfs. How 
can I see what client is doing this? I don’t seem to get this from 
'ceph daemon mds.c session ls' however 'ceph osd pool stats | grep 
client -B 1' indicates it is cephfs.

2) ceph osd blacklist ls
No blacklist entries

3) the first server keeps generating such messages, while there is no 
issue with connectivity.

[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: osd25 192.168.10.114:6804 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: osd18 192.168.10.112:6802 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: osd22 192.168.10.111:6811 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
established

PS dmesg -T gives me strange times, as you can see these are in the 
future, os time is 2 min behind (which is the correct one, ntpd sync).
[@ ]# uptime
 10:39:17 up 50 days, 13:31,  2 users,  load average: 3.60, 3.02, 2.57

4) unmount the filesystem on the first server fails.

5) evicting the cephfs sessions of the first server, does not change the 
load of the cephfs on the osd nodes.

6) unmounting all cephfs clients, still leaves me with cephfs activity 
on the data pool and on the osd nodes.

[@c03 ~]# ceph daemon mds.c session ls
[] 

7) On the first server 
[@~]# ps -auxf| grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root  6716  3.0  0.0  0 0 ?D10:18   0:59  \_ 
[kworker/0:2]
root 20039  0.0  0.0 123520  1212 pts/0D+   10:28   0:00  |  
 \_ umount /home/mail-archive/

[@ ~]# cat /proc/6716/stack
[] __wait_on_freeing_inode+0xb0/0xf0
[] find_inode+0x99/0xc0

Re: [ceph-users] writable snapshots in cephfs? GDPR/DSGVO

2019-07-11 Thread Marc Roos


What about creating snaps on a 'lower level' in the directory structure 
so you do not need to remove files from a snapshot as a work around?


-Original Message-
From: Lars Marowsky-Bree [mailto:l...@suse.com] 
Sent: donderdag 11 juli 2019 10:21
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] writable snapshots in cephfs? GDPR/DSGVO

On 2019-07-10T09:59:08, Lars Täuber   wrote:

> Hi everbody!
> 
> Is it possible to make snapshots in cephfs writable?
> We need to remove files because of this General Data Protection 
Regulation also from snapshots.

Removing data from existing WORM storage is tricky, snapshots being a 
specific form thereof. If you want to avoid copying and altering all 
existing records - which might clash with the requirement from other 
fields that data needs to be immutable, but I guess you could store 
checksums externally somewhere? -, this is difficult.

I think what you'd need is an additional layer - say, one holding the 
decryption keys for the tenant/user (or whatever granularity you want to 
be able to remove data at) - that you can still modify.

Once the keys have been successfully and permanently wiped, the old data 
is effectively permanently deleted (from all media; whether Ceph snaps 
or tape or other immutable storage).

You may have a record that you *had* the data.

Now, of course, you've got to manage keys, but that's significantly less 
data to massage.

Not a lawyer, either.

Good luck.


Regards,
Lars

--
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 
21284 (AG Nürnberg) "Architects should open possibilities and not 
determine everything." (Ueli Zbinden) 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] writable snapshots in cephfs? GDPR/DSGVO

2019-07-11 Thread Lars Marowsky-Bree
On 2019-07-10T09:59:08, Lars Täuber   wrote:

> Hi everbody!
> 
> Is it possible to make snapshots in cephfs writable?
> We need to remove files because of this General Data Protection Regulation 
> also from snapshots.

Removing data from existing WORM storage is tricky, snapshots being a
specific form thereof. If you want to avoid copying and altering all
existing records - which might clash with the requirement from other
fields that data needs to be immutable, but I guess you could store
checksums externally somewhere? -, this is difficult.

I think what you'd need is an additional layer - say, one holding the
decryption keys for the tenant/user (or whatever granularity you want to
be able to remove data at) - that you can still modify.

Once the keys have been successfully and permanently wiped, the old data
is effectively permanently deleted (from all media; whether Ceph snaps
or tape or other immutable storage).

You may have a record that you *had* the data.

Now, of course, you've got to manage keys, but that's significantly less
data to massage.

Not a lawyer, either.

Good luck.


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Lars Marowsky-Bree
On 2019-07-09T07:27:28, Frank Schilder  wrote:

> Small addition:
> 
> This result holds for rbd bench. It seems to imply good performance for 
> large-file IO on cephfs, since cephfs will split large files into many 
> objects of size object_size. Small-file IO is a different story.
> 
> The formula should be N*alloc_size=object_size/k, where N is some integer. 
> alloc_size should be an integer multiple of object_size/k.

If using rbd striping, I'd also assume that making rbd's stripe_unit be
equal to, or at least a multiple of, the stripe_width of the EC pool is
sensible.

(Similar for CephFS's layout.)

Does this hold in your environment?


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW Beast crash 14.2.1

2019-07-11 Thread EDH - Manuel Rios Fernandez
Hi Folks,

 

This night RGW crashed without sense using beast as fronted.

We solved turning on civetweb again.

 

Should be report to tracker?

 

Regards

Manuel

 

Centos 7.6

Linux ceph-rgw03 3.10.0-957.21.3.el7.x86_64 #1 SMP Tue Jun 18 16:35:19 UTC
2019 x86_64 x86_64 x86_64 GNU/Linux

 

 

fsid e1ee8086-7cce-43fd-a252-3d677af22428

last_changed 2019-06-17 22:35:18.946810

created 2018-04-17 01:37:27.768960

min_mon_release 14 (nautilus)

0: [v2:172.16.2.5:3300/0,v1:172.16.2.5:6789/0] mon.CEPH-MON01

1: [v2:172.16.2.11:3300/0,v1:172.16.2.11:6789/0] mon.CEPH002

2: [v2:172.16.2.12:3300/0,v1:172.16.2.12:6789/0] mon.CEPH003

3: [v2:172.16.2.10:3300/0,v1:172.16.2.10:6789/0] mon.CEPH001

 

   -18> 2019-07-11 09:05:01.995 7f8441aff700  4 set_mon_vals no callback set

   -17> 2019-07-11 09:05:01.995 7f845f6e47c0 10 monclient: _renew_subs

   -16> 2019-07-11 09:05:01.995 7f845f6e47c0 10 monclient: _send_mon_message
to mon.CEPH003 at v2:172.16.2.12:3300/0

  -15> 2019-07-11 09:05:01.995 7f845f6e47c0  1 librados: init done

   -14> 2019-07-11 09:05:01.995 7f845f6e47c0  5 asok(0x55cd18bac000)
register_command cr dump hook 0x55cd198247a8

   -13> 2019-07-11 09:05:01.996 7f8443302700  4 mgrc handle_mgr_map Got map
version 774

   -12> 2019-07-11 09:05:01.996 7f8443302700  4 mgrc handle_mgr_map Active
mgr is now [v2:172.16.2.10:6858/256331,v1:172.16.2.10:6859/256331]

   -11> 2019-07-11 09:05:01.996 7f8443302700  4 mgrc reconnect Starting new
session with [v2:172.16.2.10:6858/256331,v1:172.16.2.10:6859/256331]

   -10> 2019-07-11 09:05:01.996 7f844c59d700 10 monclient: get_auth_request
con 0x55cd19a62000 auth_method 0

-9> 2019-07-11 09:05:01.997 7f844cd9e700 10 monclient: get_auth_request
con 0x55cd19a62400 auth_method 0

-8> 2019-07-11 09:05:01.997 7f844c59d700 10 monclient: get_auth_request
con 0x55cd19a62800 auth_method 0

-7> 2019-07-11 09:05:01.998 7f845f6e47c0  5 asok(0x55cd18bac000)
register_command sync trace show hook 0x55cd19846c40

-6> 2019-07-11 09:05:01.998 7f845f6e47c0  5 asok(0x55cd18bac000)
register_command sync trace history hook 0x55cd19846c40

-5> 2019-07-11 09:05:01.998 7f845f6e47c0  5 asok(0x55cd18bac000)
register_command sync trace active hook 0x55cd19846c40

-4> 2019-07-11 09:05:01.998 7f845f6e47c0  5 asok(0x55cd18bac000)
register_command sync trace active_short hook 0x55cd19846c40

-3> 2019-07-11 09:05:01.999 7f844d59f700 10 monclient: get_auth_request
con 0x55cd19a62c00 auth_method 0

-2> 2019-07-11 09:05:01.999 7f844cd9e700 10 monclient: get_auth_request
con 0x55cd19a63000 auth_method 0

-1> 2019-07-11 09:05:01.999 7f845f6e47c0  0 starting handler: beast

 0> 2019-07-11 09:05:02.001 7f845f6e47c0 -1 *** Caught signal (Aborted)
**

in thread 7f845f6e47c0 thread_name:radosgw

 

ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
(stable)

1: (()+0xf5d0) [0x7f845293c5d0]

2: (gsignal()+0x37) [0x7f8451d77207]

3: (abort()+0x148) [0x7f8451d788f8]

4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f84526867d5]

5: (()+0x5e746) [0x7f8452684746]

6: (()+0x5e773) [0x7f8452684773]

7: (()+0x5e993) [0x7f8452684993]

8: (void
boost::throw_exception(boost::system::system_er
ror const&)+0x173) [0x55cd16d9f863]

9: (boost::asio::detail::do_throw_error(boost::system::error_code const&,
char const*)+0x5b) [0x55cd16d9f91b]

10: (()+0x2837fc) [0x55cd16d8b7fc]

11: (main()+0x2873) [0x55cd16d2a8b3]

12: (__libc_start_main()+0xf5) [0x7f8451d633d5]

13: (()+0x24a877) [0x55cd16d52877]

NOTE: a copy of the executable, or `objdump -rdS ` is needed to
interpret this.

 

--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds

   1/ 5 mds_balancer

   1/ 5 mds_locker

   1/ 5 mds_log

   1/ 5 mds_log_expire

   1/ 5 mds_migrator

   0/ 1 buffer

   0/ 1 timer

   0/ 1 filer

   0/ 1 striper

   0/ 1 objecter

   0/ 5 rados

   0/ 5 rbd

   0/ 5 rbd_mirror

   0/ 5 rbd_replay

   0/ 5 journaler

   0/ 5 objectcacher

   0/ 5 client

   0/ 0 osd

   0/ 5 optracker

   0/ 5 objclass

   1/ 3 filestore

   0/ 0 journal

   0/ 0 ms

   1/ 5 mon

   0/10 monc

   1/ 5 paxos

   0/ 5 tp

   1/ 5 auth

   1/ 5 crypto

   1/ 1 finisher

   1/ 1 reserver

   1/ 5 heartbeatmap

   1/ 5 perfcounter

   1/ 1 rgw

   1/ 5 rgw_sync

   1/10 civetweb

   1/ 5 javaclient

   1/ 5 asok

   1/ 1 throttle

   0/ 0 refs

   1/ 5 xio

   1/ 5 compressor

   1/ 5 bluestore

   1/ 5 bluefs

   1/ 3 bdev

   1/ 5 kstore

   4/ 5 rocksdb

   4/ 5 leveldb

   4/ 5 memdb

   1/ 5 kinetic

   1/ 5 fuse

   1/ 5 mgr

   1/ 5 mgrc

   1/ 5 dpdk

   1/ 5 eventtrace

  -2/-2 (syslog threshold)

  -1/-1 (stderr threshold)

  max_recent 1

  max_new 1000

  log_file /var/log/ceph/ceph-client.rgw.ceph-rgw03.log

--- end dump of recent events ---

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com