Re: [ceph-users] Ceph -s shows that usage is 385GB after I delete my pools

2019-07-08 Thread 刘亮
Thank you for your reply.

I have 84 osds.  7 ssd as cache tier for pool cache . 77 hdd as strorage pool.


 原始邮件
发件人: Paul Emmerich
收件人: 刘亮
抄送: ceph-users
发送时间: 2019年7月9日(周二) 01:35
主题: Re: [ceph-users] Ceph -s shows that usage is 385GB after I delete my pools

That's very likely just metadata.

How many OSDs do you have? Minimum pre-allocated size for metadata is around 1 
GB per OSD. Could be more allocated but not yet in use space after deleting 
pools.

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Jul 8, 2019 at 2:07 PM 刘亮 
mailto:liang...@linkdoc.com>> wrote:
[cid:16bd2a6d9cece6ebf5b1]

HI:
Ceph -s  shows that usage is  385GB after I delete my pools  .  Do you know 
 why? Anyone can help me?

Thank you!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Lars Marowsky-Bree
On 2019-07-08T19:37:13, Paul Emmerich  wrote:

> object_map can be a bottleneck for the first write in fresh images

We're working with CephFS here.


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Paul Emmerich
On Mon, Jul 8, 2019 at 2:42 PM Maged Mokhtar  wrote:

>
> On 08/07/2019 13:02, Lars Marowsky-Bree wrote:
> > On 2019-07-08T12:25:30, Dan van der Ster  wrote:
> >
> >> Is there a specific bench result you're concerned about?
> > We're seeing ~5800 IOPS, ~23 MiB/s on 4 KiB IO (stripe_width 8192) on a
> > pool that could do 3 GiB/s with 4M blocksize. So, yeah, well, that is
> > rather harsh, even for EC.
> >
> >> I would think that small write perf could be kept reasonable thanks to
> >> bluestore's deferred writes.
> > I believe we're being hit by the EC read-modify-write cycle on
> > overwrites.
> >
> >> FWIW, our bench results (all flash cluster) didn't show a massive
> >> performance difference between 3 replica and 4+2 EC.
> > I'm guessing that this was not 4 KiB but a more reasonable blocksize
> > that was a multiple of stripe_width?
> >
> >
> > Regards,
> >  Lars
>
> Hi Lars,
>
> Maybe not related, but we find with rbd, random 4k write iops start very
> low at first for a new image and then increase over time as we write. If
> we thick provision the image it work does not show this. This happens on
> random small block and not sequential or large. Probably related to
> initial obkect/chunk creation.
>

object_map can be a bottleneck for the first write in fresh images

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


>
> Also we use the default stripe width, maybe you try a pool with default
> width and see if it is a factor.
>
>
> /Maged
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph -s shows that usage is 385GB after I delete my pools

2019-07-08 Thread Paul Emmerich
That's very likely just metadata.

How many OSDs do you have? Minimum pre-allocated size for metadata is
around 1 GB per OSD. Could be more allocated but not yet in use space after
deleting pools.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Jul 8, 2019 at 2:07 PM 刘亮  wrote:

>
> HI:
> Ceph -s  shows that usage is  385GB after I delete my pools  .  Do you
> know  why? Anyone can help me?
>
> Thank you!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-08 Thread Brett Chancellor
I'll give that a try.  Is it something like...
ceph tell 'osd.*' bluestore_allocator stupid
ceph tell 'osd.*' bluefs_allocator stupid

And should I expect any issues doing this?


On Mon, Jul 8, 2019 at 1:04 PM Igor Fedotov  wrote:

> I should read call stack more carefully... It's not about lacking free
> space - this is rather the bug from this ticket:
>
> http://tracker.ceph.com/issues/40080
>
>
> You should upgrade to v14.2.2 (once it's available) or temporarily switch
> to stupid allocator as a workaround.
>
>
> Thanks,
>
> Igor
>
>
>
> On 7/8/2019 8:00 PM, Igor Fedotov wrote:
>
> Hi Brett,
>
> looks like BlueStore is unable to allocate additional space for BlueFS at
> main device. It's either lacking free space or it's too fragmented...
>
> Would you share osd log, please?
>
> Also please run "ceph-bluestore-tool --path  path-to-osd!!!> bluefs-bdev-sizes" and share the output.
>
> Thanks,
>
> Igor
> On 7/3/2019 9:59 PM, Brett Chancellor wrote:
>
> Hi All! Today I've had 3 OSDs stop themselves and are unable to restart,
> all with the same error. These OSDs are all on different hosts. All are
> running 14.2.1
>
> I did try the following two commands
> - ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
>   ## This failed with the same error below
> - ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
>  ## After a couple of hours returned...
> 2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
> bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs record
> found, suggest to run store repair to get consistent statistic reports
> fsck success
>
>
> ## Error when trying to start one of the OSDs
>-12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f5e42366700 thread_name:rocksdb:low0
>
>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
> (stable)
>  1: (()+0xf5d0) [0x7f5e50bd75d0]
>  2: (gsignal()+0x37) [0x7f5e4f9ce207]
>  3: (abort()+0x148) [0x7f5e4f9cf8f8]
>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x199) [0x55a7aaee96ab]
>  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*,
> char const*, ...)+0) [0x55a7aaee982a]
>  6: (interval_set std::less, std::allocator unsigned long> > > >::insert(unsigned long, unsigned long, unsigned long*,
> unsigned long*)+0x3c6) [0x55a7ab212a66]
>  7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned long,
> std::vector mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
> >*)+0x74e) [0x55a7ab48253e]
>  8: (BlueFS::_expand_slow_device(unsigned long,
> std::vector mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
> >&)+0x111) [0x55a7ab59e921]
>  9: (BlueFS::_allocate(unsigned char, unsigned long,
> bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
>  10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
> long)+0xe5) [0x55a7ab59fce5]
>  11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b]
>  12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
>  13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
>  14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
>  15: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
> const&, rocksdb::CompactionJob::SubcompactionState*,
> rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice
> const*)+0xbaa) [0x55a7abc3b73a]
>  16:
> (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0)
> [0x55a7abc3f150]
>  17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618]
>  18: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*,
> rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xcb7)
> [0x55a7aba7fb67]
>  19:
> (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
> rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]
>  20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a) [0x55a7aba8190a]
>  21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264)
> [0x55a7abc8d9c4]
>  22: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f)
> [0x55a7abc8db4f]
>  23: (()+0x129dfff) [0x55a7abd1afff]
>  24: (()+0x7dd5) [0x7f5e50bcfdd5]
>  25: (clone()+0x6d) [0x7f5e4fa95ead]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
> to interpret this.
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-08 Thread Igor Fedotov
I should read call stack more carefully... It's not about lacking free 
space - this is rather the bug from this ticket:


http://tracker.ceph.com/issues/40080


You should upgrade to v14.2.2 (once it's available) or temporarily 
switch to stupid allocator as a workaround.



Thanks,

Igor



On 7/8/2019 8:00 PM, Igor Fedotov wrote:


Hi Brett,

looks like BlueStore is unable to allocate additional space for BlueFS 
at main device. It's either lacking free space or it's too fragmented...


Would you share osd log, please?

Also please run "ceph-bluestore-tool --path path-to-osd!!!> bluefs-bdev-sizes" and share the output.


Thanks,

Igor

On 7/3/2019 9:59 PM, Brett Chancellor wrote:
Hi All! Today I've had 3 OSDs stop themselves and are unable to 
restart, all with the same error. These OSDs are all on different 
hosts. All are running 14.2.1


I did try the following two commands
- ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
  ## This failed with the same error below
- ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
 ## After a couple of hours returned...
2019-07-03 18:30:02.095 7fe7c1c1ef00 -1 
bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs 
record found, suggest to run store repair to get consistent statistic 
reports

fsck success


## Error when trying to start one of the OSDs
 -12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal 
(Aborted) **

 in thread 7f5e42366700 thread_name:rocksdb:low0

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) 
nautilus (stable)

 1: (()+0xf5d0) [0x7f5e50bd75d0]
 2: (gsignal()+0x37) [0x7f5e4f9ce207]
 3: (abort()+0x148) [0x7f5e4f9cf8f8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x199) [0x55a7aaee96ab]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
const*, char const*, ...)+0) [0x55a7aaee982a]
 6: (interval_setlong, std::less, std::allocatorlong const, unsigned long> > > >::insert(unsigned long, unsigned 
long, unsigned long*, unsigned long*)+0x3c6) [0x55a7ab212a66]
 7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned 
long, std::vectormempool::pool_allocator<(mempool::pool_index_t)4, 
bluestore_pextent_t> >*)+0x74e) [0x55a7ab48253e]
 8: (BlueFS::_expand_slow_device(unsigned long, 
std::vectormempool::pool_allocator<(mempool::pool_index_t)4, 
bluestore_pextent_t> >&)+0x111) [0x55a7ab59e921]
 9: (BlueFS::_allocate(unsigned char, unsigned long, 
bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
 10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
unsigned long)+0xe5) [0x55a7ab59fce5]

 11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b]
 12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
 13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
 14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
 15: 
(rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status 
const&, rocksdb::CompactionJob::SubcompactionState*, 
rocksdb::RangeDelAggregator*, CompactionIterationStats*, 
rocksdb::Slice const*)+0xbaa) [0x55a7abc3b73a]
 16: 
(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0) 
[0x55a7abc3f150]

 17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618]
 18: (rocksdb::DBImpl::BackgroundCompaction(bool*, 
rocksdb::JobContext*, rocksdb::LogBuffer*, 
rocksdb::DBImpl::PrepickedCompaction*)+0xcb7) [0x55a7aba7fb67]
 19: 
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, 
rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]

 20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a) [0x55a7aba8190a]
 21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264) 
[0x55a7abc8d9c4]
 22: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f) 
[0x55a7abc8db4f]

 23: (()+0x129dfff) [0x55a7abd1afff]
 24: (()+0x7dd5) [0x7f5e50bcfdd5]
 25: (clone()+0x6d) [0x7f5e4fa95ead]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-08 Thread Igor Fedotov

Hi Brett,

looks like BlueStore is unable to allocate additional space for BlueFS 
at main device. It's either lacking free space or it's too fragmented...


Would you share osd log, please?

Also please run "ceph-bluestore-tool --path path-to-osd!!!> bluefs-bdev-sizes" and share the output.


Thanks,

Igor

On 7/3/2019 9:59 PM, Brett Chancellor wrote:
Hi All! Today I've had 3 OSDs stop themselves and are unable to 
restart, all with the same error. These OSDs are all on different 
hosts. All are running 14.2.1


I did try the following two commands
- ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
  ## This failed with the same error below
- ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
 ## After a couple of hours returned...
2019-07-03 18:30:02.095 7fe7c1c1ef00 -1 
bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs 
record found, suggest to run store repair to get consistent statistic 
reports

fsck success


## Error when trying to start one of the OSDs
 -12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal 
(Aborted) **

 in thread 7f5e42366700 thread_name:rocksdb:low0

 ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) 
nautilus (stable)

 1: (()+0xf5d0) [0x7f5e50bd75d0]
 2: (gsignal()+0x37) [0x7f5e4f9ce207]
 3: (abort()+0x148) [0x7f5e4f9cf8f8]
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x199) [0x55a7aaee96ab]
 5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char 
const*, char const*, ...)+0) [0x55a7aaee982a]
 6: (interval_setlong, std::less, std::allocatorconst, unsigned long> > > >::insert(unsigned long, unsigned long, 
unsigned long*, unsigned long*)+0x3c6) [0x55a7ab212a66]
 7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned 
long, std::vectormempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> 
>*)+0x74e) [0x55a7ab48253e]
 8: (BlueFS::_expand_slow_device(unsigned long, 
std::vectormempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t> 
>&)+0x111) [0x55a7ab59e921]
 9: (BlueFS::_allocate(unsigned char, unsigned long, 
bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
 10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, 
unsigned long)+0xe5) [0x55a7ab59fce5]

 11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b]
 12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
 13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
 14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
 15: 
(rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status 
const&, rocksdb::CompactionJob::SubcompactionState*, 
rocksdb::RangeDelAggregator*, CompactionIterationStats*, 
rocksdb::Slice const*)+0xbaa) [0x55a7abc3b73a]
 16: 
(rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0) 
[0x55a7abc3f150]

 17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618]
 18: (rocksdb::DBImpl::BackgroundCompaction(bool*, 
rocksdb::JobContext*, rocksdb::LogBuffer*, 
rocksdb::DBImpl::PrepickedCompaction*)+0xcb7) [0x55a7aba7fb67]
 19: 
(rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, 
rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]

 20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a) [0x55a7aba8190a]
 21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264) 
[0x55a7abc8d9c4]
 22: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f) 
[0x55a7abc8db4f]

 23: (()+0x129dfff) [0x55a7abd1afff]
 24: (()+0x7dd5) [0x7f5e50bcfdd5]
 25: (clone()+0x6d) [0x7f5e4fa95ead]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw user audit trail

2019-07-08 Thread shubjero
Good day,

We have a sizeable ceph deployment and use object-storage heavily. We
also integrate our object-storage with OpenStack but sometimes we are
required to create S3 keys for some of our users (aws-cli, java apps
that speak s3, etc). I was wondering if it is possible to see an audit
trail of a specific access key. I have noticed that only some
applications disclose their access key in the radosgw logs whereas
others (like aws-cli). I have also been able to view the audit logs
for a specific user (which is an OpenStack project), but not
specifically a key within that user/openstack project.

Any help would be appreciated!

Thank you,

Jared Baker
Ontario Institute for Cancer Research
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 3 OSDs stopped and unable to restart

2019-07-08 Thread Brett Chancellor
Just seeing if anybody has seen this? About 15 more OSDs have failed since
then. The cluster can't backfill fast enough, and I fear data loss may be
imminent.   I did notice one of the latest ones to fail, has  lines
similar to this one right before the crash

2019-07-08 15:18:56.170 7fc732475700  5
bluestore(/var/lib/ceph/osd/ceph-59) allocate_bluefs_freespace gifting
0x4d18d0~40 to bluefs

Any thoughts?

On Sat, Jul 6, 2019 at 3:06 PM Brett Chancellor 
wrote:

> Has anybody else run into this? It seems to be slowly spreading to other
> OSDs, maybe it gets to a bad pg in the backfill process and kills off
> another OSD (just guessing since the failures are hours apart).  It's kind
> of a pain because I have ton continually rebuild these OSDs before the
> cluster runs out of space.
>
> On Wed, Jul 3, 2019 at 2:59 PM Brett Chancellor <
> bchancel...@salesforce.com> wrote:
>
>> Hi All! Today I've had 3 OSDs stop themselves and are unable to restart,
>> all with the same error. These OSDs are all on different hosts. All are
>> running 14.2.1
>>
>> I did try the following two commands
>> - ceph-kvstore-tool bluestore-kv /var/lib/ceph/osd/ceph-80 list > keys
>>   ## This failed with the same error below
>> - ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-80 fsck
>>  ## After a couple of hours returned...
>> 2019-07-03 18:30:02.095 7fe7c1c1ef00 -1
>> bluestore(/var/lib/ceph/osd/ceph-80) fsck warning: legacy statfs record
>> found, suggest to run store repair to get consistent statistic reports
>> fsck success
>>
>>
>> ## Error when trying to start one of the OSDs
>>-12> 2019-07-03 18:36:57.450 7f5e42366700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f5e42366700 thread_name:rocksdb:low0
>>
>>  ceph version 14.2.1 (d555a9489eb35f84f2e1ef49b77e19da9d113972) nautilus
>> (stable)
>>  1: (()+0xf5d0) [0x7f5e50bd75d0]
>>  2: (gsignal()+0x37) [0x7f5e4f9ce207]
>>  3: (abort()+0x148) [0x7f5e4f9cf8f8]
>>  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x199) [0x55a7aaee96ab]
>>  5: (ceph::__ceph_assertf_fail(char const*, char const*, int, char
>> const*, char const*, ...)+0) [0x55a7aaee982a]
>>  6: (interval_set> std::less, std::allocator> unsigned long> > > >::insert(unsigned long, unsigned long, unsigned long*,
>> unsigned long*)+0x3c6) [0x55a7ab212a66]
>>  7: (BlueStore::allocate_bluefs_freespace(unsigned long, unsigned long,
>> std::vector> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
>> >*)+0x74e) [0x55a7ab48253e]
>>  8: (BlueFS::_expand_slow_device(unsigned long,
>> std::vector> mempool::pool_allocator<(mempool::pool_index_t)4, bluestore_pextent_t>
>> >&)+0x111) [0x55a7ab59e921]
>>  9: (BlueFS::_allocate(unsigned char, unsigned long,
>> bluefs_fnode_t*)+0x68b) [0x55a7ab59f68b]
>>  10: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned
>> long)+0xe5) [0x55a7ab59fce5]
>>  11: (BlueFS::_flush(BlueFS::FileWriter*, bool)+0x10b) [0x55a7ab5a1b4b]
>>  12: (BlueRocksWritableFile::Flush()+0x3d) [0x55a7ab5bf84d]
>>  13: (rocksdb::WritableFileWriter::Flush()+0x19e) [0x55a7abbedd0e]
>>  14: (rocksdb::WritableFileWriter::Sync(bool)+0x2e) [0x55a7abbedfee]
>>  15: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status
>> const&, rocksdb::CompactionJob::SubcompactionState*,
>> rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice
>> const*)+0xbaa) [0x55a7abc3b73a]
>>  16:
>> (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d0)
>> [0x55a7abc3f150]
>>  17: (rocksdb::CompactionJob::Run()+0x298) [0x55a7abc40618]
>>  18: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*,
>> rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xcb7)
>> [0x55a7aba7fb67]
>>  19:
>> (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*,
>> rocksdb::Env::Priority)+0xd0) [0x55a7aba813c0]
>>  20: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x3a) [0x55a7aba8190a]
>>  21: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x264)
>> [0x55a7abc8d9c4]
>>  22: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x4f)
>> [0x55a7abc8db4f]
>>  23: (()+0x129dfff) [0x55a7abd1afff]
>>  24: (()+0x7dd5) [0x7f5e50bcfdd5]
>>  25: (clone()+0x6d) [0x7f5e4fa95ead]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed
>> to interpret this.
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-08 Thread Nathan Fish
This is very interesting, thank you. I'm curious, what is the reason
for avoiding k's with large prime factors? If I set k=5, what happens?

On Mon, Jul 8, 2019 at 8:56 AM Lei Liu  wrote:
>
> Hi Frank,
>
> Thanks for sharing valuable experience.
>
> Frank Schilder  于2019年7月8日周一 下午4:36写道:
>>
>> Hi David,
>>
>> I'm running a cluster with bluestore on raw devices (no lvm) and all 
>> journals collocated on the same disk with the data. Disks are spinning 
>> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on 
>> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All 
>> large pools are EC on spinning disk.
>>
>> I spent at least one month to run detailed benchmarks (rbd bench) depending 
>> on EC profile, object size, write size, etc. Results were varying a lot. My 
>> advice would be to run benchmarks with your hardware. If there was a single 
>> perfect choice, there wouldn't be so many options. For example, my tests 
>> will not be valid when using separate fast disks for WAL and DB.
>>
>> There are some results though that might be valid in general:
>>
>> 1) EC pools have high throughput but low IOP/s compared with replicated pools
>>
>> I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which 
>> is probably the network limit and not the disk limit. IOP/s get better with 
>> more disks, but are way lower than what replicated pools can provide. On a 
>> cephfs with EC data pool, small-file IO will be comparably slow and eat a 
>> lot of resources.
>>
>> 2) I observe massive network traffic amplification on small IO sizes, which 
>> is due to the way EC overwrites are handled. This is one bottleneck for 
>> IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD 
>> network. OSD bandwidth at least 2x client network, better 4x or more.
>>
>> 3) k should only have small prime factors, power of 2 if possible
>>
>> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All 
>> other choices were poor. The value of m seems not relevant for performance. 
>> Larger k will require more failure domains (more hardware).
>>
>> 4) object size matters
>>
>> The best throughput (1M write size) I see with object sizes of 4MB or 8MB, 
>> with IOP/s getting somewhat better with slower object sizes but throughput 
>> dropping fast. I use the default of 4MB in production. Works well for us.
>>
>> 5) jerasure is quite good and seems most flexible
>>
>> jerasure is quite CPU efficient and can handle smaller chunk sizes than 
>> other plugins, which is preferrable for IOP/s. However, CPU usage can become 
>> a problem and a plugin optimized for specific values of k and m might help 
>> here. Under usual circumstances I see very low load on all OSD hosts, even 
>> under rebalancing. However, I remember that once I needed to rebuild 
>> something on all OSDs (I don't remember what it was, sorry). In this 
>> situation, CPU load went up to 30-50% (meaning up to half the cores were at 
>> 100%), which is really high considering that each server has only 16 disks 
>> at the moment and is sized to handle up to 100. CPU power could become a 
>> bottle for us neck in the future.
>>
>> These are some general observations and do not replace benchmarks for 
>> specific use cases. I was hunting for a specific performance pattern, which 
>> might not be what you want to optimize for. I would recommend to run 
>> extensive benchmarks if you have to live with a configuration for a long 
>> time - EC profiles cannot be changed.
>>
>> We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also 
>> use bluestore compression. All meta data pools are on SSD, only very little 
>> SSD space is required. This choice works well for the majority of our use 
>> cases. We can still build small expensive pools to accommodate special 
>> performance requests.
>>
>> Best regards,
>>
>> =
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> 
>> From: ceph-users  on behalf of David 
>> 
>> Sent: 07 July 2019 20:01:18
>> To: ceph-users@lists.ceph.com
>> Subject: [ceph-users]  What's the best practice for Erasure Coding
>>
>> Hi Ceph-Users,
>>
>> I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).
>> Recently, I'm trying to use the Erasure Code pool.
>> My question is "what's the best practice for using EC pools ?".
>> More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I 
>> adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), 
>> (k=6,m=3) ).
>>
>> Does anyone share some experience?
>>
>> Thanks for any help.
>>
>> Regards,
>> David
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> 

Re: [ceph-users] rbd - volume multi attach support

2019-07-08 Thread Eddy Castillon
Hi guys,

Hope this link help you. Until Ocata, cinder driver does not support
multi-attach on ceph.

https://docs.openstack.org/cinder/latest/reference/support-matrix.html#operation_multi_attach


El lun., 8 de jul. de 2019 a la(s) 09:13, Jason Dillaman (
jdill...@redhat.com) escribió:

> On Mon, Jul 8, 2019 at 10:07 AM M Ranga Swami Reddy
>  wrote:
> >
> > Thanks Jason.
> > Btw, we use Ceph with OpenStack Cinder and Cinder Release (Q and above)
> supports multi attach. can we use the OpenStack Cinder with Q release with
> Ceph rbd for multi attach functionality?
>
> I can't speak to the OpenStack release since I don't know, but if you
> have this commit [1], it should work.
>
> > Thanks
> > Swami
>
> [1] https://review.opendev.org/#/c/595827/
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 

Sincerely,

Eddy Castillon
+51 934782232
eddy.castil...@qualifacts.com

Qualifacts, Inc. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [events] Ceph Day Netherlands July 2nd - CFP ends June 3rd

2019-07-08 Thread Caspar Smit
Mike,

Do you know if the slides from the presentations at Ceph Day Netherlands
will be made available? (and if yes, where to find them?)

Kind regards,
Caspar Smit


Op wo 29 mei 2019 om 16:42 schreef Mike Perez :

> Hi everyone,
>
> This is the last week to submit for the Ceph Day Netherlands CFP
> ending June 3rd:
>
> https://ceph.com/cephdays/netherlands-2019/
> https://zfrmz.com/E3ouYm0NiPF1b3NLBjJk
>
> --
> Mike Perez (thingee)
>
> On Thu, May 23, 2019 at 10:12 AM Mike Perez  wrote:
> >
> > Hi everyone,
> >
> > We will be having Ceph Day Netherlands July 2nd!
> >
> > https://ceph.com/cephdays/netherlands-2019/
> >
> > The CFP will be ending June 3rd, so there is still time to get your
> > Ceph related content in front of the Ceph community ranging from all
> > levels of expertise:
> >
> > https://zfrmz.com/E3ouYm0NiPF1b3NLBjJk
> >
> > If your company is interested in sponsoring the event, we would be
> > delighted to have you. Please contact me directly for further
> > information.
> >
> > Hosted by the Ceph community (and our friends) in select cities around
> > the world, Ceph Days are full-day events dedicated to fostering our
> > vibrant community.
> >
> > In addition to Ceph experts, community members, and vendors, you’ll
> > hear from production users of Ceph who’ll share what they’ve learned
> > from their deployments.
> >
> > Each Ceph Day ends with a Q session and cocktail reception. Join us!
> >
> > --
> > Mike Perez (thingee)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd - volume multi attach support

2019-07-08 Thread Jason Dillaman
On Mon, Jul 8, 2019 at 10:07 AM M Ranga Swami Reddy
 wrote:
>
> Thanks Jason.
> Btw, we use Ceph with OpenStack Cinder and Cinder Release (Q and above) 
> supports multi attach. can we use the OpenStack Cinder with Q release with 
> Ceph rbd for multi attach functionality?

I can't speak to the OpenStack release since I don't know, but if you
have this commit [1], it should work.

> Thanks
> Swami

[1] https://review.opendev.org/#/c/595827/

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd - volume multi attach support

2019-07-08 Thread M Ranga Swami Reddy
Thanks Jason.
Btw, we use Ceph with OpenStack Cinder and Cinder Release (Q and above)
supports multi attach. can we use the OpenStack Cinder with Q release with
Ceph rbd for multi attach functionality?

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Lars Marowsky-Bree
On 2019-07-08T14:36:31, Maged Mokhtar  wrote:

Hi Maged,

> Maybe not related, but we find with rbd, random 4k write iops start very low
> at first for a new image and then increase over time as we write. If we
> thick provision the image it work does not show this. This happens on random
> small block and not sequential or large. Probably related to initial
> obkect/chunk creation.

I don't see that this is related, we actually see faster performance for
random writes initially. (Unsurprising - writing to a non-existent
part/object means the OSD has nothing else to read, so no overwrite.)

> Also we use the default stripe width, maybe you try a pool with default
> width and see if it is a factor.

This is the default stripe_width for an EC pool with k=2.


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd - volume multi attach support

2019-07-08 Thread Jason Dillaman
On Mon, Jul 8, 2019 at 8:33 AM M Ranga Swami Reddy  wrote:
>
> Hello - Is ceph rbd support multi attach volumes (with ceph luminous 
> version0)?

Yes, you just need to ensure the exclusive-lock and dependent features
are disabled on the image. When creating a new image, you can use the
"--image-shared" optional handle this for you.

> Thanks
> Swami
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Missing Ubuntu Packages on Luminous

2019-07-08 Thread Stolte, Felix
Hi folks,

I want to use the community repository http://download.ceph.com/debian-luminous 
for my luminous cluster instead of the packages provided by ubuntu itself. But 
apparently only the ceph-deploy package is available for bionic (Ubuntu 18.04). 
All packages exist for trusty though. Is this intended behavior?

Regards Felix
IT-Services
Telefon 02461 61-9243
E-Mail: f.sto...@fz-juelich.de
-
-
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
-
-
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Dan van der Ster
On Mon, Jul 8, 2019 at 1:02 PM Lars Marowsky-Bree  wrote:
>
> On 2019-07-08T12:25:30, Dan van der Ster  wrote:
>
> > Is there a specific bench result you're concerned about?
>
> We're seeing ~5800 IOPS, ~23 MiB/s on 4 KiB IO (stripe_width 8192) on a
> pool that could do 3 GiB/s with 4M blocksize. So, yeah, well, that is
> rather harsh, even for EC.

How does that pool manage with the same client pattern but 3x replication?

The difference between 4kB and 4MB writes could be many things.

-- dan


>
> > I would think that small write perf could be kept reasonable thanks to
> > bluestore's deferred writes.
>
> I believe we're being hit by the EC read-modify-write cycle on
> overwrites.
>
> > FWIW, our bench results (all flash cluster) didn't show a massive
> > performance difference between 3 replica and 4+2 EC.
>
> I'm guessing that this was not 4 KiB but a more reasonable blocksize
> that was a multiple of stripe_width?
>
>
> Regards,
> Lars
>
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 
> (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Maged Mokhtar



On 08/07/2019 13:02, Lars Marowsky-Bree wrote:

On 2019-07-08T12:25:30, Dan van der Ster  wrote:


Is there a specific bench result you're concerned about?

We're seeing ~5800 IOPS, ~23 MiB/s on 4 KiB IO (stripe_width 8192) on a
pool that could do 3 GiB/s with 4M blocksize. So, yeah, well, that is
rather harsh, even for EC.


I would think that small write perf could be kept reasonable thanks to
bluestore's deferred writes.

I believe we're being hit by the EC read-modify-write cycle on
overwrites.


FWIW, our bench results (all flash cluster) didn't show a massive
performance difference between 3 replica and 4+2 EC.

I'm guessing that this was not 4 KiB but a more reasonable blocksize
that was a multiple of stripe_width?


Regards,
 Lars


Hi Lars,

Maybe not related, but we find with rbd, random 4k write iops start very 
low at first for a new image and then increase over time as we write. If 
we thick provision the image it work does not show this. This happens on 
random small block and not sequential or large. Probably related to 
initial obkect/chunk creation.


Also we use the default stripe width, maybe you try a pool with default 
width and see if it is a factor.



/Maged

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-08 Thread Lei Liu
Hi Frank,

Thanks for sharing valuable experience.

Frank Schilder  于2019年7月8日周一 下午4:36写道:

> Hi David,
>
> I'm running a cluster with bluestore on raw devices (no lvm) and all
> journals collocated on the same disk with the data. Disks are spinning
> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on
> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All
> large pools are EC on spinning disk.
>
> I spent at least one month to run detailed benchmarks (rbd bench)
> depending on EC profile, object size, write size, etc. Results were varying
> a lot. My advice would be to run benchmarks with your hardware. If there
> was a single perfect choice, there wouldn't be so many options. For
> example, my tests will not be valid when using separate fast disks for WAL
> and DB.
>
> There are some results though that might be valid in general:
>
> 1) EC pools have high throughput but low IOP/s compared with replicated
> pools
>
> I see single-thread write speeds of up to 1.2GB (gigabyte) per second,
> which is probably the network limit and not the disk limit. IOP/s get
> better with more disks, but are way lower than what replicated pools can
> provide. On a cephfs with EC data pool, small-file IO will be comparably
> slow and eat a lot of resources.
>
> 2) I observe massive network traffic amplification on small IO sizes,
> which is due to the way EC overwrites are handled. This is one bottleneck
> for IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD
> network. OSD bandwidth at least 2x client network, better 4x or more.
>
> 3) k should only have small prime factors, power of 2 if possible
>
> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All
> other choices were poor. The value of m seems not relevant for performance.
> Larger k will require more failure domains (more hardware).
>
> 4) object size matters
>
> The best throughput (1M write size) I see with object sizes of 4MB or 8MB,
> with IOP/s getting somewhat better with slower object sizes but throughput
> dropping fast. I use the default of 4MB in production. Works well for us.
>
> 5) jerasure is quite good and seems most flexible
>
> jerasure is quite CPU efficient and can handle smaller chunk sizes than
> other plugins, which is preferrable for IOP/s. However, CPU usage can
> become a problem and a plugin optimized for specific values of k and m
> might help here. Under usual circumstances I see very low load on all OSD
> hosts, even under rebalancing. However, I remember that once I needed to
> rebuild something on all OSDs (I don't remember what it was, sorry). In
> this situation, CPU load went up to 30-50% (meaning up to half the cores
> were at 100%), which is really high considering that each server has only
> 16 disks at the moment and is sized to handle up to 100. CPU power could
> become a bottle for us neck in the future.
>
> These are some general observations and do not replace benchmarks for
> specific use cases. I was hunting for a specific performance pattern, which
> might not be what you want to optimize for. I would recommend to run
> extensive benchmarks if you have to live with a configuration for a long
> time - EC profiles cannot be changed.
>
> We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also
> use bluestore compression. All meta data pools are on SSD, only very little
> SSD space is required. This choice works well for the majority of our use
> cases. We can still build small expensive pools to accommodate special
> performance requests.
>
> Best regards,
>
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: ceph-users  on behalf of David <
> xiaomajia...@gmail.com>
> Sent: 07 July 2019 20:01:18
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users]  What's the best practice for Erasure Coding
>
> Hi Ceph-Users,
>
> I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on
> lvm).
> Recently, I'm trying to use the Erasure Code pool.
> My question is "what's the best practice for using EC pools ?".
> More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should
> I adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2),
> (k=6,m=3) ).
>
> Does anyone share some experience?
>
> Thanks for any help.
>
> Regards,
> David
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd - volume multi attach support

2019-07-08 Thread M Ranga Swami Reddy
Hello - Is ceph rbd support multi attach volumes (with ceph luminous
version0)?

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph -s shows that usage is 385GB after I delete my pools

2019-07-08 Thread 刘亮
[cid:2C0147887E4A1FC2BDB6863AC5A80EEE]

HI:
Ceph -s  shows that usage is  385GB after I delete my pools  .  Do you know 
 why? Anyone can help me?

Thank you!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Lars Marowsky-Bree
On 2019-07-08T12:25:30, Dan van der Ster  wrote:

> Is there a specific bench result you're concerned about?

We're seeing ~5800 IOPS, ~23 MiB/s on 4 KiB IO (stripe_width 8192) on a
pool that could do 3 GiB/s with 4M blocksize. So, yeah, well, that is
rather harsh, even for EC.

> I would think that small write perf could be kept reasonable thanks to
> bluestore's deferred writes.

I believe we're being hit by the EC read-modify-write cycle on
overwrites.

> FWIW, our bench results (all flash cluster) didn't show a massive
> performance difference between 3 replica and 4+2 EC.

I'm guessing that this was not 4 KiB but a more reasonable blocksize
that was a multiple of stripe_width?


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Dan van der Ster
Hi Lars,

Is there a specific bench result you're concerned about?
I would think that small write perf could be kept reasonable thanks to
bluestore's deferred writes.
FWIW, our bench results (all flash cluster) didn't show a massive
performance difference between 3 replica and 4+2 EC.

I agree about not needing to read the parity during a write though.
Hopefully that's just a typo? (Or maybe there's a fast way to update
EC chunks without communicating across OSDs ?)

-- dan


-- Dan

On Mon, Jul 8, 2019 at 10:47 AM Lars Marowsky-Bree  wrote:
>
> Morning all,
>
> since Luminous/Mimic, Ceph supports allow_ec_overwrites. However, this has a
> performance impact that looks even worse than what I'd expect from a
> Read-Modify-Write cycle.
>
> https://ceph.com/community/new-luminous-erasure-coding-rbd-cephfs/ also
> mentions that the small writes would read the previous value from all
> k+m OSDs; shouldn't the k stripes be sufficient (assuming we're not
> currently degraded)?
>
> Is there any suggestion on how to make this go faster, or suggestions on
> which solution one could implement going forward?
>
>
> Regards,
> Lars
>
> --
> SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 
> (AG Nürnberg)
> "Architects should open possibilities and not determine everything." (Ueli 
> Zbinden)
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ubuntu 19.04

2019-07-08 Thread James Page
On Sun, Jul 7, 2019 at 10:30 PM Kai Stian Olstad 
wrote:

> On 06.07.2019 16:43, Ashley Merrick wrote:
> > Looking at the possibility of upgrading my personal storage cluster from
> > Ubuntu 18.04 -> 19.04 to benefit from a newer version of the Kernel e.t.c
>
> For a newer kernel install HWE[1], at the moment you will get the 18.10
> kernel, but in August it will get the 19.04 kernel.
>
> [1] https://wiki.ubuntu.com/Kernel/LTSEnablementStack


If you require newer kernel versions then this is definitely the Ubuntu
recommended and supported approach.

We do ship Ceph as part of Ubuntu (Mimic was included with 19.04) however
interim Ubuntu releases only get 9 months of updates vs the 5 years that
Ubuntu 18.04 LTS receives.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Lars Marowsky-Bree
Morning all,

since Luminous/Mimic, Ceph supports allow_ec_overwrites. However, this has a
performance impact that looks even worse than what I'd expect from a
Read-Modify-Write cycle.

https://ceph.com/community/new-luminous-erasure-coding-rbd-cephfs/ also
mentions that the small writes would read the previous value from all
k+m OSDs; shouldn't the k stripes be sufficient (assuming we're not
currently degraded)?

Is there any suggestion on how to make this go faster, or suggestions on
which solution one could implement going forward?


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-08 Thread Frank Schilder
Hi David,

I'm running a cluster with bluestore on raw devices (no lvm) and all journals 
collocated on the same disk with the data. Disks are spinning NL-SAS. Our goal 
was to build storage at lowest cost, therefore all data on HDD only. I got a 
few SSDs that I'm using for FS and RBD meta data. All large pools are EC on 
spinning disk.

I spent at least one month to run detailed benchmarks (rbd bench) depending on 
EC profile, object size, write size, etc. Results were varying a lot. My advice 
would be to run benchmarks with your hardware. If there was a single perfect 
choice, there wouldn't be so many options. For example, my tests will not be 
valid when using separate fast disks for WAL and DB.

There are some results though that might be valid in general:

1) EC pools have high throughput but low IOP/s compared with replicated pools

I see single-thread write speeds of up to 1.2GB (gigabyte) per second, which is 
probably the network limit and not the disk limit. IOP/s get better with more 
disks, but are way lower than what replicated pools can provide. On a cephfs 
with EC data pool, small-file IO will be comparably slow and eat a lot of 
resources.

2) I observe massive network traffic amplification on small IO sizes, which is 
due to the way EC overwrites are handled. This is one bottleneck for IOP/s. We 
have 10G infrastructure and use 2x10G client and 4x10G OSD network. OSD 
bandwidth at least 2x client network, better 4x or more.

3) k should only have small prime factors, power of 2 if possible

I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All other 
choices were poor. The value of m seems not relevant for performance. Larger k 
will require more failure domains (more hardware).

4) object size matters

The best throughput (1M write size) I see with object sizes of 4MB or 8MB, with 
IOP/s getting somewhat better with slower object sizes but throughput dropping 
fast. I use the default of 4MB in production. Works well for us.

5) jerasure is quite good and seems most flexible

jerasure is quite CPU efficient and can handle smaller chunk sizes than other 
plugins, which is preferrable for IOP/s. However, CPU usage can become a 
problem and a plugin optimized for specific values of k and m might help here. 
Under usual circumstances I see very low load on all OSD hosts, even under 
rebalancing. However, I remember that once I needed to rebuild something on all 
OSDs (I don't remember what it was, sorry). In this situation, CPU load went up 
to 30-50% (meaning up to half the cores were at 100%), which is really high 
considering that each server has only 16 disks at the moment and is sized to 
handle up to 100. CPU power could become a bottle for us neck in the future.

These are some general observations and do not replace benchmarks for specific 
use cases. I was hunting for a specific performance pattern, which might not be 
what you want to optimize for. I would recommend to run extensive benchmarks if 
you have to live with a configuration for a long time - EC profiles cannot be 
changed.

We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also use 
bluestore compression. All meta data pools are on SSD, only very little SSD 
space is required. This choice works well for the majority of our use cases. We 
can still build small expensive pools to accommodate special performance 
requests.

Best regards,

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: ceph-users  on behalf of David 

Sent: 07 July 2019 20:01:18
To: ceph-users@lists.ceph.com
Subject: [ceph-users]  What's the best practice for Erasure Coding

Hi Ceph-Users,

I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on lvm).
Recently, I'm trying to use the Erasure Code pool.
My question is "what's the best practice for using EC pools ?".
More specifically, which plugin (jerasure, isa, lrc, shec or  clay) should I 
adopt, and how to choose the combinations of (k,m) (e.g. (k=3,m=2), (k=6,m=3) ).

Does anyone share some experience?

Thanks for any help.

Regards,
David

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-08 Thread Jake Grimmett
Hi David,

How many nodes in your cluster? k+m has to be smaller than your node count, 
preferably by at least two. 

How important is your data? i.e. do you have a remote mirror or backup, if not 
you may want m=3 

We use 8+2 on one cluster, and 6+2 on another.

Best,

Jake


On 7 July 2019 19:01:18 BST, David  wrote:
>Hi Ceph-Users,
>
> 
>
>I'm working with a  Ceph cluster (about 50TB, 28 OSDs, all Bluestore on
>lvm).
>
>Recently, I'm trying to use the Erasure Code pool.
>
>My question is "what's the best practice for using EC pools ?".
>
>More specifically, which plugin (jerasure, isa, lrc, shec or  clay)
>should I adopt, and how to choose the combinations of (k,m) (e.g.
>(k=3,m=2), (k=6,m=3) ).
>
> 
>
>Does anyone share some experience?
>
> 
>
>Thanks for any help.
>
> 
>
>Regards,
>
>David
>
> 

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com