[ceph-users] pre-split causing slow requests when rebuild osd ?

2018-11-26 Thread hnuzhoulin2






Hi,guysI have a 42 nodes cluster,and I create the pool using expected_num_objects to pre-split filestore dirs.today I rebuild a osd because a disk error,it cause much slow request,filestore logs like below2018-11-26 16:49:41.003336 7f2dad075700 10 filestore(/home/ceph/var/lib/osd/ceph-4) create_collection /home/ceph/var/lib/osd/ceph-4/current/388.433_head = 02018-11-26 16:49:41.003479 7f2dad075700 10 filestore(/home/ceph/var/lib/osd/ceph-4) create_collection /home/ceph/var/lib/osd/ceph-4/current/388.433_TEMP = 02018-11-26 16:49:41.003570 7f2dad075700 10 filestore(/home/ceph/var/lib/osd/ceph-4) _set_replay_guard 33.0.02018-11-26 16:49:41.003591 7f2dad876700  5 filestore(/home/ceph/var/lib/osd/ceph-4) _journaled_ahead 0x55e054382300 seq 81 osr(388.2bd 0x55e053ed9280) [Transaction(0x55e06d304680)]2018-11-26 16:49:41.003603 7f2dad876700  5 filestore(/home/ceph/var/lib/osd/ceph-4) queue_op 0x55e054382300 seq 81 osr(388.2bd 0x55e053ed9280) 1079089 bytes   (queue has 50 ops and 15513428 bytes)2018-11-26 16:49:41.003608 7f2dad876700 10 filestore(/home/ceph/var/lib/osd/ceph-4)  queueing ondisk 0x55e06cc83f802018-11-26 16:49:41.024714 7f2d9d055700  5 filestore(/home/ceph/var/lib/osd/ceph-4) queue_transactions existing 0x55e053a5d1e0 osr(388.f2a 0x55e053ed92e0)2018-11-26 16:49:41.166512 7f2dac874700 10 filestore oid: #388:c940head# not skipping op, *spos 32.0.12018-11-26 16:49:41.166522 7f2dac874700 10 filestore  > header.spos 0.0.02018-11-26 16:49:41.170670 7f2dac874700 10 filestore oid: #388:c940head# not skipping op, *spos 32.0.22018-11-26 16:49:41.170680 7f2dac874700 10 filestore  > header.spos 0.0.02018-11-26 16:49:41.183259 7f2dac874700 10 filestore(/home/ceph/var/lib/osd/ceph-4) _do_op 0x55e05ddb3480 seq 32 r = 0, finisher 0x55e051d122e0 02018-11-26 16:49:41.187211 7f2dac874700 10 filestore(/home/ceph/var/lib/osd/ceph-4) _finish_op 0x55e05ddb3480 seq 32 osr(388.293 0x55e053ed84b0)/0x55e053ed84b0 lat 47.8045332018-11-26 16:49:41.187232 7f2dac874700  5 filestore(/home/ceph/var/lib/osd/ceph-4) _do_op 0x55e052113e60 seq 34 osr(388.2d94 0x55e053ed91c0)/0x55e053ed91c0 start2018-11-26 16:49:41.187236 7f2dac874700 10 filestore(/home/ceph/var/lib/osd/ceph-4) _do_transaction on 0x55e05e0221402018-11-26 16:49:41.187239 7f2da4864700  5 filestore(/home/ceph/var/lib/osd/ceph-4) queue_transactions (writeahead) 82 [Transaction(0x55e0559e6d80)]

looks like it is very slow when create pg dir like: /home/ceph/var/lib/osd/ceph-4/current/388.433but at the start of service,the status of osd is not up,it works well. no slow request,and pg dir is creating.but when the osd state is up,slow request is coming and  pg dir is creating.when I disable the config filestore merge threshold = -10 in the ceoh.conf.the rebuild process works well,pg dirs are  created  very fast.then I see dir split in log2018-11-26 19:16:56.406276 7f768b189700  1 _created [8,F,8] has 593 objects, starting split.2018-11-26 19:16:56.977392 7f768b189700  1 _created [8,F,8] split completed.2018-11-26 19:16:57.032567 7f768b189700  1 _created [8,F,8,6] has 594 objects, starting split.2018-11-26 19:16:57.814694 7f768b189700  1 _created [8,F,8,6] split completed.




so,how can I set to let all pg dirs created before the osd state is up?or other solution.Thanks.











___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor ceph cluster performance

2018-11-26 Thread Stefan Kooman
Quoting Cody (codeology@gmail.com):
> The Ceph OSD part of the cluster uses 3 identical servers with the
> following specifications:
> 
> CPU: 2 x E5-2603 @1.8GHz
> RAM: 16GB
> Network: 1G port shared for Ceph public and cluster traffics

This will hamper throughput a lot. 

> Journaling device: 1 x 120GB SSD (SATA3, consumer grade)
> OSD device: 2 x 2TB 7200rpm spindle (SATA3, consumer grade)

OK, let's stop here first: Consumer grade SSD. Percona did a nice
writeup about "fsync" speed on consumer grade SSDs [1]. As I don't know
what drives you use this might or might not be the issue.

> 
> This is not beefy enough in any way, but I am running for PoC only,
> with minimum utilization.
> 
> Ceph-mon and ceph-mgr daemons are hosted on the OpenStack Controller
> nodes. Ceph-ansible version is 3.1 and is using Filestore with
> non-colocated scenario (1 SSD for every 2 OSDs). Connection speed
> among Controllers, Computes, and OSD nodes can reach ~900Mbps tested
> using iperf.

Why filestore if I may ask? I guess bluestore with bluestore journal on
SSD and data on SATA should give you better performance. If the SSDs are
suitable for the job at all.

What version of Ceph are use using? Metrics can give you a lot of
insight. Did you take a look at those? In fFor example Ceph mgr dashboard?

> 
> I followed the Red Hat Ceph 3 benchmarking procedure [1] and received
> following results:
> 
> Write Test:
> 
> Total time run: 80.313004
> Total writes made:  17
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 0.846687
> Stddev Bandwidth:   0.320051
> Max bandwidth (MB/sec): 2
> Min bandwidth (MB/sec): 0
> Average IOPS:   0
> Stddev IOPS:0
> Max IOPS:   0
> Min IOPS:   0
> Average Latency(s): 66.6582
> Stddev Latency(s):  15.5529
> Max latency(s): 80.3122
> Min latency(s): 29.7059
> 
> Sequencial Read Test:
> 
> Total time run:   25.951049
> Total reads made: 17
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   2.62032
> Average IOPS: 0
> Stddev IOPS:  0
> Max IOPS: 1
> Min IOPS: 0
> Average Latency(s):   24.4129
> Max latency(s):   25.9492
> Min latency(s):   0.117732
> 
> Random Read Test:
> 
> Total time run:   66.355433
> Total reads made: 46
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   2.77295
> Average IOPS: 0
> Stddev IOPS:  3
> Max IOPS: 27
> Min IOPS: 0
> Average Latency(s):   21.4531
> Max latency(s):   66.1885
> Min latency(s):   0.0395266
> 
> Apparently, the results are pathetic...
> 
> As I moved on to test block devices, I got a following error message:
> 
> # rbd map image01 --pool testbench --name client.admin
> rbd: failed to add secret 'client.admin' to kernel

What replication factor are you using?

Make sure you have the client.admin keyring on the node you are issuing
this command. If you have the keyring present like Ceph expects it to
be, then you can omit the --name client.admin. On a monitor node you can
extract the admin keyring: ceph auth export client.admin. Put the output
of that in /etc/ceph/ceph.client.admin.keyring and this should work.

> Any suggestions on the above error and/or debugging would be greatly
> appreciated!

Gr. Stefan

[1]:
https://www.percona.com/blog/2018/07/18/why-consumer-ssd-reviews-are-useless-for-database-performance-use-case/
> https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/administration_guide/#benchmarking_performance
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Move Instance between Different Ceph and Openstack Installation

2018-11-26 Thread Danni Setiawan

Hi all,

I need to move instance from one Openstack with Ceph to another 
different Openstack with Ceph installation. The instance use volume to 
boot with another volume attach for data. The instance has 200GB volume 
boot and attach with 1TB volume for data.


From what I know, I need to download the volume boot to raw from rbd 
then transfer to the new Openstack installation, then reupload to Glance 
and create new instance from that image. The problem is uploading to 
Glance is quite time consuming because the size of image.


Anyone know the another efficient way to moving instance between 
different Openstack with Ceph installation?


Thanks,
Danni

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journal drive recommendation

2018-11-26 Thread Amit Ghadge
On Tue, 27 Nov 2018, 10:55 Martin Verges,  wrote:

> Hello,
>
> what type of SSD data drives do you plan to use?
>
We have plan to using external ssd data drive.

> In general, I would not recommend to use external journal on ssd OSDs, but
> it is possible to squeeze out a bit more performance depending on your data
> disks.
>

> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
>
> Am Di., 27. Nov. 2018, 02:50 hat Amit Ghadge 
> geschrieben:
>
>> Hi all,
>>
>> We have planning to use SSD data drive, so for journal drive, is there
>> any recommendation to use same drive or separate drive?
>>
>> Thanks,
>> Amit
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journal drive recommendation

2018-11-26 Thread Martin Verges
Hello,

what type of SSD data drives do you plan to use?

In general, I would not recommend to use external journal on ssd OSDs, but
it is possible to squeeze out a bit more performance depending on your data
disks.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Di., 27. Nov. 2018, 02:50 hat Amit Ghadge 
geschrieben:

> Hi all,
>
> We have planning to use SSD data drive, so for journal drive, is there any
> recommendation to use same drive or separate drive?
>
> Thanks,
> Amit
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Journal drive recommendation

2018-11-26 Thread Amit Ghadge
Hi all,

We have planning to use SSD data drive, so for journal drive, is there any
recommendation to use same drive or separate drive?

Thanks,
Amit
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Poor ceph cluster performance

2018-11-26 Thread Cody
Hello,

I have a Ceph cluster deployed together with OpenStack using TripleO.
While the Ceph cluster shows a healthy status, its performance is
painfully slow. After eliminating a possibility of network issues, I
have zeroed in on the Ceph cluster itself, but have no experience in
further debugging and tunning.

The Ceph OSD part of the cluster uses 3 identical servers with the
following specifications:

CPU: 2 x E5-2603 @1.8GHz
RAM: 16GB
Network: 1G port shared for Ceph public and cluster traffics
Journaling device: 1 x 120GB SSD (SATA3, consumer grade)
OSD device: 2 x 2TB 7200rpm spindle (SATA3, consumer grade)

This is not beefy enough in any way, but I am running for PoC only,
with minimum utilization.

Ceph-mon and ceph-mgr daemons are hosted on the OpenStack Controller
nodes. Ceph-ansible version is 3.1 and is using Filestore with
non-colocated scenario (1 SSD for every 2 OSDs). Connection speed
among Controllers, Computes, and OSD nodes can reach ~900Mbps tested
using iperf.

I followed the Red Hat Ceph 3 benchmarking procedure [1] and received
following results:

Write Test:

Total time run: 80.313004
Total writes made:  17
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 0.846687
Stddev Bandwidth:   0.320051
Max bandwidth (MB/sec): 2
Min bandwidth (MB/sec): 0
Average IOPS:   0
Stddev IOPS:0
Max IOPS:   0
Min IOPS:   0
Average Latency(s): 66.6582
Stddev Latency(s):  15.5529
Max latency(s): 80.3122
Min latency(s): 29.7059

Sequencial Read Test:

Total time run:   25.951049
Total reads made: 17
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   2.62032
Average IOPS: 0
Stddev IOPS:  0
Max IOPS: 1
Min IOPS: 0
Average Latency(s):   24.4129
Max latency(s):   25.9492
Min latency(s):   0.117732

Random Read Test:

Total time run:   66.355433
Total reads made: 46
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   2.77295
Average IOPS: 0
Stddev IOPS:  3
Max IOPS: 27
Min IOPS: 0
Average Latency(s):   21.4531
Max latency(s):   66.1885
Min latency(s):   0.0395266

Apparently, the results are pathetic...

As I moved on to test block devices, I got a following error message:

# rbd map image01 --pool testbench --name client.admin
rbd: failed to add secret 'client.admin' to kernel

Any suggestions on the above error and/or debugging would be greatly
appreciated!

Thank you very much to all.

Cody

[1] 
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html-single/administration_guide/#benchmarking_performance
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bug: Deleting images ending with whitespace in name via dashboard

2018-11-26 Thread Lenz Grimmer
Hi Alexander,

On 11/13/18 12:37 PM, Kasper, Alexander wrote:

> As i am not sure howto correctly use tracker.ceph.com, i´ll post my
> report here:
> 
> Using the dashboard to delete a rbd image via gui throws an error when
> the image name ends with an whitespace (user input error leaded to this
> situation).
> 
> Also editing this image via dashboard throws error.
> 
> Deleting via cli with the name of pool/image put in " " was succesful.
> 
> Should the input be filtered?

Thank you for reporting this and sorry for the late reply. It looks as
if you figured out how to submit this via the tracker:
https://tracker.ceph.com/issues/37084

I left some comments there, your feedback would be welcome. Thank you!

Lenz

-- 
SUSE Linux GmbH - Maxfeldstr. 5 - 90409 Nuernberg (Germany)
GF:Felix Imendörffer,Jane Smithard,Graham Norton,HRB 21284 (AG Nürnberg)



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Should ceph build against libcurl4 for Ubuntu 18.04 and later?

2018-11-26 Thread Ken Dreyer
On Thu, Nov 22, 2018 at 11:47 AM Matthew Vernon  wrote:
>
> On 22/11/2018 13:40, Paul Emmerich wrote:
> > We've encountered the same problem on Debian Buster
>
> It looks to me like this could be fixed simply by building the Bionic
> packages in a Bionic chroot (ditto Buster); maybe that could be done in
> future? Given I think the packaging process is being reviewed anyway at
> the moment (hopefully 12.2.10 will be along at some point...)

That's how we're building it currently. We build ceph in pbuilder
chroots that correspond to each distro.

On master, debian/control has Build-Depends: libcurl4-openssl-dev so
I'm not sure why we'd end up with a dependency on libcurl3.

Would you please give me a minimal set of `apt-get` reproduction steps
on Bionic for this issue? Then we can get it into tracker.ceph.com.

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What could cause mon_osd_full_ratio to be exceeded?

2018-11-26 Thread Gregory Farnum
On Mon, Nov 26, 2018 at 10:28 AM Vladimir Brik
 wrote:
>
> Hello
>
> I am doing some Ceph testing on a near-full cluster, and I noticed that,
> after I brought down a node, some OSDs' utilization reached
> osd_failsafe_full_ratio (97%). Why didn't it stop at mon_osd_full_ratio
> (90%) if mon_osd_backfillfull_ratio is 90%?

While I believe the very newest Ceph source will do this, it can be
surprisingly difficult to identify the exact size a PG will take up on
disk (thanks to omap/RocksDB data), and so for a long time we pretty
much didn't try — these ratios were checked when starting a backfill,
but we didn't try to predict where they would end up and limit
ourselves based on that.
-Greg

>
>
> Thanks,
>
> Vlad
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-26 Thread Vlad Kopylov
I see. Thank you Greg.

Ultimately leading to some kind of multi-primary OSD/MON setup, which
will most likely add lookup overheads. Though might be a reasonable
trade off for network distributed setups.
Good feature for major version.

With Glusterfs I solved it, funny as it sounds, by writing tiny fuse
fs as overlay, making all reads locally and writes to cluster. Having
that with Glusterfs there are real files on each node for local reads.

Wish there was a way to file-access local OSD so I can use same approach.

-vlad
On Mon, Nov 26, 2018 at 8:47 AM Gregory Farnum  wrote:
>
> On Tue, Nov 20, 2018 at 9:50 PM Vlad Kopylov  wrote:
>>
>> I see the point, but not for the read case:
>>   no overhead for just choosing or let Mount option choose read replica.
>>
>> This is simple feature that can be implemented, that will save many
>> people bandwidth in really distributed cases.
>
>
> This is actually much more complicated than it sounds. Allowing reads from 
> the replica OSDs while still routing writes through a different primary OSD 
> introduces a great many consistency issues. We've tried adding very limited 
> support for this read-from-replica scenario in special cases, but have had to 
> roll them all back due to edge cases where they don't work.
>
> I understand why you want it, but it's definitely not a simple feature. :(
> -Greg
>
>>
>>
>> Main issue this surfaces is that RADOS maps ignore clients - they just
>> see cluster. There should be the part of RADOS map unique or possibly
>> unique for each client connection.
>>
>> Lets file feature request?
>>
>> p.s. honestly, I don't see why anyone would use ceph for local network
>> RAID setups, there are other simple solutions out there even in your
>> own RedHat shop.
>> On Tue, Nov 20, 2018 at 8:38 PM Patrick Donnelly  wrote:
>> >
>> > You either need to accept that reads/writes will land on different data 
>> > centers, primary OSD for a given pool is always in the desired data 
>> > center, or some other non-Ceph solution which will have either expensive, 
>> > eventual, or false consistency.
>> >
>> > On Fri, Nov 16, 2018, 10:07 AM Vlad Kopylov > >>
>> >> This is what Jean suggested. I understand it and it works with primary.
>> >> But what I need is for all clients to access same files, not separate 
>> >> sets (like red blue green)
>> >>
>> >> Thanks Konstantin.
>> >>
>> >> On Fri, Nov 16, 2018 at 3:43 AM Konstantin Shalygin  
>> >> wrote:
>> >>>
>> >>> On 11/16/18 11:57 AM, Vlad Kopylov wrote:
>> >>> > Exactly. But write operations should go to all nodes.
>> >>>
>> >>> This can be set via primary affinity [1], when a ceph client reads or
>> >>> writes data, it always contacts the primary OSD in the acting set.
>> >>>
>> >>>
>> >>> If u want to totally segregate IO, you can use device classes:
>> >>>
>> >>> Just create osds with different classes:
>> >>>
>> >>> dc1
>> >>>
>> >>>host1
>> >>>
>> >>>  red osd.0 primary
>> >>>
>> >>>  blue osd.1
>> >>>
>> >>>  green osd.2
>> >>>
>> >>> dc2
>> >>>
>> >>>host2
>> >>>
>> >>>  red osd.3
>> >>>
>> >>>  blue osd.4 primary
>> >>>
>> >>>  green osd.5
>> >>>
>> >>> dc3
>> >>>
>> >>>host3
>> >>>
>> >>>  red osd.6
>> >>>
>> >>>  blue osd.7
>> >>>
>> >>>  green osd.8 primary
>> >>>
>> >>>
>> >>> create 3 crush rules:
>> >>>
>> >>> ceph osd crush rule create-replicated red default host red
>> >>>
>> >>> ceph osd crush rule create-replicated blue default host blue
>> >>>
>> >>> ceph osd crush rule create-replicated green default host green
>> >>>
>> >>>
>> >>> and 3 pools:
>> >>>
>> >>> ceph osd pool create red 64 64 replicated red
>> >>>
>> >>> ceph osd pool create blue 64 64 replicated blue
>> >>>
>> >>> ceph osd pool create blue 64 64 replicated green
>> >>>
>> >>>
>> >>> [1]
>> >>> http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity'
>> >>>
>> >>>
>> >>>
>> >>> k
>> >>>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What could cause mon_osd_full_ratio to be exceeded?

2018-11-26 Thread Vladimir Brik

> Why didn't it stop at mon_osd_full_ratio (90%)
Should be 95%

Vlad



On 11/26/18 9:28 AM, Vladimir Brik wrote:

Hello

I am doing some Ceph testing on a near-full cluster, and I noticed that, 
after I brought down a node, some OSDs' utilization reached 
osd_failsafe_full_ratio (97%). Why didn't it stop at mon_osd_full_ratio 
(90%) if mon_osd_backfillfull_ratio is 90%?



Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What could cause mon_osd_full_ratio to be exceeded?

2018-11-26 Thread Vladimir Brik

Hello

I am doing some Ceph testing on a near-full cluster, and I noticed that, 
after I brought down a node, some OSDs' utilization reached 
osd_failsafe_full_ratio (97%). Why didn't it stop at mon_osd_full_ratio 
(90%) if mon_osd_backfillfull_ratio is 90%?



Thanks,

Vlad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-26 Thread Gregory Farnum
On Tue, Nov 20, 2018 at 9:50 PM Vlad Kopylov  wrote:

> I see the point, but not for the read case:
>   no overhead for just choosing or let Mount option choose read replica.
>
> This is simple feature that can be implemented, that will save many
> people bandwidth in really distributed cases.
>

This is actually much more complicated than it sounds. Allowing reads from
the replica OSDs while still routing writes through a different primary OSD
introduces a great many consistency issues. We've tried adding very limited
support for this read-from-replica scenario in special cases, but have had
to roll them all back due to edge cases where they don't work.

I understand why you want it, but it's definitely not a simple feature. :(
-Greg


>
> Main issue this surfaces is that RADOS maps ignore clients - they just
> see cluster. There should be the part of RADOS map unique or possibly
> unique for each client connection.
>
> Lets file feature request?
>
> p.s. honestly, I don't see why anyone would use ceph for local network
> RAID setups, there are other simple solutions out there even in your
> own RedHat shop.
> On Tue, Nov 20, 2018 at 8:38 PM Patrick Donnelly 
> wrote:
> >
> > You either need to accept that reads/writes will land on different data
> centers, primary OSD for a given pool is always in the desired data center,
> or some other non-Ceph solution which will have either expensive, eventual,
> or false consistency.
> >
> > On Fri, Nov 16, 2018, 10:07 AM Vlad Kopylov  >>
> >> This is what Jean suggested. I understand it and it works with primary.
> >> But what I need is for all clients to access same files, not separate
> sets (like red blue green)
> >>
> >> Thanks Konstantin.
> >>
> >> On Fri, Nov 16, 2018 at 3:43 AM Konstantin Shalygin 
> wrote:
> >>>
> >>> On 11/16/18 11:57 AM, Vlad Kopylov wrote:
> >>> > Exactly. But write operations should go to all nodes.
> >>>
> >>> This can be set via primary affinity [1], when a ceph client reads or
> >>> writes data, it always contacts the primary OSD in the acting set.
> >>>
> >>>
> >>> If u want to totally segregate IO, you can use device classes:
> >>>
> >>> Just create osds with different classes:
> >>>
> >>> dc1
> >>>
> >>>host1
> >>>
> >>>  red osd.0 primary
> >>>
> >>>  blue osd.1
> >>>
> >>>  green osd.2
> >>>
> >>> dc2
> >>>
> >>>host2
> >>>
> >>>  red osd.3
> >>>
> >>>  blue osd.4 primary
> >>>
> >>>  green osd.5
> >>>
> >>> dc3
> >>>
> >>>host3
> >>>
> >>>  red osd.6
> >>>
> >>>  blue osd.7
> >>>
> >>>  green osd.8 primary
> >>>
> >>>
> >>> create 3 crush rules:
> >>>
> >>> ceph osd crush rule create-replicated red default host red
> >>>
> >>> ceph osd crush rule create-replicated blue default host blue
> >>>
> >>> ceph osd crush rule create-replicated green default host green
> >>>
> >>>
> >>> and 3 pools:
> >>>
> >>> ceph osd pool create red 64 64 replicated red
> >>>
> >>> ceph osd pool create blue 64 64 replicated blue
> >>>
> >>> ceph osd pool create blue 64 64 replicated green
> >>>
> >>>
> >>> [1]
> >>>
> http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity
> '
> >>>
> >>>
> >>>
> >>> k
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] will crush rule be used during object relocation in OSD failure ?

2018-11-26 Thread Gregory Farnum
On Fri, Nov 23, 2018 at 11:01 AM ST Wong (ITSC)  wrote:

> Hi all,
>
>
> We've 8 osd hosts, 4 in room 1 and 4 in room2.
>
> A pool with size = 3 using following crush map is created, to cater for
> room failure.
>
>
> rule multiroom {
> id 0
> type replicated
> min_size 2
> max_size 4
> step take default
> step choose firstn 2 type room
> step chooseleaf firstn 2 type host
> step emit
> }
>
>
>
> We're expecting:
>
> 1.for each object, there are always 2 replicas in one room and 1 replica
> in other room making size=3.  But we can't control which room has 1 or 2
> replicas.
>

Right.


>
> 2.in case an osd host fails, ceph will assign remaining osds to the same
> PG to hold replicas on the failed osd host.  Selection is based on crush
> rule of the pool, thus maintaining the same failure domain - won't make all
> replicas in the same room.
>

Yes, if a host fails the copies it held will be replaced by new copies in
the same room.


>
> 3.in case of entire room with 1 replica fails, the pool will remain
> degraded but won't do any replica relocation.
>

Right.


>
> 4. in case of entire room with 2 replicas fails, ceph will make use of
> osds in the surviving room and making 2 replicas.  Pool will not be
> writeable before all objects are made 2 copies (unless we make pool
> size=4?).  Then when recovery is complete, pool will remain in degraded
> state until the failed room recover.
>

Hmm, I'm actually not sure if this will work out — because CRUSH is
hierarchical, it will keep trying to select hosts from the dead room and
will fill out the location vector's first two spots with -1. It could be
that Ceph will skip all those "nonexistent" entries and just pick the two
copies from slots 3 and 4, but it might not. You should test this carefully
and report back!
-Greg

>
> Is our understanding correct?  Thanks a lot.
> Will do some simulation later to verify.
>
> Regards,
> /stwong
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor disks for SSD only cluster

2018-11-26 Thread Wido den Hollander



On 11/26/18 2:21 PM, Gregory Farnum wrote:
> As the monitors limit their transaction rates, I would tend for the
> higher-durability drives. I don't think any monitor throughput issues
> have been reported on clusters with SSDs for storage.

I can confirm that. Just make sure you have proper Datacenter Grade
SSDs. Don't go crazy with buying the most expensive ones, but stay away
from consumer grade SSDs.

Just make sure you have at least 100GB of free space should the MON
databases grow to a large size.

Wido

> -Greg
> 
> On Mon, Nov 26, 2018 at 5:47 AM Valmar Kuristik  > wrote:
> 
> Hello,
> 
> Can anyone say how important is to have fast storage on monitors for a
> all ssd deployment? We are planning on throwing SSDs into the monitors
> as well, but are at a loss about if to go for more durability or speed.
> Higher durability drives tend to be a lot slower for the 240GB size
> we'd
> need on the monitors, while lower durability would net considerably
> more
> write speed.
> 
> Any insight into this ?
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor disks for SSD only cluster

2018-11-26 Thread Gregory Farnum
As the monitors limit their transaction rates, I would tend for the
higher-durability drives. I don't think any monitor throughput issues have
been reported on clusters with SSDs for storage.
-Greg

On Mon, Nov 26, 2018 at 5:47 AM Valmar Kuristik  wrote:

> Hello,
>
> Can anyone say how important is to have fast storage on monitors for a
> all ssd deployment? We are planning on throwing SSDs into the monitors
> as well, but are at a loss about if to go for more durability or speed.
> Higher durability drives tend to be a lot slower for the 240GB size we'd
> need on the monitors, while lower durability would net considerably more
> write speed.
>
> Any insight into this ?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No recovery when "norebalance" flag set

2018-11-26 Thread Gregory Farnum
On Sun, Nov 25, 2018 at 2:41 PM Stefan Kooman  wrote:

> Hi list,
>
> During cluster expansion (adding extra disks to existing hosts) some
> OSDs failed (FAILED assert(0 == "unexpected error", _txc_add_transaction
> error (39) Directory not empty not handled on operation 21 (op 1,
> counting from 0), full details: https://8n1.org/14078/c534). We had
> "norebalance", "nobackfill", and "norecover" flags set. After we unset
> nobackfill and norecover (to let Ceph fix the degraded PGs) it would
> recover all but 12 objects (2 PGs). We queried the PGs and the OSDs that
> were supposed to have a copy of them, and they were already "probed".  A
> day later (~24 hours) it would still not have recovered the degraded
> objects.  After we unset the "norebalance" flag it would start
> rebalancing, backfilling and recovering. The 12 degraded objects were
> recovered.
>
> Is this expected behaviour? I would expect Ceph to always try to fix
> degraded things first and foremost. Even "pg force-recover" and "pg
> force-backfill" could not force recovery.
>

I haven't dug into how the norebalance flag works, but I think this is
expected — it presumably prevents OSDs from creating new copies of PGs,
which is what needed to happen here.
-Greg


>
> Gr. Stefan
>
>
>
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688
> <+31%20318%20648%20688> / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Gregory Farnum
On Mon, Nov 26, 2018 at 3:30 AM Janne Johansson  wrote:

> Den sön 25 nov. 2018 kl 22:10 skrev Stefan Kooman :
> >
> > Hi List,
> >
> > Another interesting and unexpected thing we observed during cluster
> > expansion is the following. After we added  extra disks to the cluster,
> > while "norebalance" flag was set, we put the new OSDs "IN". As soon as
> > we did that a couple of hundered objects would become degraded. During
> > that time no OSD crashed or restarted. Every "ceph osd crush add $osd
> > weight host=$storage-node" would cause extra degraded objects.
> >
> > I don't expect objects to become degraded when extra OSDs are added.
> > Misplaced, yes. Degraded, no
> >
> > Someone got an explantion for this?
> >
>
> Yes, when you add a drive (or 10), some PGs decide they should have one or
> more
> replicas on the new drives, a new empty PG is created there, and
> _then_ that replica
> will make that PG get into the "degraded" mode, meaning if it had 3
> fine active+clean
> replicas before, it now has 2 active+clean and one needing backfill to
> get into shape.
>
> It is a slight mistake in reporting it in the same way as an error,
> even if it looks to the
> cluster just as if it was in error and needs fixing. This gives the
> new ceph admins a
> sense of urgency or danger whereas it should be perfectly normal to add
> space to
> a cluster. Also, it could have chosen to add a fourth PG in a repl=3
> PG and fill from
> the one going out into the new empty PG and somehow keep itself with 3
> working
> replicas, but ceph chooses to first discard one replica, then backfill
> into the empty
> one, leading to this kind of "error" report.
>

See, that's the thing: Ceph is designed *not* to reduce data reliability
this way; it shouldn't do that; and so far as I've been able to establish
so far it doesn't actually do that. Which makes these degraded object
reports a bit perplexing.

What we have worked out is that sometimes objects can be degraded because
the log-based recovery takes a while after the primary juggles around PG
set membership, and I suspect that's what is turning up here. The exact
cause still eludes me a bit, but I assume it's a consequence of the
backfill and recovery throttling we've added over the years.
If a whole PG was missing then you'd expect to see very large degraded
object counts (as opposed to the 2 that Marco reported).

-Greg


>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Marco Gaiarin
Mandi! Janne Johansson
  In chel di` si favelave...

> It is a slight mistake in reporting it in the same way as an error, even if 
> it looks to the
> cluster just as if it was in error and needs fixing.

I think i'm hit a similar situation, and also i'm feeling that
something have to be 'fixed'. I seek an explanation...

I'm adding a node (blackpanther, 4 OSDs, done) and removing a
node (vedovanera[1], 4 OSDs, to be done).

I've added a new node, added slowly 4 new OSD, but in the meantime an
OSD (not the new, not the node to remove) died. My situation now is:

 root@blackpanther:~# ceph osd df tree
 ID WEIGHT   REWEIGHT SIZE   USE   AVAIL  %USE  VAR  TYPE NAME   
 -1 21.41985-  5586G 2511G  3074G 00 root default
 -2  5.45996-  5586G 2371G  3214G 42.45 0.93 host capitanamerica 
  0  1.81999  1.0  1862G  739G  1122G 39.70 0.87 osd.0   
  1  1.81999  1.0  1862G  856G  1005G 46.00 1.00 osd.1   
 10  0.90999  1.0   931G  381G   549G 40.95 0.89 osd.10  
 11  0.90999  1.0   931G  394G   536G 42.35 0.92 osd.11  
 -3  5.03996-  5586G 2615G  2970G 46.82 1.02 host vedovanera 
  2  1.3  1.0  1862G  684G  1177G 36.78 0.80 osd.2   
  3  1.81999  1.0  1862G 1081G   780G 58.08 1.27 osd.3   
  4  0.90999  1.0   931G  412G   518G 44.34 0.97 osd.4   
  5  0.90999  1.0   931G  436G   494G 46.86 1.02 osd.5   
 -4  5.45996-   931G  583G   347G 00 host deadpool   
  6  1.81999  1.0  1862G  898G   963G 48.26 1.05 osd.6   
  7  1.81999  1.0  1862G  839G  1022G 45.07 0.98 osd.7   
  8  0.909990  0 0  0 00 osd.8   
  9  0.90999  1.0   931G  583G   347G 62.64 1.37 osd.9   
 -5  5.45996-  5586G 2511G  3074G 44.96 0.98 host blackpanther   
 12  1.81999  1.0  1862G  828G  1033G 44.51 0.97 osd.12  
 13  1.81999  1.0  1862G  753G  1108G 40.47 0.88 osd.13  
 14  0.90999  1.0   931G  382G   548G 41.11 0.90 osd.14  
 15  0.90999  1.0   931G  546G   384G 58.66 1.28 osd.15  
TOTAL 21413G 9819G 11594G 45.85  
 MIN/MAX VAR: 0/1.37  STDDEV: 7.37

Perfectly healthy. But i've tried to, slowly, remove an OSD from
'vedovanera', and so i've tried with:

ceph osd crush reweight osd.2 

as you can see, i'm arrived to weight 1.4 (from 1.81999), but if i go
lower than that i catch:

   cluster 8794c124-c2ec-4e81-8631-742992159bd6
 health HEALTH_WARN
6 pgs backfill
1 pgs backfilling
7 pgs stuck unclean
recovery 2/2556513 objects degraded (0.000%)
recovery 7721/2556513 objects misplaced (0.302%)
 monmap e6: 6 mons at 
{0=10.27.251.7:6789/0,1=10.27.251.8:6789/0,2=10.27.251.11:6789/0,3=10.27.251.12:6789/0,4=10.27.251.9:6789/0,blackpanther=10.27.251.2:6789/0}
election epoch 2780, quorum 0,1,2,3,4,5 blackpanther,0,1,4,2,3
 osdmap e9302: 16 osds: 15 up, 15 in; 7 remapped pgs
  pgmap v54971897: 768 pgs, 3 pools, 3300 GB data, 830 kobjects
9911 GB used, 11502 GB / 21413 GB avail
2/2556513 objects degraded (0.000%)
7721/2556513 objects misplaced (0.302%)
 761 active+clean
   6 active+remapped+wait_backfill
   1 active+remapped+backfilling
  client io 9725 kB/s rd, 772 kB/s wr, 153 op/s

eg, 2 object 'degraded'. This really puzzled me.

Why?! Thanks.


[1] some Marvel Comics heros got translated in Italian, so 'vedovanera'
  is 'black widow' and 'capitanamerica' clearly 'Captain America'.

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No recovery when "norebalance" flag set

2018-11-26 Thread Stefan Kooman
Quoting Dan van der Ster (d...@vanderster.com):
> Haven't seen that exact issue.
> 
> One thing to note though is that if osd_max_backfills is set to 1,
> then it can happen that PGs get into backfill state, taking that
> single reservation on a given OSD, and therefore the recovery_wait PGs
> can't get a slot.
> I suppose that backfill prioritization is supposed to prevent this,
> but in my experience luminous v12.2.8 doesn't always get it right.

That's also our experience. Even if if the degraded PGs with backfill /
recovery state are given a higher priority (forced) ... than still
normally backfilling takes place.

> So next time I'd try injecting osd_max_backfills = 2 or 3 to kickstart
> the recovering PGs.

Wat still on "1" indeed. We tend to cranck that (and max recovery) with
keeping an eye on max read and write apply latency. In our setup we can
do 16 backfills concurrently / and or 2 recovery / 4 backfills. Recovery
speeds ~ 4 - 5 GB/s ... pushing it beyond that tends to crashing OSDs.

We'll try your suggestion next time.

Thanks,

Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disable intra-host replication?

2018-11-26 Thread Janne Johansson
Den mån 26 nov. 2018 kl 12:11 skrev Marco Gaiarin :
> Mandi! Janne Johansson
>   In chel di` si favelave...
>
> > The default crush rules with replication=3 would only place PGs on
> > separate hosts,
> > so in that case it would go into degraded mode if a node goes away,
> > and not place
> > replicas on different disks on the remaining hosts.
>
> 'hosts' mean 'hosts with OSDs', right?
> Because my cluster have 5 hosts, 2 are only MONs.

Yes, only hosts with OSDs.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] No recovery when "norebalance" flag set

2018-11-26 Thread Dan van der Ster
Haven't seen that exact issue.

One thing to note though is that if osd_max_backfills is set to 1,
then it can happen that PGs get into backfill state, taking that
single reservation on a given OSD, and therefore the recovery_wait PGs
can't get a slot.
I suppose that backfill prioritization is supposed to prevent this,
but in my experience luminous v12.2.8 doesn't always get it right.

So next time I'd try injecting osd_max_backfills = 2 or 3 to kickstart
the recovering PGs.

-- dan


On Sun, Nov 25, 2018 at 8:41 PM Stefan Kooman  wrote:
>
> Hi list,
>
> During cluster expansion (adding extra disks to existing hosts) some
> OSDs failed (FAILED assert(0 == "unexpected error", _txc_add_transaction
> error (39) Directory not empty not handled on operation 21 (op 1,
> counting from 0), full details: https://8n1.org/14078/c534). We had
> "norebalance", "nobackfill", and "norecover" flags set. After we unset
> nobackfill and norecover (to let Ceph fix the degraded PGs) it would
> recover all but 12 objects (2 PGs). We queried the PGs and the OSDs that
> were supposed to have a copy of them, and they were already "probed".  A
> day later (~24 hours) it would still not have recovered the degraded
> objects.  After we unset the "norebalance" flag it would start
> rebalancing, backfilling and recovering. The 12 degraded objects were
> recovered.
>
> Is this expected behaviour? I would expect Ceph to always try to fix
> degraded things first and foremost. Even "pg force-recover" and "pg
> force-backfill" could not force recovery.
>
> Gr. Stefan
>
>
>
>
> --
> | BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
> | GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disable intra-host replication?

2018-11-26 Thread Marco Gaiarin
Mandi! Janne Johansson
  In chel di` si favelave...

> The default crush rules with replication=3 would only place PGs on
> separate hosts,
> so in that case it would go into degraded mode if a node goes away,
> and not place
> replicas on different disks on the remaining hosts.

'hosts' mean 'hosts with OSDs', right?

Because my cluster have 5 hosts, 2 are only MONs.


Thanks.

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Low traffic Ceph cluster with consumer SSD.

2018-11-26 Thread Eneko Lacunza

Hi,

El 25/11/18 a las 18:23, Виталий Филиппов escribió:
Ok... That's better than previous thread with file download where the 
topic starter suffered from normal only-metadata-journaled fs... 
Thanks for the link, it would be interesting to repeat similar tests. 
Although I suspect it shouldn't be that bad... at least not all 
desktop SSDs are that broken - for example 
https://engineering.nordeus.com/power-failure-testing-with-ssds/ says 
samsumg 840 pro is ok.


Only that ceph performance for that SSD model is very very bad. We had 
one of those repurposed for ceph and had to run to buy an Intel 
enterprise SSD drive to replace it.


Don't even try :)

Cheers
Eneko

--
Zuzendari Teknikoa / Director Técnico
Binovo IT Human Project, S.L.
Telf. 943569206
Astigarraga bidea 2, 2º izq. oficina 11; 20180 Oiartzun (Gipuzkoa)
www.binovo.es

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor disks for SSD only cluster

2018-11-26 Thread Valmar Kuristik

Hello,

Can anyone say how important is to have fast storage on monitors for a 
all ssd deployment? We are planning on throwing SSDs into the monitors 
as well, but are at a loss about if to go for more durability or speed. 
Higher durability drives tend to be a lot slower for the 240GB size we'd 
need on the monitors, while lower durability would net considerably more 
write speed.


Any insight into this ?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sizing for bluestore db and wal

2018-11-26 Thread Janne Johansson
Den mån 26 nov. 2018 kl 10:10 skrev Felix Stolte :
>
> Hi folks,
>
> i upgraded our ceph cluster from jewel to luminous and want to migrate
> from filestore to bluestore. Currently we use one SSD as journal for
> thre 8TB Sata Drives with a journal partition size of 40GB. If my
> understanding of the bluestore documentation is correct, i can use a wal
> partition for the writeahead log (to decrease write latency, similar to
> filestore) and a db partition for metadata (decreasing write AND read
> latency/throughput). Now I have two questions:
>
> a) Do I really need an WAL partition if both wal and db are on the same SSD?

I think the answer is no here, if you point the DB to an SSD, bluestore will
use it for WAL also.

> b) If so, what would the ratio look like? 99% db, 1% wal?

..which means just let ceph handle this itself.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Sizing for bluestore db and wal

2018-11-26 Thread Felix Stolte

Hi folks,

i upgraded our ceph cluster from jewel to luminous and want to migrate 
from filestore to bluestore. Currently we use one SSD as journal for 
thre 8TB Sata Drives with a journal partition size of 40GB. If my 
understanding of the bluestore documentation is correct, i can use a wal 
partition for the writeahead log (to decrease write latency, similar to 
filestore) and a db partition for metadata (decreasing write AND read 
latency/throughput). Now I have two questions:


a) Do I really need an WAL partition if both wal and db are on the same SSD?

b) If so, what would the ratio look like? 99% db, 1% wal?


Best regards, Felix

--
Forschungszentrum Jülich GmbH
52425 Jülich
Sitz der Gesellschaft: Jülich
Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir. Dr. Karl Eugen Huthmacher
Geschäftsführung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Janne Johansson
Den mån 26 nov. 2018 kl 09:39 skrev Stefan Kooman :

> > It is a slight mistake in reporting it in the same way as an error,
> > even if it looks to the
> > cluster just as if it was in error and needs fixing. This gives the
> > new ceph admins a
> > sense of urgency or danger whereas it should be perfectly normal to add 
> > space to
> > a cluster. Also, it could have chosen to add a fourth PG in a repl=3
> > PG and fill from
> > the one going out into the new empty PG and somehow keep itself with 3 
> > working
> > replicas, but ceph chooses to first discard one replica, then backfill
> > into the empty
> > one, leading to this kind of "error" report.
>
> Thanks for the explanation. I agree with you that it would be more safe to
> first backfill to the new PG instead of just assuming the new OSD will
> be fine and discarding a perfectly healthy PG. We do have max_size 3 in
> the CRUSH ruleset ... I wonder if Ceph would behave differently if we
> would have max_size 4 ... to actually allow a fourth copy in the first
> place ...

I don't think the replication number is important, it's more of a choice which
PERHAPS is meant to allow you to move PGs to a new drive when the cluster is
near full, since it will clear out space lots faster if you just kill
off one unneeded
replica and starts writing to a new drive, whereas keeping all old
replicas until data is
100% ok on the new replica will make new space not appear until a large
amount of data has moved, which for large drives and large PGs might take
a very long time.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Stefan Kooman
Quoting Janne Johansson (icepic...@gmail.com):
> Yes, when you add a drive (or 10), some PGs decide they should have one or 
> more
> replicas on the new drives, a new empty PG is created there, and
> _then_ that replica
> will make that PG get into the "degraded" mode, meaning if it had 3
> fine active+clean
> replicas before, it now has 2 active+clean and one needing backfill to
> get into shape.
> 
> It is a slight mistake in reporting it in the same way as an error,
> even if it looks to the
> cluster just as if it was in error and needs fixing. This gives the
> new ceph admins a
> sense of urgency or danger whereas it should be perfectly normal to add space 
> to
> a cluster. Also, it could have chosen to add a fourth PG in a repl=3
> PG and fill from
> the one going out into the new empty PG and somehow keep itself with 3 working
> replicas, but ceph chooses to first discard one replica, then backfill
> into the empty
> one, leading to this kind of "error" report.

Thanks for the explanation. I agree with you that it would be more safe to
first backfill to the new PG instead of just assuming the new OSD will
be fine and discarding a perfectly healthy PG. We do have max_size 3 in
the CRUSH ruleset ... I wonder if Ceph would behave differently if we
would have max_size 4 ... to actually allow a fourth copy in the first
place ...

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects afte: ceph osd in $osd

2018-11-26 Thread Janne Johansson
Den sön 25 nov. 2018 kl 22:10 skrev Stefan Kooman :
>
> Hi List,
>
> Another interesting and unexpected thing we observed during cluster
> expansion is the following. After we added  extra disks to the cluster,
> while "norebalance" flag was set, we put the new OSDs "IN". As soon as
> we did that a couple of hundered objects would become degraded. During
> that time no OSD crashed or restarted. Every "ceph osd crush add $osd
> weight host=$storage-node" would cause extra degraded objects.
>
> I don't expect objects to become degraded when extra OSDs are added.
> Misplaced, yes. Degraded, no
>
> Someone got an explantion for this?
>

Yes, when you add a drive (or 10), some PGs decide they should have one or more
replicas on the new drives, a new empty PG is created there, and
_then_ that replica
will make that PG get into the "degraded" mode, meaning if it had 3
fine active+clean
replicas before, it now has 2 active+clean and one needing backfill to
get into shape.

It is a slight mistake in reporting it in the same way as an error,
even if it looks to the
cluster just as if it was in error and needs fixing. This gives the
new ceph admins a
sense of urgency or danger whereas it should be perfectly normal to add space to
a cluster. Also, it could have chosen to add a fourth PG in a repl=3
PG and fill from
the one going out into the new empty PG and somehow keep itself with 3 working
replicas, but ceph chooses to first discard one replica, then backfill
into the empty
one, leading to this kind of "error" report.

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com