Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-26 Thread Vlad Kopylov
I see. Thank you Greg.

Ultimately leading to some kind of multi-primary OSD/MON setup, which
will most likely add lookup overheads. Though might be a reasonable
trade off for network distributed setups.
Good feature for major version.

With Glusterfs I solved it, funny as it sounds, by writing tiny fuse
fs as overlay, making all reads locally and writes to cluster. Having
that with Glusterfs there are real files on each node for local reads.

Wish there was a way to file-access local OSD so I can use same approach.

-vlad
On Mon, Nov 26, 2018 at 8:47 AM Gregory Farnum  wrote:
>
> On Tue, Nov 20, 2018 at 9:50 PM Vlad Kopylov  wrote:
>>
>> I see the point, but not for the read case:
>>   no overhead for just choosing or let Mount option choose read replica.
>>
>> This is simple feature that can be implemented, that will save many
>> people bandwidth in really distributed cases.
>
>
> This is actually much more complicated than it sounds. Allowing reads from 
> the replica OSDs while still routing writes through a different primary OSD 
> introduces a great many consistency issues. We've tried adding very limited 
> support for this read-from-replica scenario in special cases, but have had to 
> roll them all back due to edge cases where they don't work.
>
> I understand why you want it, but it's definitely not a simple feature. :(
> -Greg
>
>>
>>
>> Main issue this surfaces is that RADOS maps ignore clients - they just
>> see cluster. There should be the part of RADOS map unique or possibly
>> unique for each client connection.
>>
>> Lets file feature request?
>>
>> p.s. honestly, I don't see why anyone would use ceph for local network
>> RAID setups, there are other simple solutions out there even in your
>> own RedHat shop.
>> On Tue, Nov 20, 2018 at 8:38 PM Patrick Donnelly  wrote:
>> >
>> > You either need to accept that reads/writes will land on different data 
>> > centers, primary OSD for a given pool is always in the desired data 
>> > center, or some other non-Ceph solution which will have either expensive, 
>> > eventual, or false consistency.
>> >
>> > On Fri, Nov 16, 2018, 10:07 AM Vlad Kopylov > >>
>> >> This is what Jean suggested. I understand it and it works with primary.
>> >> But what I need is for all clients to access same files, not separate 
>> >> sets (like red blue green)
>> >>
>> >> Thanks Konstantin.
>> >>
>> >> On Fri, Nov 16, 2018 at 3:43 AM Konstantin Shalygin  
>> >> wrote:
>> >>>
>> >>> On 11/16/18 11:57 AM, Vlad Kopylov wrote:
>> >>> > Exactly. But write operations should go to all nodes.
>> >>>
>> >>> This can be set via primary affinity [1], when a ceph client reads or
>> >>> writes data, it always contacts the primary OSD in the acting set.
>> >>>
>> >>>
>> >>> If u want to totally segregate IO, you can use device classes:
>> >>>
>> >>> Just create osds with different classes:
>> >>>
>> >>> dc1
>> >>>
>> >>>host1
>> >>>
>> >>>  red osd.0 primary
>> >>>
>> >>>  blue osd.1
>> >>>
>> >>>  green osd.2
>> >>>
>> >>> dc2
>> >>>
>> >>>host2
>> >>>
>> >>>  red osd.3
>> >>>
>> >>>  blue osd.4 primary
>> >>>
>> >>>  green osd.5
>> >>>
>> >>> dc3
>> >>>
>> >>>host3
>> >>>
>> >>>  red osd.6
>> >>>
>> >>>  blue osd.7
>> >>>
>> >>>  green osd.8 primary
>> >>>
>> >>>
>> >>> create 3 crush rules:
>> >>>
>> >>> ceph osd crush rule create-replicated red default host red
>> >>>
>> >>> ceph osd crush rule create-replicated blue default host blue
>> >>>
>> >>> ceph osd crush rule create-replicated green default host green
>> >>>
>> >>>
>> >>> and 3 pools:
>> >>>
>> >>> ceph osd pool create red 64 64 replicated red
>> >>>
>> >>> ceph osd pool create blue 64 64 replicated blue
>> >>>
>> >>> ceph osd pool create blue 64 64 replicated green
>> >>>
>> >>>
>> >>> [1]
>> >>> http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity'
>> >>>
>> >>>
>> >>>
>> >>> k
>> >>>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-26 Thread Gregory Farnum
On Tue, Nov 20, 2018 at 9:50 PM Vlad Kopylov  wrote:

> I see the point, but not for the read case:
>   no overhead for just choosing or let Mount option choose read replica.
>
> This is simple feature that can be implemented, that will save many
> people bandwidth in really distributed cases.
>

This is actually much more complicated than it sounds. Allowing reads from
the replica OSDs while still routing writes through a different primary OSD
introduces a great many consistency issues. We've tried adding very limited
support for this read-from-replica scenario in special cases, but have had
to roll them all back due to edge cases where they don't work.

I understand why you want it, but it's definitely not a simple feature. :(
-Greg


>
> Main issue this surfaces is that RADOS maps ignore clients - they just
> see cluster. There should be the part of RADOS map unique or possibly
> unique for each client connection.
>
> Lets file feature request?
>
> p.s. honestly, I don't see why anyone would use ceph for local network
> RAID setups, there are other simple solutions out there even in your
> own RedHat shop.
> On Tue, Nov 20, 2018 at 8:38 PM Patrick Donnelly 
> wrote:
> >
> > You either need to accept that reads/writes will land on different data
> centers, primary OSD for a given pool is always in the desired data center,
> or some other non-Ceph solution which will have either expensive, eventual,
> or false consistency.
> >
> > On Fri, Nov 16, 2018, 10:07 AM Vlad Kopylov  >>
> >> This is what Jean suggested. I understand it and it works with primary.
> >> But what I need is for all clients to access same files, not separate
> sets (like red blue green)
> >>
> >> Thanks Konstantin.
> >>
> >> On Fri, Nov 16, 2018 at 3:43 AM Konstantin Shalygin 
> wrote:
> >>>
> >>> On 11/16/18 11:57 AM, Vlad Kopylov wrote:
> >>> > Exactly. But write operations should go to all nodes.
> >>>
> >>> This can be set via primary affinity [1], when a ceph client reads or
> >>> writes data, it always contacts the primary OSD in the acting set.
> >>>
> >>>
> >>> If u want to totally segregate IO, you can use device classes:
> >>>
> >>> Just create osds with different classes:
> >>>
> >>> dc1
> >>>
> >>>host1
> >>>
> >>>  red osd.0 primary
> >>>
> >>>  blue osd.1
> >>>
> >>>  green osd.2
> >>>
> >>> dc2
> >>>
> >>>host2
> >>>
> >>>  red osd.3
> >>>
> >>>  blue osd.4 primary
> >>>
> >>>  green osd.5
> >>>
> >>> dc3
> >>>
> >>>host3
> >>>
> >>>  red osd.6
> >>>
> >>>  blue osd.7
> >>>
> >>>  green osd.8 primary
> >>>
> >>>
> >>> create 3 crush rules:
> >>>
> >>> ceph osd crush rule create-replicated red default host red
> >>>
> >>> ceph osd crush rule create-replicated blue default host blue
> >>>
> >>> ceph osd crush rule create-replicated green default host green
> >>>
> >>>
> >>> and 3 pools:
> >>>
> >>> ceph osd pool create red 64 64 replicated red
> >>>
> >>> ceph osd pool create blue 64 64 replicated blue
> >>>
> >>> ceph osd pool create blue 64 64 replicated green
> >>>
> >>>
> >>> [1]
> >>>
> http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity
> '
> >>>
> >>>
> >>>
> >>> k
> >>>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-20 Thread Vlad Kopylov
I see the point, but not for the read case:
  no overhead for just choosing or let Mount option choose read replica.

This is simple feature that can be implemented, that will save many
people bandwidth in really distributed cases.

Main issue this surfaces is that RADOS maps ignore clients - they just
see cluster. There should be the part of RADOS map unique or possibly
unique for each client connection.

Lets file feature request?

p.s. honestly, I don't see why anyone would use ceph for local network
RAID setups, there are other simple solutions out there even in your
own RedHat shop.
On Tue, Nov 20, 2018 at 8:38 PM Patrick Donnelly  wrote:
>
> You either need to accept that reads/writes will land on different data 
> centers, primary OSD for a given pool is always in the desired data center, 
> or some other non-Ceph solution which will have either expensive, eventual, 
> or false consistency.
>
> On Fri, Nov 16, 2018, 10:07 AM Vlad Kopylov >
>> This is what Jean suggested. I understand it and it works with primary.
>> But what I need is for all clients to access same files, not separate sets 
>> (like red blue green)
>>
>> Thanks Konstantin.
>>
>> On Fri, Nov 16, 2018 at 3:43 AM Konstantin Shalygin  wrote:
>>>
>>> On 11/16/18 11:57 AM, Vlad Kopylov wrote:
>>> > Exactly. But write operations should go to all nodes.
>>>
>>> This can be set via primary affinity [1], when a ceph client reads or
>>> writes data, it always contacts the primary OSD in the acting set.
>>>
>>>
>>> If u want to totally segregate IO, you can use device classes:
>>>
>>> Just create osds with different classes:
>>>
>>> dc1
>>>
>>>host1
>>>
>>>  red osd.0 primary
>>>
>>>  blue osd.1
>>>
>>>  green osd.2
>>>
>>> dc2
>>>
>>>host2
>>>
>>>  red osd.3
>>>
>>>  blue osd.4 primary
>>>
>>>  green osd.5
>>>
>>> dc3
>>>
>>>host3
>>>
>>>  red osd.6
>>>
>>>  blue osd.7
>>>
>>>  green osd.8 primary
>>>
>>>
>>> create 3 crush rules:
>>>
>>> ceph osd crush rule create-replicated red default host red
>>>
>>> ceph osd crush rule create-replicated blue default host blue
>>>
>>> ceph osd crush rule create-replicated green default host green
>>>
>>>
>>> and 3 pools:
>>>
>>> ceph osd pool create red 64 64 replicated red
>>>
>>> ceph osd pool create blue 64 64 replicated blue
>>>
>>> ceph osd pool create blue 64 64 replicated green
>>>
>>>
>>> [1]
>>> http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity'
>>>
>>>
>>>
>>> k
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-20 Thread Patrick Donnelly
You either need to accept that reads/writes will land on different data
centers, primary OSD for a given pool is always in the desired data center,
or some other non-Ceph solution which will have either expensive, eventual,
or false consistency.

On Fri, Nov 16, 2018, 10:07 AM Vlad Kopylov  This is what Jean suggested. I understand it and it works with primary.
> *But what I need is for all clients to access same files, not separate
> sets (like red blue green)*
>
> Thanks Konstantin.
>
> On Fri, Nov 16, 2018 at 3:43 AM Konstantin Shalygin 
> wrote:
>
>> On 11/16/18 11:57 AM, Vlad Kopylov wrote:
>> > Exactly. But write operations should go to all nodes.
>>
>> This can be set via primary affinity [1], when a ceph client reads or
>> writes data, it always contacts the primary OSD in the acting set.
>>
>>
>> If u want to totally segregate IO, you can use device classes:
>>
>> Just create osds with different classes:
>>
>> dc1
>>
>>host1
>>
>>  red osd.0 primary
>>
>>  blue osd.1
>>
>>  green osd.2
>>
>> dc2
>>
>>host2
>>
>>  red osd.3
>>
>>  blue osd.4 primary
>>
>>  green osd.5
>>
>> dc3
>>
>>host3
>>
>>  red osd.6
>>
>>  blue osd.7
>>
>>  green osd.8 primary
>>
>>
>> create 3 crush rules:
>>
>> ceph osd crush rule create-replicated red default host red
>>
>> ceph osd crush rule create-replicated blue default host blue
>>
>> ceph osd crush rule create-replicated green default host green
>>
>>
>> and 3 pools:
>>
>> ceph osd pool create red 64 64 replicated red
>>
>> ceph osd pool create blue 64 64 replicated blue
>>
>> ceph osd pool create blue 64 64 replicated green
>>
>>
>> [1]
>>
>> http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity
>> '
>>
>>
>>
>> k
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-19 Thread Vlad Kopylov
Yes. Using GlusterFS now.
But Ceph has best write replication which I am struggling to make gluster
guys implement.

If this read replica pick issue could be fixed ceph could be a good cloud
fs not just local network RAID.

On Mon, Nov 19, 2018 at 2:54 AM Konstantin Shalygin  wrote:

> On 11/17/18 1:07 AM, Vlad Kopylov wrote:
>
> This is what Jean suggested. I understand it and it works with primary.
> *But what I need is for all clients to access same files, not separate
> sets (like red blue green)*
>
> You should look to other solutions, like GlusterFS. Ceph is overhead for
> this case IMHO.
>
>
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-18 Thread Konstantin Shalygin

On 11/17/18 1:07 AM, Vlad Kopylov wrote:

This is what Jean suggested. I understand it and it works with primary.
*But what I need is for all clients to access same files, not separate 
sets (like red blue green)*


You should look to other solutions, like GlusterFS. Ceph is overhead for 
this case IMHO.




k


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-16 Thread Vlad Kopylov
This is what Jean suggested. I understand it and it works with primary.
*But what I need is for all clients to access same files, not separate sets
(like red blue green)*

Thanks Konstantin.

On Fri, Nov 16, 2018 at 3:43 AM Konstantin Shalygin  wrote:

> On 11/16/18 11:57 AM, Vlad Kopylov wrote:
> > Exactly. But write operations should go to all nodes.
>
> This can be set via primary affinity [1], when a ceph client reads or
> writes data, it always contacts the primary OSD in the acting set.
>
>
> If u want to totally segregate IO, you can use device classes:
>
> Just create osds with different classes:
>
> dc1
>
>host1
>
>  red osd.0 primary
>
>  blue osd.1
>
>  green osd.2
>
> dc2
>
>host2
>
>  red osd.3
>
>  blue osd.4 primary
>
>  green osd.5
>
> dc3
>
>host3
>
>  red osd.6
>
>  blue osd.7
>
>  green osd.8 primary
>
>
> create 3 crush rules:
>
> ceph osd crush rule create-replicated red default host red
>
> ceph osd crush rule create-replicated blue default host blue
>
> ceph osd crush rule create-replicated green default host green
>
>
> and 3 pools:
>
> ceph osd pool create red 64 64 replicated red
>
> ceph osd pool create blue 64 64 replicated blue
>
> ceph osd pool create blue 64 64 replicated green
>
>
> [1]
>
> http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity
> '
>
>
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-16 Thread Konstantin Shalygin

On 11/16/18 11:57 AM, Vlad Kopylov wrote:

Exactly. But write operations should go to all nodes.


This can be set via primary affinity [1], when a ceph client reads or 
writes data, it always contacts the primary OSD in the acting set.



If u want to totally segregate IO, you can use device classes:

Just create osds with different classes:

dc1

  host1

    red osd.0 primary

    blue osd.1

    green osd.2

dc2

  host2

    red osd.3

    blue osd.4 primary

    green osd.5

dc3

  host3

    red osd.6

    blue osd.7

    green osd.8 primary


create 3 crush rules:

ceph osd crush rule create-replicated red default host red

ceph osd crush rule create-replicated blue default host blue

ceph osd crush rule create-replicated green default host green


and 3 pools:

ceph osd pool create red 64 64 replicated red

ceph osd pool create blue 64 64 replicated blue

ceph osd pool create blue 64 64 replicated green


[1] 
http://docs.ceph.com/docs/master/rados/operations/crush-map/#primary-affinity'




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-15 Thread Vlad Kopylov
Exactly. But write operations should go to all nodes.

v

On Wed, Nov 14, 2018 at 9:52 PM Konstantin Shalygin  wrote:

> On 11/15/18 9:31 AM, Vlad Kopylov wrote:
> > Thanks Konstantin, I already tried accessing it in different ways and
> > best I got is bulk renamed files and other non presentable data.
> >
> > Maybe to solve this I can create overlapping osd pools?
> > Like one pool includes all 3 osd for replication, and 3 more include
> > one osd at each site with same blocks?
> >
>
> As far as I understand, you need something like this:
>
>
> vm1 io -> building1 osds only
>
> vm2 io -> building2 osds only
>
> vm3 io -> buildgin3 osds only
>
>
> Right?
>
>
>
> k
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-14 Thread Konstantin Shalygin

On 11/15/18 9:31 AM, Vlad Kopylov wrote:
Thanks Konstantin, I already tried accessing it in different ways and 
best I got is bulk renamed files and other non presentable data.


Maybe to solve this I can create overlapping osd pools?
Like one pool includes all 3 osd for replication, and 3 more include 
one osd at each site with same blocks?




As far as I understand, you need something like this:


vm1 io -> building1 osds only

vm2 io -> building2 osds only

vm3 io -> buildgin3 osds only


Right?



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-14 Thread Vlad Kopylov
Thanks Konstantin, I already tried accessing it in different ways and best
I got is bulk renamed files and other non presentable data.

Maybe to solve this I can create overlapping osd pools?
Like one pool includes all 3 osd for replication, and 3 more include one
osd at each site with same blocks?

v

On Wed, Nov 14, 2018 at 12:11 AM Konstantin Shalygin  wrote:

> Or is it possible to mount one OSD directly for read file access?
>
> In Ceph is impossible to io directly to OSD, only to PG.
>
>
>
> k
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-13 Thread Konstantin Shalygin

Or is it possible to mount one OSD directly for read file access?


In Ceph is impossible to io directly to OSD, only to PG.



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-13 Thread Vlad Kopylov
Each of 3 clients from different buildings are picking same
primary-affinity, and everything is slow at least on two.
Instead of just read from their local OSD they read mostly from
primary-affinity.

*What I need is something like primary-affinity for each client connection*

ID  CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF
 -1   0.08189 root default
 -3   0.02730 host vm1
  0   hdd 0.02730 osd.0 up  1.0 1.0
-10   0.02730 host vm2
  1   hdd 0.02730 osd.1 up  1.0 0.5
 -5   0.02730 host vm3
  2   hdd 0.02730 osd.2 up  1.0 0.5

v

On Tue, Nov 13, 2018 at 4:25 PM Jean-Charles Lopez 
wrote:

> Hi Vlad,
>
> No need for a specific CRUSH map configuration. I’d suggest you use the
> primary-affinity setting on the OSD so that only the OSDs that are close to
> your read point are are selected as primary.
>
> See https://ceph.com/geen-categorie/ceph-primary-affinity/ for information
>
> Just set the primary affinity of all the OSDs in building 2 to 0.
>
> Only the OSDs in building 1 should then be used as primary OSDs.
>
> BR
> JC
>
> On Nov 13, 2018, at 12:19, Vlad Kopylov  wrote:
>
> Or is it possible to mount one OSD directly for read file access?
>
> v
>
> On Sun, Nov 11, 2018 at 1:47 PM Vlad Kopylov  wrote:
>
>> Maybe it is possible if done via gateway-nfs export?
>> Settings for gateway allow read osd selection?
>>
>> v
>>
>> On Sun, Nov 11, 2018 at 1:01 AM Martin Verges 
>> wrote:
>>
>>> Hello Vlad,
>>>
>>> If you want to read from the same data, then it ist not possible (as far
>>> I know).
>>>
>>> --
>>> Martin Verges
>>> Managing director
>>>
>>> Mobile: +49 174 9335695
>>> E-Mail: martin.ver...@croit.io
>>> Chat: https://t.me/MartinVerges
>>>
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>>
>>> Web: https://croit.io
>>> YouTube: https://goo.gl/PGE1Bx
>>>
>>> Am Sa., 10. Nov. 2018, 03:47 hat Vlad Kopylov 
>>> geschrieben:
>>>
 Maybe i missed something but FS is explicitly selecting pools to put
 files and metadata, like I did below.
 So if I create new pools - data in them will be different. If I apply
 the rule dc1_primary to cfs_data pool, and client from dc3 connects to fs
 t01 - it will start using dc1 hosts


 ceph osd pool create cfs_data 100
 ceph osd pool create cfs_meta 100
 ceph fs new t01 cfs_data cfs_meta
 sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o
 name=admin,secretfile=/home/mciadmin/admin.secret

 rule dc1_primary {
 id 1
 type replicated
 min_size 1
 max_size 10
 step take dc1
 step chooseleaf firstn 1 type host
 step emit
 step take dc2
 step chooseleaf firstn -2 type host
 step emit
 step take dc3
 step chooseleaf firstn -2 type host
 step emit
 }

 On Fri, Nov 9, 2018 at 9:32 PM Vlad Kopylov  wrote:

> Just to confirm - it will still populate  3 copies in each datacenter?
> Thought this map was to select where to write to, guess it does write
> replication on the back end.
>
> I thought pools are completely separate and clients would not see each
> others data?
>
> Thank you Martin!
>
>
>
>
> On Fri, Nov 9, 2018 at 2:10 PM Martin Verges 
> wrote:
>
>> Hello Vlad,
>>
>> you can generate something like this:
>>
>> rule dc1_primary_dc2_secondary {
>> id 1
>> type replicated
>> min_size 1
>> max_size 10
>> step take dc1
>> step chooseleaf firstn 1 type host
>> step emit
>> step take dc2
>> step chooseleaf firstn 1 type host
>> step emit
>> step take dc3
>> step chooseleaf firstn -2 type host
>> step emit
>> }
>>
>> rule dc2_primary_dc1_secondary {
>> id 2
>> type replicated
>> min_size 1
>> max_size 10
>> step take dc1
>> step chooseleaf firstn 1 type host
>> step emit
>> step take dc2
>> step chooseleaf firstn 1 type host
>> step emit
>> step take dc3
>> step chooseleaf firstn -2 type host
>> step emit
>> }
>>
>> After you added such crush rules, you can configure the pools:
>>
>> ~ $ ceph osd pool set  crush_ruleset 1
>> ~ $ ceph osd pool set  crush_ruleset 2
>>
>> Now you place your workload from dc1 to the dc1 pool, and workload
>> from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
>> your workload issn't that write intensive) and save some money in dc3
>> as your client would always read from a SSD and write to 

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-13 Thread Jean-Charles Lopez
Hi Vlad,

No need for a specific CRUSH map configuration. I’d suggest you use the 
primary-affinity setting on the OSD so that only the OSDs that are close to 
your read point are are selected as primary.

See https://ceph.com/geen-categorie/ceph-primary-affinity/ for information

Just set the primary affinity of all the OSDs in building 2 to 0.

Only the OSDs in building 1 should then be used as primary OSDs.

BR
JC

> On Nov 13, 2018, at 12:19, Vlad Kopylov  wrote:
> 
> Or is it possible to mount one OSD directly for read file access?
> 
> v
> 
> On Sun, Nov 11, 2018 at 1:47 PM Vlad Kopylov  > wrote:
> Maybe it is possible if done via gateway-nfs export?
> Settings for gateway allow read osd selection?
> 
> v
> 
> On Sun, Nov 11, 2018 at 1:01 AM Martin Verges  > wrote:
> Hello Vlad,
> 
> If you want to read from the same data, then it ist not possible (as far I 
> know).
> 
> --
> Martin Verges
> Managing director
> 
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io 
> Chat: https://t.me/MartinVerges 
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> 
> Web: https://croit.io 
> YouTube: https://goo.gl/PGE1Bx 
> Am Sa., 10. Nov. 2018, 03:47 hat Vlad Kopylov  > geschrieben:
> Maybe i missed something but FS is explicitly selecting pools to put files 
> and metadata, like I did below.
> So if I create new pools - data in them will be different. If I apply the 
> rule dc1_primary to cfs_data pool, and client from dc3 connects to fs t01 - 
> it will start using dc1 hosts
> 
> 
> ceph osd pool create cfs_data 100
> ceph osd pool create cfs_meta 100
> ceph fs new t01 cfs_data cfs_meta
> sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o 
> name=admin,secretfile=/home/mciadmin/admin.secret
> 
> rule dc1_primary {
> id 1
> type replicated
> min_size 1
> max_size 10
> step take dc1
> step chooseleaf firstn 1 type host
> step emit
> step take dc2
> step chooseleaf firstn -2 type host
> step emit
> step take dc3
> step chooseleaf firstn -2 type host
> step emit
> }
> 
> On Fri, Nov 9, 2018 at 9:32 PM Vlad Kopylov  > wrote:
> Just to confirm - it will still populate  3 copies in each datacenter?
> Thought this map was to select where to write to, guess it does write 
> replication on the back end.
> 
> I thought pools are completely separate and clients would not see each others 
> data?
> 
> Thank you Martin!
> 
> 
> 
> 
> On Fri, Nov 9, 2018 at 2:10 PM Martin Verges  > wrote:
> Hello Vlad,
> 
> you can generate something like this:
> 
> rule dc1_primary_dc2_secondary {
> id 1
> type replicated
> min_size 1
> max_size 10
> step take dc1
> step chooseleaf firstn 1 type host
> step emit
> step take dc2
> step chooseleaf firstn 1 type host
> step emit
> step take dc3
> step chooseleaf firstn -2 type host
> step emit
> }
> 
> rule dc2_primary_dc1_secondary {
> id 2
> type replicated
> min_size 1
> max_size 10
> step take dc1
> step chooseleaf firstn 1 type host
> step emit
> step take dc2
> step chooseleaf firstn 1 type host
> step emit
> step take dc3
> step chooseleaf firstn -2 type host
> step emit
> }
> 
> After you added such crush rules, you can configure the pools:
> 
> ~ $ ceph osd pool set  crush_ruleset 1
> ~ $ ceph osd pool set  crush_ruleset 2
> 
> Now you place your workload from dc1 to the dc1 pool, and workload
> from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
> your workload issn't that write intensive) and save some money in dc3
> as your client would always read from a SSD and write to Hybrid.
> 
> Btw. all this could be done with a few simple clicks through our web
> frontend. Even if you want to export it via CephFS / NFS / .. it is
> possible to set it on a per folder level. Feel free to take a look at
> https://www.youtube.com/watch?v=V33f7ipw9d4 
>  to see how easy it could
> be.
> 
> --
> Martin Verges
> Managing director
> 
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io 
> Chat: https://t.me/MartinVerges 
> 
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> 
> Web: https://croit.io 
> YouTube: https://goo.gl/PGE1Bx 
> 
> 
> 2018-11-09 17:35 GMT+01:00 Vlad Kopylov  >:
> > 

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-13 Thread Vlad Kopylov
Or is it possible to mount one OSD directly for read file access?

v

On Sun, Nov 11, 2018 at 1:47 PM Vlad Kopylov  wrote:

> Maybe it is possible if done via gateway-nfs export?
> Settings for gateway allow read osd selection?
>
> v
>
> On Sun, Nov 11, 2018 at 1:01 AM Martin Verges 
> wrote:
>
>> Hello Vlad,
>>
>> If you want to read from the same data, then it ist not possible (as far
>> I know).
>>
>> --
>> Martin Verges
>> Managing director
>>
>> Mobile: +49 174 9335695
>> E-Mail: martin.ver...@croit.io
>> Chat: https://t.me/MartinVerges
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>>
>> Web: https://croit.io
>> YouTube: https://goo.gl/PGE1Bx
>>
>> Am Sa., 10. Nov. 2018, 03:47 hat Vlad Kopylov 
>> geschrieben:
>>
>>> Maybe i missed something but FS is explicitly selecting pools to put
>>> files and metadata, like I did below.
>>> So if I create new pools - data in them will be different. If I apply
>>> the rule dc1_primary to cfs_data pool, and client from dc3 connects to fs
>>> t01 - it will start using dc1 hosts
>>>
>>>
>>> ceph osd pool create cfs_data 100
>>> ceph osd pool create cfs_meta 100
>>> ceph fs new t01 cfs_data cfs_meta
>>> sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o
>>> name=admin,secretfile=/home/mciadmin/admin.secret
>>>
>>> rule dc1_primary {
>>> id 1
>>> type replicated
>>> min_size 1
>>> max_size 10
>>> step take dc1
>>> step chooseleaf firstn 1 type host
>>> step emit
>>> step take dc2
>>> step chooseleaf firstn -2 type host
>>> step emit
>>> step take dc3
>>> step chooseleaf firstn -2 type host
>>> step emit
>>> }
>>>
>>> On Fri, Nov 9, 2018 at 9:32 PM Vlad Kopylov  wrote:
>>>
 Just to confirm - it will still populate  3 copies in each datacenter?
 Thought this map was to select where to write to, guess it does write
 replication on the back end.

 I thought pools are completely separate and clients would not see each
 others data?

 Thank you Martin!




 On Fri, Nov 9, 2018 at 2:10 PM Martin Verges 
 wrote:

> Hello Vlad,
>
> you can generate something like this:
>
> rule dc1_primary_dc2_secondary {
> id 1
> type replicated
> min_size 1
> max_size 10
> step take dc1
> step chooseleaf firstn 1 type host
> step emit
> step take dc2
> step chooseleaf firstn 1 type host
> step emit
> step take dc3
> step chooseleaf firstn -2 type host
> step emit
> }
>
> rule dc2_primary_dc1_secondary {
> id 2
> type replicated
> min_size 1
> max_size 10
> step take dc1
> step chooseleaf firstn 1 type host
> step emit
> step take dc2
> step chooseleaf firstn 1 type host
> step emit
> step take dc3
> step chooseleaf firstn -2 type host
> step emit
> }
>
> After you added such crush rules, you can configure the pools:
>
> ~ $ ceph osd pool set  crush_ruleset 1
> ~ $ ceph osd pool set  crush_ruleset 2
>
> Now you place your workload from dc1 to the dc1 pool, and workload
> from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
> your workload issn't that write intensive) and save some money in dc3
> as your client would always read from a SSD and write to Hybrid.
>
> Btw. all this could be done with a few simple clicks through our web
> frontend. Even if you want to export it via CephFS / NFS / .. it is
> possible to set it on a per folder level. Feel free to take a look at
> https://www.youtube.com/watch?v=V33f7ipw9d4 to see how easy it could
> be.
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
>
> 2018-11-09 17:35 GMT+01:00 Vlad Kopylov :
> > Please disregard pg status, one of test vms was down for some time
> it is
> > healing.
> > Question only how to make it read from proper datacenter
> >
> > If you have an example.
> >
> > Thanks
> >
> >
> > On Fri, Nov 9, 2018 at 11:28 AM Vlad Kopylov 
> wrote:
> >>
> >> Martin, thank you for the tip.
> >> googling ceph crush rule examples doesn't give much on rules, just
> static
> >> placement of buckets.
> >> this all seems to be for placing data, not to 

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-11 Thread Vlad Kopylov
Maybe it is possible if done via gateway-nfs export?
Settings for gateway allow read osd selection?

v

On Sun, Nov 11, 2018 at 1:01 AM Martin Verges 
wrote:

> Hello Vlad,
>
> If you want to read from the same data, then it ist not possible (as far I
> know).
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
> Am Sa., 10. Nov. 2018, 03:47 hat Vlad Kopylov 
> geschrieben:
>
>> Maybe i missed something but FS is explicitly selecting pools to put
>> files and metadata, like I did below.
>> So if I create new pools - data in them will be different. If I apply the
>> rule dc1_primary to cfs_data pool, and client from dc3 connects to fs t01 -
>> it will start using dc1 hosts
>>
>>
>> ceph osd pool create cfs_data 100
>> ceph osd pool create cfs_meta 100
>> ceph fs new t01 cfs_data cfs_meta
>> sudo mount -t ceph ceph1:6789:/ /mnt/t01 -o
>> name=admin,secretfile=/home/mciadmin/admin.secret
>>
>> rule dc1_primary {
>> id 1
>> type replicated
>> min_size 1
>> max_size 10
>> step take dc1
>> step chooseleaf firstn 1 type host
>> step emit
>> step take dc2
>> step chooseleaf firstn -2 type host
>> step emit
>> step take dc3
>> step chooseleaf firstn -2 type host
>> step emit
>> }
>>
>> On Fri, Nov 9, 2018 at 9:32 PM Vlad Kopylov  wrote:
>>
>>> Just to confirm - it will still populate  3 copies in each datacenter?
>>> Thought this map was to select where to write to, guess it does write
>>> replication on the back end.
>>>
>>> I thought pools are completely separate and clients would not see each
>>> others data?
>>>
>>> Thank you Martin!
>>>
>>>
>>>
>>>
>>> On Fri, Nov 9, 2018 at 2:10 PM Martin Verges 
>>> wrote:
>>>
 Hello Vlad,

 you can generate something like this:

 rule dc1_primary_dc2_secondary {
 id 1
 type replicated
 min_size 1
 max_size 10
 step take dc1
 step chooseleaf firstn 1 type host
 step emit
 step take dc2
 step chooseleaf firstn 1 type host
 step emit
 step take dc3
 step chooseleaf firstn -2 type host
 step emit
 }

 rule dc2_primary_dc1_secondary {
 id 2
 type replicated
 min_size 1
 max_size 10
 step take dc1
 step chooseleaf firstn 1 type host
 step emit
 step take dc2
 step chooseleaf firstn 1 type host
 step emit
 step take dc3
 step chooseleaf firstn -2 type host
 step emit
 }

 After you added such crush rules, you can configure the pools:

 ~ $ ceph osd pool set  crush_ruleset 1
 ~ $ ceph osd pool set  crush_ruleset 2

 Now you place your workload from dc1 to the dc1 pool, and workload
 from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
 your workload issn't that write intensive) and save some money in dc3
 as your client would always read from a SSD and write to Hybrid.

 Btw. all this could be done with a few simple clicks through our web
 frontend. Even if you want to export it via CephFS / NFS / .. it is
 possible to set it on a per folder level. Feel free to take a look at
 https://www.youtube.com/watch?v=V33f7ipw9d4 to see how easy it could
 be.

 --
 Martin Verges
 Managing director

 Mobile: +49 174 9335695
 E-Mail: martin.ver...@croit.io
 Chat: https://t.me/MartinVerges

 croit GmbH, Freseniusstr. 31h, 81247 Munich
 CEO: Martin Verges - VAT-ID: DE310638492
 Com. register: Amtsgericht Munich HRB 231263

 Web: https://croit.io
 YouTube: https://goo.gl/PGE1Bx


 2018-11-09 17:35 GMT+01:00 Vlad Kopylov :
 > Please disregard pg status, one of test vms was down for some time it
 is
 > healing.
 > Question only how to make it read from proper datacenter
 >
 > If you have an example.
 >
 > Thanks
 >
 >
 > On Fri, Nov 9, 2018 at 11:28 AM Vlad Kopylov 
 wrote:
 >>
 >> Martin, thank you for the tip.
 >> googling ceph crush rule examples doesn't give much on rules, just
 static
 >> placement of buckets.
 >> this all seems to be for placing data, not to giving client in
 specific
 >> datacenter proper read osd
 >>
 >> maybe something wrong with placement groups?
 >>
 >> I added datacenter dc1 dc2 dc3
 >> Current replicated_rule is
 >>
 >> rule replicated_rule {
 >> id 0
 >> type replicated

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-09 Thread Martin Verges
Hello Vlad,

you can generate something like this:

rule dc1_primary_dc2_secondary {
id 1
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}

rule dc2_primary_dc1_secondary {
id 2
type replicated
min_size 1
max_size 10
step take dc1
step chooseleaf firstn 1 type host
step emit
step take dc2
step chooseleaf firstn 1 type host
step emit
step take dc3
step chooseleaf firstn -2 type host
step emit
}

After you added such crush rules, you can configure the pools:

~ $ ceph osd pool set  crush_ruleset 1
~ $ ceph osd pool set  crush_ruleset 2

Now you place your workload from dc1 to the dc1 pool, and workload
from dc2 to the dc2 pool. You could also use HDD with SSD journal (if
your workload issn't that write intensive) and save some money in dc3
as your client would always read from a SSD and write to Hybrid.

Btw. all this could be done with a few simple clicks through our web
frontend. Even if you want to export it via CephFS / NFS / .. it is
possible to set it on a per folder level. Feel free to take a look at
https://www.youtube.com/watch?v=V33f7ipw9d4 to see how easy it could
be.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


2018-11-09 17:35 GMT+01:00 Vlad Kopylov :
> Please disregard pg status, one of test vms was down for some time it is
> healing.
> Question only how to make it read from proper datacenter
>
> If you have an example.
>
> Thanks
>
>
> On Fri, Nov 9, 2018 at 11:28 AM Vlad Kopylov  wrote:
>>
>> Martin, thank you for the tip.
>> googling ceph crush rule examples doesn't give much on rules, just static
>> placement of buckets.
>> this all seems to be for placing data, not to giving client in specific
>> datacenter proper read osd
>>
>> maybe something wrong with placement groups?
>>
>> I added datacenter dc1 dc2 dc3
>> Current replicated_rule is
>>
>> rule replicated_rule {
>> id 0
>> type replicated
>> min_size 1
>> max_size 10
>> step take default
>> step chooseleaf firstn 0 type host
>> step emit
>> }
>>
>> # buckets
>> host ceph1 {
>> id -3 # do not change unnecessarily
>> id -2 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item osd.0 weight 1.000
>> }
>> datacenter dc1 {
>> id -9 # do not change unnecessarily
>> id -4 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item ceph1 weight 1.000
>> }
>> host ceph2 {
>> id -5 # do not change unnecessarily
>> id -6 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item osd.1 weight 1.000
>> }
>> datacenter dc2 {
>> id -10 # do not change unnecessarily
>> id -8 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item ceph2 weight 1.000
>> }
>> host ceph3 {
>> id -7 # do not change unnecessarily
>> id -12 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item osd.2 weight 1.000
>> }
>> datacenter dc3 {
>> id -11 # do not change unnecessarily
>> id -13 class ssd # do not change unnecessarily
>> # weight 1.000
>> alg straw2
>> hash 0 # rjenkins1
>> item ceph3 weight 1.000
>> }
>> root default {
>> id -1 # do not change unnecessarily
>> id -14 class ssd # do not change unnecessarily
>> # weight 3.000
>> alg straw2
>> hash 0 # rjenkins1
>> item dc1 weight 1.000
>> item dc2 weight 1.000
>> item dc3 weight 1.000
>> }
>>
>>
>> #ceph pg dump
>> dumped all
>> version 29433
>> stamp 2018-11-09 11:23:44.510872
>> last_osdmap_epoch 0
>> last_pg_scan 0
>> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTESLOG
>> DISK_LOG STATE  STATE_STAMPVERSION
>> REPORTED UP  UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP
>> LAST_DEEP_SCRUB DEEP_SCRUB_STAMP   SNAPTRIMQ_LEN
>> 1.5f  0  00 0   00
>> 00   active+clean 2018-11-09 04:35:32.320607  0'0
>> 544:1317 [0,2,1]  0 [0,2,1]  00'0 2018-11-09
>> 04:35:32.320561 0'0 2018-11-04 11:55:54.756115 0
>> 2.5c143  0  143 0   0 19490267
>> 461  461 active+undersized+degraded 2018-11-08 19:02:03.873218  508'461
>> 544:2100   [2,1]  2   [2,1]  2290'380 

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-09 Thread Vlad Kopylov
Please disregard pg status, one of test vms was down for some time it is
healing.
Question only how to make it read from proper datacenter

If you have an example.

Thanks


On Fri, Nov 9, 2018 at 11:28 AM Vlad Kopylov  wrote:

> Martin, thank you for the tip.
> googling ceph crush rule examples doesn't give much on rules, just static
> placement of buckets.
> this all seems to be for placing data, not to giving client in specific
> datacenter proper read osd
>
> maybe something wrong with placement groups?
>
> I added datacenter dc1 dc2 dc3
> Current replicated_rule is
>
> rule replicated_rule {
> id 0
>   type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type host
> step emit
> }
>
> # buckets
> host ceph1 {
>   id -3   # do not change unnecessarily
>   id -2 class ssd # do not change unnecessarily
>   # weight 1.000
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.0 weight 1.000
> }
> datacenter dc1 {
>   id -9   # do not change unnecessarily
>   id -4 class ssd # do not change unnecessarily
>   # weight 1.000
>   alg straw2
>   hash 0  # rjenkins1
>   item ceph1 weight 1.000
> }
> host ceph2 {
>   id -5   # do not change unnecessarily
>   id -6 class ssd # do not change unnecessarily
>   # weight 1.000
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.1 weight 1.000
> }
> datacenter dc2 {
>   id -10  # do not change unnecessarily
>   id -8 class ssd # do not change unnecessarily
>   # weight 1.000
>   alg straw2
>   hash 0  # rjenkins1
>   item ceph2 weight 1.000
> }
> host ceph3 {
>   id -7   # do not change unnecessarily
>   id -12 class ssd# do not change unnecessarily
>   # weight 1.000
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.2 weight 1.000
> }
> datacenter dc3 {
>   id -11  # do not change unnecessarily
>   id -13 class ssd# do not change unnecessarily
>   # weight 1.000
>   alg straw2
>   hash 0  # rjenkins1
>   item ceph3 weight 1.000
> }
> root default {
>   id -1   # do not change unnecessarily
>   id -14 class ssd# do not change unnecessarily
>   # weight 3.000
>   alg straw2
>   hash 0  # rjenkins1
>   item dc1 weight 1.000
>   item dc2 weight 1.000
>   item dc3 weight 1.000
> }
>
>
> #ceph pg dump
> dumped all
> version 29433
> stamp 2018-11-09 11:23:44.510872
> last_osdmap_epoch 0
> last_pg_scan 0
> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTESLOG  
> DISK_LOG STATE  STATE_STAMPVERSION  
> REPORTED UP  UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP 
>LAST_DEEP_SCRUB DEEP_SCRUB_STAMP   SNAPTRIMQ_LEN
> 1.5f  0  00 0   000   
>  0   active+clean 2018-11-09 04:35:32.320607  0'0 
> 544:1317 [0,2,1]  0 [0,2,1]  00'0 2018-11-09 
> 04:35:32.320561 0'0 2018-11-04 11:55:54.756115 0
> 2.5c143  0  143 0   0 19490267  461   
>461 active+undersized+degraded 2018-11-08 19:02:03.873218  508'461 
> 544:2100   [2,1]  2   [2,1]  2290'380 2018-11-07 
> 18:58:43.043719  64'120 2018-11-05 14:21:49.256324 0
> .
> sum 15239 0 2053 2659 0 2157615019 58286 58286
> OSD_STAT USEDAVAIL  TOTAL  HB_PEERS PG_SUM PRIMARY_PG_SUM
> 23.7 GiB 28 GiB 32 GiB[0,1]200 73
> 13.7 GiB 28 GiB 32 GiB[0,2]200 58
> 03.7 GiB 28 GiB 32 GiB[1,2]173 69
> sum   11 GiB 85 GiB 96 GiB
>
> #ceph pg map 2.5c
> osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]
>
> #pg map 1.5f
> osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]
>
>
> On Fri, Nov 9, 2018 at 2:21 AM Martin Verges 
> wrote:
>
>> Hello Vlad,
>>
>> Ceph clients connect to the primary OSD of each PG. If you create a
>> crush rule for building1 and one for building2 that takes a OSD from
>> the same building as the first one, your reads to the pool will always
>> be on the same building (if the cluster is healthy) and only write
>> request get replicated to the other building.
>>
>> --
>> Martin Verges
>> Managing director
>>
>> Mobile: +49 174 9335695
>> E-Mail: martin.ver...@croit.io
>> Chat: https://t.me/MartinVerges
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>>
>> Web: https://croit.io
>> YouTube: https://goo.gl/PGE1Bx
>>
>>
>> 2018-11-09 4:54 GMT+01:00 Vlad Kopylov :
>> > I am trying to test replicated ceph with servers in different
>> buildings, and
>> > I 

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-09 Thread Vlad Kopylov
Martin, thank you for the tip.
googling ceph crush rule examples doesn't give much on rules, just static
placement of buckets.
this all seems to be for placing data, not to giving client in specific
datacenter proper read osd

maybe something wrong with placement groups?

I added datacenter dc1 dc2 dc3
Current replicated_rule is

rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# buckets
host ceph1 {
id -3   # do not change unnecessarily
id -2 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0  # rjenkins1
item osd.0 weight 1.000
}
datacenter dc1 {
id -9   # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0  # rjenkins1
item ceph1 weight 1.000
}
host ceph2 {
id -5   # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0  # rjenkins1
item osd.1 weight 1.000
}
datacenter dc2 {
id -10  # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 1.000
alg straw2
hash 0  # rjenkins1
item ceph2 weight 1.000
}
host ceph3 {
id -7   # do not change unnecessarily
id -12 class ssd# do not change unnecessarily
# weight 1.000
alg straw2
hash 0  # rjenkins1
item osd.2 weight 1.000
}
datacenter dc3 {
id -11  # do not change unnecessarily
id -13 class ssd# do not change unnecessarily
# weight 1.000
alg straw2
hash 0  # rjenkins1
item ceph3 weight 1.000
}
root default {
id -1   # do not change unnecessarily
id -14 class ssd# do not change unnecessarily
# weight 3.000
alg straw2
hash 0  # rjenkins1
item dc1 weight 1.000
item dc2 weight 1.000
item dc3 weight 1.000
}


#ceph pg dump
dumped all
version 29433
stamp 2018-11-09 11:23:44.510872
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
LOG  DISK_LOG STATE  STATE_STAMP
VERSION  REPORTED UP  UP_PRIMARY ACTING  ACTING_PRIMARY LAST_SCRUB
SCRUB_STAMPLAST_DEEP_SCRUB DEEP_SCRUB_STAMP
SNAPTRIMQ_LEN
1.5f  0  00 0   00
   00   active+clean 2018-11-09 04:35:32.320607
  0'0 544:1317 [0,2,1]  0 [0,2,1]  00'0
2018-11-09 04:35:32.320561 0'0 2018-11-04 11:55:54.756115
   0
2.5c143  0  143 0   0 19490267
 461  461 active+undersized+degraded 2018-11-08 19:02:03.873218
508'461 544:2100   [2,1]  2   [2,1]  2290'380
2018-11-07 18:58:43.043719  64'120 2018-11-05 14:21:49.256324
   0
.
sum 15239 0 2053 2659 0 2157615019 58286 58286
OSD_STAT USEDAVAIL  TOTAL  HB_PEERS PG_SUM PRIMARY_PG_SUM
23.7 GiB 28 GiB 32 GiB[0,1]200 73
13.7 GiB 28 GiB 32 GiB[0,2]200 58
03.7 GiB 28 GiB 32 GiB[1,2]173 69
sum   11 GiB 85 GiB 96 GiB

#ceph pg map 2.5c
osdmap e545 pg 2.5c (2.5c) -> up [2,1] acting [2,1]

#pg map 1.5f
osdmap e547 pg 1.5f (1.5f) -> up [0,2,1] acting [0,2,1]


On Fri, Nov 9, 2018 at 2:21 AM Martin Verges  wrote:

> Hello Vlad,
>
> Ceph clients connect to the primary OSD of each PG. If you create a
> crush rule for building1 and one for building2 that takes a OSD from
> the same building as the first one, your reads to the pool will always
> be on the same building (if the cluster is healthy) and only write
> request get replicated to the other building.
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
>
> 2018-11-09 4:54 GMT+01:00 Vlad Kopylov :
> > I am trying to test replicated ceph with servers in different buildings,
> and
> > I have a read problem.
> > Reads from one building go to osd in another building and vice versa,
> making
> > reads slower then writes! Making read as slow as slowest node.
> >
> > Is there a way to
> > - disable parallel read (so it reads only from the same osd node where
> mon
> > is);
> > - or give each client read restriction per osd?
> > - or maybe strictly specify read osd on mount;
> > - or have node read delay cap (for example 

Re: [ceph-users] read performance, separate client CRUSH maps or limit osd read access from each client

2018-11-08 Thread Martin Verges
Hello Vlad,

Ceph clients connect to the primary OSD of each PG. If you create a
crush rule for building1 and one for building2 that takes a OSD from
the same building as the first one, your reads to the pool will always
be on the same building (if the cluster is healthy) and only write
request get replicated to the other building.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


2018-11-09 4:54 GMT+01:00 Vlad Kopylov :
> I am trying to test replicated ceph with servers in different buildings, and
> I have a read problem.
> Reads from one building go to osd in another building and vice versa, making
> reads slower then writes! Making read as slow as slowest node.
>
> Is there a way to
> - disable parallel read (so it reads only from the same osd node where mon
> is);
> - or give each client read restriction per osd?
> - or maybe strictly specify read osd on mount;
> - or have node read delay cap (for example if node time out is larger then 2
> ms then do not use such node for read as other replicas are available).
> - or ability to place Clients on the Crush map - so it understands that osd
> in - for example osd in the same data-center as client has preference, and
> pull data from it/them.
>
> Mounting with kernel client latest mimic.
>
> Thank you!
>
> Vlad
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com