Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Serkan Çoban
you have 24M objects, not 2.4M.
Each object will eat 64KB of storage, so 24M objects uses 1.5TB storage.
Add 3x replication to that, it is 4.5TB

On Sat, Oct 20, 2018 at 11:47 PM Waterbly, Dan  wrote:
>
> Hi Jakub,
>
> No, my setup seems to be the same as yours. Our system is mainly for 
> archiving loads of data. This data has to be stored forever and allow reads, 
> albeit seldom considering the number of objects we will store vs the number 
> of objects that ever will be requested.
>
> It just really seems odd that the metadata surrounding the 25M objects is so 
> high.
>
> We have 144 osds on 9 storage nodes. Perhaps it makes perfect sense but I’d 
> like to know why we are seeing what we are and how it all adds up.
>
> Thanks!
> Dan
>
> Get Outlook for iOS
>
>
>
> On Sat, Oct 20, 2018 at 12:36 PM -0700, "Jakub Jaszewski" 
>  wrote:
>
>> Hi Dan,
>>
>> Did you configure block.wal/block.db as separate devices/partition 
>> (osd_scenario: non-collocated or lvm for clusters installed using 
>> ceph-ansbile playbooks )?
>>
>> I run Ceph version 13.2.1 with non-collocated data.db and have the same 
>> situation - the sum of block.db partitions' size is displayed as RAW USED in 
>> ceph df.
>> Perhaps it is not the case for collocated block.db/wal.
>>
>> Jakub
>>
>> On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan  
>> wrote:
>>>
>>> I get that, but isn’t 4TiB to track 2.45M objects excessive? These numbers 
>>> seem very high to me.
>>>
>>> Get Outlook for iOS
>>>
>>>
>>>
>>> On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban" 
>>>  wrote:
>>>
 4.65TiB includes size of wal and db partitions too.
 On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
 >
 > Hello,
 >
 >
 >
 > I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
 > replication).
 >
 >
 >
 > I am confused by the usage ceph df is reporting and am hoping someone 
 > can shed some light on this. Here is what I see when I run ceph df
 >
 >
 >
 > GLOBAL:
 >
 > SIZEAVAIL   RAW USED %RAW USED
 >
 > 1.02PiB 1.02PiB  4.65TiB  0.44
 >
 > POOLS:
 >
 > NAME   ID USED
 > %USED MAX AVAIL OBJECTS
 >
 > .rgw.root  1  3.30KiB
 >  0330TiB   17
 >
 > .rgw.buckets.data  2  22.9GiB 0330TiB 
 > 24550943
 >
 > default.rgw.control3   0B
 >  0330TiB8
 >
 > default.rgw.meta   4 373B
 >  0330TiB3
 >
 > default.rgw.log5   0B
 >  0330TiB0
 >
 > .rgw.control   6   0B 0330TiB
 > 8
 >
 > .rgw.meta  7  2.18KiB 0330TiB
 >12
 >
 > .rgw.log   8   0B 0330TiB
 >   194
 >
 > .rgw.buckets.index 9   0B 0330TiB
 >  2560
 >
 >
 >
 > Why does my bucket pool report usage of 22.9GiB but my cluster as a 
 > whole is reporting 4.65TiB? There is nothing else on this cluster as it 
 > was just installed and configured.
 >
 >
 >
 > Thank you for your help with this.
 >
 >
 >
 > -Dan
 >
 >
 >
 > Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
 > dan.water...@sos.wa.gov
 >
 > WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
 >
 >
 >
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Verifying the location of the wal

2018-10-20 Thread Serkan Çoban
ceph-bluestore-tool can show you the disk labels.
ceph-bluestore-tool show-label --dev /dev/sda1
On Sun, Oct 21, 2018 at 1:29 AM Robert Stanford  wrote:
>
>
>  An email from this list stated that the wal would be created in the same 
> place as the db, if the db were specified when running ceph-volume lvm 
> create, and the db were specified on that command line.  I followed those 
> instructions and like the other person writing to this list today, I was 
> surprised to find that my cluster usage was higher than the total of pools 
> (higher by an amount the same as all my wal sizes on each node combined).  
> This leads me to think my wal actually is on the data disk and not the ssd I 
> specified the db should go to.
>
>  How can I verify which disk the wal is on, from the command line?  I've 
> searched the net and not come up with anything.
>
>  Thanks and regards
>  R
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Verifying the location of the wal

2018-10-20 Thread Robert Stanford
 An email from this list stated that the wal would be created in the same
place as the db, if the db were specified when running ceph-volume lvm
create, and the db were specified on that command line.  I followed those
instructions and like the other person writing to this list today, I was
surprised to find that my cluster usage was higher than the total of pools
(higher by an amount the same as all my wal sizes on each node combined).
This leads me to think my wal actually is on the data disk and not the ssd
I specified the db should go to.

 How can I verify which disk the wal is on, from the command line?  I've
searched the net and not come up with anything.

 Thanks and regards
 R
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Waterbly, Dan
Hi Jakub,

No, my setup seems to be the same as yours. Our system is mainly for archiving 
loads of data. This data has to be stored forever and allow reads, albeit 
seldom considering the number of objects we will store vs the number of objects 
that ever will be requested.

It just really seems odd that the metadata surrounding the 25M objects is so 
high.

We have 144 osds on 9 storage nodes. Perhaps it makes perfect sense but I’d 
like to know why we are seeing what we are and how it all adds up.

Thanks!
Dan

Get Outlook for iOS



On Sat, Oct 20, 2018 at 12:36 PM -0700, "Jakub Jaszewski" 
mailto:jaszewski.ja...@gmail.com>> wrote:

Hi Dan,

Did you configure block.wal/block.db as separate devices/partition 
(osd_scenario: non-collocated or lvm for clusters installed using ceph-ansbile 
playbooks )?

I run Ceph version 13.2.1 with non-collocated data.db and have the same 
situation - the sum of block.db partitions' size is displayed as RAW USED in 
ceph df.
Perhaps it is not the case for collocated block.db/wal.

Jakub

On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan 
mailto:dan.water...@sos.wa.gov>> wrote:
I get that, but isn’t 4TiB to track 2.45M objects excessive? These numbers seem 
very high to me.

Get Outlook for iOS



On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban" 
mailto:cobanser...@gmail.com>> wrote:


4.65TiB includes size of wal and db partitions too.
On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>
> Hello,
>
>
>
> I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
> replication).
>
>
>
> I am confused by the usage ceph df is reporting and am hoping someone can 
> shed some light on this. Here is what I see when I run ceph df
>
>
>
> GLOBAL:
>
> SIZEAVAIL   RAW USED %RAW USED
>
> 1.02PiB 1.02PiB  4.65TiB  0.44
>
> POOLS:
>
> NAME   ID USED%USED   
>   MAX AVAIL OBJECTS
>
> .rgw.root  1  3.30KiB 0   
>  330TiB   17
>
> .rgw.buckets.data  2  22.9GiB 0330TiB 24550943
>
> default.rgw.control3   0B 0   
>  330TiB8
>
> default.rgw.meta   4 373B 0   
>  330TiB3
>
> default.rgw.log5   0B 0   
>  330TiB0
>
> .rgw.control   6   0B 0330TiB8
>
> .rgw.meta  7  2.18KiB 0330TiB   12
>
> .rgw.log   8   0B 0330TiB  194
>
> .rgw.buckets.index 9   0B 0330TiB 2560
>
>
>
> Why does my bucket pool report usage of 22.9GiB but my cluster as a whole is 
> reporting 4.65TiB? There is nothing else on this cluster as it was just 
> installed and configured.
>
>
>
> Thank you for your help with this.
>
>
>
> -Dan
>
>
>
> Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
> dan.water...@sos.wa.gov
>
> WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] slow_used_bytes - SlowDB being used despite lots of space free in BlockDB on SSD?

2018-10-20 Thread Nick Fisk
> >> On 10/18/2018 7:49 PM, Nick Fisk wrote:
> >>> Hi,
> >>>
> >>> Ceph Version = 12.2.8
> >>> 8TB spinner with 20G SSD partition
> >>>
> >>> Perf dump shows the following:
> >>>
> >>> "bluefs": {
> >>>   "gift_bytes": 0,
> >>>   "reclaim_bytes": 0,
> >>>   "db_total_bytes": 21472731136,
> >>>   "db_used_bytes": 3467640832,
> >>>   "wal_total_bytes": 0,
> >>>   "wal_used_bytes": 0,
> >>>   "slow_total_bytes": 320063143936,
> >>>   "slow_used_bytes": 4546625536,
> >>>   "num_files": 124,
> >>>   "log_bytes": 11833344,
> >>>   "log_compactions": 4,
> >>>   "logged_bytes": 316227584,
> >>>   "files_written_wal": 2,
> >>>   "files_written_sst": 4375,
> >>>   "bytes_written_wal": 204427489105,
> >>>   "bytes_written_sst": 248223463173
> >>>
> >>> Am I reading that correctly, about 3.4GB used out of 20GB on the SSD, yet 
> >>> 4.5GB of DB is stored on the spinning disk?
> >> Correct. Most probably the rationale for this is the layered scheme
> >> RocksDB uses to keep its sst. For each level It has a maximum
> >> threshold (determined by level no, some base value and corresponding
> >> multiplier - see max_bytes_for_level_base &
> >> max_bytes_for_level_multiplier at
> >> https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide)
> >> If the next level  (at its max size) doesn't fit into the space available 
> >> at DB volume - it's totally spilled over to slow device.
> >> IIRC level_base is about 250MB and multiplier is 10 so the third level 
> >> needs 25Gb and hence doesn't fit into your DB volume.
> >>
> >> In fact  DB volume of 20GB is VERY small for 8TB OSD - just 0.25% of the 
> >> slow one. AFAIR current recommendation is about 4%.
> >>
> > Thanks Igor, these nodes were designed back in the filestore days where 
> > Small 10DWPD SSD's were all the rage, I might be able to
> shrink the OS/swap partition and get each DB partition up to 25/26GB, they 
> are not going to get any bigger than that as that’s the
> NVME completely filled. But I'm then going have to effectively wipe all the 
> disks I've done so far and re-backfill. ☹ Are there any
> tunables to change this behaviour post OSD deployment to move data back onto 
> SSD?
> None I'm aware of.
> 
> However I've just completed development for offline BlueFS volume migration 
> feature within ceph-bluestore-tool. It allows DB/WAL
> volumes allocation and resizing as well as moving BlueFS data between volumes 
> (with some limitations unrelated to your case). Hence
> one doesn't need slow backfilling to adjust BlueFS volume configuration.
> Here is the PR (Nautilus only for now):
> https://github.com/ceph/ceph/pull/23103

That sounds awesome, I might look at leaving the current OSD's how they are and 
look to "fix" them when Nautilus comes out.

> 
> >
> > On a related note, does frequently accessed data move into the SSD, or is 
> > the overspill a one way ticket? I would assume writes
> would cause data in rocksdb to be written back into L0 and work its way down, 
> but I'm not sure about reads?
> AFAIK reads don't trigger any data layout changes.
> 



> >
> > So I think the lesson from this is that despite whatever DB usage you may 
> > think you may end up with, always make sure your SSD
> partition is bigger than 26GB (L0+L1)?
> In fact that's
> L0+L1 (2x250Mb), L2(2500MB), L3(25000MB) which is about 28GB.

Well I upgraded a new node and after shrinking the OS, I managed to assign 29GB 
as the DB's. It's just finished backfilling and disappointingly it looks like 
the DB has over spilled onto the disks ☹ So the magic minimum number is going 
to be somewhere between 30GB and 40GB. I might be able to squeeze 30G 
partitions out if I go for a tiny OS disk and no swap. Will try that on the 
next one. Hoping that 30G does it.



> 
> One more observation from my side - RocksDB might additionally use up to 100% 
> of the level maximum size during compaction -
> hence it might make sense to have up to 25GB of additional spare space. 
> Surely this spare space wouldn't be fully used most of the
> time. And actually I don't have any instructions or clear knowledge base for 
> this aspect. Just some warning.
> To track such an  excess I used additional perf counters, see commit
> 2763c4de41ea55a97ed7400f54a2b2d841894bf5 in
> https://github.com/ceph/ceph/pull/23208
> Perhaps makes sense to have a separare PR for this stuff and even backport 
> it...

I think I'm starting to capture some of that data as I'm graphing all the "perf 
dump" values into graphite. The nodes with the 40GB DB partitions with all data 
on SSD currently have about 10GiB in the DB. During compactions the highest it 
has peaked over the last few days is up to 14GiB. In the nodes with the 20GB 
partitions, the SSD.DB sits at about 2.5GiB and peaks to just under 5GiB, the 
slow sits at 4.3GiB and peaks to about 6GiB.



> 
> 
> >
> >>> Am I also understanding 

Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Oliver Freyermuth
Ok, I'll try out the balancer end of the upcoming week then (after we've fixed 
a HW-issue with one of our mons
and the cooling system). 

Until then, any further advice and whether upmap is recommended over 
crush-compat (all clients are Luminous) are welcome ;-). 

Cheers,
Oliver

Am 20.10.18 um 21:26 schrieb Janne Johansson:
> Ok, can't say "why" then, I'd reweigh them somewhat to even it out,
> 1.22 -vs- 0.74 in variance is a lot, so either a balancer plugin for
> the MGRs, a script or just a few manual tweaks might be in order.
> 
> Den lör 20 okt. 2018 kl 21:02 skrev Oliver Freyermuth
> :
>>
>> All OSDs are of the very same size. One OSD host has slightly more disks (33 
>> instead of 31), though.
>> So also that that can't explain the hefty difference.
>>
>> I attach the output of "ceph osd tree" and "ceph osd df".
>>
>> The crush rule for the ceph_data pool is:
>> rule cephfs_data {
>> id 2
>> type erasure
>> min_size 3
>> max_size 6
>> step set_chooseleaf_tries 5
>> step set_choose_tries 100
>> step take default class hdd
>> step chooseleaf indep 0 type host
>> step emit
>> }
>> So that only considers the hdd device class. EC is done with k=4 m=2.
>>
>> So I don't see any imbalance on the hardware level, but only a somewhat 
>> uneven distribution of PGs.
>> Am I missing something, or is this really just a case for the ceph balancer 
>> plugin?
>> I'm just a bit astonished this effect is so huge.
>> Maybe our 4096 PGs for the ceph_data pool are not enough to get an even 
>> distribution without balancing?
>> But it yields about 100 PGs per OSD, as you can see...
>>
>> --
>> # ceph osd tree
>> ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
>>  -1   826.26428 root default
>>  -3 0.43700 host mon001
>>   0   ssd   0.21799 osd.0   up  1.0 1.0
>>   1   ssd   0.21799 osd.1   up  1.0 1.0
>>  -5 0.43700 host mon002
>>   2   ssd   0.21799 osd.2   up  1.0 1.0
>>   3   ssd   0.21799 osd.3   up  1.0 1.0
>> -31 1.81898 host mon003
>> 230   ssd   0.90999 osd.230 up  1.0 1.0
>> 231   ssd   0.90999 osd.231 up  1.0 1.0
>> -10   116.64600 host osd001
>>   4   hdd   3.64499 osd.4   up  1.0 1.0
>>   5   hdd   3.64499 osd.5   up  1.0 1.0
>>   6   hdd   3.64499 osd.6   up  1.0 1.0
>>   7   hdd   3.64499 osd.7   up  1.0 1.0
>>   8   hdd   3.64499 osd.8   up  1.0 1.0
>>   9   hdd   3.64499 osd.9   up  1.0 1.0
>>  10   hdd   3.64499 osd.10  up  1.0 1.0
>>  11   hdd   3.64499 osd.11  up  1.0 1.0
>>  12   hdd   3.64499 osd.12  up  1.0 1.0
>>  13   hdd   3.64499 osd.13  up  1.0 1.0
>>  14   hdd   3.64499 osd.14  up  1.0 1.0
>>  15   hdd   3.64499 osd.15  up  1.0 1.0
>>  16   hdd   3.64499 osd.16  up  1.0 1.0
>>  17   hdd   3.64499 osd.17  up  1.0 1.0
>>  18   hdd   3.64499 osd.18  up  1.0 1.0
>>  19   hdd   3.64499 osd.19  up  1.0 1.0
>>  20   hdd   3.64499 osd.20  up  1.0 1.0
>>  21   hdd   3.64499 osd.21  up  1.0 1.0
>>  22   hdd   3.64499 osd.22  up  1.0 1.0
>>  23   hdd   3.64499 osd.23  up  1.0 1.0
>>  24   hdd   3.64499 osd.24  up  1.0 1.0
>>  25   hdd   3.64499 osd.25  up  1.0 1.0
>>  26   hdd   3.64499 osd.26  up  1.0 1.0
>>  27   hdd   3.64499 osd.27  up  1.0 1.0
>>  28   hdd   3.64499 osd.28  up  1.0 1.0
>>  29   hdd   3.64499 osd.29  up  1.0 1.0
>>  30   hdd   3.64499 osd.30  up  1.0 1.0
>>  31   hdd   3.64499 osd.31  up  1.0 1.0
>>  32   hdd   3.64499 osd.32  up  1.0 1.0
>>  33   hdd   3.64499 osd.33  up  1.0 1.0
>>  34   hdd   3.64499 osd.34  up  1.0 1.0
>>  35   hdd   3.64499 osd.35  up  1.0 1.0
>> -13   116.64600 host osd002
>>  36   hdd   3.64499 osd.36  up  1.0 1.0
>>  37   hdd   3.64499 osd.37  up  1.0 1.0
>>  38   hdd   3.64499 osd.38  up  1.0 1.0
>>  39   hdd   3.64499 osd.39  up  1.0 1.0
>>  40   hdd   3.64499 osd.40  up  1.0 1.0
>>  41   hdd   3.64499 osd.41  up  1.0 1.0
>>  42   hdd   3.64499 osd.42  up  1.0 1.0
>>  43   hdd   3.64499 osd.43  up  1.0 1.0
>>  44   hdd   3.64499 osd.44  up  

Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Jakub Jaszewski
Hi Dan,

Did you configure block.wal/block.db as separate devices/partition
(osd_scenario: non-collocated or lvm for clusters installed using
ceph-ansbile playbooks )?

I run Ceph version 13.2.1 with non-collocated data.db and have the same
situation - the sum of block.db partitions' size is displayed as RAW USED
in ceph df.
Perhaps it is not the case for collocated block.db/wal.

Jakub

On Sat, Oct 20, 2018 at 8:34 PM Waterbly, Dan 
wrote:

> I get that, but isn’t 4TiB to track 2.45M objects excessive? These numbers
> seem very high to me.
>
> Get Outlook for iOS 
>
>
>
> On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban" <
> cobanser...@gmail.com> wrote:
>
> 4.65TiB includes size of wal and db partitions too.
>> On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>> >
>> > Hello,
>> >
>> >
>> >
>> > I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
>> > replication).
>> >
>> >
>> >
>> > I am confused by the usage ceph df is reporting and am hoping someone can 
>> > shed some light on this. Here is what I see when I run ceph df
>> >
>> >
>> >
>> > GLOBAL:
>> >
>> > SIZEAVAIL   RAW USED %RAW USED
>> >
>> > 1.02PiB 1.02PiB  4.65TiB  0.44
>> >
>> > POOLS:
>> >
>> > NAME   ID USED
>> > %USED MAX AVAIL OBJECTS
>> >
>> > .rgw.root  1  3.30KiB 
>> > 0330TiB   17
>> >
>> > .rgw.buckets.data  2  22.9GiB 0330TiB 
>> > 24550943
>> >
>> > default.rgw.control3   0B 
>> > 0330TiB8
>> >
>> > default.rgw.meta   4 373B 
>> > 0330TiB3
>> >
>> > default.rgw.log5   0B 
>> > 0330TiB0
>> >
>> > .rgw.control   6   0B 0330TiB  
>> >   8
>> >
>> > .rgw.meta  7  2.18KiB 0330TiB  
>> >  12
>> >
>> > .rgw.log   8   0B 0330TiB  
>> > 194
>> >
>> > .rgw.buckets.index 9   0B 0330TiB 
>> > 2560
>> >
>> >
>> >
>> > Why does my bucket pool report usage of 22.9GiB but my cluster as a whole 
>> > is reporting 4.65TiB? There is nothing else on this cluster as it was just 
>> > installed and configured.
>> >
>> >
>> >
>> > Thank you for your help with this.
>> >
>> >
>> >
>> > -Dan
>> >
>> >
>> >
>> > Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
>> > dan.water...@sos.wa.gov
>> >
>> > WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] A basic question on failure domain

2018-10-20 Thread Cody
That was clearly explained. Thank you so much!

Best regards,
Cody
On Sat, Oct 20, 2018 at 1:02 PM Maged Mokhtar  wrote:
>
>
>
> On 20/10/18 05:28, Cody wrote:
> > Hi folks,
> >
> > I have a rookie question. Does the number of the buckets chosen as the
> > failure domain must be equal or greater than the number of replica (or
> > k+m for erasure coding)?
> >
> > E.g., for an erasure code profile where k=4, m=2, failure domain=rack,
> > does it only work when there are 6 or more racks in the CRUSH
> > hierarchy? Or would it continue to iterate down the tree and
> > eventually would work as long as there are 6 or more OSDs?
> >
> > Thank you very much.
> >
> > Best regards,
> > Cody
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> The rule associated with the ec profile you mentioned, will indeed try
> to select 6 rack buckets then get an osd leaf from each. If you only had
> 5 racks for example, it will return only 5 osds per PG, the pool will
> function but in degraded state (if pool min_size was 5). This rule will
> not return more that 1 osd per rack, if it did it will not achieving the
> failure domain you gave.
> You can write a custom rule that uses 2 racks and select 3 hosts from
> each, and associate this with the k4 m2 pool, crush will not mind..it
> will do whatever you tell it, but if 1 rack fails your pool goes down,
> so would not be achieving  a failure domain at rack level unless you do
> have 6 or more racks.
>
> Maged
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Oliver Freyermuth
All OSDs are of the very same size. One OSD host has slightly more disks (33 
instead of 31), though. 
So also that that can't explain the hefty difference. 

I attach the output of "ceph osd tree" and "ceph osd df". 

The crush rule for the ceph_data pool is:
rule cephfs_data {
id 2
type erasure
min_size 3
max_size 6
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step chooseleaf indep 0 type host
step emit
}
So that only considers the hdd device class. EC is done with k=4 m=2. 

So I don't see any imbalance on the hardware level, but only a somewhat uneven 
distribution of PGs. 
Am I missing something, or is this really just a case for the ceph balancer 
plugin? 
I'm just a bit astonished this effect is so huge. 
Maybe our 4096 PGs for the ceph_data pool are not enough to get an even 
distribution without balancing? 
But it yields about 100 PGs per OSD, as you can see... 

--
# ceph osd tree
ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF 
 -1   826.26428 root default
 -3 0.43700 host mon001 
  0   ssd   0.21799 osd.0   up  1.0 1.0 
  1   ssd   0.21799 osd.1   up  1.0 1.0 
 -5 0.43700 host mon002 
  2   ssd   0.21799 osd.2   up  1.0 1.0 
  3   ssd   0.21799 osd.3   up  1.0 1.0 
-31 1.81898 host mon003 
230   ssd   0.90999 osd.230 up  1.0 1.0 
231   ssd   0.90999 osd.231 up  1.0 1.0 
-10   116.64600 host osd001 
  4   hdd   3.64499 osd.4   up  1.0 1.0 
  5   hdd   3.64499 osd.5   up  1.0 1.0 
  6   hdd   3.64499 osd.6   up  1.0 1.0 
  7   hdd   3.64499 osd.7   up  1.0 1.0 
  8   hdd   3.64499 osd.8   up  1.0 1.0 
  9   hdd   3.64499 osd.9   up  1.0 1.0 
 10   hdd   3.64499 osd.10  up  1.0 1.0 
 11   hdd   3.64499 osd.11  up  1.0 1.0 
 12   hdd   3.64499 osd.12  up  1.0 1.0 
 13   hdd   3.64499 osd.13  up  1.0 1.0 
 14   hdd   3.64499 osd.14  up  1.0 1.0 
 15   hdd   3.64499 osd.15  up  1.0 1.0 
 16   hdd   3.64499 osd.16  up  1.0 1.0 
 17   hdd   3.64499 osd.17  up  1.0 1.0 
 18   hdd   3.64499 osd.18  up  1.0 1.0 
 19   hdd   3.64499 osd.19  up  1.0 1.0 
 20   hdd   3.64499 osd.20  up  1.0 1.0 
 21   hdd   3.64499 osd.21  up  1.0 1.0 
 22   hdd   3.64499 osd.22  up  1.0 1.0 
 23   hdd   3.64499 osd.23  up  1.0 1.0 
 24   hdd   3.64499 osd.24  up  1.0 1.0 
 25   hdd   3.64499 osd.25  up  1.0 1.0 
 26   hdd   3.64499 osd.26  up  1.0 1.0 
 27   hdd   3.64499 osd.27  up  1.0 1.0 
 28   hdd   3.64499 osd.28  up  1.0 1.0 
 29   hdd   3.64499 osd.29  up  1.0 1.0 
 30   hdd   3.64499 osd.30  up  1.0 1.0 
 31   hdd   3.64499 osd.31  up  1.0 1.0 
 32   hdd   3.64499 osd.32  up  1.0 1.0 
 33   hdd   3.64499 osd.33  up  1.0 1.0 
 34   hdd   3.64499 osd.34  up  1.0 1.0 
 35   hdd   3.64499 osd.35  up  1.0 1.0 
-13   116.64600 host osd002 
 36   hdd   3.64499 osd.36  up  1.0 1.0 
 37   hdd   3.64499 osd.37  up  1.0 1.0 
 38   hdd   3.64499 osd.38  up  1.0 1.0 
 39   hdd   3.64499 osd.39  up  1.0 1.0 
 40   hdd   3.64499 osd.40  up  1.0 1.0 
 41   hdd   3.64499 osd.41  up  1.0 1.0 
 42   hdd   3.64499 osd.42  up  1.0 1.0 
 43   hdd   3.64499 osd.43  up  1.0 1.0 
 44   hdd   3.64499 osd.44  up  1.0 1.0 
 45   hdd   3.64499 osd.45  up  1.0 1.0 
 46   hdd   3.64499 osd.46  up  1.0 1.0 
 47   hdd   3.64499 osd.47  up  1.0 1.0 
 48   hdd   3.64499 osd.48  up  1.0 1.0 
 49   hdd   3.64499 osd.49  up  1.0 1.0 
 50   hdd   3.64499 osd.50  up  1.0 1.0 
 51   hdd   3.64499 osd.51  up  1.0 1.0 
 52   hdd   3.64499 osd.52  up  1.0 1.0 
 53   hdd   3.64499 osd.53  up  1.0 1.0 
 54   hdd   3.64499 osd.54  up  1.0 1.0 
 55   hdd   3.64499   

Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Janne Johansson
Yes, if you have uneven sizes I guess you could end up in a situation
where you have
lots of 1TB OSDs and a number of 2TB OSD but pool replication forces
the pool to have one
PG replica on the 1TB OSD, then it would be possible to state "this
pool cant write more than X G"
but when it is full, there would be free space left on some of the
2TB-OSDs, but which the pool
cant utilize. Probably same for uneven OSD hosts if you have those.

Den lör 20 okt. 2018 kl 20:28 skrev Oliver Freyermuth
:
>
> Dear Janne,
>
> yes, of course. But since we only have two pools here, this can not explain 
> the difference.
> The metadata is replicated (3 copies) across ssd drives, and we have < 3 TB 
> of total raw storage for that.
> So looking at the raw space usage, we can ignore that.
>
> All the rest is used for the ceph_data pool. So the ceph_data pool, in terms 
> of raw storage, is about 50 % used.
>
> But in terms of storage shown for that pool, it's almost 63 % %USED.
> So I guess this can purely be from bad balancing, correct?
>
> Cheers,
> Oliver
>
> Am 20.10.18 um 19:49 schrieb Janne Johansson:
> > Do mind that drives may have more than one pool on them, so RAW space
> > is what it says, how much free space there is. Then the avail and
> > %USED on per-pool stats will take replication into account, it can
> > tell how much data you may write into that particular pool, given that
> > pools replication or EC settings.
> >
> > Den lör 20 okt. 2018 kl 19:09 skrev Oliver Freyermuth
> > :
> >>
> >> Dear Cephalopodians,
> >>
> >> as many others, I'm also a bit confused by "ceph df" output
> >> in a pretty straightforward configuration.
> >>
> >> We have a CephFS (12.2.7) running, with 4+2 EC profile.
> >>
> >> I get:
> >> 
> >> # ceph df
> >> GLOBAL:
> >> SIZE AVAIL RAW USED %RAW USED
> >> 824T  410T 414T 50.26
> >> POOLS:
> >> NAMEID USED %USED MAX AVAIL OBJECTS
> >> cephfs_metadata 1  452M  0.05  860G   365774
> >> cephfs_data 2  275T 62.68  164T 75056403
> >> 
> >>
> >> So about 50 % of raw space are used, but already ~63 % of filesystem space 
> >> are used.
> >> Is this purely from imperfect balancing?
> >> In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage 
> >> down to 37.12 %.
> >>
> >> We did not yet use the balancer plugin.
> >> We don't have any pre-luminous clients.
> >> In that setup, I take it that "upmap" mode would be recommended - correct?
> >> Any "gotchas" using that on luminous?
> >>
> >> Cheers,
> >> Oliver
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
>
>


-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Oliver Freyermuth
Dear Janne,

yes, of course. But since we only have two pools here, this can not explain the 
difference. 
The metadata is replicated (3 copies) across ssd drives, and we have < 3 TB of 
total raw storage for that. 
So looking at the raw space usage, we can ignore that. 

All the rest is used for the ceph_data pool. So the ceph_data pool, in terms of 
raw storage, is about 50 % used. 

But in terms of storage shown for that pool, it's almost 63 % %USED. 
So I guess this can purely be from bad balancing, correct? 

Cheers,
Oliver

Am 20.10.18 um 19:49 schrieb Janne Johansson:
> Do mind that drives may have more than one pool on them, so RAW space
> is what it says, how much free space there is. Then the avail and
> %USED on per-pool stats will take replication into account, it can
> tell how much data you may write into that particular pool, given that
> pools replication or EC settings.
> 
> Den lör 20 okt. 2018 kl 19:09 skrev Oliver Freyermuth
> :
>>
>> Dear Cephalopodians,
>>
>> as many others, I'm also a bit confused by "ceph df" output
>> in a pretty straightforward configuration.
>>
>> We have a CephFS (12.2.7) running, with 4+2 EC profile.
>>
>> I get:
>> 
>> # ceph df
>> GLOBAL:
>> SIZE AVAIL RAW USED %RAW USED
>> 824T  410T 414T 50.26
>> POOLS:
>> NAMEID USED %USED MAX AVAIL OBJECTS
>> cephfs_metadata 1  452M  0.05  860G   365774
>> cephfs_data 2  275T 62.68  164T 75056403
>> 
>>
>> So about 50 % of raw space are used, but already ~63 % of filesystem space 
>> are used.
>> Is this purely from imperfect balancing?
>> In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage 
>> down to 37.12 %.
>>
>> We did not yet use the balancer plugin.
>> We don't have any pre-luminous clients.
>> In that setup, I take it that "upmap" mode would be recommended - correct?
>> Any "gotchas" using that on luminous?
>>
>> Cheers,
>> Oliver
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Waterbly, Dan
I get that, but isn’t 4TiB to track 2.45M objects excessive? These numbers seem 
very high to me.

Get Outlook for iOS



On Sat, Oct 20, 2018 at 10:27 AM -0700, "Serkan Çoban" 
mailto:cobanser...@gmail.com>> wrote:


4.65TiB includes size of wal and db partitions too.
On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>
> Hello,
>
>
>
> I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
> replication).
>
>
>
> I am confused by the usage ceph df is reporting and am hoping someone can 
> shed some light on this. Here is what I see when I run ceph df
>
>
>
> GLOBAL:
>
> SIZEAVAIL   RAW USED %RAW USED
>
> 1.02PiB 1.02PiB  4.65TiB  0.44
>
> POOLS:
>
> NAME   ID USED%USED   
>   MAX AVAIL OBJECTS
>
> .rgw.root  1  3.30KiB 0   
>  330TiB   17
>
> .rgw.buckets.data  2  22.9GiB 0330TiB 24550943
>
> default.rgw.control3   0B 0   
>  330TiB8
>
> default.rgw.meta   4 373B 0   
>  330TiB3
>
> default.rgw.log5   0B 0   
>  330TiB0
>
> .rgw.control   6   0B 0330TiB8
>
> .rgw.meta  7  2.18KiB 0330TiB   12
>
> .rgw.log   8   0B 0330TiB  194
>
> .rgw.buckets.index 9   0B 0330TiB 2560
>
>
>
> Why does my bucket pool report usage of 22.9GiB but my cluster as a whole is 
> reporting 4.65TiB? There is nothing else on this cluster as it was just 
> installed and configured.
>
>
>
> Thank you for your help with this.
>
>
>
> -Dan
>
>
>
> Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
> dan.water...@sos.wa.gov
>
> WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Janne Johansson
Do mind that drives may have more than one pool on them, so RAW space
is what it says, how much free space there is. Then the avail and
%USED on per-pool stats will take replication into account, it can
tell how much data you may write into that particular pool, given that
pools replication or EC settings.

Den lör 20 okt. 2018 kl 19:09 skrev Oliver Freyermuth
:
>
> Dear Cephalopodians,
>
> as many others, I'm also a bit confused by "ceph df" output
> in a pretty straightforward configuration.
>
> We have a CephFS (12.2.7) running, with 4+2 EC profile.
>
> I get:
> 
> # ceph df
> GLOBAL:
> SIZE AVAIL RAW USED %RAW USED
> 824T  410T 414T 50.26
> POOLS:
> NAMEID USED %USED MAX AVAIL OBJECTS
> cephfs_metadata 1  452M  0.05  860G   365774
> cephfs_data 2  275T 62.68  164T 75056403
> 
>
> So about 50 % of raw space are used, but already ~63 % of filesystem space 
> are used.
> Is this purely from imperfect balancing?
> In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage 
> down to 37.12 %.
>
> We did not yet use the balancer plugin.
> We don't have any pre-luminous clients.
> In that setup, I take it that "upmap" mode would be recommended - correct?
> Any "gotchas" using that on luminous?
>
> Cheers,
> Oliver
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Serkan Çoban
4.65TiB includes size of wal and db partitions too.
On Sat, Oct 20, 2018 at 7:45 PM Waterbly, Dan  wrote:
>
> Hello,
>
>
>
> I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
> replication).
>
>
>
> I am confused by the usage ceph df is reporting and am hoping someone can 
> shed some light on this. Here is what I see when I run ceph df
>
>
>
> GLOBAL:
>
> SIZEAVAIL   RAW USED %RAW USED
>
> 1.02PiB 1.02PiB  4.65TiB  0.44
>
> POOLS:
>
> NAME   ID USED%USED   
>   MAX AVAIL OBJECTS
>
> .rgw.root  1  3.30KiB 0   
>  330TiB   17
>
> .rgw.buckets.data  2  22.9GiB 0330TiB 24550943
>
> default.rgw.control3   0B 0   
>  330TiB8
>
> default.rgw.meta   4 373B 0   
>  330TiB3
>
> default.rgw.log5   0B 0   
>  330TiB0
>
> .rgw.control   6   0B 0330TiB8
>
> .rgw.meta  7  2.18KiB 0330TiB   12
>
> .rgw.log   8   0B 0330TiB  194
>
> .rgw.buckets.index 9   0B 0330TiB 2560
>
>
>
> Why does my bucket pool report usage of 22.9GiB but my cluster as a whole is 
> reporting 4.65TiB? There is nothing else on this cluster as it was just 
> installed and configured.
>
>
>
> Thank you for your help with this.
>
>
>
> -Dan
>
>
>
> Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
> dan.water...@sos.wa.gov
>
> WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph df space usage confusion - balancing needed?

2018-10-20 Thread Oliver Freyermuth
Dear Cephalopodians,

as many others, I'm also a bit confused by "ceph df" output
in a pretty straightforward configuration. 

We have a CephFS (12.2.7) running, with 4+2 EC profile. 

I get:

# ceph df
GLOBAL:
SIZE AVAIL RAW USED %RAW USED 
824T  410T 414T 50.26 
POOLS:
NAMEID USED %USED MAX AVAIL OBJECTS  
cephfs_metadata 1  452M  0.05  860G   365774 
cephfs_data 2  275T 62.68  164T 75056403


So about 50 % of raw space are used, but already ~63 % of filesystem space are 
used. 
Is this purely from imperfect balancing? 
In "ceph osd df", I do indeed see OSD usages spreading from 65.02 % usage down 
to 37.12 %. 

We did not yet use the balancer plugin. 
We don't have any pre-luminous clients. 
In that setup, I take it that "upmap" mode would be recommended - correct? 
Any "gotchas" using that on luminous? 

Cheers,
Oliver



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Drive for Wal and Db

2018-10-20 Thread Maged Mokhtar



On 20/10/18 18:57, Robert Stanford wrote:


 Our OSDs are BlueStore and are on regular hard drives. Each OSD has a 
partition on an SSD for its DB.  Wal is on the regular hard drives.  
Should I move the wal to share the SSD with the DB?


 Regards
R


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


you should put wal on the faster device, wal and db could share the same 
ssd partition,


Maged
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] A basic question on failure domain

2018-10-20 Thread Maged Mokhtar



On 20/10/18 05:28, Cody wrote:

Hi folks,

I have a rookie question. Does the number of the buckets chosen as the
failure domain must be equal or greater than the number of replica (or
k+m for erasure coding)?

E.g., for an erasure code profile where k=4, m=2, failure domain=rack,
does it only work when there are 6 or more racks in the CRUSH
hierarchy? Or would it continue to iterate down the tree and
eventually would work as long as there are 6 or more OSDs?

Thank you very much.

Best regards,
Cody
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


The rule associated with the ec profile you mentioned, will indeed try 
to select 6 rack buckets then get an osd leaf from each. If you only had 
5 racks for example, it will return only 5 osds per PG, the pool will 
function but in degraded state (if pool min_size was 5). This rule will 
not return more that 1 osd per rack, if it did it will not achieving the 
failure domain you gave.
You can write a custom rule that uses 2 racks and select 3 hosts from 
each, and associate this with the k4 m2 pool, crush will not mind..it 
will do whatever you tell it, but if 1 rack fails your pool goes down, 
so would not be achieving  a failure domain at rack level unless you do 
have 6 or more racks.


Maged
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Drive for Wal and Db

2018-10-20 Thread Robert Stanford
 Our OSDs are BlueStore and are on regular hard drives.  Each OSD has a
partition on an SSD for its DB.  Wal is on the regular hard drives.  Should
I move the wal to share the SSD with the DB?

 Regards
R
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CEPH Cluster Usage Discrepancy

2018-10-20 Thread Waterbly, Dan
Hello,

I have inserted 2.45M 1,000 byte objects into my cluster (radosgw, 3x 
replication).

I am confused by the usage ceph df is reporting and am hoping someone can shed 
some light on this. Here is what I see when I run ceph df

GLOBAL:
SIZEAVAIL   RAW USED %RAW USED
1.02PiB 1.02PiB  4.65TiB  0.44
POOLS:
NAME   ID USED%USED 
MAX AVAIL OBJECTS
.rgw.root  1  3.30KiB 0 
   330TiB   17
.rgw.buckets.data  2  22.9GiB 0330TiB 24550943
default.rgw.control3   0B 0 
   330TiB8
default.rgw.meta   4 373B 0 
   330TiB3
default.rgw.log5   0B 0 
   330TiB0
.rgw.control   6   0B 0330TiB8
.rgw.meta  7  2.18KiB 0330TiB   12
.rgw.log   8   0B 0330TiB  194
.rgw.buckets.index 9   0B 0330TiB 2560

Why does my bucket pool report usage of 22.9GiB but my cluster as a whole is 
reporting 4.65TiB? There is nothing else on this cluster as it was just 
installed and configured.

Thank you for your help with this.

-Dan

Dan Waterbly | Senior Application Developer | 509.235.7500 x225 | 
dan.water...@sos.wa.gov
WASHINGTON STATE ARCHIVES | DIGITAL ARCHIVES

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com