Re: [ceph-users] Huge amount of cephfs metadata writes while only reading data (rsync from storage, to single disk)

2018-03-20 Thread Yan, Zheng
On Mon, Mar 19, 2018 at 11:45 PM, Nicolas Huillard
 wrote:
> Le lundi 19 mars 2018 à 15:30 +0300, Sergey Malinin a écrit :
>> Default for mds_log_events_per_segment is 1024, in my set up I ended
>> up with 8192.
>> I calculated that value like IOPS / log segments * 5 seconds (afaik
>> MDS performs journal maintenance once in 5 seconds by default).
>
> I tried 4096 from the initial 1024, then 8192 at the time of your
> answer, then 16384, with not much improvements...
>
> Then I tried to reduce the number of MDS, from 4 to 1, which definitely
> works (sorry if my initial mail didn't make it very clear that I was
> using many MDSs, even though it mentioned mds.2).
> I now have low rate of metadata write (40-50kBps), and the inter-DC
> link load reflects the size and direction of the actual data.
>
> I'll now try to reduce mds_log_events_per_segment back to its original
> value (1024), because performance is not optimal, and stutters a bit
> too much.
>
> Thanks for your advice!
>

This seems like load balancer bug. Improving load balancer is on the
top of our todo list.

Regards
Yan, Zheng

> --
> Nicolas Huillard
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] XFS Metadata corruption while activating OSD

2018-03-20 Thread 赵赵贺东
I’m sorry for my late reply.
Thank you for your reply.
Yes, this error only exists while backend is xfs.
Ext4 will not trigger the error.



> 在 2018年3月12日,下午6:31,Peter Woodman  写道:
> 
> from what i've heard, xfs has problems on arm. use btrfs, or (i
> believe?) ext4+bluestore will work.
> 
> On Sun, Mar 11, 2018 at 9:49 PM, Christian Wuerdig
>  wrote:
>> Hm, so you're running OSD nodes with 2GB of RAM and 2x10TB = 20TB of
>> storage? Literally everything posted on this list in relation to HW
>> requirements and related problems will tell you that this simply isn't going
>> to work. The slightest hint of a problem will simply kill the OSD nodes with
>> OOM. Have you tried with smaller disks - like 1TB models (or even smaller
>> like 256GB SSDs) and see if the same problem persists?
>> 
>> 
>> On Tue, 6 Mar 2018 at 10:51, 赵赵贺东  wrote:
>>> 
>>> Hello ceph-users,
>>> 
>>> It is a really really Really tough problem for our team.
>>> We investigated in the problem for a long time, try a lot of efforts, but
>>> can’t solve the problem, even the concentrate cause of the problem is still
>>> unclear for us!
>>> So, Anyone give any solution/suggestion/opinion whatever  will be highly
>>> highly appreciated!!!
>>> 
>>> Problem Summary:
>>> When we activate osd, there will be  metadata corrupttion in the
>>> activating disk, probability is 100% !
>>> 
>>> Admin Nodes node:
>>> Platform: X86
>>> OS: Ubuntu 16.04
>>> Kernel: 4.12.0
>>> Ceph: Luminous 12.2.2
>>> 
>>> OSD nodes:
>>> Platform: armv7
>>> OS:   Ubuntu 14.04
>>> Kernel:   4.4.39
>>> Ceph: Lominous 12.2.2
>>> Disk: 10T+10T
>>> Memory: 2GB
>>> 
>>> Deploy log:
>>> 
>>> 
>>> dmesg log:(Sorry arms001-01 dmesg log has log has been lost, but error
>>> message about metadata corruption on arms003-10 are the same with
>>> arms001-01)
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.534232] XFS (sda1): Unmount and
>>> run xfs_repair
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.539100] XFS (sda1): First 64
>>> bytes of corrupted metadata buffer:
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.545504] eb82f000: 58 46 53 42 00
>>> 00 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.553569] eb82f010: 00 00 00 00 00
>>> 00 00 00 00 00 00 00 00 00 00 00  
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.561624] eb82f020: fc 4e e3 89 50
>>> 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.569706] eb82f030: 00 00 00 00 80
>>> 00 00 07 ff ff ff ff ff ff ff ff  
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.58] XFS (sda1): metadata I/O
>>> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.602944] XFS (sda1): Metadata
>>> corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data
>>> block 0x48b9ff80
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.614170] XFS (sda1): Unmount and
>>> run xfs_repair
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.619030] XFS (sda1): First 64
>>> bytes of corrupted metadata buffer:
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.625403] eb901000: 58 46 53 42 00
>>> 00 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.633441] eb901010: 00 00 00 00 00
>>> 00 00 00 00 00 00 00 00 00 00 00  
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.641474] eb901020: fc 4e e3 89 50
>>> 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.649519] eb901030: 00 00 00 00 80
>>> 00 00 07 ff ff ff ff ff ff ff ff  
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.657554] XFS (sda1): metadata I/O
>>> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.675056] XFS (sda1): Metadata
>>> corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data
>>> block 0x48b9ff80
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.686228] XFS (sda1): Unmount and
>>> run xfs_repair
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.691054] XFS (sda1): First 64
>>> bytes of corrupted metadata buffer:
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.697425] eb901000: 58 46 53 42 00
>>> 00 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.705459] eb901010: 00 00 00 00 00
>>> 00 00 00 00 00 00 00 00 00 00 00  
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.713489] eb901020: fc 4e e3 89 50
>>> 8f 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.721520] eb901030: 00 00 00 00 80
>>> 00 00 07 ff ff ff ff ff ff ff ff  
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.729558] XFS (sda1): metadata I/O
>>> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
>>> Mar  5 11:08:49 arms003-10 kernel: [  252.741953] XFS 

Re: [ceph-users] XFS Metadata corruption while activating OSD

2018-03-20 Thread 赵赵贺东


> 在 2018年3月12日,上午9:49,Christian Wuerdig  写道:
> 
> Hm, so you're running OSD nodes with 2GB of RAM and 2x10TB = 20TB of storage? 
> Literally everything posted on this list in relation to HW requirements and 
> related problems will tell you that this simply isn't going to work. The 
> slightest hint of a problem will simply kill the OSD nodes with OOM. Have you 
> tried with smaller disks - like 1TB models (or even smaller like 256GB SSDs) 
> and see if the same problem persists?

Thank you for your reply.
I am sorry for my late reply.
You are right , when the backend is bluestore , there was OOM from time to time.
Now will upgrade our HW to see whether we avoid OOM.
Besides, after we upgrade kernel from 4.4.39 to 4.4.120, the activating osd xfs 
error seems to be fixed.

> 
> 
> On Tue, 6 Mar 2018 at 10:51, 赵赵贺东  > wrote:
> Hello ceph-users,
> 
> It is a really really Really tough problem for our team.
> We investigated in the problem for a long time, try a lot of efforts, but 
> can’t solve the problem, even the concentrate cause of the problem is still 
> unclear for us!
> So, Anyone give any solution/suggestion/opinion whatever  will be highly 
> highly appreciated!!!
> 
> Problem Summary:
> When we activate osd, there will be  metadata corrupttion in the activating 
> disk, probability is 100% !
> 
> Admin Nodes node:
> Platform: X86
> OS:   Ubuntu 16.04
> Kernel:   4.12.0
> Ceph: Luminous 12.2.2
> 
> OSD nodes:
> Platform: armv7
> OS:   Ubuntu 14.04
> Kernel:   4.4.39
> Ceph: Lominous 12.2.2
> Disk: 10T+10T
> Memory:   2GB
> 
> Deploy log:
> 
> 
> dmesg log:(Sorry arms001-01 dmesg log has log has been lost, but error 
> message about metadata corruption on arms003-10 are the same with arms001-01)
> Mar  5 11:08:49 arms003-10 kernel: [  252.534232] XFS (sda1): Unmount and run 
> xfs_repair
> Mar  5 11:08:49 arms003-10 kernel: [  252.539100] XFS (sda1): First 64 bytes 
> of corrupted metadata buffer:
> Mar  5 11:08:49 arms003-10 kernel: [  252.545504] eb82f000: 58 46 53 42 00 00 
> 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
> Mar  5 11:08:49 arms003-10 kernel: [  252.553569] eb82f010: 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00 00 00  
> Mar  5 11:08:49 arms003-10 kernel: [  252.561624] eb82f020: fc 4e e3 89 50 8f 
> 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
> Mar  5 11:08:49 arms003-10 kernel: [  252.569706] eb82f030: 00 00 00 00 80 00 
> 00 07 ff ff ff ff ff ff ff ff  
> Mar  5 11:08:49 arms003-10 kernel: [  252.58] XFS (sda1): metadata I/O 
> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
> Mar  5 11:08:49 arms003-10 kernel: [  252.602944] XFS (sda1): Metadata 
> corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data 
> block 0x48b9ff80
> Mar  5 11:08:49 arms003-10 kernel: [  252.614170] XFS (sda1): Unmount and run 
> xfs_repair
> Mar  5 11:08:49 arms003-10 kernel: [  252.619030] XFS (sda1): First 64 bytes 
> of corrupted metadata buffer:
> Mar  5 11:08:49 arms003-10 kernel: [  252.625403] eb901000: 58 46 53 42 00 00 
> 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
> Mar  5 11:08:49 arms003-10 kernel: [  252.633441] eb901010: 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00 00 00  
> Mar  5 11:08:49 arms003-10 kernel: [  252.641474] eb901020: fc 4e e3 89 50 8f 
> 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
> Mar  5 11:08:49 arms003-10 kernel: [  252.649519] eb901030: 00 00 00 00 80 00 
> 00 07 ff ff ff ff ff ff ff ff  
> Mar  5 11:08:49 arms003-10 kernel: [  252.657554] XFS (sda1): metadata I/O 
> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
> Mar  5 11:08:49 arms003-10 kernel: [  252.675056] XFS (sda1): Metadata 
> corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data 
> block 0x48b9ff80
> Mar  5 11:08:49 arms003-10 kernel: [  252.686228] XFS (sda1): Unmount and run 
> xfs_repair
> Mar  5 11:08:49 arms003-10 kernel: [  252.691054] XFS (sda1): First 64 bytes 
> of corrupted metadata buffer:
> Mar  5 11:08:49 arms003-10 kernel: [  252.697425] eb901000: 58 46 53 42 00 00 
> 10 00 00 00 00 00 91 73 fe fb  XFSB.s..
> Mar  5 11:08:49 arms003-10 kernel: [  252.705459] eb901010: 00 00 00 00 00 00 
> 00 00 00 00 00 00 00 00 00 00  
> Mar  5 11:08:49 arms003-10 kernel: [  252.713489] eb901020: fc 4e e3 89 50 8f 
> 42 aa be bc 07 0c 6e fa 83 2f  .N..P.B.n../
> Mar  5 11:08:49 arms003-10 kernel: [  252.721520] eb901030: 00 00 00 00 80 00 
> 00 07 ff ff ff ff ff ff ff ff  
> Mar  5 11:08:49 arms003-10 kernel: [  252.729558] XFS (sda1): metadata I/O 
> error: block 0x48b9ff80 ("xfs_trans_read_buf_map") error 117 numblks 8
> Mar  5 11:08:49 arms003-10 kernel: [  252.741953] XFS (sda1): Metadata 
> corruption detected at xfs_dir3_data_read_verify+0x58/0xd0, xfs_dir3_data 
> block 

Re: [ceph-users] Object lifecycle and indexless buckets

2018-03-20 Thread Casey Bodley



On 03/20/2018 01:33 PM, Robert Stanford wrote:


 Hello,

 Does object expiration work on indexless (blind) buckets?

 Thank you


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


No. Lifecycle processing needs to list the buckets, so objects in 
indexless buckets would not expire.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backfilling on Luminous

2018-03-20 Thread David Turner
@Pavan, I did not know about 'filestore split rand factor'.  That looks
like it was added in Jewel and I must have missed it.  To disable it, would
I just set it to 0 and restart all of the OSDs?  That isn't an option at
the moment, but restarting the OSDs after this backfilling is done is
definitely doable.

On Mon, Mar 19, 2018 at 5:28 PM Pavan Rallabhandi <
prallabha...@walmartlabs.com> wrote:

> David,
>
>
>
> Pretty sure you must be aware of the filestore random split on existing
> OSD PGs, `filestore split rand factor`, may be you could try that too.
>
>
>
> Thanks,
>
> -Pavan.
>
>
>
> *From: *ceph-users  on behalf of David
> Turner 
> *Date: *Monday, March 19, 2018 at 1:36 PM
> *To: *Caspar Smit 
> *Cc: *ceph-users 
> *Subject: *EXT: Re: [ceph-users] Backfilling on Luminous
>
>
>
> Sorry for being away. I set all of my backfilling to VERY slow settings
> over the weekend and things have been stable, but incredibly slow (1%
> recovery from 3% misplaced to 2% all weekend).  I'm back on it now and well
> rested.
>
>
>
> @Caspar, SWAP isn't being used on these nodes and all of the affected OSDs
> have been filestore.
>
>
>
> @Dan, I think you hit the nail on the head.  I didn't know that logging
> was added for subfolder splitting in Luminous!!! That's AMAZING  We are
> seeing consistent subfolder splitting all across the cluster.  The majority
> of the crashed OSDs have a split started before the crash and then
> commenting about it in the crash dump.  Looks like I just need to write a
> daemon to watch for splitting to start and throttle recovery until it's
> done.
>
>
>
> I had injected the following timeout settings, but it didn't seem to
> affect anything.  I may need to have placed them in ceph.conf and let them
> pick up the new settings as the OSDs crashed, but I didn't really want
> different settings on some OSDs in the cluster.
>
>
>
> osd_op_thread_suicide_timeout=1200 (from 180)
>
> osd-recovery-thread-timeout=300  (from 30)
>
>
>
> My game plan for now is to watch for splitting in the log, increase
> recovery sleep, decrease osd_recovery_max_active, and watch for splitting
> to finish before setting them back to more aggressive settings.  After this
> cluster is done backfilling I'm going to do my best to reproduce this
> scenario in a test environment and open a ticket to hopefully fix why this
> is happening so detrimentally.
>
>
>
>
>
> On Fri, Mar 16, 2018 at 4:00 AM Caspar Smit 
> wrote:
>
> Hi David,
>
>
>
> What about memory usage?
>
>
>
> 1] 23 OSD nodes: 15x 10TB Seagate Ironwolf filestore with journals on
> Intel DC P3700, 70% full cluster, Dual Socket E5-2620 v4 @ 2.10GHz, 128GB
> RAM.
>
>
>
> If you upgrade to bluestore, memory usage will likely increase. 15x10TB ~~
> 150GB RAM needed especially in recovery/backfilling scenario's like these.
>
>
>
> Kind regards,
>
> Caspar
>
>
>
>
>
> 2018-03-15 21:53 GMT+01:00 Dan van der Ster :
>
> Did you use perf top or iotop to try to identify where the osd is stuck?
>
> Did you try increasing the op thread suicide timeout from 180s?
>
>
>
> Splitting should log at the beginning and end of an op, so it should be
> clear if it's taking longer than the timeout.
>
>
>
> .. Dan
>
>
>
>
>
>
>
> On Mar 15, 2018 9:23 PM, "David Turner"  wrote:
>
> I am aware of the filestore splitting happening.  I manually split all of
> the subfolders a couple weeks ago on this cluster, but every time we have
> backfilling the newly moved PGs have a chance to split before the
> backfilling is done.  When that has happened in the past it causes some
> blocked requests and will flap OSDs if we don't increase the
> osd_heartbeat_grace, but it has never consistently killed the OSDs during
> the task.  Maybe that's new in Luminous due to some of the priority and
> timeout settings.
>
>
>
> This problem in general seems unrelated to the subfolder splitting,
> though, since it started to happen very quickly into the backfilling
> process.  Definitely before many of the recently moved PGs would have
> reached that point.  I've also confirmed that the OSDs that are dying are
> not just stuck on a process (like it looks like with filestore splitting),
> but actually segfaulting and restarting.
>
>
>
> On Thu, Mar 15, 2018 at 4:08 PM Dan van der Ster 
> wrote:
>
> Hi,
>
>
>
> Do you see any split or merge messages in the osd logs?
>
> I recall some surprise filestore splitting on a few osds after the
> luminous upgrade.
>
>
>
> .. Dan
>
>
>
>
>
> On Mar 15, 2018 6:04 PM, "David Turner"  wrote:
>
> I upgraded a [1] cluster from Jewel 10.2.7 to Luminous 12.2.2 and last
> week I added 2 nodes to the cluster.  The backfilling has been ATROCIOUS.
> I have OSDs consistently [2] segfaulting during recovery.  There's no
> pattern of which OSDs are 

Re: [ceph-users] Crush Bucket move crashes mons

2018-03-20 Thread warren.jeffs
Hi Paul,

Many thanks for the replies, I actually did (1) and it worked perfectly, I was 
also able to reproduce this via a test monitor too.

I have updated the bug with all of this info so hopefully no one hits this 
again.

Many thanks.

Warren

From: Paul Emmerich 
Sent: 20 March 2018 17:21
To: Jeffs, Warren (STFC,RAL,ISIS) 
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] Crush Bucket move crashes mons

Hi,
I made the changes directly to the crush map, i.e.,

(1) deleting the all the weight_set blocks and then move the bucket via the CLI
or
(2) move the buckets in the crush map and add a new entry to the weight set


Paul

2018-03-16 21:00 GMT+01:00 
>:
Hi Paul,

Many thanks for the super quick replys and analysis on this.

Is it a case of removing the weights from the new hosts and there osds then 
moving them? After reweighing them correctly?

I already have a bug open, I will get this email chain added to this.

Warren

From: Paul Emmerich [paul.emmer...@croit.io]
Sent: 16 March 2018 16:48
To: Jeffs, Warren (STFC,RAL,ISIS)
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] Crush Bucket move crashes mons

Hi,

looks like it fails to adjust the number of weight set entries when moving the 
entries. The good news is that this is 100% reproducible with your crush map:
you should open a bug at http://tracker.ceph.com/ to get this fixed.

Deleting the weight set fixes the problem. Moving the item manually with manual 
adjustment of the weight set also works in my quick test.

Paul


2018-03-16 16:03 GMT+01:00 
>>:
Hi Paul

Many thanks for the reply.

The command is: crush move rack04  room=R80-Upper

Crush map is here: https://pastebin.com/CX7GKtBy

I’ve done some more testing, and the following all work:

• Moving machines between the racks under the default root.

• Renaming racks/hosts under the default root

• Renaming the default root

• Creating a new root

• Adding rack05 and rack04 + hosts nina408 and nina508 into the new root

But when trying to move  anything into the default root it fails.

I have tried moving the following into default root:

• Nina408 – with hosts in and without

• Nina508 – with hosts in and without

• Rack04

• Rack05

• Rack03 – which I created with nothing in it to try and move.


Since first email, I have got the cluster to HEALTH_OK with reweight mapping 
drives, so everything cluster wise appears to be functioning fine.

I have not tried manually editing the crush map and reimporting for the risk 
that it makes the cluster fall over, as this is currently in production. With 
the CLI I can at least cancel the command the monitor comes back up fine.

Many thanks.

Warren


From: Paul Emmerich 
[mailto:paul.emmer...@croit.io>]
Sent: 16 March 2018 13:54
To: Jeffs, Warren (STFC,RAL,ISIS) 
>>
Cc: 
ceph-us...@ceph.com>
Subject: Re: [ceph-users] Crush Bucket move crashes mons

Hi,

the error looks like there might be something wrong with the device classes 
(which are managed via separate trees with magic names behind the scenes).

Can you post your crush map and the command that you are trying to run?
Paul

2018-03-15 16:27 GMT+01:00 
>>:
Hi All,

Having some interesting challenges.

I am trying to move 2 new nodes + 2 new racks into my default root, I have 
added them to the cluster outside of the Root=default.

They are all in and up – happy it seems. The new nodes have all 12 OSDs in them 
and they are all ‘UP’

So when going to move them into the correctly room bucket under the default 
root they fail.

This is the error log at the time: https://pastebin.com/mHfkEp3X

I can create another host in the crush and move that in and out of rack buckets 
– all while being outside of the default root. Trying to move an empty Rack 
bucket into the default root fails too.

All of the cluster is on 12.2.4. I do have 2 backfill full osds which is the 
reason for needing these disks in the cluster asap.

Any thoughts?

Cheers

Warren Jeffs

ISIS Infrastructure Services
STFC Rutherford Appleton Laboratory
e-mail:  
warren.je...@stfc.ac.uk>


___

Re: [ceph-users] How to increase the size of requests written to a ceph image

2018-03-20 Thread Russell Glaue
I wanted to report an update.

We added more ceph storage nodes, so we can take the problem OSDs out.
speeds are faster.

I found a way to monitor OSD latency in ceph, using "ceph pg dump osds"
The commit latency is always "0" for us.
  fs_perf_stat/commit_latency_ms
But the apply latency shows us the slow OSDs.
  fs_perf_stat/apply_latency_ms

The latest ceph has a prometheus plugin (
http://docs.ceph.com/docs/master/mgr/prometheus/), so this information can
be stored and monitored (e.g. with Grafana). Then, over time, we can see
which OSDs are the problem. (so I don't have to deal with atop, nor run
lots of benchmark tests)  (Use this for older ceph versions:
https://github.com/digitalocean/ceph_exporter)

It turns out we had about 5 problem SSD drives in the slowest ceph node,
and about 2 in the second slowest. All the other OSDs in those two machines
(the crucial drives I reported earlier) are running below a max 0.02
milliseconds - so I just had a few bad drives. The newest ceph nodes we
added, we purchased the kingston drives, and their latency is below a max
0.001 millisecond latency - none are bad drives. I now see up to 28MBps
write speeds, and 260MBps read speeds.

-
# ceph pg dump osds -f json-pretty
dumped osds in format json-pretty

[
{
"osd": 8,
"kb": 1952015104,
"kb_used": 1331273140,
"kb_avail": 620741964,
"hb_in": [
0,
1,
2,
3,
5,
6,
11,
12,
13,
16,
17,
18,
19,
20,
21
],
"hb_out": [],
"snap_trim_queue_len": 0,
"num_snap_trimming": 0,
"op_queue_age_hist": {
"histogram": [],
"upper_bound": 1
},
"fs_perf_stat": {
"commit_latency_ms": 0,
"apply_latency_ms": 49
}
},
...
-




On Fri, Dec 8, 2017 at 9:20 AM, Russell Glaue  wrote:

> Here are some random samples I recorded in the past 30 minutes.
>
>  11 K blocks   10542 kB/s   909 op/s
>  12 K blocks   15397 kB/s  1247 op/s
>  26 K blocks   34306 kB/s  1307 op/s
>  33 K blocks   48509 kB/s  1465 op/s
>  59 K blocks   59333 kB/s   999 op/s
> 172 K blocks  101939 kB/s   590 op/s
> 104 K blocks   82605 kB/s   788 op/s
> 128 K blocks   77454 kB/s   601 op/s
> 136 K blocks   47526 kB/s   348 op/s
>
>
>
> On Fri, Dec 8, 2017 at 2:04 AM, Maged Mokhtar 
> wrote:
>
>> 4M block sizes you will only need 22.5 iops
>>
>> On 2017-12-08 09:59, Maged Mokhtar wrote:
>>
>> Hi Russell,
>>
>> It is probably due to the difference in block sizes used in the test vs
>> your cluster load. You have a latency problem which is limiting your max
>> write iops to around 2.5K. For large block sizes you do not need that many
>> iops, for example if you write in 4M block sizes you will only need 12.5
>> iops to reach your bandwidth of 90 MB/s, in such case you latency problem
>> will not affect your bandwidth. The reason i had suggested you run the
>> original test in 4k size was because this was the original problem subject
>> of this thread, the gunzip test and the small block sizes you were getting
>> with iostat.
>>
>> If you want to know a "rough" ballpark on what block sizes you currently
>> see on your cluster, get the total bandwidth and iops as reported by ceph (
>> ceph status should give you this ) and divide the first by the second.
>>
>> I still think you have a significant latency/iops issue: a 36 all SSDs
>> cluster should give much higher that 2.5K iops
>>
>> Maged
>>
>>
>> On 2017-12-07 23:57, Russell Glaue wrote:
>>
>> I want to provide an update to my interesting situation.
>> (New storage nodes were purchased and are going into the cluster soon)
>>
>> I have been monitoring the ceph storage nodes with atop and read/write
>> through put with ceph-dash for the last month.
>> I am regularly seeing 80-90MB/s of write throughput (140MB/s read) on the
>> ceph cluster. At these moments, the problem ceph node I have been speaking
>> of shows 101% disk busy on the same 3 to 4 (of the 9) OSDs. So I am getting
>> the throughput that I want with on the cluster, despite the OSDs in
>> question.
>>
>> However, when I run the bench tests described in this thread, I do not
>> see the write throughput go above 5MB/s.
>> When I take the problem node out, and run the bench tests, I see the
>> throughput double, but not over 10MB/s.
>>
>> Why is the ceph cluster getting up to 90MB/s write in the wild, but not
>> when running the bench tests ?
>>
>> -RG
>>
>>
>>
>>
>> On Fri, Oct 27, 2017 at 4:21 PM, Russell Glaue  wrote:
>>
>>> Yes, several have recommended the fio test now.
>>> I cannot perform a fio test at this time. Because the post referred to
>>> directs us to write the fio test data directly to the disk device, e.g.
>>> /dev/sdj. I'd have to take an OSD completely out in order to 

[ceph-users] Lost space or expected?

2018-03-20 Thread Caspar Smit
Hi all,

Here's the output of 'rados df' for one of our clusters (Luminous 12.2.2):

ec_pool 75563G 19450232 0 116701392 0 0 0 385351922 27322G 800335856 294T
rbd 42969M 10881 0 32643 0 0 0 615060980 14767G 970301192 207T
rbdssd 252G 65446 0 196338 0 0 0 29392480 1581G 211205402 2601G

total_objects 19526559
total_used 148T
total_avail 111T
total_space 259T


ec_pool (k=4, m=2)
rbd (size = 3/2)
rbdssd (size = 3/2)

If i calculate the space i should be using:

ec_pool = 75 TB x 1.5 = 112.5 TB  (4+2 is storage times 1.5 right?)
rbd = 42 GB x 3 = 150 GB
rbdssd = 252 GB x 3 = 756 GB

Let's say 114TB in total.

Why is there 148TB used space? (That's a 30TB difference)
Is this expected behaviour? A bug? (if so, how can i reclaim this space?)

kind regards,
Caspar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] master osd crash during scrub pg or scrub pg manually

2018-03-20 Thread 解决
Good evening everyone.
My ceph is cross-compiled and runs on armv7l 32-bit development board.The ceph 
version is 10.2.3,The compiler version is 6.3.0.
After I placed an object in the rados cluster, I scrubed the object manually. 
At this time, the main osd crashed.
Here is the osd log:


 ceph version  ()
 1: (()+0x7a7de8) [0x7fd1dde8]
 2: (__default_sa_restorer()+0) [0xb68db3c0]
 3: (()+0x24309c) [0x7f7b909c]
 4: (std::_Rb_tree_iterator 
std::_Rb_tree, 
hobject_t::BitwiseComparator, std::allocator >::_M_emplace_hint_unique, std::tuple<> 
>(std::_Rb_tree_const_iterator, 
std::piecewise_construct_t const&, std::tuple&&, 
std::tuple<>&&)+0x48) [0x7f87eed8]
 5: (ScrubMap::decode(ceph::buffer::list::iterator&, long long)+0x2b8) 
[0x7fa31498]
 6: (PG::sub_op_scrub_map(std::shared_ptr)+0x1e8) [0x7f862db8]
 7: (ReplicatedPG::do_sub_op(std::shared_ptr)+0x274) [0x7f8acb78]
 8: (ReplicatedPG::do_request(std::shared_ptr&, 
ThreadPool::TPHandle&)+0x518) [0x7f8d201c]
 9: (OSD::dequeue_op(boost::intrusive_ptr, std::shared_ptr, 
ThreadPool::TPHandle&)+0x3c4) [0x7f783e6c]
 10: (PGQueueable::RunVis::operator()(std::shared_ptr&)+0x68) 
[0x7f78412c]
 11: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x5d4) [0x7f79c664]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x764) 
[0x7fe01da8]
 13: (()+0x88ea18) [0x7fe04a18]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.


 0> 2018-03-16 11:26:39.186442 95fe5a30  2 -- 172.16.10.31:6800/6528 >> 
172.16.10.35:6789/0 pipe(0x86236000 sd=23 :41154 s=2 pgs=174 cs=1 l=1 
c=0x8631b7c0).reader got KEEPALIVE_ACK
--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 kinetic
   1/ 5 fuse
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-osd.1.log




I also debugged with gdb.Here is the gdb debugging information:


[Thread 0x95636a30 (LWP 7835) exited]


Thread 50 "tp_osd_tp" received signal SIGSEGV, Segmentation fault.
0x7f85c09c in std::__cxx11::basic_string::_Alloc_hider::_Alloc_hider (__a=..., __dat=, this=)
at /usr/include/c++/6.3.0/bits/basic_string.h:110
110/usr/include/c++/6.3.0/bits/basic_string.h: No such file or directory.
(gdb) where
#0  0x7f85c09c in std::__cxx11::basic_string::_Alloc_hider::_Alloc_hider (__a=..., __dat=, this=)
at /usr/include/c++/6.3.0/bits/basic_string.h:110
#1  std::__cxx11::basic_string::basic_string (__str=..., this=) at 
/usr/include/c++/6.3.0/bits/basic_string.h:399
#2  object_t::object_t (this=) at 
/usr/src/debug/ceph-src/10.2.3-r0/git/src/include/object.h:32
#3  hobject_t::hobject_t (this=0x859a1850, rhs=...) at 
/usr/src/debug/ceph-src/10.2.3-r0/git/src/common/hobject.h:97
#4  0x7f921ed8 in std::pair::pair(std::tuple&, std::tuple<>&, 
std::_Index_tuple<0u>, std::_Index_tuple<>) (
__tuple2=..., __tuple1=..., this=0x859a1850) at 
/usr/include/c++/6.3.0/tuple:1586
#5  std::pair::pair(std::piecewise_construct_t, std::tuple, std::tuple<>) 
(__second=..., __first=...,
this=0x859a1850) at /usr/include/c++/6.3.0/tuple:1575
#6  __gnu_cxx::new_allocator >::construct 
>(std::pair*, std::piecewise_construct_t 
const&, std::tuple&&, std::tuple<>&&) (
this=, __p=0x859a1850) at 
/usr/include/c++/6.3.0/ext/new_allocator.h:120
#7  std::allocator_traits > >::construct >(std::allocator >&, std::pair*, 
std::piecewise_construct_t const&, std::tuple&&, 
std::tuple<>&&) (__a=..., __p=) at 
/usr/include/c++/6.3.0/bits/alloc_traits.h:455
#8  std::_Rb_tree, 
hobject_t::BitwiseComparator, std::allocator >::_M_construct_node, std::tuple<> 
>(std::_Rb_tree_node*, 

Re: [ceph-users] Cephfs and number of clients

2018-03-20 Thread Patrick Donnelly
On Tue, Mar 20, 2018 at 3:27 AM, James Poole  wrote:
> I have a query regarding cephfs and prefered number of clients. We are
> currently using luminous cephfs to support storage for a number of web
> servers. We have one file system split into folders, example:
>
> /vol1
> /vol2
> /vol3
> /vol4
>
> At the moment the root of the cephfs filesystem is mounted to each web
> server. The query is would there be a benefit to having separate mount
> points for each folder like above?

Performance benefit? No. Data isolation benefit? Sure.

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs and number of clients

2018-03-20 Thread James Poole
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I have a query regarding cephfs and prefered number of clients. We are
currently using luminous cephfs to support storage for a number of web
servers. We have one file system split into folders, example:

/vol1
/vol2
/vol3
/vol4

At the moment the root of the cephfs filesystem is mounted to each web
server. The query is would there be a benefit to having separate mount
points for each folder like above? From further reading it would
indicate that each mount point would come with it's own client and as a
result it's own capabilities and hence benefit from more threads?

Many thanks
James
-BEGIN PGP SIGNATURE-

iQIzBAEBCAAdFiEEdM4X95Iy3BSBVEO8cwh2diG+igcFAlqw4h0ACgkQcwh2diG+
igcXAA/9GMV/Y7w1t65f13lgSs3km6AcRWsvxPVk+Xyq7CDON3XolWZqrAw+nPX4
zM7+pRWX8Lzzpz8/DxkwuytMrgA/BK9bsLeOWYMdqrOIQqYrTLs2Q41kWNVSDTUs
QgzpNRaBq1+bXF4f+dsRyhUdEgAd4t8eHM6KqZetjB8km7Wq2j3WgYx+WZfzZ5Yo
vzJlMnFOoLEBiD2JeYjmmNSA80mSBoKcIhTRaL6H4qb1HztYbcZa4nDqfs4YQiqQ
pLrClqcuex2F11AaauJwQ5eBe2lBPWxQtgzEsLY2p3WZj4NBQGfSfaXkReEdCijY
vhWhr4daWh2S3wsSXvtbMV4tb3xpZMtQ8o2//ziMLdQY0ZPe/68zvp0hRnz6jY9P
q1vQHfjVHos1heGYwmfqa6vrfQgF18pa0sWXUpvuwif8uEifjJ0Sf+dAOoPi6VUb
JLpPiOMjRiZwy51aBmlTjDmVVegbzg81+CJNcM2/o1spNZKCfUXtNu1cR3MkuzlQ
T9rSy8lnA/XlLEMgUlD4Q/ej/GoBm3TGTlVpEJeMiltIYyiHAFYaksM6VAUupbVA
r3uwJpEWGjt6/vhltECSYqwhSsX8qpPwAI55YQWNnQ5vM9/hE3rEz4m5e0JxzX9j
9mw8iN47hgeb6ar/emLWz/Rf2M9tsgJldB5N5hJO+OOTEvWgsYY=
=w7l9
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reducing pg_num for a pool

2018-03-20 Thread Ovidiu Poncea

That is great news!

Thanks,
Ovidiu

On 03/19/2018 10:44 AM, Gregory Farnum wrote:

Maybe (likely?) in Mimic. Certainly the next release.

Some code has been written but the reason we haven’t done this before 
is the number of edge cases involved, and it’s not clear how long 
rounding those off will take.

-Greg
On Fri, Mar 16, 2018 at 2:38 PM Ovidiu Poncea 
> wrote:


Hi All,

Is there any news on when/if support for decreasing pg_num will be
available?

Thank you,
Ovidiu
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] wrong stretch package dependencies (was Luminous v12.2.3 released)

2018-03-20 Thread Micha Krause

Hi,


agreed. but the packages built for stretch do depend on the library

I had a wrong debian version in my sources list :-(

Thanks for looking into it.


Micha Krause
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com