Re: [ceph-users] lstat() hangs on single file

2016-02-13 Thread Gregory Farnum
On Sat, Feb 13, 2016 at 8:14 AM, Blade Doyle  wrote:
> Greg,  Thats very useful info.  I had not queried the admin sockets before
> today, so I am learning new things!
>
> on the x86_64: mds, mon, and osd, and rbd + cephfs client
> ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
>
> On the arm7 nodes: mon, osd, and rbd + cephfs clients
> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>
> Yes, all reported ops are on the same inode.
>
> I power cycled the entire cluster last night, including mds and clients.
> That *did* clear up the read locks and I was able to access the files again.
>
> However, the stuck read locks have returned ;(   So now I have the same
> issue again.
>
> It seems likely that the "plex" media player is the client that holds the
> initial hanging read lock.  It could be something specific that app is
> doing, or could just be a coincidence.
>
> It would be reasonably easy to update ceph version on the x86_64 (mds
> server, a mon, and the client running plex).  I'll work on that if you think
> it could solve my issues?

It might. If it doesn't I'd be interested in whatever logs and
debugging information you can share with upstream to try and identify
what's going on. :)
-Greg

>
> Thanks again!
> Blade.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Christian Balzer
On Sat, 13 Feb 2016 20:51:19 -0700 Tom Christensen wrote:

> > > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2
> > > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788
> > > load_pgs opened
> > 564 pgs > --- > Another minute to load the PGs.
> > Same OSD reboot as above : 8 seconds for this.
> 
> Do you really have 564 pgs on a single OSD?  

Yes, the reason is simple, more than a year ago it should have been 8 OSDs
(halving that number) and now it should be 18 OSDs, which would be a
perfect fit for the 1024 PGs in the rbd pool.

>I've never had anything like
> decent performance on an OSD with greater than about 150pgs.  In our
> production clusters we aim for 25-30 primary pgs per osd, 75-90pgs/osd
> total (with size set to 3).  When we initially deployed our large cluster
> with 150-200pgs/osd (total, 50-70 primary pgs/osd, again size 3) we had
> no end of trouble getting pgs to peer.  The OSDs ate RAM like nobody's
> business, took forever to do anything, and in general caused problems.

The cluster performs admirable for the stress it is under, the number of
PGs per OSD never really was an issue when it came to CPU/RAM/network.
For example the restart increased the OSD process size from 1.3 to 2.8GB,
but that left 24GB still "free".
The main reason to have more OSDs (and thus a lower PG count per OSD) is
to have more IOPS from the underlying storage.

> If you're running 564 pgs/osd in this 4 OSD cluster, I'd look at that
> first as the potential culprit.  That is a lot of threads inside the OSD
> process that all need to get CPU/network/disk time in order to peer as
> they come up.  Especially on firefly I would point to this.  We've moved
> to Hammer and that did improve a number of our performance bottlenecks,
> though we've also grown our cluster without adding pgs, so we are now
> down in the 25-30 primary pgs/osd range, and restarting osds, or whole
> nodes (24-32 OSDs for us) no longer causes us pain.  

At that PG count, how good (bad really) is your data balancing out?

> In the past
> restarting a node could cause 5-10 minutes of peering and pain/slow
> requests/unhappiness of various sorts (RAM exhaustion, OOM Killer,
> Flapping OSDs).  

Nodes with that high number of OSDs I can indeed see cause pain, which

> This all improved greatly once we got our pg/osd count
> under 100 even before we upgraded to hammer.
> 

Interesting point, but in my case all the slowness can be attributed to
disk I/O of the respective backing storage. Which should be fast enough if
ALL that it would do were to read things in. 
I'll see if Hammer behaves better, but I doubt it (especially for the
first time when it upgrades stuff on the disk).

Penultimately however I didn't ask on how to speed up OSD restarts (I have
a lot of knowledge/ideas on how to do that), I asked about mitigating the
impact of OSD restarts when they are going to be slow, for whatever reason.

Regards,

Christian
> 
> 
> 
> 
> On Sat, Feb 13, 2016 at 11:08 AM, Lionel Bouton <
> lionel-subscript...@bouton.name> wrote:
> 
> > Hi,
> >
> > Le 13/02/2016 15:52, Christian Balzer a écrit :
> > > [..]
> > >
> > > Hum that's surprisingly long. How much data (size and nb of files) do
> > > you have on this OSD, which FS do you use, what are the mount
> > > options, what is the hardware and the kind of access ?
> > >
> > > I already mentioned the HW, Areca RAID controller with 2GB HW cache
> > > and a 7 disk RAID6 per OSD.
> > > Nothing aside from noatime for mount options and EXT4.
> >
> > Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me
> > and may not be innocent.
> >
> > >
> > > 2.6TB per OSD and with 1.4 million objects in the cluster a little
> > > more than 700k files per OSD.
> >
> > That's nearly 3x more than my example OSD but it doesn't explain the
> > more than 10x difference in startup time (especially considering BTRFS
> > OSDs are slow to startup and my example was with dropped caches unlike
> > your case). Your average file size is similar so it's not that either.
> > Unless you have a more general, system-wide performance problem which
> > impacts everything including the OSD init, there's 3 main components
> > involved here :
> > - Ceph OSD init code,
> > - ext4 filesystem,
> > - HW RAID6 block device.
> >
> > So either :
> > - OSD init code doesn't scale past ~500k objects per OSD.
> > - your ext4 filesystem is slow for the kind of access used during init
> > (inherently or due to fragmentation, you might want to use filefrag on
> > a random sample on PG directories, omap and meta),
> > - your RAID6 array is slow for the kind of access used during init.
> > - any combination of the above.
> >
> > I believe it's possible but doubtful that the OSD code wouldn't scale
> > at this level (this does not feel like an abnormally high number of
> > objects to me). Ceph devs will know better.
> > ext4 could be a problem as it's not the most common choice for OSDs
> > (from what I rea

Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Christian Balzer

Hello,

I was about to write something very much along these lines, thanks for
beating me to it. ^o^

On Sat, 13 Feb 2016 21:50:17 -0700 Robert LeBlanc wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> I'm still going to see if I can get Ceph clients to hardly notice that
> an OSD comes back in. Our set up is EXT4 and our SSDs have the hardest
> time with the longest recovery impact. It should be painless no matter
> how slow the drives/CPU/etc are. If it means waiting to service client
> I/O until all the peering, and stuff (not including
> backfilling/recovery because that can be done in the background
> without much impact already) is completed before sending the client
> I/O to the OSD, then that is what I'm going to target. That way if it
> takes 5 minutes for the OSD to get it's bearing because it is swapping
> due to low memory or whatever, the clients happily ignore the OSD
> until it says it is ready and don't have all the client I/O fighting
> to get a piece of scarce resources.
> 
Spot on. 
The recommendation the Ceph documentation is noout, the logic everybody
assumes is happening is that no I/O goes to the OSD until it is actually
ready to serve it and the reality clearly disproves it. 
Once the restart takes longer for whatever reasons than a few seconds it
becomes very visible.

> I appreciate all the suggestions that have been mentioned and believe
> that there is a fundamental issue here that causes a problem when you
> run your hardware into the red zone (like we have to do out of
> necessity). You may be happy with how things are set-up in your
> environment, but I'm not ready to give up on it and I think we can
> make it better. That way it "Just Works" (TM) with more hardware and
> configurations and doesn't need tons of efforts to get it tuned just
> right. Oh, and be careful not to touch it, the balance of the force
> might get thrown off and the whole thing will tank. 

This is exactly what happened in my case and we've seen evidence for in
this ML plenty of times.
Like with nearly all things I/O, there is a tipping point until everything
is fine and then it isn't, often catastrophically so.

> That does not make
> me feel confident. Ceph is so resilient in so many ways already, why
> should this be an Achilles heel for some?

Well said indeed.

Christian
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.4
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJWwAeGCRDmVDuy+mK58QAAG6MP/j+JN2z1qLK2KwlQOr/w
> dam1U6t1WCzwN1XBpvYvbvJKKMcRHcwKmauuzTLYeEG8FjhgnOcvHaSRoHd8
> NURWINnGQrdTbxiMRGDbwC6iWfJypWMDN5d1vibo9aXC8ib7W6l9R21f+Koa
> CsgyZV32kSwEs36teeM4JZrZBTlYQ4qRTOsMUDIfE1JFtBaeDjEwyI6gajdB
> XsQo3mnqhe4LQC7x9oem/MpKEHp1Y/LO8tyf4jj72ZUp+qmJy2F3+oUPnCdU
> P4h3uC0GZUd6l43p5cKW1w/h1mfEwR/9ppsIyufTghqlWFlE6dziaQdlas88
> IuDpGwCJfyJhiH18VxbtRpZQpNorJ27uxNjPPDcWNoUFHR8+daTCu+8NU6vT
> 8xiZhBWpLiH/tShUtR6ZQnumwKgbwc+VOfHj+GSTY/DIfat/zaPxtKYsCHWz
> LNE6fkzd4st2Aw7UVPSSUKrH/87RhIEnlipptZsh5SQNFUrl1G5ztNBTj7Xl
> tyb+HD1Ge3u2mgS/ycnRGQECyXyUMvPXwITDqHLhN3wF7D/A3616v3Pg2H+v
> R/dU8Wq31wA+A0LRuViMJy2PJMgEBoux+zhBsJFun4TPdXkpC15QODhpquMs
> /0ofBwHG+FaWmmwVSQ0A0jMGGodfXTAgP4r/tL58JjGTgi1xtQu9L74u5KPD
> yHbZ
> =rnWI
> -END PGP SIGNATURE-
> 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Sat, Feb 13, 2016 at 8:51 PM, Tom Christensen 
> wrote:
> >> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2
> >> > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788
> >> > load_pgs opened
> >> 564 pgs > --- > Another minute to load the PGs.
> >> Same OSD reboot as above : 8 seconds for this.
> >
> > Do you really have 564 pgs on a single OSD?  I've never had anything
> > like decent performance on an OSD with greater than about 150pgs.  In
> > our production clusters we aim for 25-30 primary pgs per osd,
> > 75-90pgs/osd total (with size set to 3).  When we initially deployed
> > our large cluster with 150-200pgs/osd (total, 50-70 primary pgs/osd,
> > again size 3) we had no end of trouble getting pgs to peer.  The OSDs
> > ate RAM like nobody's business, took forever to do anything, and in
> > general caused problems.  If you're running 564 pgs/osd in this 4 OSD
> > cluster, I'd look at that first as the potential culprit.  That is a
> > lot of threads inside the OSD process that all need to get
> > CPU/network/disk time in order to peer as they come up.  Especially on
> > firefly I would point to this.  We've moved to Hammer and that did
> > improve a number of our performance bottlenecks, though we've also
> > grown our cluster without adding pgs, so we are now down in the 25-30
> > primary pgs/osd range, and restarting osds, or whole nodes (24-32 OSDs
> > for us) no longer causes us pain.  In the past restarting a node could
> > cause 5-10 minutes of peering and pain/slow requests/unhappiness of
> > various sorts (RAM exhaustion, OOM Killer, Flapping OSDs).  This all
> > 

Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'm still going to see if I can get Ceph clients to hardly notice that
an OSD comes back in. Our set up is EXT4 and our SSDs have the hardest
time with the longest recovery impact. It should be painless no matter
how slow the drives/CPU/etc are. If it means waiting to service client
I/O until all the peering, and stuff (not including
backfilling/recovery because that can be done in the background
without much impact already) is completed before sending the client
I/O to the OSD, then that is what I'm going to target. That way if it
takes 5 minutes for the OSD to get it's bearing because it is swapping
due to low memory or whatever, the clients happily ignore the OSD
until it says it is ready and don't have all the client I/O fighting
to get a piece of scarce resources.

I appreciate all the suggestions that have been mentioned and believe
that there is a fundamental issue here that causes a problem when you
run your hardware into the red zone (like we have to do out of
necessity). You may be happy with how things are set-up in your
environment, but I'm not ready to give up on it and I think we can
make it better. That way it "Just Works" (TM) with more hardware and
configurations and doesn't need tons of efforts to get it tuned just
right. Oh, and be careful not to touch it, the balance of the force
might get thrown off and the whole thing will tank. That does not make
me feel confident. Ceph is so resilient in so many ways already, why
should this be an Achilles heel for some?
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWwAeGCRDmVDuy+mK58QAAG6MP/j+JN2z1qLK2KwlQOr/w
dam1U6t1WCzwN1XBpvYvbvJKKMcRHcwKmauuzTLYeEG8FjhgnOcvHaSRoHd8
NURWINnGQrdTbxiMRGDbwC6iWfJypWMDN5d1vibo9aXC8ib7W6l9R21f+Koa
CsgyZV32kSwEs36teeM4JZrZBTlYQ4qRTOsMUDIfE1JFtBaeDjEwyI6gajdB
XsQo3mnqhe4LQC7x9oem/MpKEHp1Y/LO8tyf4jj72ZUp+qmJy2F3+oUPnCdU
P4h3uC0GZUd6l43p5cKW1w/h1mfEwR/9ppsIyufTghqlWFlE6dziaQdlas88
IuDpGwCJfyJhiH18VxbtRpZQpNorJ27uxNjPPDcWNoUFHR8+daTCu+8NU6vT
8xiZhBWpLiH/tShUtR6ZQnumwKgbwc+VOfHj+GSTY/DIfat/zaPxtKYsCHWz
LNE6fkzd4st2Aw7UVPSSUKrH/87RhIEnlipptZsh5SQNFUrl1G5ztNBTj7Xl
tyb+HD1Ge3u2mgS/ycnRGQECyXyUMvPXwITDqHLhN3wF7D/A3616v3Pg2H+v
R/dU8Wq31wA+A0LRuViMJy2PJMgEBoux+zhBsJFun4TPdXkpC15QODhpquMs
/0ofBwHG+FaWmmwVSQ0A0jMGGodfXTAgP4r/tL58JjGTgi1xtQu9L74u5KPD
yHbZ
=rnWI
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sat, Feb 13, 2016 at 8:51 PM, Tom Christensen  wrote:
>> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2
>> > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788
>> > load_pgs opened
>> 564 pgs > --- > Another minute to load the PGs.
>> Same OSD reboot as above : 8 seconds for this.
>
> Do you really have 564 pgs on a single OSD?  I've never had anything like
> decent performance on an OSD with greater than about 150pgs.  In our
> production clusters we aim for 25-30 primary pgs per osd, 75-90pgs/osd total
> (with size set to 3).  When we initially deployed our large cluster with
> 150-200pgs/osd (total, 50-70 primary pgs/osd, again size 3) we had no end of
> trouble getting pgs to peer.  The OSDs ate RAM like nobody's business, took
> forever to do anything, and in general caused problems.  If you're running
> 564 pgs/osd in this 4 OSD cluster, I'd look at that first as the potential
> culprit.  That is a lot of threads inside the OSD process that all need to
> get CPU/network/disk time in order to peer as they come up.  Especially on
> firefly I would point to this.  We've moved to Hammer and that did improve a
> number of our performance bottlenecks, though we've also grown our cluster
> without adding pgs, so we are now down in the 25-30 primary pgs/osd range,
> and restarting osds, or whole nodes (24-32 OSDs for us) no longer causes us
> pain.  In the past restarting a node could cause 5-10 minutes of peering and
> pain/slow requests/unhappiness of various sorts (RAM exhaustion, OOM Killer,
> Flapping OSDs).  This all improved greatly once we got our pg/osd count
> under 100 even before we upgraded to hammer.
>
>
>
>
>
> On Sat, Feb 13, 2016 at 11:08 AM, Lionel Bouton
>  wrote:
>>
>> Hi,
>>
>> Le 13/02/2016 15:52, Christian Balzer a écrit :
>> > [..]
>> >
>> > Hum that's surprisingly long. How much data (size and nb of files) do
>> > you have on this OSD, which FS do you use, what are the mount options,
>> > what is the hardware and the kind of access ?
>> >
>> > I already mentioned the HW, Areca RAID controller with 2GB HW cache and
>> > a
>> > 7 disk RAID6 per OSD.
>> > Nothing aside from noatime for mount options and EXT4.
>>
>> Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me
>> and may not be innocent.
>>
>> >
>> > 2.6TB per OSD and with 1.4 million objects in the cluster a little more
>> > than 700k files per OSD.
>>
>> That's nearly 3x more than my example OSD but

Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Tom Christensen
> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2
> > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788
> > load_pgs opened
> 564 pgs > --- > Another minute to load the PGs.
> Same OSD reboot as above : 8 seconds for this.

Do you really have 564 pgs on a single OSD?  I've never had anything like
decent performance on an OSD with greater than about 150pgs.  In our
production clusters we aim for 25-30 primary pgs per osd, 75-90pgs/osd
total (with size set to 3).  When we initially deployed our large cluster
with 150-200pgs/osd (total, 50-70 primary pgs/osd, again size 3) we had no
end of trouble getting pgs to peer.  The OSDs ate RAM like nobody's
business, took forever to do anything, and in general caused problems.  If
you're running 564 pgs/osd in this 4 OSD cluster, I'd look at that first as
the potential culprit.  That is a lot of threads inside the OSD process
that all need to get CPU/network/disk time in order to peer as they come
up.  Especially on firefly I would point to this.  We've moved to Hammer
and that did improve a number of our performance bottlenecks, though we've
also grown our cluster without adding pgs, so we are now down in the 25-30
primary pgs/osd range, and restarting osds, or whole nodes (24-32 OSDs for
us) no longer causes us pain.  In the past restarting a node could cause
5-10 minutes of peering and pain/slow requests/unhappiness of various sorts
(RAM exhaustion, OOM Killer, Flapping OSDs).  This all improved greatly
once we got our pg/osd count under 100 even before we upgraded to hammer.





On Sat, Feb 13, 2016 at 11:08 AM, Lionel Bouton <
lionel-subscript...@bouton.name> wrote:

> Hi,
>
> Le 13/02/2016 15:52, Christian Balzer a écrit :
> > [..]
> >
> > Hum that's surprisingly long. How much data (size and nb of files) do
> > you have on this OSD, which FS do you use, what are the mount options,
> > what is the hardware and the kind of access ?
> >
> > I already mentioned the HW, Areca RAID controller with 2GB HW cache and a
> > 7 disk RAID6 per OSD.
> > Nothing aside from noatime for mount options and EXT4.
>
> Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me
> and may not be innocent.
>
> >
> > 2.6TB per OSD and with 1.4 million objects in the cluster a little more
> > than 700k files per OSD.
>
> That's nearly 3x more than my example OSD but it doesn't explain the
> more than 10x difference in startup time (especially considering BTRFS
> OSDs are slow to startup and my example was with dropped caches unlike
> your case). Your average file size is similar so it's not that either.
> Unless you have a more general, system-wide performance problem which
> impacts everything including the OSD init, there's 3 main components
> involved here :
> - Ceph OSD init code,
> - ext4 filesystem,
> - HW RAID6 block device.
>
> So either :
> - OSD init code doesn't scale past ~500k objects per OSD.
> - your ext4 filesystem is slow for the kind of access used during init
> (inherently or due to fragmentation, you might want to use filefrag on a
> random sample on PG directories, omap and meta),
> - your RAID6 array is slow for the kind of access used during init.
> - any combination of the above.
>
> I believe it's possible but doubtful that the OSD code wouldn't scale at
> this level (this does not feel like an abnormally high number of objects
> to me). Ceph devs will know better.
> ext4 could be a problem as it's not the most common choice for OSDs
> (from what I read here XFS is usually preferred over it) and it forces
> Ceph to use omap to store data which would be stored in extended
> attributes otherwise (which probably isn't without performance problems).
> RAID5/6 on HW might have performance problems. The usual ones happen on
> writes and OSD init is probably read-intensive (or maybe not, you should
> check the kind of access happening during the OSD init to avoid any
> surprise) but with HW cards it's difficult to know for sure the
> performance limitations they introduce (the only sure way is testing the
> actual access patterns).
>
> So I would probably try to reproduce the problem replacing one OSDs
> based on RAID6 arrays with as many OSDs as you have devices in the arrays.
> Then if it solves the problem and you didn't already do it you might
> want to explore Areca tuning, specifically with RAID6 if you must have it.
>
>
> >
> > And kindly take note that my test cluster has less than 120k objects and
> > thus 15k files per OSD and I still was able to reproduce this behaviour
> (in
> > spirit at least).
>
> I assume the test cluster uses ext4 and RAID6 arrays too: it would be a
> perfect testing environment for defragmentation/switch to XFS/switch to
> single drive OSDs then.
>
> >
> >> The only time I saw OSDs take several minutes to reach the point where
> >> they fully rejoin is with BTRFS with default options/config.
> >>
> > There isn't a pole long enough I would touch BTRFS with for production,
> > especia

Re: [ceph-users] ceph 9.2.0 SAMSUNG ssd performance issue?

2016-02-13 Thread Huan Zhang
Thanks Nick,
seems ceph has big performance gap on all ssd setup. Software latency can
be a bottleneck.

https://ceph.com/planet/the-ceph-and-tcmalloc-performance-story/
http://www.flashmemorysummit.com/English/Collaterals/Proceedings/2015/20150813_S303E_Zhang.pdf
http://events.linuxfoundation.org/sites/events/files/slides/optimizing_ceph_flash.pdf

Build with jemalloc and try again...



2016-02-12 20:57 GMT+08:00 Nick Fisk :

> I will do my best to answer, but some of the questions are starting to
> stretch the limit of my knowledge
>
> > -Original Message-
> > From: Huan Zhang [mailto:huan.zhang...@gmail.com]
> > Sent: 12 February 2016 12:15
> > To: Nick Fisk 
> > Cc: Irek Fasikhov ; ceph-users  > us...@ceph.com>
> > Subject: Re: [ceph-users] ceph 9.2.0 SAMSUNG ssd performance issue?
> >
> > My enviroment:
> > 32 cores Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
> > 10GiB NICS
> > 4 osds/host
> >
> > My client is database(mysql) direct/sync write per transaction, a little
> bit
> > sensitive to io latency(sync/direct).
>
> Ok, yes, write latency is important here if your DB's will be doing lots
> of small inserts/updates
>
> > I used sata disk for osd backends, get  ~100 iops/4k/1 iodepth, ~10ms io
> > latency , similar to one sata disk iops (fio direct=1 sync=1 bs=4k).
> >
> > To improve the mysql write performance, use ssd to instead, since ssd
> > latency is over 100 times to sata,
> > But the result is sad to me.
>
> Yes, there is an inherent performance cap in software defined storage,
> mainly due to the fact you are swapping a SAS cable for networking+code.
> You will never get raw SSD performance for low queue depth because of this.
> Although I hope that at some point in the future Ceph should be able to hit
> about 1000iops with replication.
>
> >
> > There are two things still strange to me.
> > 1.fio the journal partition, ~77us latency, why filestore->
> journal_latency:
> > ~1.1ms?
>
> This is most likely due to Ceph not just doing a straight single write.
> There is also other processing likely happening as well. I'm sure someone a
> bit more knowledgeable, could probably elaborate a bit more.
>
> > fio --filename=/dev/sda2 --direct=1 --sync=1 --rw=write --bs=4k --
> > numjobs=1 --iodepth=1 --runtime=60 --time_based --group_reporting --
> > name=journal-test
> >
> > lat (usec): min=43, max=1503, avg=77.75, stdev=17.42
> >
> > 2. 1.1ms journal_latency is far better than sata(5-10ms) i used before,
> > why ceph end latency is not improved(ssd ~7ms, sata ~10ms)?
>
> The journal write is just a small part of the write process. Ie check
> crush map, send replica request...and lots more
>
> >  2ms seems make sense to me. is there a way to calculate the total
> latency,
> > like  journal_latency+...=total latency?
> >
>
> Possibly, but I couldn't even attempt answer this. If you find out, please
> let me know as I would also find this very useful :-)
>
> One thing you can do is turn the debug logging right up and then in the
> logs you can see the steps that each IO takes and how long it took.
>
> Which brings me on to my next point, turn all logging down to 0/0 (
> http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/) .
> At 4k IO's the overhead of logging is significant.
>
> Other things to try are setting the kernel parameter idle=poll, at the
> risk of increased power usage and seeing if you can stop your CPU's going
> into power saving states.
>
> If anybody else has any other good ideas, please step in.
>
> Nick
>
>
> >
> > 2016-02-12 19:28 GMT+08:00 Nick Fisk :
> > Write latency of 1.1ms is ok, but not brilliant. What IO size are you
> testing
> > with?
> >
> > Don't forget if you have a journal latency of 1.1ms, excluding all other
> latency
> > introduced by networking, replication and processing in the OSD code, you
> > won't get more than about 900 iops. All the things I mention all add
> latency
> > and so you often see 2-3ms of latency for a replicated write. This in
> turn will
> > limit you to 300-500 iops for directio writes.
> >
> > The fact you are seeing around 200 could be about right depending on IO
> > size, CPU speed and network speed.
> >
> > Also what is your end use/requirement? This may or may not matter.
> >
> > Nick
> >
> > > -Original Message-
> > > From: Huan Zhang [mailto:huan.zhang...@gmail.com]
> > > Sent: 12 February 2016 11:00
> > > To: Nick Fisk 
> > > Cc: Irek Fasikhov ; ceph-users  > > us...@ceph.com>
> > > Subject: Re: [ceph-users] ceph 9.2.0 SAMSUNG ssd performance issue?
> > >
> > > thanks nick,
> > > filestore-> journal_latency: ~1.1ms
> > > 214.0/180611
> > > 0.0011848669239415096
> > >
> > > seems ssd write is ok, any other idea is highly appreciated!
> > >
> > >  "filestore": {
> > > "journal_queue_max_ops": 300,
> > > "journal_queue_ops": 0,
> > > "journal_ops": 180611,
> > > "journal_queue_max_bytes": 33554432,
> > > "journal_queue_bytes": 0,
> > > "journal_bytes": 3

Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Lionel Bouton
Hi,

Le 13/02/2016 15:52, Christian Balzer a écrit :
> [..]
>
> Hum that's surprisingly long. How much data (size and nb of files) do
> you have on this OSD, which FS do you use, what are the mount options,
> what is the hardware and the kind of access ?
>
> I already mentioned the HW, Areca RAID controller with 2GB HW cache and a
> 7 disk RAID6 per OSD. 
> Nothing aside from noatime for mount options and EXT4.

Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me
and may not be innocent.

>  
> 2.6TB per OSD and with 1.4 million objects in the cluster a little more
> than 700k files per OSD.

That's nearly 3x more than my example OSD but it doesn't explain the
more than 10x difference in startup time (especially considering BTRFS
OSDs are slow to startup and my example was with dropped caches unlike
your case). Your average file size is similar so it's not that either.
Unless you have a more general, system-wide performance problem which
impacts everything including the OSD init, there's 3 main components
involved here :
- Ceph OSD init code,
- ext4 filesystem,
- HW RAID6 block device.

So either :
- OSD init code doesn't scale past ~500k objects per OSD.
- your ext4 filesystem is slow for the kind of access used during init
(inherently or due to fragmentation, you might want to use filefrag on a
random sample on PG directories, omap and meta),
- your RAID6 array is slow for the kind of access used during init.
- any combination of the above.

I believe it's possible but doubtful that the OSD code wouldn't scale at
this level (this does not feel like an abnormally high number of objects
to me). Ceph devs will know better.
ext4 could be a problem as it's not the most common choice for OSDs
(from what I read here XFS is usually preferred over it) and it forces
Ceph to use omap to store data which would be stored in extended
attributes otherwise (which probably isn't without performance problems).
RAID5/6 on HW might have performance problems. The usual ones happen on
writes and OSD init is probably read-intensive (or maybe not, you should
check the kind of access happening during the OSD init to avoid any
surprise) but with HW cards it's difficult to know for sure the
performance limitations they introduce (the only sure way is testing the
actual access patterns).

So I would probably try to reproduce the problem replacing one OSDs
based on RAID6 arrays with as many OSDs as you have devices in the arrays.
Then if it solves the problem and you didn't already do it you might
want to explore Areca tuning, specifically with RAID6 if you must have it.


>
> And kindly take note that my test cluster has less than 120k objects and
> thus 15k files per OSD and I still was able to reproduce this behaviour (in
> spirit at least).

I assume the test cluster uses ext4 and RAID6 arrays too: it would be a
perfect testing environment for defragmentation/switch to XFS/switch to
single drive OSDs then.

>
>> The only time I saw OSDs take several minutes to reach the point where
>> they fully rejoin is with BTRFS with default options/config.
>>
> There isn't a pole long enough I would touch BTRFS with for production,
> especially in conjunction with Ceph.

That's a matter of experience and environment but I can understand: we
invested more than a week of testing/development to reach a point where
BTRFS was performing better than XFS in our use case. Not everyone can
dedicate as much time just to select a filesystem and support it. There
might be use cases where it's not even possible to use it (I'm not sure
how it would perform if you only did small objects storage for example).

BTRFS has been invaluable though : it detected and helped fix corruption
generated by faulty Raid controllers (by forcing Ceph to use other
replicas when repairing). I wouldn't let precious data live on anything
other than checksumming filesystems now (the probabilities of
undetectable disk corruption are too high for our use case now). We have
30 BTRFS OSDs in production (and many BTRFS filesystems on other
systems) and we've never had any problem with them. These filesystems
even survived several bad datacenter equipment failures (faulty backup
generator control system and UPS blowing up during periodic testing).
That said I'm susbcribed to linux-btrfs, was one of the SATA controller
driver maintainers long ago so I know my way around kernel code, I hand
pick the kernel versions going to production and we have custom tools
and maintenance procedures for the BTRFS OSDs. So I've means and
experience which make this choice comfortable for me and my team: I
wouldn't blindly advise BTRFS to anyone else (not yet).

Anyway it's possible ext4 is a problem but it seems to me less likely
than the HW RAID6. In my experience RAID controllers with cache aren't
really worth it with Ceph. Most of the time they perform well because of
BBWC/FBWC but when you get into a situation where you must
repair/backfill because you lost an OSD or added a new

Re: [ceph-users] lstat() hangs on single file

2016-02-13 Thread Blade Doyle
Greg,  Thats very useful info.  I had not queried the admin sockets before
today, so I am learning new things!

on the x86_64: mds, mon, and osd, and rbd + cephfs client
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

On the arm7 nodes: mon, osd, and rbd + cephfs clients
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

Yes, all reported ops are on the same inode.

I power cycled the entire cluster last night, including mds and clients.
That *did* clear up the read locks and I was able to access the files again.

However, the stuck read locks have returned ;(   So now I have the same
issue again.

It seems likely that the "plex" media player is the client that holds the
initial hanging read lock.  It could be something specific that app is
doing, or could just be a coincidence.

It would be reasonably easy to update ceph version on the x86_64 (mds
server, a mon, and the client running plex).  I'll work on that if you
think it could solve my issues?

Thanks again!
Blade.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Christian Balzer

Hello,

On Sat, 13 Feb 2016 11:14:23 +0100 Lionel Bouton wrote:

> Le 13/02/2016 06:31, Christian Balzer a écrit :
> > [...] > --- > So from shutdown to startup about 2 seconds, not that
> > bad. >
> However here is where the cookie crumbles massively: > --- > 2016-02-12
> 01:33:50.263152 7f75be4d57c0  0 filestore(/var/lib/ceph/osd/ceph-2)
> limited size xattrs > 2016-02-12 01:35:31.809897 7f75be4d57c0  0
> filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal
> mode > : checkpoint is not enabled > --- > Nearly 2 minutes to mount
> things, it probably had to go to disk quite a > bit, as not everything
> was in the various slab caches. And yes, there is > 32GB of RAM, most of
> it pagecache and vfs_cache_pressure is set to 1. > During that time,
> silence of the lambs when it came to ops.
>
>
> Hum that's surprisingly long. How much data (size and nb of files) do
> you have on this OSD, which FS do you use, what are the mount options,
> what is the hardware and the kind of access ?
> 
I already mentioned the HW, Areca RAID controller with 2GB HW cache and a
7 disk RAID6 per OSD. 
Nothing aside from noatime for mount options and EXT4.
 
2.6TB per OSD and with 1.4 million objects in the cluster a little more
than 700k files per OSD.

And kindly take note that my test cluster has less than 120k objects and
thus 15k files per OSD and I still was able to reproduce this behaviour (in
spirit at least).

> The only time I saw OSDs take several minutes to reach the point where
> they fully rejoin is with BTRFS with default options/config.
>
There isn't a pole long enough I would touch BTRFS with for production,
especially in conjunction with Ceph.
 
> For reference our last OSD restart only took 6 seconds to complete this
> step. We only have RBD storage, so this OSD with 1TB of data has ~25
> 4M files. It was created ~ 1 year ago and this is after a complete OS
> umount/mount cycle which drops the cache (from experience Ceph mount
> messages doesn't actually imply that the FS was not mounted).
>
The "mount" in the ceph logs clearly is not a FS/OS level mount.
This OSD was up for about 2 years.

My other, more "conventional" production cluster has 400GB and 100k files
per OSD and is very fast to restart as well. 
Alas it is also nowhere near as busy as this cluster, by order of 2
magnitudes roughly.

> > Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2
> > 1788 load_pgs 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788
> > load_pgs opened
> 564 pgs > --- > Another minute to load the PGs.
> Same OSD reboot as above : 8 seconds for this.
> 
> This would be way faster if we didn't start with an umounted OSD.
> 
Again, it was never unmounted from a FS/OS perspective.

Regards,

Christian

> This OSD is still BTRFS but we don't use autodefrag anymore (we replaced
> it with our own defragmentation scheduler) and disabled BTRFS snapshots
> in Ceph to reach this point. Last time I checked an OSD startup was
> still faster with XFS.
> 
> So do you use BTRFS in the default configuration or have a very high
> number of files on this OSD ?
> 
> Lionel


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Lionel Bouton
Le 13/02/2016 06:31, Christian Balzer a écrit :
> [...] > --- > So from shutdown to startup about 2 seconds, not that bad. >
However here is where the cookie crumbles massively: > --- > 2016-02-12
01:33:50.263152 7f75be4d57c0  0 filestore(/var/lib/ceph/osd/ceph-2)
limited size xattrs > 2016-02-12 01:35:31.809897 7f75be4d57c0  0
filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal
mode > : checkpoint is not enabled > --- > Nearly 2 minutes to mount
things, it probably had to go to disk quite a > bit, as not everything
was in the various slab caches. And yes, there is > 32GB of RAM, most of
it pagecache and vfs_cache_pressure is set to 1. > During that time,
silence of the lambs when it came to ops.
Hum that's surprisingly long. How much data (size and nb of files) do
you have on this OSD, which FS do you use, what are the mount options,
what is the hardware and the kind of access ?

The only time I saw OSDs take several minutes to reach the point where
they fully rejoin is with BTRFS with default options/config.

For reference our last OSD restart only took 6 seconds to complete this
step. We only have RBD storage, so this OSD with 1TB of data has ~25
4M files. It was created ~ 1 year ago and this is after a complete OS
umount/mount cycle which drops the cache (from experience Ceph mount
messages doesn't actually imply that the FS was not mounted).

> Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2 1788 
> load_pgs
> 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788 load_pgs opened
564 pgs > --- > Another minute to load the PGs.
Same OSD reboot as above : 8 seconds for this.

This would be way faster if we didn't start with an umounted OSD.

This OSD is still BTRFS but we don't use autodefrag anymore (we replaced
it with our own defragmentation scheduler) and disabled BTRFS snapshots
in Ceph to reach this point. Last time I checked an OSD startup was
still faster with XFS.

So do you use BTRFS in the default configuration or have a very high
number of files on this OSD ?

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com