from:"Lionel Bouton"

Re: [ceph-users] Major ceph disaster

2019-05-13 Thread Lionel Bouton

Le 13/05/2019 à 16:20, Kevin Flöh a écrit :
> Dear ceph experts,
>
> [...] We have 4 nodes with 24 osds each and use 3+1 erasure coding. [...]
> Here is what happened: One osd daemon could not be started and
> therefore we decided to mark the osd as lost and set it up from
> scratch. Ceph started recovering and then we lost another osd with the
> same behavior. We did the same as for the first osd.

With 3+1 you only allow a single OSD failure per pg at a given time. You
have 4096 pgs and 96 osds, having 2 OSD fail at the same time on 2
separate servers (assuming standard crush rules) is a death sentence for
the data on some pgs using both of those OSD (the ones not fully
recovered before the second failure).

Depending on the data stored (CephFS ?) you probably can recover most of
it but some of it is irremediably lost.

If you can recover the data from the failed OSD at the time they failed
you might be able to recover some of your lost data (with the help of
Ceph devs), if not there's nothing to do.

In the later case I'd add a new server to use at least 3+2 for a fresh
pool instead of 3+1 and begin moving the data to it.

The 12.2 + 13.2 mix is a potential problem in addition to the one above
but it's a different one.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Lionel Bouton

Le 11/12/2018 à 15:51, Konstantin Shalygin a écrit :
>
>> Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes
>> and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous
>> cluster (which already holds lot's of images).
>> The server has access to both local and cluster-storage, I only need
>> to live migrate the storage, not machine.
>>
>> I have never used live migration as it can cause more issues and the
>> VMs that are already migrated, had planned downtime.
>> Taking the VM offline and convert/import using qemu-img would take
>> some hours but I would like to still serve clients, even if it is
>> slower.
>>
>> The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with
>> BBU). There are two HDDs bound as RAID1 which are constantly under 30%
>> - 60% load (this goes up to 100% during reboot, updates or login
>> prime-time).
>>
>> What happens when either the local compute node or the ceph cluster
>> fails (degraded)? Or network is unavailable?
>> Are all writes performed to both locations? Is this fail-safe? Or does
>> the VM crash in worst case, which can lead to dirty shutdown for MS-EX
>> DBs?
>>
>> The node currently has 4GB free RAM and 29GB listed as cache /
>> available. These numbers need caution because we have "tuned" enabled
>> which causes de-deplication on RAM and this host runs about 10 Windows
>> VMs.
>> During reboots or updates, RAM can get full again.
>>
>> Maybe I am to cautious about live-storage-migration, maybe I am not.
>>
>> What are your experiences or advices?
>>
>> Thank you very much!
>
> I was read your message two times and still can't figure out what is
> your question?
>
> You need move your block image from some storage to Ceph? No, you
> can't do this without downtime because fs consistency.
>
> You can easy migrate your filesystem via rsync for example, with small
> downtime for reboot VM.
>

I believe OP is trying to use the storage migration feature of QEMU.
I've never tried it and I wouldn't recommend it (probably not very
tested and there is a large window for failure).

One tactic that can be used assuming OP is using LVM in the VM for
storage is to add a Ceph volume to the VM (probably needs a reboot) add
the corresponding virtual disk to the VM volume group and then migrate
all data from the logical volume(s) to the new disk. LVM is using
mirroring internally during the transfer so you get robustness by using
it. It can be slow (especially with old kernels) but at least it is
safe. I've done a DRBD to Ceph migration with this process 5 years ago.
When all logical volumes are moved to the new disk you can remove the
old disk from the volume group.

Assuming everything is on LVM including the root filesystem, only moving
the boot partition will have to be done outside of LVM.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-05-31 Thread Lionel Bouton

On 31/05/2018 14:41, Simon Ironside wrote:
> On 24/05/18 19:21, Lionel Bouton wrote:
>
>> Unfortunately I just learned that Supermicro found an incompatibility
>> between this motherboard and SM863a SSDs (I don't have more information
>> yet) and they proposed S4600 as an alternative. I immediately remembered
>> that there were problems and asked for a delay/more information and dug
>> out this old thread.
>
> In case it helps you, I'm about to go down the same Supermicro EPYC
> and SM863a path as you. I asked about the incompatibility you
> mentioned and they knew what I was referring to. The incompatibility
> is between the on-board SATA controller and the SM863a and has
> apparently already been fixed.

That's good news.

> Even if not fixed, the incompatibility wouldn't be present if you're
> using a RAID controller instead of the on board SATA (which I intend
> to - don't know if you were?).

I wasn't : we plan to use the 14 on board SATA connectors. As long as we
can we use a standard SATA/AHCI controller as they cause less headaches
than RAID controllers even in HBA mode.

Thanks a lot for this information, I've forwarded it to our Supermicro
reseller.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Many concurrent drive failures - How do I activate pgs?

2018-05-24 Thread Lionel Bouton

Hi,

On 22/02/2018 23:32, Mike Lovell wrote:
> hrm. intel has, until a year ago, been very good with ssds. the
> description of your experience definitely doesn't inspire confidence.
> intel also dropping the entire s3xxx and p3xxx series last year before
> having a viable replacement has been driving me nuts.
>
> i don't know that i have the luxury of being able to return all of the
> ones i have or just buying replacements. i'm going to need to at least
> try them in production. it'll probably happen with the s4600 limited
> to a particular fault domain. these are also going to be filestore
> osds so maybe that will result in a different behavior. i'll try to
> post updates as i have them.

Sorry for the deep digging into the archives. I might be in a situation
where I could get S4600 (with filestore initially but I would very much
like them to support Bluestore without bursting into flames).

To expand a Ceph cluster and test EPYC in our context we have ordered a
server based on a Supermicro EPYC motherboard and SM863a SSDs. For
reference :
https://www.supermicro.nl/Aplus/motherboard/EPYC7000/H11DSU-iN.cfm

Unfortunately I just learned that Supermicro found an incompatibility
between this motherboard and SM863a SSDs (I don't have more information
yet) and they proposed S4600 as an alternative. I immediately remembered
that there were problems and asked for a delay/more information and dug
out this old thread.

Has anyone successfully used Ceph with S4600 ? If so could you share if
you used filestore or bluestore, which firmware was used and
approximately how much data was written on the most used SSDs ?

Best regards,

Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] HW Raid vs. Multiple OSD

2017-11-13 Thread Lionel Bouton

Le 13/11/2017 à 15:47, Oscar Segarra a écrit :
> Thanks Mark, Peter, 
>
> For clarification, the configuration with RAID5 is having many servers
> (2 or more) with RAID5 and CEPH on top of it. Ceph will replicate data
> between servers. Of course, each server will have just one OSD daemon
> managing a big disk.
>
> It looks functionally is the same using RAID5 +  1 Ceph daemon as 8
> CEPH daemons.

Functionally it's the same but RAID5 will kill your write performance.

For example if you start with 3 OSD hosts and a pool size of 3, due to
RAID5 each and every write on your Ceph cluster will imply a read on one
server on every disks minus one then a write on *all* the disks of the
cluster.

If you use one OSD per disk you'll have a read on one disk only and a
write on 3 disks only : you'll get approximately 8 times the IOPS for
writes (with 8 disks per server). Clever RAID5 logic can minimize this
for some I/O patterns but it is a bet and will never be as good as what
you'll get with one disk per OSD.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] dropping filestore+btrfs testing for luminous

2017-07-04 Thread Lionel Bouton

Le 04/07/2017 à 19:00, Jack a écrit :
> You may just upgrade to Luminous, then replace filestore by bluestore

You don't just "replace" filestore by bluestore on a production cluster
: you transition over several weeks/months from the first to the second.
The two must be rock stable and have predictable performance
characteristics to do that.
We took more than 6 months with Firefly to migrate from XFS to Btrfs and
studied/tuned the cluster along the way. Simply replacing a store by
another without any experience of the real world behavior of the new one
is just playing with fire (and a huge heap of customer data).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] dropping filestore+btrfs testing for luminous

2017-07-04 Thread Lionel Bouton

Le 30/06/2017 à 18:48, Sage Weil a écrit :
> On Fri, 30 Jun 2017, Lenz Grimmer wrote:
>> Hi Sage,
>>
>> On 06/30/2017 05:21 AM, Sage Weil wrote:
>>
>>> The easiest thing is to
>>>
>>> 1/ Stop testing filestore+btrfs for luminous onward.  We've recommended 
>>> against btrfs for a long time and are moving toward bluestore anyway.
>> Searching the documentation for "btrfs" does not really give a user any
>> clue that the use of Btrfs is discouraged.
>>
>> Where exactly has this been recommended?
>>
>> The documentation currently states:
>>
>> http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/?highlight=btrfs#osds
>>
>> "We recommend using the xfs file system or the btrfs file system when
>> running mkfs."
>>
>> http://docs.ceph.com/docs/master/rados/configuration/filesystem-recommendations/?highlight=btrfs#filesystems
>>
>> "btrfs is still supported and has a comparatively compelling set of
>> features, but be mindful of its stability and support status in your
>> Linux distribution."
>>
>> http://docs.ceph.com/docs/master/start/os-recommendations/?highlight=btrfs#ceph-dependencies
>>
>> "If you use the btrfs file system with Ceph, we recommend using a recent
>> Linux kernel (3.14 or later)."
>>
>> As an end user, none of these statements would really sound as
>> recommendations *against* using Btrfs to me.
>>
>> I'm therefore concerned about just disabling the tests related to
>> filestore on Btrfs while still including and shipping it. This has
>> potential to introduce regressions that won't get caught and fixed.
> Ah, crap.  This is what happens when devs don't read their own 
> documetnation.  I recommend against btrfs every time it ever comes up, the 
> downstream distributions all support only xfs, but yes, it looks like the 
> docs never got updated... despite the xfs focus being 5ish years old now.
>
> I'll submit a PR to clean this up, but
>  
>>> 2/ Leave btrfs in the mix for jewel, and manually tolerate and filter out 
>>> the occasional ENOSPC errors we see.  (They make the test runs noisy but 
>>> are pretty easy to identify.)
>>>
>>> If we don't stop testing filestore on btrfs now, I'm not sure when we 
>>> would ever be able to stop, and that's pretty clearly not sustainable.
>>> Does that seem reasonable?  (Pretty please?)
>> If you want to get rid of filestore on Btrfs, start a proper deprecation
>> process and inform users that support for it it's going to be removed in
>> the near future. The documentation must be updated accordingly and it
>> must be clearly emphasized in the release notes.
>>
>> Simply disabling the tests while keeping the code in the distribution is
>> setting up users who happen to be using Btrfs for failure.
> I don't think we can wait *another* cycle (year) to stop testing this.
>
> We can, however,
>
>  - prominently feature this in the luminous release notes, and
>  - require the 'enable experimental unrecoverable data corrupting features =
> btrfs' in order to use it, so that users are explicitly opting in to 
> luminous+btrfs territory.
>
> The only good(ish) news is that we aren't touching FileStore if we can 
> help it, so it less likely to regress than other things.  And we'll 
> continue testing filestore+btrfs on jewel for some time.
>
> Is that good enough?

Not sure how we will handle the transition. Is bluestore considered
stable in Jewel ? Then our current clusters (recently migrated from
Firefly to Hammer) will have support for both BTRFS+Filestore and
Bluestore when the next upgrade takes place. If Bluestore is only
considered stable on Luminous I don't see how we can manage the
transition easily. The only path I see is to :
- migrate to XFS+filestore with Jewel (which will not only take time but
will be a regression for us : this will cause performance and sizing
problems on at least one of our clusters and we will lose the silent
corruption detection from BTRFS)
- then upgrade to Luminous and migrate again to Bluestore.
I was not expecting the transition from Btrfs+Filestore to Bluestore to
be this convoluted (we were planning to add Bluestore OSDs one at a time
and study the performance/stability for months before migrating the
whole clusters). Is there any way to restrict your BTRFS tests to at
least a given stable configuration (BTRFS is known to have problems with
the high rate of snapshot deletion Ceph generates by default for example
and we use 'filestore btrfs snap = false') ?

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-18 Thread Lionel Bouton

Le 18/04/2017 à 11:24, Jogi Hofmüller a écrit :
> Hi,
>
> thanks for all you comments so far.
>
> Am Donnerstag, den 13.04.2017, 16:53 +0200 schrieb Lionel Bouton:
>> Hi,
>>
>> Le 13/04/2017 à 10:51, Peter Maloney a écrit :
>>> Ceph snapshots relly slow things down.
> I can confirm that now :(
>
>> We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
>> measurable impact on performance... until we tried to remove them. We
>> usually have at least one snapshot per VM image, often 3 or 4.
> This might have been true for hammer and older versions of ceph. From
> what I can tell now, every snapshot taken reduces performance of the
> entire cluster :(

The version isn't the only difference here. We use BTRFS with a custom
defragmentation process for the filestores, which is highly uncommon for
Ceph users. As I said, Ceph has support for BTRFS CoW, so a part of the
snapshot handling processes is actually handled by BTRFS.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Lionel Bouton

Le 13/04/2017 à 17:47, mj a écrit :
> Hi,
>
> On 04/13/2017 04:53 PM, Lionel Bouton wrote:
>> We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
>> measurable impact on performance... until we tried to remove them.
>
> What exactly do you mean with that?

Just what I said : having snapshots doesn't impact performance, only
removing them (obviously until Ceph is finished cleaning up).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow requests and short OSD failures in small cluster

2017-04-13 Thread Lionel Bouton

Hi,

Le 13/04/2017 à 10:51, Peter Maloney a écrit :
> [...]
> Also more things to consider...
>
> Ceph snapshots relly slow things down.

We use rbd snapshots on Firefly (and Hammer now) and I didn't see any
measurable impact on performance... until we tried to remove them. We
usually have at least one snapshot per VM image, often 3 or 4.
Note that we use BTRFS filestores where IIRC the CoW is handled by the
filesystem so it might be faster compared to the default/recommended XFS
filestores.

>  They aren't efficient like on
> zfs and btrfs. Having one might take away some % performance, and having
> 2 snaps takes potentially double, etc. until it is crawling. And it's
> not just the CoW... even just rbd snap rm, rbd diff, etc. starts to take
> many times longer. See http://tracker.ceph.com/issues/10823 for
> explanation of CoW. My goal is just to keep max 1 long term snapshot.[...]

In my experience with BTRFS filestores, snap rm impact is proportional
to the amount of data specific to the snapshot being removed (ie: not
present on any other snapshot) but completely unrelated to the number of
existing snapshots. For example the first one removed can be handled
very fast and it can be the last one removed that takes the most time
and impacts the most the performance.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-10 Thread Lionel Bouton

Hi,

Le 10/01/2017 à 19:32, Brian Andrus a écrit :
> [...]
>
>
> I think the main point I'm trying to address is - as long as the
> backing OSD isn't egregiously handling large amounts of writes and it
> has a good journal in front of it (that properly handles O_DSYNC [not
> D_SYNC as Sebastien's article states]), it is unlikely inconsistencies
> will occur upon a crash and subsequent restart.

I don't see how you can guess if it is "unlikely". If you need SSDs you
are probably handling relatively large amounts of accesses (so large
amounts of writes aren't unlikely) or you would have used cheap 7200rpm
or even slower drives.

Remember that in the default configuration, if you have any 3 OSDs
failing at the same time, you have chances of losing data. For <30 OSDs
and size=3 this is highly probable as there are only a few thousands
combinations of 3 OSDs possible (and you usually have typically a
thousand or 2 of pgs picking OSDs in a more or less random pattern).

With SSDs not handling write barriers properly I wouldn't bet on
recovering the filesystems of all OSDs properly given a cluster-wide
power loss shutting down all the SSDs at the same time... In fact as the
hardware will lie about the stored data, the filesystem might not even
detect the crash properly and might apply its own journal on outdated
data leading to unexpected results.
So losing data is a possibility and testing for it is almost impossible
(you'll have to reproduce all the different access patterns your Ceph
cluster could experience at the time of a power loss and trigger the
power losses in each case).

>
> Therefore - while not ideal to rely on journals to maintain consistency,

Ceph journals aren't designed for maintaining the filestore consistency.
They *might* restrict the access patterns to the filesystems in such a
way that running fsck on them after a "let's throw away committed data"
crash might have better chances of restoring enough data but if it's the
case it's only an happy coincidence (and you will have to run these
fscks *manually* as the filesystem can't detect inconsistencies by itself).

> that is what they are there for.

No. They are here for Ceph internal consistency, not the filesystem
backing the filestore consistency. Ceph relies both on journals and
filesystems able to maintain internal consistency and supporting syncfs
to maintain consistency, if the journal or the filesystem fails the OSD
is damaged. If 3 OSDs are damaged at the same time on a size=3 pool you
enter "probable data loss" territory.

> There is a situation where "consumer-grade" SSDs could be used as
> OSDs. While not ideal, it can and has been done before, and may be
> preferable to tossing out $500k of SSDs (Seen it firsthand!)

For these I'd like to know :
- which SSD models were used ?
- how long did the SSDs survive (some consumer SSDs not only lie to the
system about write completions but they usually don't handle large
amounts of write nearly as well as DC models) ?
- how many cluster-wide power losses did the cluster survive ?
- what were the access patterns on the cluster during the power losses ?

If for a model not guaranteed for sync writes there hasn't been dozens
of power losses on clusters under large loads without any problem
detected in the week following (thing deep-scrub), using them is playing
Russian roulette with your data.

AFAIK there have only been reports of data losses and/or heavy
maintenance later when people tried to use consumer SSDs (admittedly
mainly for journals). I've yet to spot long-running robust clusters
built with consumer SSDs.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Analysing ceph performance with SSD journal, 10gbe NIC and 2 replicas -Hammer release

2017-01-07 Thread Lionel Bouton

Le 07/01/2017 à 14:11, kevin parrikar a écrit :
> Thanks for your valuable input.
> We were using these SSD in our NAS box(synology)  and it was giving
> 13k iops for our fileserver in raid1.We had a few spare disks which we
> added to our ceph nodes hoping that it will give good performance same
> as that of NAS box.(i am not comparing NAS with ceph ,just the reason
> why we decided to use these SSD)
>
> We dont have S3520 or S3610 at the moment but can order one of these
> to see how it performs in ceph .We have 4xS3500  80Gb handy.
> If i create a 2 node cluster with 2xS3500 each and with replica of
> 2,do you think it can deliver 24MB/s of 4k writes .

Probably not. See
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

According to the page above the DC S3500 reaches 39MB/s. Its capacity
isn't specified, yours are 80GB only which is the lowest capacity I'm
aware of and for all DC models I know of the speed goes down with the
capacity so you probably will get lower than that.
If you put both data and journal on the same device you cut your
bandwidth in half : so this would give you an average <20MB/s per OSD
(with occasional peaks above that if you don't have a sustained 20MB/s).
With 4 OSDs and size=2, your total write bandwidth is <40MB/s. For a
single stream of data you will only get <20MB/s though (you won't
benefit from parallel writes to the 4 OSDs and will only write on 2 at a
time).

Not that by comparison the 250GB 840 EVO only reaches 1.9MB/s.

But even if you reach the 40MB/s, these models are not designed for
heavy writes, you will probably kill them long before their warranty is
expired (IIRC these are rated for ~24GB writes per day over the warranty
period). In your configuration you only have to write 24G each day (as
you have 4 of them, write both to data and journal and size=2) to be in
this situation (this is an average of only 0.28 MB/s compared to your 24
MB/s target).

> We bought S3500 because last time when we tried ceph, people were
> suggesting this model :) :)

The 3500 series might be enough with the higher capacities in some rare
cases but the 80GB model is almost useless.

You have to do the math considering :
- how much you will write to the cluster (guess high if you have to guess),
- if you will use the SSD for both journals and data (which means
writing twice on them),
- your replication level (which means you will write multiple times the
same data),
- when you expect to replace the hardware,
- the amount of writes per day they support under warranty (if the
manufacturer doesn't present this number prominently they probably are
trying to sell you a fast car headed for a brick wall)

If your hardware can't handle the amount of write you expect to put in
it then you are screwed. There were reports of new Ceph users not aware
of this and using cheap SSDs that failed in a matter of months all at
the same time. You definitely don't want to be in their position.
In fact as problems happen (hardware failure leading to cluster storage
rebalancing for example) you should probably get a system able to handle
10x the amount of writes you expect it to handle and then monitor the
SSD SMART attributes to be alerted long before they die and replace them
before problems happen. You definitely want a controller allowing access
to this information. If you can't you will have to monitor the writes
and guess this value which is risky as write amplification inside SSDs
is not easy to guess...

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how possible is that ceph cluster crash

2016-11-19 Thread Lionel Bouton

Le 19/11/2016 à 00:52, Brian :: a écrit :
> This is like your mother telling not to cross the road when you were 4
> years of age but not telling you it was because you could be flattened
> by a car :)
>
> Can you expand on your answer? If you are in a DC with AB power,
> redundant UPS, dual feed from the electric company, onsite generators,
> dual PSU servers, is it still a bad idea?

Yes it is.

In such a datacenter where we have a Ceph cluster there was a complete
shutdown because of a design error : the probes used by the solution
responsible for starting and stopping the generators were installed
before the breakers installed on the feeds. After a blackout where
generators kicked in the breakers opened due to a surge when power was
restored. The generators were stopped because power was restored, and
the UPS systems failed 3 minutes later. Closing the breakers couldn't be
done in time (you don't approach them without being heavily protected,
putting on the suit to protect you needs more time than simply closing
the breaker).

There's no such thing as uninterruptible power supply.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph OSD with 95% full

2016-07-19 Thread Lionel Bouton

Hi,

On 19/07/2016 13:06, Wido den Hollander wrote:
>> Op 19 juli 2016 om 12:37 schreef M Ranga Swami Reddy :
>>
>>
>> Thanks for the correction...so even one OSD reaches to 95% full, the
>> total ceph cluster IO (R/W) will be blocked...Ideally read IO should
>> work...
> That should be a config option, since reading while writes still block is 
> also a danger. Multiple clients could read the same object, perform a 
> in-memory change and their write will block.
>
> Now, which client will 'win' after the full flag has been removed?
>
> That could lead to data corruption.

If it did, the clients would be broken as normal usage (without writes
being blocked) doesn't prevent multiple clients from reading the same
data and trying to write at the same time. So if multiple writes (I
suppose on the same data blocks) are possibly waiting the order in which
they are performed *must not* matter in your system. The alternative is
to prevent simultaneous write accesses from multiple clients (this is
how non-cluster filesystems must be configured on top of Ceph/RBD, they
must even be prevented from read-only accessing an already mounted fs).

>
> Just make sure you have proper monitoring on your Ceph cluster. At nearfull 
> it goes into WARN and you should act on that.

+1 : monitoring is not an option.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-12 Thread Lionel Bouton

Hi,

Le 12/07/2016 02:51, Brad Hubbard a écrit :
>  [...]
 This is probably a fragmentation problem : typical rbd access patterns
 cause heavy BTRFS fragmentation.
>>> To the extent that operations take over 120 seconds to complete? Really?
>> Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very
>> aggressive way, rewriting data all over the place and creating/deleting
>> snapshots every filestore sync interval (5 seconds max by default IIRC).
>>
>> As I said there are 3 main causes of performance degradation :
>> - the snapshots,
>> - the journal in a standard copy-on-write file (move it out of the FS or
>> use NoCow),
>> - the weak auto defragmentation of BTRFS (autodefrag mount option).
>>
>> Each one of them is enough to impact or even destroy performance in the
>> long run. The 3 combined make BTRFS unusable by default. This is why
>> BTRFS is not recommended : if you want to use it you have to be prepared
>> for some (heavy) tuning. The first 2 points are easy to address, for the
>> last (which begins to be noticeable when you accumulate rewrites on your
>> data) I'm not aware of any other tool than the one we developed and
>> published on github (link provided in previous mail).
>>
>> Another thing : you better have a recent 4.1.x or 4.4.x kernel on your
>> OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise
>> it now and would recommend 4.4.x if it's possible for you and 4.1.x
>> otherwise.
> Thanks for the information. I wasn't aware things were that bad with BTRFS as
> I haven't had much to do with it up to this point.

Bad is relative. BTRFS was very time consuming to set up (mainly because
of the defragmentation scheduler development but finding sources of
inefficiency was no picnic either), but once used properly it has 3
unique advantages :
- data checksums : this forces Ceph to use one good replica by refusing
to hand over corrupted data and makes it far easier to handle silent
data corruption (and some of our RAID controllers, probably damaged by
electrical surges, had this nasty habit of flipping bits so it really
was a big time/data saver here),
- compression : you get more space for free,
- speed : we get better latencies than XFS with it.

Until bluestore is production ready (it should address these points even
better than BTRFS does), if I don't find a use case where BTRFS falls on
its face there's no way I'd used anything but BTRFS with Ceph.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Lionel Bouton

Le 11/07/2016 11:56, Brad Hubbard a écrit :
> On Mon, Jul 11, 2016 at 7:18 PM, Lionel Bouton
> <lionel-subscript...@bouton.name> wrote:
>> Le 11/07/2016 04:48, 한승진 a écrit :
>>> Hi cephers.
>>>
>>> I need your help for some issues.
>>>
>>> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>>>
>>> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>>>
>>> I've experienced one of OSDs was killed himself.
>>>
>>> Always it issued suicide timeout message.
>> This is probably a fragmentation problem : typical rbd access patterns
>> cause heavy BTRFS fragmentation.
> To the extent that operations take over 120 seconds to complete? Really?

Yes, really. I had these too. By default Ceph/RBD uses BTRFS in a very
aggressive way, rewriting data all over the place and creating/deleting
snapshots every filestore sync interval (5 seconds max by default IIRC).

As I said there are 3 main causes of performance degradation :
- the snapshots,
- the journal in a standard copy-on-write file (move it out of the FS or
use NoCow),
- the weak auto defragmentation of BTRFS (autodefrag mount option).

Each one of them is enough to impact or even destroy performance in the
long run. The 3 combined make BTRFS unusable by default. This is why
BTRFS is not recommended : if you want to use it you have to be prepared
for some (heavy) tuning. The first 2 points are easy to address, for the
last (which begins to be noticeable when you accumulate rewrites on your
data) I'm not aware of any other tool than the one we developed and
published on github (link provided in previous mail).

Another thing : you better have a recent 4.1.x or 4.4.x kernel on your
OSDs if you use BTRFS. We've used it since 3.19.x but I wouldn't advise
it now and would recommend 4.4.x if it's possible for you and 4.1.x
otherwise.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Ceph OSD suicide himself

2016-07-11 Thread Lionel Bouton

Le 11/07/2016 04:48, 한승진 a écrit :
> Hi cephers.
>
> I need your help for some issues.
>
> The ceph cluster version is Jewel(10.2.1), and the filesytem is btrfs.
>
> I run 1 Mon and 48 OSD in 4 Nodes(each node has 12 OSDs).
>
> I've experienced one of OSDs was killed himself.
>
> Always it issued suicide timeout message.

This is probably a fragmentation problem : typical rbd access patterns
cause heavy BTRFS fragmentation.

If you already use the autodefrag mount option, you can try this which
performs much better for us :
https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb

Note that it can take some time to fully defragment the filesystems but
it shouldn't put more stress than autodefrag while doing so.

If you don't already use it, set :
filestore btrfs snap = false
in ceph.conf an restart your OSDs.

Finally if you use journals on the filesystem and not on dedicated
partitions, you'll have to recreate them with the NoCow attribute
(there's no way to defragment journals in any way that doesn't kill
performance otherwise).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pg scrub and auto repair in hammer

2016-06-29 Thread Lionel Bouton

Hi,

Le 29/06/2016 18:33, Stefan Priebe - Profihost AG a écrit :
>> Am 28.06.2016 um 09:43 schrieb Lionel Bouton 
>> <lionel-subscript...@bouton.name>:
>>
>> Hi,
>>
>> Le 28/06/2016 08:34, Stefan Priebe - Profihost AG a écrit :
>>> [...]
>>> Yes but at least BTRFS is still not working for ceph due to
>>> fragmentation. I've even tested a 4.6 kernel a few weeks ago. But it
>>> doubles it's I/O after a few days.
>> BTRFS autodefrag is not working over the long term. That said BTRFS
>> itself is working far better than XFS on our cluster (noticeably better
>> latencies). As not having checksums wasn't an option we coded and are
>> using this:
>>
>> https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb
>>
>> This actually saved us from 2 faulty disk controllers which were
>> infrequently corrupting data in our cluster.
>>
>> Mandatory too for performance :
>> filestore btrfs snap = false
> This sounds interesting. For how long you use this method?

More than a year now. Since the beginning almost two years ago we always
had at least one or two BTRFS OSDs to test and compare to the XFS ones.
At the very beginning we had to recycle them regularly because their
performance degraded over time. This was not a problem as Ceph makes it
easy to move data around safely.
We only switched after both finding out that "filestore btrfs snap =
false" was mandatory (when true it creates large write spikes every
filestore sync interval) and that a custom defragmentation process was
needed to maintain performance over the long run.

>  What kind of workload do you have?

A dozen VMs using rbd through KVM built-in support. There are different
kinds of access patterns : a large PostgreSQL instance (75+GB on disk,
300+ tx/s with peaks of ~2000 with a mean of 50+ IO/s and peaks to 1000,
mostly writes), a small MySQL instance (hard to say : was very large but
we moved most of its content to PostgreSQL which left only a small
database for a proprietary tool and large ibdata* files with mostly
holes), a very large NFS server (~10 TB), lots of Ruby on Rails
applications and background workers.

On the whole storage system Ceph reports an average of 170 op/s with
peaks that can reach 3000.

>  How did you measure the performance and latency?

Every useful metric we can get is fed to a Zabbix server. Latency is
measured both by the kernel on each disk with the average time a request
stays in queue (number of IOs / accumulated wait time over a given
period : you can find these values in /sys/block//stat) and at Ceph
level by monitoring the apply latency (we now have journals on SSD so
our commit latency is mostly limited by the available CPU).
The most interesting metric is the apply latency, block device latency
is useful to monitor to see how much the device itself is pushed and how
well read performs (apply latency only gives us the write side of the
story).

The behavior during backfills confirmed the latency benefits too : BTRFS
OSDs were less frequently involved in slow requests than the XFS ones.

>  What kernel do you use with btrfs?

4.4.6 currently (we just finished migrating all servers last week-end).
But the switch from XFS to BTRFS occurred with late 3.9 kernels IIRC.

I don't have measurements for this but when we switched from 4.1.15-r1
("-r1" is for Gentoo patches) to 4.4.6 we saw faster OSD startups
(including the initial filesystem mount). The only drawback with BTRFS
(if you don't count having to develop and run a custom defragmentation
scheduler) was the OSD startup times vs XFS. It was very slow when
starting from an unmounted filesystem at least until 4.1.x. This was not
really a problem as we don't restart OSDs often.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another cluster completely hang

2016-06-29 Thread Lionel Bouton

Hi,

Le 29/06/2016 12:00, Mario Giammarco a écrit :
> Now the problem is that ceph has put out two disks because scrub  has
> failed (I think it is not a disk fault but due to mark-complete)

There is something odd going on. I've only seen deep-scrub failing (ie
detect one inconsistency and marking the pg so) so I'm not sure what
happens in the case of a "simple" scrub failure but what should not
happen is the whole OSD going down on scrub of deepscrub fairure which
you seem to imply did happen.
Do you have logs for these two failures giving a hint at what happened
(probably /var/log/ceph/ceph-osd..log) ? Any kernel log pointing to
hardware failure(s) around the time these events happened ?

Another point : you said that you had one disk "broken". Usually ceph
handles this case in the following manner :
- the OSD detects the problem and commit suicide (unless it's configured
to ignore IO errors which is not the default),
- your cluster is then in degraded state with one OSD down/in,
- after a timeout (several minutes), Ceph decides that the OSD won't
come up again soon and marks the OSD "out" (so one OSD down/out),
- as the OSD is out, crush adapts pg positions based on the remaining
available OSDs and bring back all degraded pg to clean state by creating
missing replicas while moving pgs around. You see a lot of IO, many pg
in wait_backfill/backfilling states at this point,
- when all is done the cluster is back to HEALTH_OK

When your disk was broken and you waited 24 hours how far along this
process was your cluster ?

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pg scrub and auto repair in hammer

2016-06-28 Thread Lionel Bouton

Hi,

Le 28/06/2016 08:34, Stefan Priebe - Profihost AG a écrit :
> [...]
> Yes but at least BTRFS is still not working for ceph due to
> fragmentation. I've even tested a 4.6 kernel a few weeks ago. But it
> doubles it's I/O after a few days.

BTRFS autodefrag is not working over the long term. That said BTRFS
itself is working far better than XFS on our cluster (noticeably better
latencies). As not having checksums wasn't an option we coded and are
using this:

https://github.com/jtek/ceph-utils/blob/master/btrfs-defrag-scheduler.rb

This actually saved us from 2 faulty disk controllers which were
infrequently corrupting data in our cluster.

Mandatory too for performance :
filestore btrfs snap = false

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Pinpointing performance bottleneck / would SSD journals help?

2016-06-27 Thread Lionel Bouton

Le 27/06/2016 17:42, Daniel Schneller a écrit :
> Hi!
>
> We are currently trying to pinpoint a bottleneck and are somewhat stuck.
>
> First things first, this is the hardware setup:
>
> 4x DELL PowerEdge R510, 12x4TB OSD HDDs, journal colocated on HDD
>   96GB RAM, 2x6 Cores + HT
> 2x1GbE bonded interfaces for Cluster Network
> 2x1GbE bonded interfaces for Public Network
> Ceph Hammer on Ubuntu 14.04
>
> 6 OpenStack Compute Nodes with all-RBD VMs (no ephemeral storage).
>
> The VMs run a variety of stuff, most notable MongoDB, Elasticsearch
> and our custom software which uses both the VM's virtual disks as
> well the Rados Gateway for Object Storage.
>
> Recently, under certain more write intensive conditions we see reads
> overall system performance starting to suffer as well.
>
> Here is an iostat -x 3 sample for one of the VMs hosting MongoDB.
> Notice the "await" times (vda is the root, vdb is the data volume).
>
>
> Linux 3.13.0-35-generic (node02) 06/24/2016 _x86_64_(16 CPU)
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   1.550.000.440.420.00   97.59
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vda   0.00 0.910.091.01 2.55 9.59   
> 22.12 0.01  266.90 2120.51   98.59   4.76   0.52
> vdb   0.00 1.53   18.39   40.79   405.98   483.92   
> 30.07 0.305.685.425.80   3.96  23.43
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   5.050.002.083.160.00   89.71
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vda   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> vdb   0.00 7.00   23.00   29.00   368.00   500.00   
> 33.38 1.91  446.00  422.26  464.83  19.08  99.20
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   4.430.001.734.940.00   88.90
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vda   0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> vdb   0.0013.00   45.00   83.00   712.00  1041.00   
> 27.39 2.54 1383.25  272.18 1985.64   7.50  96.00
>
>
> If we read this right, the average time spent waiting for read or write
> requests to be serviced can be multi-second. This would go in line with
> MongoDB's slow log, where we see fully indexed queries, returning a
> single result, taking over a second, where they would normally be
> finished
> quasi instantly.
>
> So far we have looked at these metrics (using StackExchange's Bosun
> from https://bosun.org). Most values are collected every 15 seconds.
>
> * Network Link saturation.
>  All links / bonds are well below any relevant load (around 35MB/s or
>  less)

Or you sure ? On each server you have 12 OSDs with a theoretical
bandwidth of at least half of 100MB/s (minimum bandwidth of any
reasonable HDD but halved because of the journal on the same device).
Which means your total disk bandwidth per server is 600MB/s.
Bonded links are not perfect aggregation (depending on the mode one
client will either always use the same link or have its traffic
imperfectly balanced between the 2), so your theoretical network
bandwidth is probably nearest to 1Gbps (~ 120MB/s).

What could happen is that the 35MB/s is an average over a large period
(several seconds), it's probably peaking at 120MB/s during short bursts.
I wouldn't use less than 10Gbps for both the cluster and public networks
in your case.

You didn't say how many VMs are running : the rkB/s and wkB/s seem very
low (note that for write intensive tasks your VM is reading quite a
bit...) but if you have 10 VMs or more battling for read and write
access this way it wouldn't be unexpected. As soon as latency rises for
one reason or another (here it would be network latency) you can expect
the total throughput of random accesses to plummet.

If your cluster isn't already backfilling or deep scrubbing you can
expect it to crumble on itself when it does (and it will have to perform
these at some point)...

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Lionel Bouton

Le 12/04/2016 01:40, Lindsay Mathieson a écrit :
> On 12/04/2016 9:09 AM, Lionel Bouton wrote:
>> * If the journal is not on a separate partition (SSD), it should
>> definitely be re-created NoCoW to avoid unnecessary fragmentation. From
>> memory : stop OSD, touch journal.new, chattr +C journal.new, dd
>> if=journal of=journal.new (your dd options here for best perf/least
>> amount of cache eviction), rm journal, mv journal.new journal, start OSD
>> again.
>
> Flush the journal after stopping the OSD !
>

No need to: dd makes an exact duplicate.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Deprecating ext4 support

2016-04-11 Thread Lionel Bouton

Hi,

Le 11/04/2016 23:57, Mark Nelson a écrit :
> [...]
> To add to this on the performance side, we stopped doing regular
> performance testing on ext4 (and btrfs) sometime back around when ICE
> was released to focus specifically on filestore behavior on xfs. 
> There were some cases at the time where ext4 was faster than xfs, but
> not consistently so.  btrfs is often quite fast on fresh fs, but
> degrades quickly due to fragmentation induced by cow with
> small-writes-to-large-object workloads (IE RBD small writes).  If
> btrfs auto-defrag is now safe to use in production it might be worth
> looking at again, but probably not ext4.

For BTRFS, autodefrag is probably not performance-safe (yet), at least
with RBD access patterns. At least it wasn't in 4.1.9 when we tested it
last time (the performance degraded slowly but surely over several weeks
from an initially good performing filesystem to the point where we
measured a 100% increase in average latencies and large spikes and
stopped the experiment). I didn't see any patches on linux-btrfs since
then (it might have benefited from other modifications, but the
autodefrag algorithm wasn't reworked itself AFAIK).
That's not an inherent problem of BTRFS but of the autodefrag
implementation though. Deactivating autodefrag and reimplementing a
basic, cautious defragmentation scheduler gave us noticeably better
latencies with BTRFS vs XFS (~30% better) on the same hardware and
workload long term (as in almost a year and countless full-disk rewrites
on the same filesystems due to both normal writes and rebalancing with 3
to 4 months of XFS and BTRFS OSDs coexisting for comparison purposes).
I'll certainly remount a subset of our OSDs autodefrag as I did with
4.1.9 when we will deploy 4.4.x or a later LTS kernel. So I might have
more up to date information in the coming months. I don't plan to
compare BTRFS to XFS anymore though : XFS only saves us from running our
defragmentation scheduler, BTRFS is far more suited to our workload and
we've seen constant improvements in behavior in the (arguably bumpy
until late 3.19 versions) 3.16.x to 4.1.x road.

Other things:

* If the journal is not on a separate partition (SSD), it should
definitely be re-created NoCoW to avoid unnecessary fragmentation. From
memory : stop OSD, touch journal.new, chattr +C journal.new, dd
if=journal of=journal.new (your dd options here for best perf/least
amount of cache eviction), rm journal, mv journal.new journal, start OSD
again.
* filestore btrfs snap = false
  is mandatory if you want consistent performance (at least on HDDs). It
may not be felt with almost empty OSDs but performance hiccups appear if
any non trivial amount of data is added to the filesystems.
  IIRC, after debugging surprisingly the snapshot creation didn't seem
to be the actual cause of the performance problems but the snapshot
deletion... It's so bad that the default should probably be false and
not true.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-20 Thread Lionel Bouton

Hi,

Le 20/03/2016 15:23, Francois Lafont a écrit :
> Hello,
>
> On 20/03/2016 04:47, Christian Balzer wrote:
>
>> That's not protection, that's an "uh-oh, something is wrong, you better
>> check it out" notification, after which you get to spend a lot of time
>> figuring out which is the good replica 
> In fact, I have never been confronted to this case so far and I have a
> couple of questions.
>
> 1. When it happens (ie a deep scrub fails), is it mentioned in the output
> of the "ceph status" command and, in this case, can you confirm to me
> that the health of the cluster in the output is different of "HEALTH_OK"?

Yes. This is obviously a threat to your data so the cluster isn't
HEALTH_OK (HEALTH_WARN IIRC).

>
> 2. For instance, if it happens with the PG id == 19.10 and if I have 3 OSDs
> for this PG (because my pool has replica size == 3). I suppose that the
> concerned OSDs are OSD id == 1, 6 and 12. Can you tell me if this "naive"
> method is valid to solve the problem (and, if not, why)?
>
> a) ssh in the node which hosts osd-1 and I launch this command:
> ~# id=1 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | 
> sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
> 055b0fd18cee4b158a8d336979de74d25fadc1a3  -
>
> b) ssh in the node which hosts osd-6 and I launch this command:
> ~# id=6 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | 
> sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
> 055b0fd18cee4b158a8d336979de74d25fadc1a3 -
>
> c) ssh in the node which hosts osd-12 and I launch this command:
> ~# id=12 && sha1sum /var/lib/ceph/osd/ceph-$id/current/19.10_head/* | 
> sed "s|/ceph-$id/|/ceph-id/|" | sha1sum
> 3f786850e387550fdab836ed7e6dc881de23001b -

You may get 3 different hashes because of concurrent writes on the PG.
So you may have to restart your commands and probably try to launch them
at the same time on all nodes to avoid this problem. If you have
constant heavy writes on all your PGs this will probably never give a
useful result.

>
> I notice that the result is different for osd-12 so it's the "bad" osd.
> So, in the node which hosts osd-12, I launch this command:
>
> id=12 && rm /var/lib/ceph/osd/ceph-$id/current/19.10_head/*

You should stop the OSD, flush its journal and then do this before
restarting the OSD.

> And now I can launch safely this command:
>
> ceph pg repair 19.10
>
> Is there a problem with this "naive" method?

It is probably overkill (and may not work, see above). Usually you can
find out the exact file (see the link in my previous post) in this
directory which differs and should be deleted. I believe that if the
offending file isn't on the primary you can directly launch the repair
command.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-19 Thread Lionel Bouton

Le 19/03/2016 18:38, Heath Albritton a écrit :
> If you google "ceph bluestore" you'll be able to find a couple slide
> decks on the topic.  One of them by Sage is easy to follow without the
> benefit of the presentation.  There's also the " Redhat Ceph Storage
> Roadmap 2016" deck.
>
> In any case, bluestore is not intended to address bitrot.  Given that
> ceph is a distributed file system, many of the posix file system
> features are not required for the underlying block storage device.
>  Bluestore is intended to address this and reduce the disk IO required
> to store user data.
>
> Ceph protects against bitrot at a much higher level by validating the
> checksum of the entire placement group during a deep scrub.

My impression is that the only protection against bitrot is provided by
the underlying filesystem which means that you don't get any if you use
XFS or EXT4.

I can't trust Ceph on this alone until its bitrot protection (if any) is
clearly documented. The situation is far from clear right now. The
documentations states that deep scrubs are using checksums to validate
data, but this is not good enough at least because we don't known what
these checksums are supposed to cover (see below for another reason).
There is even this howto by Sebastien Han about repairing a PG :
http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
which clearly concludes that with only 2 replicas you can't reliably
find out which object is corrupted with Ceph alone. If Ceph really
stored checksums to verify all the objects it stores we could manually
check which replica is valid.

Even if deep scrubs would use checksums to verify data this would not be
enough to protect against bitrot: there is a window between a corruption
event and a deep scrub where the data on a primary can be returned to a
client. BTRFS solves this problem by returning an IO error for any data
read that doesn't match its checksum (or automatically rebuilds it if
the allocation group is using RAID1/10/5/6). I've never seen this kind
of behavior documented for Ceph.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ZFS or BTRFS for performance?

2016-03-18 Thread Lionel Bouton

Hi,

Le 18/03/2016 20:58, Mark Nelson a écrit :
> FWIW, from purely a performance perspective Ceph usually looks pretty
> fantastic on a fresh BTRFS filesystem.  In fact it will probably
> continue to look great until you do small random writes to large
> objects (like say to blocks in an RBD volume).  Then COW starts
> fragmenting the objects into oblivion.  I've seen sequential read
> performance drop by 300% after 5 minutes of 4K random writes to the
> same RBD blocks.
>
> Autodefrag might help.

With 3.19 it wasn't enough for our workload and we had to develop our
own defragmentation, see scheduler https://github.com/jtek/ceph-utils.
We tried autodefrag again with a 4.0.5 kernel but it wasn't good enough
yet (and based on my reading of the linux-btrfs list I don't think there
is any work done on it currently).

>   A long time ago I recall Josef told me it was dangerous to use (I
> think it could run the node out of memory and corrupt the FS), but it
> may be that it's safer now.

No problem here (as long as we use our defragmentation scheduler,
otherwise the performance degrades over time/amount of rewrites).

>   In any event we don't really do a lot of testing with BTRFS these
> days as bluestore is indeed the next gen OSD backend.

Will bluestore provide the same protection against bitrot than BTRFS?
Ie: with BTRFS the deep-scrubs detect inconsistencies *and* the OSD(s)
with invalid data get IO errors when trying to read corrupted data and
as such can't be used as the source for repairs even if they are primary
OSD(s). So with BTRFS you get a pretty good overall protection against
bitrot in Ceph (it allowed us to automate the repair process in the most
common cases). With XFS IIRC unless you override the default behavior
the primary OSD is always the source for repairs (even if all the
secondaries agree on another version of the data).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help: pool not responding

2016-02-29 Thread Lionel Bouton

Le 29/02/2016 22:50, Shinobu Kinjo a écrit :
>> the fact that they are optimized for benchmarks and certainly not
>> Ceph OSD usage patterns (with or without internal journal).
> Are you assuming that SSHD is causing the issue?
> If you could elaborate on this more, it would be helpful.

Probably not (unless they reveal themselves extremely unreliable with
Ceph OSD usage patterns which would be surprising to me).

For incomplete PG the documentation seems good enough for what should be
done :
http://docs.ceph.com/docs/master/rados/operations/pg-states/

The relevant text:

/Incomplete/
Ceph detects that a placement group is missing information about
writes that may have occurred, or does not have any healthy copies.
If you see this state, try to start any failed OSDs that may contain
the needed information or temporarily adjust min_size to allow recovery.

We don't have the full history but the most probable cause of these
incomplete PGs is that min_size is set to 2 or 3 and at some time the 4
incomplete pgs didn't have as many replica as the min_size value. So if
setting min_size to 2 isn't enough setting it to 1 should unfreeze them.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to properly deal with NEAR FULL OSD

2016-02-19 Thread Lionel Bouton

Le 19/02/2016 17:17, Don Laursen a écrit :
>
> Thanks. To summarize
>
> Your data, images+volumes = 27.15% space used
>
> Raw used = 81.71% used
>
>  
>
> This is a big difference that I can’t account for? Can anyone? So is
> your cluster actually full?
>

I believe this is the pool size being accounted for and it is harmless:
3 x 27.15 = 81.45 which is awfully close to 81.71.
We have the same behavior on our Ceph cluster.

>  
>
> I had the same problem with my small cluster. Raw used was about 85%
> and actual data, with replication, was about 30%. My OSDs were also
> BRTFS. BRTFS was causing its own problems. I fixed my problem by
> removing each OSD one at a time and re-adding as the default XFS
> filesystem. Doing so brought the percentages used to be about the same
> and it’s good now.
>

That's odd : AFAIK we had the same behaviour with XFS before migrating
to BTRFS.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Lionel Bouton

Le 13/02/2016 06:31, Christian Balzer a écrit :
> [...] > --- > So from shutdown to startup about 2 seconds, not that bad. >
However here is where the cookie crumbles massively: > --- > 2016-02-12
01:33:50.263152 7f75be4d57c0  0 filestore(/var/lib/ceph/osd/ceph-2)
limited size xattrs > 2016-02-12 01:35:31.809897 7f75be4d57c0  0
filestore(/var/lib/ceph/osd/ceph-2) mount: enabling WRITEAHEAD journal
mode > : checkpoint is not enabled > --- > Nearly 2 minutes to mount
things, it probably had to go to disk quite a > bit, as not everything
was in the various slab caches. And yes, there is > 32GB of RAM, most of
it pagecache and vfs_cache_pressure is set to 1. > During that time,
silence of the lambs when it came to ops.
Hum that's surprisingly long. How much data (size and nb of files) do
you have on this OSD, which FS do you use, what are the mount options,
what is the hardware and the kind of access ?

The only time I saw OSDs take several minutes to reach the point where
they fully rejoin is with BTRFS with default options/config.

For reference our last OSD restart only took 6 seconds to complete this
step. We only have RBD storage, so this OSD with 1TB of data has ~25
4M files. It was created ~ 1 year ago and this is after a complete OS
umount/mount cycle which drops the cache (from experience Ceph mount
messages doesn't actually imply that the FS was not mounted).

> Next this : > --- > 2016-02-12 01:35:33.915981 7f75be4d57c0  0 osd.2 1788 
> load_pgs
> 2016-02-12 01:36:32.989709 7f75be4d57c0  0 osd.2 1788 load_pgs opened
564 pgs > --- > Another minute to load the PGs.
Same OSD reboot as above : 8 seconds for this.

This would be way faster if we didn't start with an umounted OSD.

This OSD is still BTRFS but we don't use autodefrag anymore (we replaced
it with our own defragmentation scheduler) and disabled BTRFS snapshots
in Ceph to reach this point. Last time I checked an OSD startup was
still faster with XFS.

So do you use BTRFS in the default configuration or have a very high
number of files on this OSD ?

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Reducing the impact of OSD restarts (noout ain't uptosnuff)

2016-02-13 Thread Lionel Bouton

Hi,

Le 13/02/2016 15:52, Christian Balzer a écrit :
> [..]
>
> Hum that's surprisingly long. How much data (size and nb of files) do
> you have on this OSD, which FS do you use, what are the mount options,
> what is the hardware and the kind of access ?
>
> I already mentioned the HW, Areca RAID controller with 2GB HW cache and a
> 7 disk RAID6 per OSD. 
> Nothing aside from noatime for mount options and EXT4.

Thanks for the reminder. That said 7-disk RAID6 and EXT4 is new to me
and may not be innocent.

>  
> 2.6TB per OSD and with 1.4 million objects in the cluster a little more
> than 700k files per OSD.

That's nearly 3x more than my example OSD but it doesn't explain the
more than 10x difference in startup time (especially considering BTRFS
OSDs are slow to startup and my example was with dropped caches unlike
your case). Your average file size is similar so it's not that either.
Unless you have a more general, system-wide performance problem which
impacts everything including the OSD init, there's 3 main components
involved here :
- Ceph OSD init code,
- ext4 filesystem,
- HW RAID6 block device.

So either :
- OSD init code doesn't scale past ~500k objects per OSD.
- your ext4 filesystem is slow for the kind of access used during init
(inherently or due to fragmentation, you might want to use filefrag on a
random sample on PG directories, omap and meta),
- your RAID6 array is slow for the kind of access used during init.
- any combination of the above.

I believe it's possible but doubtful that the OSD code wouldn't scale at
this level (this does not feel like an abnormally high number of objects
to me). Ceph devs will know better.
ext4 could be a problem as it's not the most common choice for OSDs
(from what I read here XFS is usually preferred over it) and it forces
Ceph to use omap to store data which would be stored in extended
attributes otherwise (which probably isn't without performance problems).
RAID5/6 on HW might have performance problems. The usual ones happen on
writes and OSD init is probably read-intensive (or maybe not, you should
check the kind of access happening during the OSD init to avoid any
surprise) but with HW cards it's difficult to know for sure the
performance limitations they introduce (the only sure way is testing the
actual access patterns).

So I would probably try to reproduce the problem replacing one OSDs
based on RAID6 arrays with as many OSDs as you have devices in the arrays.
Then if it solves the problem and you didn't already do it you might
want to explore Areca tuning, specifically with RAID6 if you must have it.


>
> And kindly take note that my test cluster has less than 120k objects and
> thus 15k files per OSD and I still was able to reproduce this behaviour (in
> spirit at least).

I assume the test cluster uses ext4 and RAID6 arrays too: it would be a
perfect testing environment for defragmentation/switch to XFS/switch to
single drive OSDs then.

>
>> The only time I saw OSDs take several minutes to reach the point where
>> they fully rejoin is with BTRFS with default options/config.
>>
> There isn't a pole long enough I would touch BTRFS with for production,
> especially in conjunction with Ceph.

That's a matter of experience and environment but I can understand: we
invested more than a week of testing/development to reach a point where
BTRFS was performing better than XFS in our use case. Not everyone can
dedicate as much time just to select a filesystem and support it. There
might be use cases where it's not even possible to use it (I'm not sure
how it would perform if you only did small objects storage for example).

BTRFS has been invaluable though : it detected and helped fix corruption
generated by faulty Raid controllers (by forcing Ceph to use other
replicas when repairing). I wouldn't let precious data live on anything
other than checksumming filesystems now (the probabilities of
undetectable disk corruption are too high for our use case now). We have
30 BTRFS OSDs in production (and many BTRFS filesystems on other
systems) and we've never had any problem with them. These filesystems
even survived several bad datacenter equipment failures (faulty backup
generator control system and UPS blowing up during periodic testing).
That said I'm susbcribed to linux-btrfs, was one of the SATA controller
driver maintainers long ago so I know my way around kernel code, I hand
pick the kernel versions going to production and we have custom tools
and maintenance procedures for the BTRFS OSDs. So I've means and
experience which make this choice comfortable for me and my team: I
wouldn't blindly advise BTRFS to anyone else (not yet).

Anyway it's possible ext4 is a problem but it seems to me less likely
than the HW RAID6. In my experience RAID controllers with cache aren't
really worth it with Ceph. Most of the time they perform well because of
BBWC/FBWC but when you get into a situation where you must
repair/backfill because you lost an OSD or added a

Re: [ceph-users] Increasing time to save RGW objects

2016-02-09 Thread Lionel Bouton

Le 09/02/2016 20:07, Kris Jurka a écrit :
>
>
> On 2/9/2016 10:11 AM, Lionel Bouton wrote:
>
>> Actually if I understand correctly how PG splitting works the next spike
>> should be  times smaller and spread over  times the period (where
>>  is the number of subdirectories created during each split which
>> seems to be 15 according to OSDs' directory layout).
>>
>
> I would expect that splitting one directory would take the same amount
> of time as it did this time, it's just that now there will be N times
> as many directories to split because of the previous splits.  So the
> duration of the spike would be quite a bit longer.

Oops I missed this bit, I believe you are right: the spike duration
should be ~16x longer but the slowdown roughly the same over this new
period :-(

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Increasing time to save RGW objects

2016-02-09 Thread Lionel Bouton

Le 09/02/2016 20:18, Lionel Bouton a écrit :
> Le 09/02/2016 20:07, Kris Jurka a écrit :
>>
>> On 2/9/2016 10:11 AM, Lionel Bouton wrote:
>>
>>> Actually if I understand correctly how PG splitting works the next spike
>>> should be  times smaller and spread over  times the period (where
>>>  is the number of subdirectories created during each split which
>>> seems to be 15 according to OSDs' directory layout).
>>>
>> I would expect that splitting one directory would take the same amount
>> of time as it did this time, it's just that now there will be N times
>> as many directories to split because of the previous splits.  So the
>> duration of the spike would be quite a bit longer.
> Oops I missed this bit, I believe you are right: the spike duration
> should be ~16x longer but the slowdown roughly the same over this new
> period :-(

As I don't see any way around this, I'm thinking out of the box.

As splitting is costly for you you might want to try to avoid it (or at
least limit it to the first occurrence if your use case can handle such
a slowdown).
You can test increasing the PG number of your pool before reaching the
point where the split starts.
This would generate movements but this might (or might not) slow down
your access less than what you see when splitting occurs (I'm not sure
about the exact constraints but basically Ceph forces you to increase
the number of placement PG by small amounts which should limit the
performance impact).

Another way to do this with no movement and slowdown is to add pools
(which basically create new placement groups without rebalancing data)
but this means modifying your application so that new objects are stored
on the new pool (which may or may not be possible depending on your
actual access patterns).

There are limits to these 2 suggestions : increasing the number of
placement groups have costs so you might want to check with devs how
high you can go and if it fits your constraints.

Lionel.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Increasing time to save RGW objects

2016-02-09 Thread Lionel Bouton

Hi,

Le 09/02/2016 17:07, Kris Jurka a écrit :
>
>
> On 2/8/2016 9:16 AM, Gregory Farnum wrote:
>> On Mon, Feb 8, 2016 at 8:49 AM, Kris Jurka  wrote:
>>>
>>> I've been testing the performance of ceph by storing objects through
>>> RGW.
>>> This is on Debian with Hammer using 40 magnetic OSDs, 5 mons, and 4 RGW
>>> instances.  Initially the storage time was holding reasonably
>>> steady, but it
>>> has started to rise recently as shown in the attached chart.
>>>
>>
>> It's probably a combination of your bucket indices getting larger and
>> your PGs getting split into subfolders on the OSDs. If you keep
>> running tests and things get slower it's the first; if they speed
>> partway back up again it's the latter.
>
> Indeed, after running for another day, performance has leveled back
> out, as attached.  So tuning something like filestore_split_multiple
> would have moved around the time of this performance spike, but is
> there a way to eliminate it?  Some way of saying, start with N levels
> of directory structure because I'm going to have a ton of objects?  If
> this test continues, it's just going to hit another, worse spike later
> when it needs to split again.

Actually if I understand correctly how PG splitting works the next spike
should be  times smaller and spread over  times the period (where
 is the number of subdirectories created during each split which
seems to be 15 according to OSDs' directory layout).

That said, the problem that could happen is that by the time you reach
the next split you might have reached  times the object creation
speed you have currently and get the very same spike.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Increasing time to save RGW objects

2016-02-09 Thread Lionel Bouton

Le 09/02/2016 19:11, Lionel Bouton a écrit :
> Actually if I understand correctly how PG splitting works the next spike
> should be  times smaller and spread over  times the period (where
>  is the number of subdirectories created during each split which
> seems to be 15

typo : 16
>  according to OSDs' directory layout).
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] K is for Kraken

2016-02-08 Thread Lionel Bouton

Le 08/02/2016 20:09, Robert LeBlanc a écrit :
> Too bad K isn't an LTS. It was be fun to release the Kraken many times.

Kraken is an awesome release name !
How I will miss being able to say/write to our clients that we just
released the Kraken on their infra :-/

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Journal

2016-01-29 Thread Lionel Bouton

Le 29/01/2016 01:12, Jan Schermer a écrit :
> [...]
>> Second I'm not familiar with Ceph internals but OSDs must make sure that 
>> their PGs are synced so I was under the impression that the OSD content for 
>> a PG on the filesystem should always be guaranteed to be on all the other 
>> active OSDs *or* their journals (so you wouldn't apply journal content 
>> unless the other journals have already committed the same content). If you 
>> remove the journals there's no intermediate on-disk "buffer" that can be 
>> used to guarantee such a thing: one OSD will always have data that won't be 
>> guaranteed to be on disk on the others. As I understand this you could say 
>> that this is some form of 2-phase commit.
> You can simply commit the data (to the filestore), and it would be in fact 
> faster.
> Client gets the write acknowledged when all the OSDs have the data - that 
> doesn't change in this scenario. If one OSD gets ahead of the others and 
> commits something the other OSDs do not before the whole cluster goes down 
> then it doesn't hurt anything - you didn't acknowledge so the client has to 
> replay if it cares, _NOT_ the OSDs.
> The problem still exists, just gets shifted elsewhere. But the client (guest 
> filesystem) already handles this.

Hum, if one OSD gets ahead of the others there must be a way for the
OSDs to resynchronize themselves. I assume that on resync for each PG
OSDs probably compare something very much like a tx_id.

What I was expecting is that in the case of a small backlog the journal
- containing the last modifications by design - was used during recovery
to fetch all the recent transaction contents. It seemed efficient to me:
especially on rotating media fetching data from the journal would avoid
long seeks. The first alternative I can think of is maintaining a
separate log of the recently modified objects in the filestore without
the actual content of the modification. Then you can fetch the objects
from the filestore as needed but this probably seeks all over the place.
In the case of multiple PGs lagging behind on other OSDs, reading the
local journal would be even better as you have even more chances of
ordering reads to avoid seeks on the journal and much more seeks would
happen on the filestore.

But if I understand correctly, there is indeed a log of the recent
modifications in the filestore which is used when a PG is recovering
because another OSD is lagging behind (not when Ceph reports a full
backfill where I suppose all objects' versions of a PG are compared).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Journal

2016-01-29 Thread Lionel Bouton

Le 29/01/2016 16:25, Jan Schermer a écrit :
>
> [...]
>
>
> But if I understand correctly, there is indeed a log of the recent
> modifications in the filestore which is used when a PG is recovering
> because another OSD is lagging behind (not when Ceph reports a full
> backfill where I suppose all objects' versions of a PG are compared).
>
> That list of transactions becomes useful only when OSD crashes and comes back 
> up - it needs to catch up somehow and this is one of the options. But do you 
> really need the "content" of those transactions which is what the journal 
> does?
> If you have no such list then you need to either rely on things like mtime of 
> the object, or simply compare the hash of the objects (scrub).

This didn't seem robust enough to me but I think I had forgotten about
the monitors' role in maintaining coherency.

Let's say you use a pool with size=3 and min_size=2. You begin with a PG
with 3 active OSDs then you lose a first OSD for this PG and only two
active OSDs remain: the clients still happily read and write to this PG
and the downed OSD is now lagging behind.
Then one of the remaining active OSDs disappears. Client I/O blocks
because of min_size. Now the first downed (lagging) OSD comes back. At
this point Ceph has everything it needs to recover (enough OSDs to reach
min_size and all the data reported committed to disk to the client in
the surviving OSD) but must decide which OSD actually has this valid
data between the two.

At this point I was under the impression that OSDs could determine this
for themselves without any outside intervention. But reflecting on this
situation I don't see how they could handle all cases by themselves (for
example an active primary should be able to determine by itself that it
must send the last modifications to any other OSD but it wouldn't work
if all OSD go down for a PG : when coming back all could be the last
primary from their point of view with no robust way to decide which is
right without the monitors being involved).
The monitors maintain the status of each OSDs for each PG if I'm not
mistaken so I suppose the monitors knowledge of the situation will be
used to determine which OSDs have the good data (the last min_size OSDs
up for each PG) and trigger the others to resync before the PG reaches
active+clean.

That said this doesn't address the other point: when the resync happens,
using the journal content of the primary could theoretically be faster
if the filestores are on spinning disks. I realize that recent writes in
the filestore might be in the kernel's cache (which would avoid the
costly seeks) and that using the journal instead would probably mean
that the OSDs maintain an in-memory index of all the IO transactions
still stored in the journal to be efficient so it isn't such a clear win.

Thanks a lot for the explanations.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Journal

2016-01-28 Thread Lionel Bouton

Le 28/01/2016 22:32, Jan Schermer a écrit :
> P.S. I feel very strongly that this whole concept is broken
> fundamentaly. We already have a journal for the filesystem which is
> time proven, well behaved and above all fast. Instead there's this
> reinvented wheel which supposedly does it better in userspace while
> not really avoiding the filesystem journal either. It would maybe make
> sense if OSD was storing the data on a block device directly, avoiding
> the filesystem altogether. But it would still do the same bloody thing
> and (no disrespect) ext4 does this better than Ceph ever will.
>

Hum I've seen this discussed previously but I'm not sure the fs journal
could be used as a Ceph journal.

First BTRFS doesn't have a journal per se, so you would not be able to
use xfs or ext4 journal on another device with journal=data setup to
make write bursts/random writes fast. And I won't go back to XFS or test
ext4... I've detected too much silent corruption by hardware with BTRFS
to trust our data to any filesystem not using CRC on reads (and in our
particular case the compression and speed are additional bonuses).

Second I'm not familiar with Ceph internals but OSDs must make sure that
their PGs are synced so I was under the impression that the OSD content
for a PG on the filesystem should always be guaranteed to be on all the
other active OSDs *or* their journals (so you wouldn't apply journal
content unless the other journals have already committed the same
content). If you remove the journals there's no intermediate on-disk
"buffer" that can be used to guarantee such a thing: one OSD will always
have data that won't be guaranteed to be on disk on the others. As I
understand this you could say that this is some form of 2-phase commit.

I may be mistaken: there are structures in the filestore that *may* take
on this role but I'm not sure what their exact use is : the
_TEMP dirs, the omap and meta dirs. My guess is that they serve
other purposes: it would make sense to use the journals for this because
the data is already there and the commit/apply coherency barriers seem
both trivial and efficient to use.

That's not to say that the journals are the only way to maintain the
needed coherency, just that they might be used to do so because once
they are here, this is a trivial extension of their use.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Repository with some internal utils

2016-01-19 Thread Lionel Bouton

Hi,

someone asked me if he could get access to the BTRFS defragmenter we
used for our Ceph OSDs. I took a few minutes to put together a small
github repository with :
- the defragmenter I've been asked about (tested on 7200 rpm drives and
designed to put low IO load on them),
- the scrub scheduler we use to avoid load spikes on Firefly,
- some basic documentation (this is still rough around the edges so you
better like to read Ruby code if you want to peak at most of the logic,
tune or hack these).

Here it is: https://github.com/jtek/ceph-utils

This is running in production for several months now and I didn't touch
the code or the numerous internal tunables these scripts have for
several weeks so it probably won't destroy your clusters. These scripts
come without warranties though.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph cache tier and rbd volumes/SSD primary, HDD replica crush rule!

2016-01-12 Thread Lionel Bouton

Le 12/01/2016 18:27, Mihai Gheorghe a écrit :
> One more question. Seeing that cache tier holds data on it untill it
> reaches % ratio, i suppose i must set replication to 2 or higher on
> the cache pool to not lose hot data not writen to the cold storage in
> case of a drive failure, right? 
>
> Also, will there be any perfomance penalty if i set the osd journal on
> the same SSD as the OSD. I now have one SSD specially for journaling
> the SSD OSDs. I know that in the case of mechanical drive this is a
> problem!

With traditional 7200rpm SATA HDD OSDs, one DC SSD for 4 to 6 OSDs is
usually advised because it will have both the bandwidth and the IOPS
needed to absorb the writes the HDDs themselves can handle. With SSD
based OSDs I would advise against separating journals from filestore
because :

- if you don't hit Ceph bottlenecks it might be difficult to find a
combination of journal and filestore SSD models that ensure that one
journal SSD can handle several filestores efficiently (in perf, cost and
endurance) so you could end up with one journal SSD per filestore SSD to
get the best behaviour at which point you would simply be wasting space
and reliability by underusing the journal SSDs : the theoretical IOPS
limit would be the same than using all SSDs to have both a filestore and
its journal which provides nearly twice the space and on hardware
failure doesn't render one SSD useless in addition to the one failing.
- anyway currently Ceph is probably the bottleneck most of the time with
SSD-based pools, so you probably won't be able to saturate your
filestore SSDs, dedicating SSDs to journals may not help individual OSD
performance but reduce the total number of OSDs  : so you probably want
as many OSDs as possible to get the highest IOPS,
- performance would be less predictable : depending on the workload you
could alternatively hit bottlenecks on the journal SSDs or the filestore
SSDs.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-23 Thread Lionel Bouton

Le 23/12/2015 16:18, Mart van Santen a écrit :
> Hi all,
>
>
> On 12/22/2015 01:55 PM, Wido den Hollander wrote:
>> On 22-12-15 13:43, Andrei Mikhailovsky wrote:
>>> Hello guys,
>>>
>>> Was wondering if anyone has done testing on Samsung PM863 120 GB version to 
>>> see how it performs? IMHO the 480GB version seems like a waste for the 
>>> journal as you only need to have a small disk size to fit 3-4 osd journals. 
>>> Unless you get a far greater durability.
>>>
>> In that case I would look at the SM836 from Samsung. They are sold as
>> write-intensive SSDs.
>>
>> Wido
>>
>
> Today I received a small batch of SM863 (1.9TBs) disks. So maybe these
> testresults are helpfull for making a decision
> This is on an IBM X3550M4 with a MegaRaid SAS card (so not in jbod
> mode). Unfortunally I have no suitable JBOD card available at my test
> server so I'm stuck with the "RAID" layer in the HBA
>
>
>
> disabled drive cache, disabled controller cache
> ---
>
>
> 1 job
> ---
> Run status group 0 (all jobs):
>   WRITE: io=906536KB, aggrb=15108KB/s, minb=15108KB/s, maxb=15108KB/s,
> mint=60001msec, maxt=60001msec
>
> Disk stats (read/write):
>   sdd: ios=91/452978, merge=0/0, ticks=12/39032, in_queue=39016, util=65.04%

Either the MegaRaid SAS card is the bottleneck or SM863 1.9TB are 8x
slower than PM863 480GB on this particular test which is a bit
surprising: it would make the SM863 one of the slowest (or even the
slowest) DC SSD usable as Ceph journals.
Do you have any other SSD (if possible one of the models or one similar
to the ones listed on
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
which give more than 15MB/s with one job) connected to the same card
model you could test for comparison?

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-23 Thread Lionel Bouton

Le 23/12/2015 18:37, Mart van Santen a écrit :
> So, maybe you are right and is the HBA the bottleneck (LSI Logic /
> Symbios Logic MegaRAID SAS 2108). Under all cirumstances, I do not get
> close to the numbers of the PM863 quoted by Sebastien. But his site
> does not state what kind of HBA he is using..

In fact I was the one doing those tests and I added the relevant
information in the comments (Disqus user Gyver): the PM863 tested is
connected to the Intel C612 chipset SATA ports (configured as AHCI) of a
dual Xeon E5v3 board, so this is a purely SATA configuration.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Intel S3710 400GB and Samsung PM863 480GB fio results

2015-12-22 Thread Lionel Bouton

Le 22/12/2015 13:43, Andrei Mikhailovsky a écrit :
> Hello guys,
>
> Was wondering if anyone has done testing on Samsung PM863 120 GB version to 
> see how it performs? IMHO the 480GB version seems like a waste for the 
> journal as you only need to have a small disk size to fit 3-4 osd journals. 
> Unless you get a far greater durability.

The problem is endurance. If we use the 480GB for 3 OSDs each on the
cluster we might build we expect 3 years (with some margin for error but
not including any write amplification at the SSD level) before the SSDs
will fail.
In our context a 120GB model might not even last a year (endurance is
1/4th of the 480GB model). This is why SM863 models will probably be
more suitable if you have access to them: you can use smaller ones which
cost less and get more endurance (you'll have to check the performance
though, usually smaller models have lower IOPS and bandwidth).

> I am planning to replace my current journal ssds over the next month or so 
> and would like to find out if there is an a good alternative to the Intel's 
> 3700/3500 series. 

3700 are a safe bet (the 100GB model is rated for ~1.8PBW). 3500 models
probably don't have enough endurance for many Ceph clusters to be cost
effective. The 120GB model is only rated for 70TBW and you have to
consider both client writes and rebalance events.
I'm uneasy with SSDs expected to fail within the life of the system they
are in: you can have a cascade effect where an SSD failure brings down
several OSDs triggering a rebalance which might make SSDs installed at
the same time fail too. In this case in the best scenario you will reach
your min_size (>=2) and block any writes which would prevent more SSD
failures until you move journals to fresh SSDs. If min_size = 1 you
might actually lose data.

If you expect to replace your current journal SSDs if I were you I would
make a staggered deployment over several months/a year to avoid them
failing at the same time in case of an unforeseen problem. In addition
this would allow to evaluate the performance and behavior of a new SSD
model with your hardware (there have been reports of performance
problems with some combinations of RAID controllers and SSD
models/firmware versions) without impacting your cluster's overall
performance too much.

When using SSDs for journals you have to monitor both :
* the SSD wear leveling or something equivalent (SMART data may not be
available if you use a RAID controller but usually you can get the total
amount data written) of each SSD,
* the client writes on the whole cluster.
And check periodically what the expected lifespan left there is for each
of your SSD based on their current state, average write speed, estimated
write amplification (both due to pool's size parameter and the SSD
model's inherent write amplification) and the amount of data moved by
rebalance events you expect to happen.
Ideally you should make this computation before choosing the SSD models,
but several variables are not always easy to predict and probably will
change during the life of your cluster.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD only pool without journal

2015-12-17 Thread Lionel Bouton

Hi,

Le 17/12/2015 16:47, Misa a écrit :
> Hello everyone,
>
> does it make sense to create SSD only pool from OSDs without journal?

No, because AFAIK you can't have OSDs without journals yet.
IIRC there is work done for alternate stores where you wouldn't need
journals anymore but it's not yet production ready.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Global, Synchronous Blocked Requests

2015-11-28 Thread Lionel Bouton

Hi,

Le 28/11/2015 04:24, Brian Felton a écrit :
> Greetings Ceph Community,
>
> We are running a Hammer cluster (0.94.3-1) in production that recently
> experienced asymptotic performance degradation.  We've been migrating
> data from an older non-Ceph cluster at a fairly steady pace for the
> past eight weeks (about 5TB a week).  Overnight, the ingress rate
> dropped by 95%.  Upon investigation, we found we were receiving
> hundreds of thousands of 'slow request' warnings. 
> [...] Each storage server contains 72 6TB SATA drives for Ceph (648
> OSDs, ~3.5PB in total).  Each disk is set up as its own ZFS zpool. 
> Each OSD has a 10GB journal, located within the disk's zpool.

This behavior is similar to what you get with a default BTRFS setup :
performance is good initially and gets worse after some time. As BTRFS
and ZFS are both CoW filesystems, the causes might be the same. In our
case, we had two problems with BTRFS :
- snapshot removal is costly, we use filestore btrfs snap = false,
- fragmentation gets really bad over time even with autodefrag :
  . we created the journals NoCoW to avoid them becoming fragmented and
later moved them to SSD,
  . we developed our own defragmentation scheduler.

Fragmentation was ultimately the biggest cause of performance problem
for us (snapshots only caused manageable spikes of writes).

If you can, I'd advise to do what we initially did : use a mix of
XFS-based OSD (probably the most used case with Ceph) and ZFS-based OSD.
You'll be able to find out if ZFS is slower than XFS in your case by
checking which OSDs are involved in slow requests (you should probably
monitor your commit and apply latencies too).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Scrubbing question

2015-11-26 Thread Lionel Bouton

Le 26/11/2015 15:53, Tomasz Kuzemko a écrit :
> ECC will not be able to recover the data, but it will always be able to
> detect that data is corrupted.

No. That's a theoretical impossibility as the detection is done by some
kind of hash over the memory content which brings the possibility of
hash collisions. For cryptographic hashes collisions are by definition
nearly impossible to trigger but obviously memory controllers can't use
cryptographic hashes to protect the memory content : the verification
would be prohibitive (both in hardware costs and in latencies). Most ECC
implementations use hamming codes which correct all single-bit errors
and detect all 2-bit errors but can have false negatives for 3+ bit
errors. There's even speculation that modern hardware makes this more
likely because individual chips now use buses that aren't 1-bit anymore
and defective chips don't store only 1-bit in a byte returned by a read
anymore but several.

>  AFAIK under Linux this results in
> immediate halt of system, so it would not be able to report bad checksum
> data during deep-scrub.

It can, it's just less likely.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH over SW-RAID

2015-11-23 Thread Lionel Bouton

Le 23/11/2015 19:58, Jose Tavares a écrit :
>
>
> On Mon, Nov 23, 2015 at 4:15 PM, Lionel Bouton
> <lionel-subscript...@bouton.name
> <mailto:lionel-subscript...@bouton.name>> wrote:
>
> Hi,
>
> Le 23/11/2015 18:37, Jose Tavares a écrit :
> > Yes, but with SW-RAID, when we have a block that was read and does not 
> match its checksum, the device falls out of the array
>
> I don't think so. Under normal circumstances a device only falls
> out of a md array if it doesn't answer IO queries after a timeout
> (md arrays only read from the smallest subset of devices needed to
> get the data, they don't verify redundancy on the fly for
> performance reasons). This may not be the case when you explicitly
> ask an array to perform a check though (I don't have any
> first-hand check failure coming to mind).
>
> >, and the data is read again from the other devices in the array. The 
> problem is that in SW-RAID1 we don't have the badblocks isolated. The disks 
> can be sincronized again as the write operation is not tested. The problem 
> (device falling out of the array) will happen again if we try to read any 
> other data written over the bad block. With consumer-level SATA drives 
> badblocks are handled internally
> nowadays : the drives remap bad sectors to a reserve by trying to
> copy their content there (this might fail and md might not have
> the opportunity to correct the error: it doesn't use checksums so
> it can't tell which drive has unaltered data, only which one
> doesn't answer IO queries).
>
>
>
> hmm, suppose the drive is unable to remap bad blocks internally, when
> you write data to the drive, it will also write in hardware the data
> checksum.

One weak data checksum which is not available to the kernel, yes.
Filesystems and applications on top of them may use stronger checksums
and handle read problems that the drives can't detect themselves.

> When you read the data, it will compare to this checksum that was
> written previously. If it fails, the drive will reset and the SW-RAID
> will drop the drive. This is how sata drives work..

If it fails AFAIK from past experience it doesn't reset by itself, the
kernel driver in charge of the device will receive an IO error and will
retry the IO several times. One of those latter attempts might succeed
(errors aren't always repeatable) and eventually after a timeout it will
try to reset the interface with the drive and the drive itself (the
kernel doesn't know where the problem is only that it didn't get the
result it was expecting).
While this happen I believe the filesystem/md/lvm/... stack can receive
an IO error (the timeout at their level might not be the same as the
timeout at the device level). So some errors can be masked to md and
some can percolate through. In the later case, yes the md array will
drop the device.

>  
>
>
>
> > My new question regarding Ceph is if it isolates this bad sectors where 
> it found bad data when scrubbing? or there will be always a replica of 
> something over a known bad block..?
> Ceph OSDs don't know about bad sectors, they delegate IO to the
> filesystems beneath them. Some filesystems can recover from
> corrupted data from one drive (ZFS or BTRFS when using redundancy
> at their level) and the same filesystems will refuse to give Ceph
> OSD data when they detect corruption on non redundant filesystems,
> Ceph detects this (usually during scrubs) and then manual Ceph
> repair will rewrite data over the corrupted data (at this time if
> the underlying drive detected a bad sector it will not reuse it).
>
>
> Just forget about the hardware bad block remapped list. It got filled
> as soon as we start to use the drive .. :)

Then you can move this drive to the trash pile/ask for a replacement. It
is basically unusable.

>  
>
>
> > > I also saw that Ceph use same metrics when capturing data from
> disks. When the disk is resetting or have problems, its metrics
> are going to be bad and the cluster will rank bad this osd. But I
> didn't saw any way of sending alerts or anything like that.
> SW-RAID has its mdadm monitor that alerts when things go bad.
> Should I have to be looking for ceph logs all the time to see when
> things go bad?
> I'm not aware of any osd "ranking".
>
> Lionel
>
>
>
> Does "weight" means the same?

There are 2 weights I'm aware of, the crush weight for an OSD and the
temporary OSD weight. The first is the basic weight used by crush to
choose how to split your data (an OSD with a weight of 2 is expected to
get roughly twice the amount of data of an OSD with a weight of 1 on a
normal

Re: [ceph-users] CEPH over SW-RAID

2015-11-23 Thread Lionel Bouton

Hi,

Le 23/11/2015 18:37, Jose Tavares a écrit :
> Yes, but with SW-RAID, when we have a block that was read and does not match 
> its checksum, the device falls out of the array

I don't think so. Under normal circumstances a device only falls out of
a md array if it doesn't answer IO queries after a timeout (md arrays
only read from the smallest subset of devices needed to get the data,
they don't verify redundancy on the fly for performance reasons). This
may not be the case when you explicitly ask an array to perform a check
though (I don't have any first-hand check failure coming to mind).

>, and the data is read again from the other devices in the array. The problem 
>is that in SW-RAID1 we don't have the badblocks isolated. The disks can be 
>sincronized again as the write operation is not tested. The problem (device 
>falling out of the array) will happen again if we try to read any other data 
>written over the bad block. With consumer-level SATA drives badblocks are 
>handled internally
nowadays : the drives remap bad sectors to a reserve by trying to copy
their content there (this might fail and md might not have the
opportunity to correct the error: it doesn't use checksums so it can't
tell which drive has unaltered data, only which one doesn't answer IO
queries).

> My new question regarding Ceph is if it isolates this bad sectors where it 
> found bad data when scrubbing? or there will be always a replica of something 
> over a known bad block..?
Ceph OSDs don't know about bad sectors, they delegate IO to the
filesystems beneath them. Some filesystems can recover from corrupted
data from one drive (ZFS or BTRFS when using redundancy at their level)
and the same filesystems will refuse to give Ceph OSD data when they
detect corruption on non redundant filesystems, Ceph detects this
(usually during scrubs) and then manual Ceph repair will rewrite data
over the corrupted data (at this time if the underlying drive detected a
bad sector it will not reuse it).

> > I also saw that Ceph use same metrics when capturing data from disks.
When the disk is resetting or have problems, its metrics are going to be
bad and the cluster will rank bad this osd. But I didn't saw any way of
sending alerts or anything like that. SW-RAID has its mdadm monitor that
alerts when things go bad. Should I have to be looking for ceph logs all
the time to see when things go bad?
I'm not aware of any osd "ranking".

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH over SW-RAID

2015-11-23 Thread Lionel Bouton

Le 23/11/2015 21:01, Jose Tavares a écrit :
>
>
>
> > My new question regarding Ceph is if it isolates this bad sectors where 
> it found bad data when scrubbing? or there will be always a replica of 
> something over a known bad block..?
> Ceph OSDs don't know about bad sectors, they delegate IO to the
> filesystems beneath them. Some filesystems can recover from
> corrupted data from one drive (ZFS or BTRFS when using redundancy
> at their level) and the same filesystems will refuse to give Ceph
> OSD data when they detect corruption on non redundant filesystems,
> Ceph detects this (usually during scrubs) and then manual Ceph
> repair will rewrite data over the corrupted data (at this time if
> the underlying drive detected a bad sector it will not reuse it).
>
>
> Just forget about the hardware bad block remapped list. It got filled
> as soon as we start to use the drive .. :)
>
>
> Then you can move this drive to the trash pile/ask for a
> replacement. It is basically unusable.
>
>
> Why?
> 1 (or more) out of 8 drives I see have the remap list full ...
> If you isolate the rest using software you can continue to use the
> drive .. There are no performance issues, etc ..
>

Ceph currently uses filesystems to store its data. As there are no
supported filesystems/software layer handling badblocks dynamically, you
*will* have some OSD filesystems being remounted read-only and OSD
failures as soon as you hit one sector misbehaving (if they already
emptied the reserve you are almost guaranteed to get new defective
sectors later, see below). If your bad drives are distributed over your
whole cluster, you will have far more chances of simultaneous failures
and degraded or inactive pgs (which will freeze any IO to them). You
will then have to manually put these OSDs back online to recover
(unfreeze IO). If you don't succeed because the drives failed to the
point that you can't recover the OSD content, you will simply lose data.

>From what I can read here, the main filesystems for Ceph are XFS, Btrfs
and Ext4 with some people using ZFS. Of those 4, only ext4 has support
for manually setting badblocks on an umounted filesystem. If you don't
have the precise offset for each of them, you'll have to scan the whole
device (e2fsck -c) before e2fsck can *try* to put your filesystem in
shape after any bad block is detected. You will have to be very careful
to remove any file using a bad block to avoid corrupting data before
restarting the OSD (hopefully e2fsck should move them for you to
lost+found). You might not be able to restart the OSD depending on the
actual files missing.

Finally at least XFS and Btrfs don't have any support for bad blocks
AFAIK. So you simply can't use your drives with these 2 filesystems
without the filesystems failing and fsck not working. MD raid won't help
you either as it has zero support for badblocks.

The fact that badblocks support is almost non-existent is simple to
understand from past history. Only old filesystems that were used when
drives didn't have internal reserves to handle badblocks transparently
and bad blocks were a normal occurence still have support for keeping
tabs on bad sectors (ext4 got it from ext2, vfat/fat32 has it too, ...).
Today a disk drive which starts to report bad sectors on reads has
emptied its reserve so it has a large history of bad sectors already. It
isn't failing one sector, it's in the process of failing thousands of
them, so there's no reason to expect it to behave correctly anymore :
all the application layers above (md, lvm, filesystems, ...) just don't
try to fight a battle that can't be won and would add complexity and
diminish performance in the case of a normal working drive.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH over SW-RAID

2015-11-23 Thread Lionel Bouton

Le 23/11/2015 18:17, Jan Schermer a écrit :
> SW-RAID doesn't help with bit-rot if that's what you're afraid of.
> If you are afraid bit-rot you need to use a fully checksumming filesystem 
> like ZFS.
> Ceph doesn't help there either when using replicas - not sure how strong 
> error detection+correction is in EC-type pools.
>
> The only thing I can suggest (apart from using ZFS) is getting drives that 
> have a higher BER rating so bit-rot isn't as likely to occur.

We use BTRFS for this reason (we had to circumvent some performance
problems with it to do so).

Drives aren't the only possible sources of bit-rot unfortunately. We had
HP Raid controllers of Gen8 servers slowly but consistently corrupting
data after a datacenter-wide power failure which damaged some equipment.
For reference the power failure itself was correctly handled but when
the electrical facility resolved the problem it sent a power surge which
destroyed parts of the power management system and apparently some Raid
controllers didn't like the surge either: BTRFS checksum errors started
to pop with an abnormal rate (we have enough data writes to
statistically get one or two of them each year with consumer-level SATA
drives but it rose to one per day).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CEPH over SW-RAID

2015-11-23 Thread Lionel Bouton

Le 23/11/2015 21:58, Jose Tavares a écrit :
>
> AFAIK, people are complaining about lots os bad blocks in the new big
> disks. The hardware list seems to be small and unable to replace
> theses blocks.

Note that if by big disks you mean SMR-based disks, they can exhibit
what looks like bad blocks but are actually consequences of a firmware
and/or kernel driver bug (I understand recent Linux kernels are
SMR-aware in some way as the bug is not reproducible on all kernel
versions, typically not on older versions). These drives should not be
used like "normal" drives anyway (by design they only work well when no
rewriting of existing sectors happen), so most Ceph clusters should
avoid them.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] O_DIRECT on deep-scrub read

2015-10-08 Thread Lionel Bouton

Le 07/10/2015 13:44, Paweł Sadowski a écrit :
> Hi,
>
> Can anyone tell if deep scrub is done using O_DIRECT flag or not? I'm
> not able to verify that in source code.
>
> If not would it be possible to add such feature (maybe config option) to
> help keeping Linux page cache in better shape?

Note : this would probably be even more useful with backfills when
inserting/replacing OSDs because they focus most of the IOs on these
OSDs (I recently posted that we got far better performance when
rebuilding OSDs if we selectively disabled the RAID card cache for them
for example).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Simultaneous CEPH OSD crashes

2015-10-03 Thread Lionel Bouton

Hi,

Le 29/09/2015 19:06, Samuel Just a écrit :
> It's an EIO.  The osd got an EIO from the underlying fs.  That's what
> causes those asserts.  You probably want to redirect to the relevant
> fs maling list.

Thanks.

I didn't get any answer on this from BTRFS developers yet. The problem
seems hard to reproduce though (we still have the same configuration in
production without any new crash and we only had a total of 3 OSD crashes).

I'll just say for reference that BTRFS with kernel 3.18.9 looks
suspicious to me (from the events that happened to us on a mixed
3.18.9/4.0.5 cluster statistically there's about 80% chances that
there's a BTRFS bug in 3.18.9 solved in 4.0.5).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Predict performance

2015-10-02 Thread Lionel Bouton

Hi,

Le 02/10/2015 18:15, Christian Balzer a écrit :
> Hello,
> On Fri, 2 Oct 2015 15:31:11 +0200 Javier C.A. wrote:
>
> Firstly, this has been discussed countless times here.
> For one of the latest recurrences, check the archive for:
>
> "calculating maximum number of disk and node failure that can
> be handled by cluster with out data loss"
>
>
>> A classic raid5 system takes a looong time to rebullid  the raid, so i
>> would say NO, but how long does it take for ceph to rebullid the
>> placement group?
>>
> A placement group resides on an OSD. 
> Until the LAST PG on a failed OSD has been recovered, you are prone to
> data loss.
> And a single lost PG might affect ALL your images...

True.

>
> So while your OSDs are mostly empty, recovery will be faster than a RAID5.
>
> Once it gets fuller AND you realize that rebuilding OSDs SEVERELY impacts
> your cluster performance (at least in your smallish example) you are
> likely to tune down the recovery and backfill parameters to a level where
> it takes LONGER than a typical RAID controller recovery.

No, it doesn't. At least it shouldn't: in a RAID5 array, you need to
read all blocks from all the other devices to build the data on your
replacement device.
To rebuild an OSD, you only have to read the amount of data you will
store on the replacement device, which is  times less reads and as
much writes than what would happen with RAID5. This is more easily
compared to what would happen with a RAID10 array.

But if you care about redundancy more than minimizing the total amount
of IO pressure linked to balancing the cluster you won't rebuild the OSD
but let the failed one go out and data be reorganized in addition to the
missing replica reconstruction. In this case you will distribute *both*
the reads and writes on all devices.
Pgs will be moved around which will add some read/write load on the
cluster (this is why this will put more IO pressure overall). One of the
jobs of the CRUSH algorithm is to minimize the amount of such movements.
That said even if there are additional movements they don't help with
redundancy, the only process important for redundancy is the replica
being rebuilt for each pg in degraded state which should be far faster
than what RAID5 allows (if Ceph prioritizes backfills and recoveries
moving pgs from degraded to clean, which I suppose it does but can't
find a reference for, then replace "should be far faster" by "is far
faster").

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Issue with journal on another drive

2015-09-29 Thread Lionel Bouton

Hi,

Le 29/09/2015 13:32, Jiri Kanicky a écrit :
> Hi Lionel.
>
> Thank you for your reply. In this case I am considering to create
> separate partitions for each disk on the SSD drive. Would be good to
> know what is the performance difference, because creating partitions
> is kind of waste of space.

The difference is hard to guess : filesystems need more CPU power than
raw block devices for example, so if you don't have much CPU power this
can make a significant difference. Filesystems might put more load on
our storage too (for example ext3/4 with data=journal will at least
double the disk writes). So there's a lot to consider and nothing will
be faster for journals than a raw partition. LVM logical volumes come a
close second behind because usually (if you simply use LVM to create
your logical volumes and don't try to use anything else like snapshots)
they don't change access patterns and almost don't need any CPU power.

>
> One more question, is it a good idea to move journal for 3 OSDs to a
> single SSD considering if SSD fails the whole node with 3 HDDs will be
> down?

If your SSDs are working well with Ceph and aren't cheap models dying
under heavy writes, yes. I use one 200GB DC3710 SSD for 6 7200rpm SATA
OSDs (using 60GB of it for the 6 journals) and it works very well (they
were a huge performance boost compared to our previous use of internal
journals).
Some SSDs are slower than HDDs for Ceph journals though (there has been
a lot of discussions on this subject on this mailing list).

> Thinking of it, leaving journal on each OSD might be safer, because
> journal on one disk does not affect other disks (OSDs). Or do you
> think that having the journal on SSD is better trade off?

You will put significantly more stress on your HDD leaving journal on
them and good SSDs are far more robust than HDDs so if you pick Intel DC
or equivalent SSD for journal your infrastructure might even be more
robust than one using internal journals (HDDs are dropping like flies
when you have hundreds of them). There are other components able to take
down all your OSDs : the disk controller, the CPU, the memory, the power
supply, ... So adding one robust SSD shouldn't change the overall
availabilty much (you must check their wear level and choose the models
according to the amount of writes you want them to support over their
lifetime though).

The main reason for journals on SSD is performance anyway. If your setup
is already fast enough without them, I wouldn't try to add SSDs.
Otherwise, if you can't reach the level of performance needed by adding
the OSDs already needed for your storage capacity objectives, go SSD.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Simultaneous CEPH OSD crashes

2015-09-29 Thread Lionel Bouton

Le 27/09/2015 10:25, Lionel Bouton a écrit :
> Le 27/09/2015 09:15, Lionel Bouton a écrit :
>> Hi,
>>
>> we just had a quasi simultaneous crash on two different OSD which
>> blocked our VMs (min_size = 2, size = 3) on Firefly 0.80.9.
>>
>> the first OSD to go down had this error :
>>
>> 2015-09-27 06:30:33.257133 7f7ac7fef700 -1 os/FileStore.cc: In function
>> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
>> size_t, ceph::bufferlist&, bool)' thread 7f7ac7fef700 time 2015-09-27
>> 06:30:33.145251
>> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
>> || got != -5)
>>
>> the second OSD crash was similar :
>>
>> 2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function
>> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
>> size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27
>> 06:30:57.260978
>> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
>> || got != -5)
>>
>> I'm familiar with this error : it happened already with a BTRFS read
>> error (invalid csum) and I could correct it after flush-journal/deleting
>> the corrupted file/starting OSD/pg repair.
>> This time though there isn't any kernel log indicating an invalid csum.
>> The kernel is different though : we use 3.18.9 on these two servers and
>> the others had 4.0.5 so maybe BTRFS doesn't log invalid checksum errors
>> with this version. I've launched btrfs scrub on the 2 filesystems just
>> in case (still waiting for completion).
>>
>> The first attempt to restart these OSDs failed: one OSD died 19 seconds
>> after start, the other 21 seconds. Seeing that, I temporarily brought
>> down the min_size to 1 which allowed the 9 incomplete PG to recover. I
>> verified this by bringing min_size again to 2 and then restarted the 2
>> OSDs. They didn't crash yet.
>>
>> For reference the assert failures were still the same when the OSD died
>> shortly after start :
>> 2015-09-27 08:20:19.332835 7f4467bd0700 -1 os/FileStore.cc: In function
>> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
>> size_t, ceph::bufferlist&, bool)' thread 7f4467bd0700 time 2015-09-27
>> 08:20:19.325126
>> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
>> || got != -5)
>>
>> 2015-09-27 08:20:50.626344 7f97f2d95700 -1 os/FileStore.cc: In function
>> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
>> size_t, ceph::bufferlist&, bool)' thread 7f97f2d95700 time 2015-09-27
>> 08:20:50.605234
>> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
>> || got != -5)
>>
>> Note that at 2015-09-27 06:30:11 a deep-scrub started on a PG involving
>> one (and only one) of these 2 OSD. As we evenly space deep-scrubs (with
>> currently a 10 minute interval), this might be relevant (or just a
>> coincidence).
>>
>> I made copies of the ceph osd logs (including the stack trace and the
>> recent events) if needed.
>>
>> Can anyone put some light on why these OSDs died ?
> I just had a thought. Could launching a defragmentation on a file in a
> BTRFS OSD filestore trigger this problem?

That's not it : we had another crash a couple of hours ago on one of the
two servers involved in the first crashes and there was no concurrent
defragmentation going on.

2015-09-29 14:18:53.479881 7f8d78ff9700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
size_t, ceph::bufferlist&, bool)' thread 7f8d78ff9700 time 2015-09-29
14:18:53.425790
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
|| got != -5)

 ceph version 0.80.9 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
 1: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned
long, ceph::buffer::list&, bool)+0x96a) [0x8917ea]
 2: (ReplicatedBackend::objects_read_sync(hobject_t const&, unsigned
long, unsigned long, ceph::buffer::list*)+0x81) [0x90ecc1]
 3: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*,
std::vector<OSDOp, std::allocator >&)+0x6a81) [0x801091]
 4: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x63)
[0x809f23]
 5: (ReplicatedPG::execute_ctx(ReplicatedPG::OpContext*)+0xb6f) [0x80adbf]
 6: (ReplicatedPG::do_op(std::tr1::shared_ptr)+0x2ced) [0x815f4d]
 7: (ReplicatedPG::do_request(std::tr1::shared_ptr,
ThreadPool::TPHandle&)+0x70c) [0x7b047c]
 8: (OSD::dequeue_op(boost::intrusive_ptr,
std::tr1::shared_ptr, ThreadPool::TPHandle&)+0x34a) [0x60c74a]
 9: (OSD::OpWQ::_process(boost::intrusive_ptr,
ThreadPool::TPHandle&

Re: [ceph-users] Simultaneous CEPH OSD crashes

2015-09-27 Thread Lionel Bouton

Le 27/09/2015 09:15, Lionel Bouton a écrit :
> Hi,
>
> we just had a quasi simultaneous crash on two different OSD which
> blocked our VMs (min_size = 2, size = 3) on Firefly 0.80.9.
>
> the first OSD to go down had this error :
>
> 2015-09-27 06:30:33.257133 7f7ac7fef700 -1 os/FileStore.cc: In function
> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
> size_t, ceph::bufferlist&, bool)' thread 7f7ac7fef700 time 2015-09-27
> 06:30:33.145251
> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
> || got != -5)
>
> the second OSD crash was similar :
>
> 2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function
> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
> size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27
> 06:30:57.260978
> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
> || got != -5)
>
> I'm familiar with this error : it happened already with a BTRFS read
> error (invalid csum) and I could correct it after flush-journal/deleting
> the corrupted file/starting OSD/pg repair.
> This time though there isn't any kernel log indicating an invalid csum.
> The kernel is different though : we use 3.18.9 on these two servers and
> the others had 4.0.5 so maybe BTRFS doesn't log invalid checksum errors
> with this version. I've launched btrfs scrub on the 2 filesystems just
> in case (still waiting for completion).
>
> The first attempt to restart these OSDs failed: one OSD died 19 seconds
> after start, the other 21 seconds. Seeing that, I temporarily brought
> down the min_size to 1 which allowed the 9 incomplete PG to recover. I
> verified this by bringing min_size again to 2 and then restarted the 2
> OSDs. They didn't crash yet.
>
> For reference the assert failures were still the same when the OSD died
> shortly after start :
> 2015-09-27 08:20:19.332835 7f4467bd0700 -1 os/FileStore.cc: In function
> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
> size_t, ceph::bufferlist&, bool)' thread 7f4467bd0700 time 2015-09-27
> 08:20:19.325126
> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
> || got != -5)
>
> 2015-09-27 08:20:50.626344 7f97f2d95700 -1 os/FileStore.cc: In function
> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
> size_t, ceph::bufferlist&, bool)' thread 7f97f2d95700 time 2015-09-27
> 08:20:50.605234
> os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
> || got != -5)
>
> Note that at 2015-09-27 06:30:11 a deep-scrub started on a PG involving
> one (and only one) of these 2 OSD. As we evenly space deep-scrubs (with
> currently a 10 minute interval), this might be relevant (or just a
> coincidence).
>
> I made copies of the ceph osd logs (including the stack trace and the
> recent events) if needed.
>
> Can anyone put some light on why these OSDs died ?

I just had a thought. Could launching a defragmentation on a file in a
BTRFS OSD filestore trigger this problem?
We have a process doing just that. It waits until there's no recent
access to queue files for defragmentation but there's no guarantee that
it will not defragment a file an OSD is about to use.
This might explain the nearly simultaneous crash as the defragmentation
is triggered by write access patterns which should be the roughly the
same on all 3 OSDs hosting a copy of the file. The defragmentation isn't
running at the exact same time because it is queued which could explain
why we got 2 crashes instead of 3.

I'll probably ask on linux-btrfs but the possible conditions leading to
this assert failure would help pinpoint the problem, so if someone knows
this code well enough without knowing how BTRFS behaves while
defragmenting I'll bridge the gap.

I just activated autodefrag on one of the two affected servers for all
its BTRFS filesystems and disabled our own defragmentation process.
With recent tunings we might not need our own defragmentation scheduler
anymore and we can afford to lose some performance while investigating this.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Simultaneous CEPH OSD crashes

2015-09-27 Thread Lionel Bouton

Hi,

we just had a quasi simultaneous crash on two different OSD which
blocked our VMs (min_size = 2, size = 3) on Firefly 0.80.9.

the first OSD to go down had this error :

2015-09-27 06:30:33.257133 7f7ac7fef700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
size_t, ceph::bufferlist&, bool)' thread 7f7ac7fef700 time 2015-09-27
06:30:33.145251
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
|| got != -5)

the second OSD crash was similar :

2015-09-27 06:30:57.373841 7f05d92cf700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
size_t, ceph::bufferlist&, bool)' thread 7f05d92cf700 time 2015-09-27
06:30:57.260978
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
|| got != -5)

I'm familiar with this error : it happened already with a BTRFS read
error (invalid csum) and I could correct it after flush-journal/deleting
the corrupted file/starting OSD/pg repair.
This time though there isn't any kernel log indicating an invalid csum.
The kernel is different though : we use 3.18.9 on these two servers and
the others had 4.0.5 so maybe BTRFS doesn't log invalid checksum errors
with this version. I've launched btrfs scrub on the 2 filesystems just
in case (still waiting for completion).

The first attempt to restart these OSDs failed: one OSD died 19 seconds
after start, the other 21 seconds. Seeing that, I temporarily brought
down the min_size to 1 which allowed the 9 incomplete PG to recover. I
verified this by bringing min_size again to 2 and then restarted the 2
OSDs. They didn't crash yet.

For reference the assert failures were still the same when the OSD died
shortly after start :
2015-09-27 08:20:19.332835 7f4467bd0700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
size_t, ceph::bufferlist&, bool)' thread 7f4467bd0700 time 2015-09-27
08:20:19.325126
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
|| got != -5)

2015-09-27 08:20:50.626344 7f97f2d95700 -1 os/FileStore.cc: In function
'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
size_t, ceph::bufferlist&, bool)' thread 7f97f2d95700 time 2015-09-27
08:20:50.605234
os/FileStore.cc: 2641: FAILED assert(allow_eio || !m_filestore_fail_eio
|| got != -5)

Note that at 2015-09-27 06:30:11 a deep-scrub started on a PG involving
one (and only one) of these 2 OSD. As we evenly space deep-scrubs (with
currently a 10 minute interval), this might be relevant (or just a
coincidence).

I made copies of the ceph osd logs (including the stack trace and the
recent events) if needed.

Can anyone put some light on why these OSDs died ?

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] question on reusing OSD

2015-09-15 Thread Lionel Bouton

Le 16/09/2015 01:21, John-Paul Robinson a écrit :
> Hi,
>
> I'm working to correct a partitioning error from when our cluster was
> first installed (ceph 0.56.4, ubuntu 12.04).  This left us with 2TB
> partitions for our OSDs, instead of the 2.8TB actually available on
> disk, a 29% space hit.  (The error was due to a gdisk bug that
> mis-computed the end of the disk during the ceph-disk-prepare and placed
> the journal at the 2TB mark instead of the true end of the disk at
> 2.8TB. I've updated gdisk to a newer release that works correctly.)
>
> I'd like to fix this problem by taking my existing 2TB OSDs offline one
> at a time, repartitioning them and then bringing them back into the
> cluster.  Unfortunately I can't just grow the partitions, so the
> repartition will be destructive.

Hum, why should it be? If the journal is at the 2TB mark, you should be
able to:
- stop the OSD,
- flush the journal, (ceph-osd -i  --flush-journal),
- unmount the data filesystem (might be superfluous but the kernel seems
to cache the partition layout when a partition is active),
- remove the journal partition,
- extend the data partition,
- place the journal partition at the end of the drive (in fact you
probably want to write a precomputed partition layout in one go).
- mount the data filesystem, resize it online,
- ceph-osd -i  --mkjournal (assuming your setup can find the
partition again automatically without reconfiguration)
- start the OSD

If you script this you should not have to use noout: the OSD should come
back in a matter of seconds and the impact on the storage network minimal.

Note that the start of the disk is where you get the best sequential
reads/writes. Given that most data accesses are random and all journal
accesses are sequential I put the journal at the start of the disk when
data and journal are sharing the same platters.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Lionel Bouton

Le 10/09/2015 22:56, Robert LeBlanc a écrit :
> We are trying to add some additional OSDs to our cluster, but the
> impact of the backfilling has been very disruptive to client I/O and
> we have been trying to figure out how to reduce the impact. We have
> seen some client I/O blocked for more than 60 seconds. There has been
> CPU and RAM head room on the OSD nodes, network has been fine, disks
> have been busy, but not terrible.

It seems you've already exhausted most of the ways I know. When
confronted to this situation, I used a simple script to throttle
backfills (freezing them, then re-enabling them), this helped our VMs at
the time but you must be prepared for very long migrations and some
experimentations with different schedulings. You simply pass it the
number of seconds backfills are allowed to proceed then the number of
seconds during them they pause.

Here's the script, which should be self-explanatory:
http://pastebin.com/sy7h1VEy

something like :

./throttler 10 120

limited the impact on our VMs (the idea being that during the 10s the
backfill won't be able to trigger filestore syncs and the 120s pause
will allow the filestore syncs to remove "dirty" data from the journals
without interfering too much with concurrent writes).
I believe you must have a high filestore sync value to hope to benefit
from this (we use 30s).
At the very least the long pause will eventually allow VMs to move data
to disk regularly instead of being nearly frozen.

Note that your pgs are more than 10G each, if the OSDs can't stop a
backfill before finishing transferring the current pg this won't help (I
assume backfills go through journals and they probably won't be able to
act as write-back caches anymore as one PG will be enough to fill them up).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Lionel Bouton

Le 11/09/2015 00:20, Robert LeBlanc a écrit :
> I don't think the script will help our situation as it is just setting
> osd_max_backfill from 1 to 0. It looks like that change doesn't go
> into effect until after it finishes the PG.

That was what I was afraid of. Note that it should help a little anyway
(if not that's worrying, setting backfills to 0 completely should solve
your clients IO problems in a matter of minutes).
You may have better results by allowing backfills on only a few of your
OSD at a time. For example deep-scrubs were a problem on our
installation when at times there were several going on. We implemented a
scheduler that enforces limits on simultaneous deep-scrubs and these
problems are gone.
That's a last resort and rough around the edges but if every other means
of reducing the impact on your clients has failed, that's the best you
can hope for.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Lionel Bouton

Le 11/09/2015 01:24, Lincoln Bryant a écrit :
> On 9/10/2015 5:39 PM, Lionel Bouton wrote:
>> For example deep-scrubs were a problem on our installation when at
>> times there were several going on. We implemented a scheduler that
>> enforces limits on simultaneous deep-scrubs and these problems are gone.
>
> Hi Lionel,
>
> Out of curiosity, how many was "several" in your case?

I had to issue ceph osd set nodeep-scrub several times with 3 or 4
concurrent deep-scrubs to avoid processes blocked in D state on VMs and
I could see the VM loads start rising with only 2. At the time I had
only 3 or 4 servers with 18 or 24 OSDs on Firefly. Obviously the more
servers and OSDs you have the more simultaneous deep scrubs you can handle.

One PG is ~5GB on our installation and it was probably ~4GB at the time.
As deep scrubs must read data on all replicas, with size=3 having 3 or 4
concurrent on only 3 or 4 servers means reading anywhere between 10 to
20G from disks on each server (and I don't think the OSDs are trying to
bypass the kernel cache).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] backfilling on a single OSD and caching controllers

2015-09-09 Thread Lionel Bouton

Hi,

just a tip I just validated on our hardware. I'm currently converting an
OSD from xfs with journal on same platter to btrfs with journal on SSD.
To avoid any unwanted movement, I reused the same OSD number, weight and
placement : so Ceph is simply backfilling all PGs previously stored on
the old version of this OSD.

The problem is that all the other OSDs on the same server (which has a
total of 6) suffer greatly (>10x jump in apply latencies). I
half-expected this: the RAID card has 2GB of battery-backed RAM from
which ~1.6-1.7 GB is used as write cache. Obviously if you write the
entire content of an OSD through this cache (~500GB currently) it will
not be useful: the first GBs will be put in cache but the OSD will
overflow the cache (writing faster than what the HDD can handle) which
will then become useless for the backfilling.
Worse, once the cache is full writes to the other HDDs will compete for
access to the cache with the backfilling OSD instead of getting the full
benefit of a BBWC.

I already took the precaution of excluding the SSDs from the
controller's cache (which already divides the cache pressure by 2
because the writes to journals are not using it). But right now I just
disabled the cache for the HDD behind the OSD on which backfilling is
happening and I saw an immediate performance gain: apply latencies for
the other OSDs on the same server jumped back from >100ms to <10ms.

AFAIK the Ceph OSD code doesn't bypass the kernel cache when
backfilling, if it's really the case this might be a good idea to do so
(or at least make it configurable): the probability that the data
written during backfilling is reused should be lower than the one for
normal accesses.

On an HP Smart Storage Array:

hpacucli> ctrl slot= ld  modify caching=disable

when the backfilling stops:

hpacucli> ctrl slot= ld  modify caching=enable

This is not usable when there are large scale rebalancing (where nearly
all OSDs are hit by pg movements) but in this particular case this helps
a *lot*.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Corruption of file systems on RBD images

2015-09-02 Thread Lionel Bouton

Le 02/09/2015 18:16, Mathieu GAUTHIER-LAFAYE a écrit :
> Hi Lionel,
>
> - Original Message -
>> From: "Lionel Bouton" <lionel+c...@bouton.name>
>> To: "Mathieu GAUTHIER-LAFAYE" <mathieu.gauthier-laf...@lapth.cnrs.fr>, 
>> ceph-us...@ceph.com
>> Sent: Wednesday, 2 September, 2015 4:40:26 PM
>> Subject: Re: [ceph-users] Corruption of file systems on RBD images
>>
>> Hi Mathieu,
>>
>> Le 02/09/2015 14:10, Mathieu GAUTHIER-LAFAYE a écrit :
>>> Hi All,
>>>
>>> We have some troubles regularly with virtual machines using RBD storage.
>>> When we restart some virtual machines, they starts to do some filesystem
>>> checks. Sometime it can rescue it, sometime the virtual machine die (Linux
>>> or Windows).
>> What is the cause of death as reported by the VM? FS inconsistency?
>> Block device access timeout? ...
> The VM starts normally without any error message but when the OS starts it 
> detects some inconsistencies on the filesystem.
>
> It try to repair it (fsck.ext4 or chkdsk.exe)... Few times, the repair was 
> successful and we didn't notice any corruption on the VM but we not checked 
> all the filesystem. The solution is often to reinstall the VM.

Hum. Ceph is pretty good at keeping your data safe (with a small caveat,
see below) so you might have some other problem causing data corruption.
The first thing coming to mind is that the VM might run on faulty
hardware (corrupting data in memory before being written to disk).

> [...]
> We have not detected any performance issues due to scrubbing. My doubt was 
> when it check for data integrity of a pg on two replicas. Can it take a wrong 
> decision and replace the good data with the bad one ? I have got probably 
> wrong on how works the scrubbing. Data is safe even if we have only two 
> replicas ?

I'm not 100% sure. With non-checksumming filesystems, if the primary OSD
for a PG is corrupted I believe you are out of luck: AFAIK Ceph doesn't
have internal checksums which allows it to detect corruption when
reading back data and will give you back what the OSD disk has event if
it's corrupted. When repairing a pg (after detecting inconsistencies
during a deep scrub) it seems it doesn't try to find the "right" value
by vote (ie: if you use size=3, you could choose the data on the 2
"secondary" OSDs even if they match but don't match the primary to
correct corruption on the primary) but overwrite secondary OSDs with the
data from the primary OSD (which obviously would transmit any corruption
from the primary to the secondary OSDs).

Then there's a subtlety: with BTRFS and disk corruption the underlying
filesystem will return a system error when reading from the primary OSD
(because all reads are checked against internal checksums) and I believe
Ceph will then switch the read to a secondary OSD to give back valid
data to the rbd client. I'm not sure how repairing works in this case: I
suspect the data is overwritten by data from the first OSD where a read
doesn't fail which would correct the situation without any room for an
incorrect choice but the documentation and posts on this subject where
not explicit about it.

If I'm right (please wait confirmation about Ceph behaviour with Btrfs
from developpers), Ceph shouldn't be able to corrupt data from your VM
and corruption should happen before it is stored.
That said there's a theoretical small window where corruption could
occur outside the system running the VM: in the primary OSD contacted by
this system if the data to be written is corrupted after being received
and before being transmitted to secondary OSDs Ceph itself could corrupt
data (due to flaky hardware on some OSDs). This could be protected
against by computing a checksum of the data on the rbd client and
checking it on all OSD before writing to disk but I don't know the
internals/protocols so I don't know if it's done and this window closed.

>>> We use BTRFS for OSD with a kernel 3.10. This was not strongly discouraged
>>> when we start the deployment of CEPH last year. Now, it seems that the
>>> kernel version should be 3.14 or later for this kind of setup.
>> See https://btrfs.wiki.kernel.org/index.php/Gotchas for various reasons
>> to upgrade.
>>
>> We have a good deal of experience with Btrfs in production now. We had
>> to disable snapshots, make the journal NoCOW, disable autodefrag and
>> develop our own background defragmenter (which converts to zlib at the
>> same time it defragments for additional space savings). We currently use
>> kernel version 4.0.5 (we don't use any RAID level so we don't need 4.0.6
>> to get a fix for an online RAID level conversion bug) and I wouldn't use
>> anything less than 3.19.5. The results are pr

Re: [ceph-users] Corruption of file systems on RBD images

2015-09-02 Thread Lionel Bouton

Hi Mathieu,

Le 02/09/2015 14:10, Mathieu GAUTHIER-LAFAYE a écrit :
> Hi All,
>
> We have some troubles regularly with virtual machines using RBD storage. When 
> we restart some virtual machines, they starts to do some filesystem checks. 
> Sometime it can rescue it, sometime the virtual machine die (Linux or 
> Windows).

What is the cause of death as reported by the VM? FS inconsistency?
Block device access timeout? ...

>
> We have move from Firefly to Hammer the last month. I don't know if the 
> problem is in Ceph and is still there or if we continue to see symptom of a 
> Firefly bug.
>
> We have two rooms in two separate building, so we set the replica size to 2. 
> I'm in doubt if it can cause this kind of problems when scrubbing operations. 
> I guess the recommended replica size is at less 3.

Scrubbing is pretty harmless, deep scrubbing is another matter.
Simultaneous deep scrubs on the same OSD are a performance killer. It
seems latest Ceph versions provide some way of limiting its impact on
performance (scrubs are done per pg so 2 simultaneous scrubs can and
often involve the same OSD and I think there's a limit on scrubs per OSD
now). AFAIK Firefly doesn't have this (and it surely didn't when we were
confronted to the problem) so we developed our own deep scrub scheduler
to avoid involving the same OSD twice (in fact our scheduler tries to
interleave scrubs so that each OSD has as much inactivity after a deep
scrub as possible before the next). This helps a lot.

>
> We use BTRFS for OSD with a kernel 3.10. This was not strongly discouraged 
> when we start the deployment of CEPH last year. Now, it seems that the kernel 
> version should be 3.14 or later for this kind of setup.

See https://btrfs.wiki.kernel.org/index.php/Gotchas for various reasons
to upgrade.

We have a good deal of experience with Btrfs in production now. We had
to disable snapshots, make the journal NoCOW, disable autodefrag and
develop our own background defragmenter (which converts to zlib at the
same time it defragments for additional space savings). We currently use
kernel version 4.0.5 (we don't use any RAID level so we don't need 4.0.6
to get a fix for an online RAID level conversion bug) and I wouldn't use
anything less than 3.19.5. The results are pretty good, but Btrfs is
definitely not an out-of-the-box solution for Ceph.


>
> Does some people already have got similar problems ? Do you think, it's 
> related to our BTRFS setup. Is it the replica size of the pool ?

It mainly depends on the answer to the first question above (is it a
corruption or a freezing problem?).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph for multi-site operation

2015-08-24 Thread Lionel Bouton

Le 24/08/2015 15:11, Julien Escario a écrit :
 Hello,
 First, let me advise I'm really a noob with Cephsince I have only read some
 documentation.

 I'm now trying to deploy a Ceph cluster for testing purposes. The cluster is
 based on 3 (more if necessary) hypervisors running proxmox 3.4.

 Before going futher, I have an essential question : is Ceph usable in a case 
 of
 multiple sites storage ?

It depends on what you really need it to do (access patterns and
behaviour when a link goes down).


 Long story :
 My goal is to run hypervisors on 2 datacenters separated by 4ms latency.

Note : unless you are studying Ceph behaviour in this case this goal is
in fact a method to reach a goal. If you describe the actual goal you
might get different suggestions.

 Bandwidth is 1Gbps actually but will be upgraded in a near future.

 So is it possible to run a an active/active Ceph cluster to get a shared 
 storage
 between the two sites.

It is but it probably won't behave correctly in your case. The latency
and the bandwidth will hurt a lot. Any application requiring that data
is confirmed stored on disk will be hit by the 4ms latency and 1Gbps
will have to be shared between inter-site replication traffic and
regular VM disk accesses. Your storage will most probably behave like a
very slow single hard drive shared between all your VMs.
Some workloads might work correctly (if you don't have any significant
writes and most of your data will fit in caches for example).

When the link between your 2 datacenters is severed, in the worst case
(no quorum reachable or a crushmap that won't allow each pg to reach
min_size with only one datacenter) everything will freeze, in the best
case (giving priority to a single datacenter by running more monitors on
it and a crushmap storing at least min_size replicas on it) when the
link will be going down everything will run on this datacenter.

You can get around a part of the performance problems by going with a
3-way replication, 2 replicas on your primary datacenter and 1 on the
secondary where all OSD are configured with primary affinity 0. All
reads will be served from the primary datacenter and only writes would
go to the secondary. You'll have to run all your VM on the primary
datacenter and setup your monitors such that the elected master will be
in the primary datacenter (I believe it is chosen by the first name
according to alphabetical order). You'll have a copy of your data on the
secondary datacenter in case of a disaster on the primary but recovering
will be hard (you'll have to reach a quorum of monitors in the secondary
datacenter and I'm not sure how to proceed if you only have one out of 3
for example).


  Of course, I'll have to be sure that no machien is
 running at the same time on both sites.

With your bandwidth and latency, without knowing more about your
workloads it's probable that running VM on both sites will get you very
slow IOs. Multi datacenter for simple object storage using RGW seems to
work, but RBD volumes accesses are usually more demanding.

  Hypervisor will be in charge of this.

 Is there a mean to ask Ceph to keep at least one copy (or two) in each site 
 and
 ask it to make all blocs reads from the nearest location ?
 I'm aware that writes would have to be replicated and there's only a 
 synchronous
 mode for this.

 I've read many documentation and use cases about Ceph and it seems some are
 saying it could be used in such replication and others are not. Need of 
 erasure
 coding isn't clear too.

Don't use erasure coding for RBD volumes. You'll need a caching tier and
it seems tricky to get right and might not be fully tested (I've seen a
snapshot bug discussed here last week).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] EXT4 for Production and Journal Question?

2015-08-24 Thread Lionel Bouton

Le 24/08/2015 19:34, Robert LeBlanc a écrit :
 Building off a discussion earlier this month [1], how supported is
 EXT4 for OSDs? It seems that some people are getting good results with
 it and I'll be testing it in our environment.

 The other question is if the EXT4 journal is even necessary if you are
 using Ceph SSD journals. My thoughts are thus: Incoming I/O is written
 to the SSD journal. The journal then flushes to the EXT4 partition.
 Only after the write is completed (I understand that this is a direct
 sync write) does Ceph free the SSD journal entry.

 Doesn't this provide the same reliability as the EXT4 journal? If an
 OSD crashed in the middle of the write with no EXT4 journal, the file
 system would be repaired and then Ceph would rewrite the last
 transaction that didn't complete? I'm sure I'm missing something
 here...

I didn't try this configuration but what you miss is probably :
- the file system recovery time when there's no journal available.
e2fsck on large filesystems can be long and may need user interaction.
You don't want that if you just had a cluster-wide (or even partial but
involving tens of disks some of which might be needed to reach min_size)
power failure.
- the less tested behaviour: I'm not sure there's even a guarantee from
ext4 without journal than e2fsck can recover properly after a crash (ie:
with data consistent with the Ceph journal).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] btrfs w/ centos 7.1

2015-08-07 Thread Lionel Bouton

Le 07/08/2015 22:05, Ben Hines a écrit :
 Howdy,

 The Ceph docs still say btrfs is 'experimental' in one section, but
 say it's the long term ideal for ceph in the later section. Is this
 still accurate with Hammer? Is it mature enough on centos 7.1 for
 production use?

 (kernel is  3.10.0-229.7.2.el7.x86_64 )

Difficult to say with distribution kernels, they may have patched their
kernels to fix some Btrfs issues or not (3.10.0 is more than 2 years
old) I wouldn't trust them myself.

We are converting our OSDs to Btrfs but we use recent kernel versions
(4.0.5 currently), we disabled Btrfs snapshots in ceph.conf (they are
too costly), created journals NOCOW (we will move them to SSDs
eventually) and developed our own defragmentation scheduler (Btrfs' own
autodefrag didn't perform well with Ceph when we started and we use the
btrfs defragmentation process to recompress data with zlib instead of
lzo as we mount OSD fs with compress=lzo for lower latency OSD writes).
In the above conditions, it is faster than XFS (~30% lower apply
latencies according to ceph osd perf), detects otherwise silent data
corruption (it caught some already) and provides ~10% additional storage
space thanks to lzo/zlib compression (most of our data is in the form of
already compressed files stored on RBD, actual gains obviously depend on
the data).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS vs RBD

2015-07-22 Thread Lionel Bouton

Le 22/07/2015 21:17, Lincoln Bryant a écrit :
 Hi Hadi,

 AFAIK, you can’t safely mount RBD as R/W on multiple machines. You
 could re-export the RBD as NFS, but that’ll introduce a bottleneck and
 probably tank your performance gains over CephFS.

 For what it’s worth, some of our RBDs are mapped to multiple machines,
 mounted read-write on one and read-only on the others. We haven’t seen
 any strange effects from that, but I seem to recall it being ill advised.

Yes it is, for several reasons. Here are two at the top of my head.

Some (many/most/all?) filesystems update on-disk data when they are
mounted even if the mount is read-only. If you map your RBD devices
read-only before mounting the filesystem itself read-only you should be
safe from corruption occurring at mount time though.
The system with read-write access will keep its in-memory data in sync
with the on-disk data. The others with read-only access will not as they
won't be aware of the writes done, this means they will eventually get
incoherent data and will generate fs access errors with various levels
of errors from the benign read error to potentially full kernel crash
with whole filesystem freeze in-between.

Don't do that unless you :
- carefully setup your rbd mappings read-only everywhere but the system
doing the writes,
- can withstand a (simultaneous) system crash on all the systems
mounting the rbd mappings read-only.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to recover from: 1 pgs down; 10 pgs incomplete; 10 pgs stuck inactive; 10 pgs stuck unclean

2015-07-15 Thread Lionel Bouton

Le 15/07/2015 10:55, Jelle de Jong a écrit :
 On 13/07/15 15:40, Jelle de Jong wrote:
 I was testing a ceph cluster with osd_pool_default_size = 2 and while
 rebuilding the OSD on one ceph node a disk in an other node started
 getting read errors and ceph kept taking the OSD down, and instead of me
 executing ceph osd set nodown while the other node was rebuilding I kept
 restarting the OSD for a while and ceph took the OSD in for a few
 minutes and then taking it back down.

 I then removed the bad OSD from the cluster and later added it back in
 with nodown flag set and a weight of zero, moving all the data away.
 Then removed the OSD again and added a new OSD with a new hard drive.

 However I ended up with the following cluster status and I can't seem to
 find how to get the cluster healthy again. I'm doing this as tests
 before taking this ceph configuration in further production.

 http://paste.debian.net/plain/281922

 If I lost data, my bad, but how could I figure out in what pool the data
 was lost and in what rbd volume (so what kvm guest lost data).
 Anybody that can help?

 Can I somehow reweight some OSD to resolve the problems? Or should I
 rebuild the whole cluster and loose all data?

If your min_size is 2, try setting to 1 and restart each of your OSD. If
ceph -s doesn't show any progress repairing your data, you'll have to
either get developpers to help savage what can be from your disks or
rebuild the cluster with size=3 and restore your data.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Issue with journal on another drive

2015-07-13 Thread Lionel Bouton

On 07/14/15 00:08, Rimma Iontel wrote:
 Hi all,

 [...]
 Is there something that needed to be done to journal partition to
 enable sharing between multiple OSDs?  Or is there something else
 that's causing the isssue?


IIRC you can't share a volume between multiple OSDs. What you could do
if splitting this partition isn't possible is create a LVM volume group
with it as a single physical volume (change type of partition to lvm,
pvcreate /dev/sda6, vgcreate journal_vg /dev/sda6). Then you can create
a logical volumes in it for each of your OSDs (lvcreate -n
osdn_journal -L one_third_of_available_space journal_vg) and use
them (/dev/journal_vg/osdn_journal) in your configuration.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Real world benefit from SSD Journals for a more read than write cluster

2015-07-12 Thread Lionel Bouton

On 07/12/15 05:55, Alex Gorbachev wrote:
 FWIW. Based on the excellent research by Mark Nelson
 (http://ceph.com/community/ceph-performance-part-2-write-throughput-without-ssd-journals/)
 we have dropped SSD journals altogether, and instead went for the
 battery protected controller writeback cache.

Note that this has limitations (and the research is nearly 2 years old):
- the controller writeback caches are relatively small (often less than
4GB, 2GB is common on the controller, a small portion is not usable, and
10% of the rest is often used for readahead/read cache) and this is
shared by all of your drives. If your workload is not write spikes
oriented, but nearly constant writes this won't help as you will be
limited on each OSD by roughly half of the disk IOPS. With journals on
SSDs when you hit their limit (which is ~5GB of buffer for 10GB journals
and not 2GB divided by the amount of OSDs per controller), the limit is
the raw disk IOPS.
- you *must* make sure the controller is configured to switch to
write-through when the battery/capacitor fails (or a power failure on
hardware from the same generation could make you lose all of the OSDs
connected to them in a single event which means data loss),
- you should monitor the battery/capacitor status to trigger maintenance
(and your cluster will slow down while the battery/capacitor is waiting
for a replacement, you might want to down the associated OSDs depending
on your cluster configuration). We mostly eliminated this problem by
replacing the whole chassis of the servers we lease for new generations
every 2 or 3 years: if you time the hardware replacement to match a
fresh chassis generation this means fresh capacitors and they shouldn't
fail you (ours are rated for 3 years).

We just ordered Intel S3710 SSDs even though we have battery/capacitor
backed caches on the controllers: the latencies have started to rise
nevertheless when there are long periods of write intensive activity.
I'm currently pondering if we should bypass the write-cache for the
SSDs. The cache is obviously less effective on them and might be more
useful overall if it is dedicated to the rotating disks. Does anyone
have test results with cache active/inactive on SSD journals with HP
Smart Array p420 or p840 controllers?

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to prefer faster disks in same pool

2015-07-10 Thread Lionel Bouton

On 07/10/15 02:13, Christoph Adomeit wrote:
 Hi Guys,

 I have a ceph pool that is mixed with 10k rpm disks and 7.2 k rpm disks.

 There are 85 osds and 10 of them are 10k
 Size is not an issue, the pool is filled only 20%

 I want to somehow prefer the 10 k rpm disks so that they get more i/o

 What is the most intelligent wy to prefer the faster disks ?
 Just give them another weight or are there other methods ?

If you cluster is read intensive you can use primary affinity to
redirect reads to your 10k drives. Add

mon osd allow primary affinity = true

in your ceph.conf, restart your monitors and for each OSD on 7.2k use :

ceph osd primary-affinity 7.2k_id 0

For every pg with at least one 10k OSD, this will make one of the 10k
drive OSD primary and will perform reads on it.

But with only 10 OSDs being 10k and 75 OSDs being 7.2k, I'm not sure
what will happen: most pgs clearly will be only on 7.2k OSDs so you may
not gain much.

It's worth a try if you don't want to reorganize your storage though and
it's by far the less time consuming if you want to revert your changes
later.

Another way with better predictability would be to define a 10k root and
use a custom rule for your pool which would take the primary from this
new root and switch to the default root for the next OSDs, but you don't
have enough of them to keep the data balanced (for a size=3 pool, you'd
need 1/3 of 10k OSD and 2/3 of 7.2k OSD). This would create a bottleneck
on your 10k drives.

I fear there's no gain in creating a separate 10k pool: you don't have
enough drives to get as much performance from the new 10k pool as you
can from the resulting 7.2k-only pool. Maybe with some specific data
access pattern this could work but I'm not sure what those would be (you
might get more useful suggestions if you describe how the current pool
is used).

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FW: Ceph data locality

2015-07-07 Thread Lionel Bouton

On 07/07/15 18:20, Dmitry Meytin wrote:
 Exactly because of that issue I've reduced the number of Ceph replications to 
 2 and the number of HDFS copies is also 2 (so we're talking about 4 copies).
 I want (but didn't tried yet) to change Ceph replication to 1 and change HDFS 
 back to 3.

You are stacking a distributed storage network on top of another, no
wonder you find the performance below your expectations.

You could (should?) use CephFS instead of HDFS on RBD backed VMs (as
this is clearly redundant and inefficient). Note that if you try to use
size=1 for your RBD pool instead (which will probably be slower than
using Hadoop with CephFS) and loose only one disk you will probably
freeze most or all of your VMs (as their disks will be split across all
physical disks of your Ceph cluster) and certainly corrupt all of their
filesystems.

See http://ceph.com/docs/master/cephfs/hadoop/

If this doesn't work for you I'll suggest separating the VMs system
disks from the Hadoop storage and run Hadoop storage nodes on bare
metal. VMs could either be backed by local disks or RBD if you need to
but in any case they should avoid creating any large IO spikes which
could disturb the Hadoop storage nodes.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FW: Ceph data locality

2015-07-07 Thread Lionel Bouton

On 07/07/15 17:41, Dmitry Meytin wrote:
 Hi Lionel,
 Thanks for the answer.
 The missing info:
 1) Ceph 0.80.9 Firefly
 2) map-reduce makes sequential reads of blocks of 64MB (or 128 MB)
 3) HDFS which is running on top of Ceph is replicating data for 3 times 
 between VMs which could be located on the same physical host or different 
 hosts

Hdfs on top of Ceph? How does it work exactly? If you run VMs backed by
RBD which are then used in Hadoop to build HDFS this will mean that HDFS
makes 3 copies and with default Ceph pool size=3 this would make 9
copies of the same data. If I understand this right this is very
inefficient.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] FW: Ceph data locality

2015-07-07 Thread Lionel Bouton

Hi Dmitry,

On 07/07/15 14:42, Dmitry Meytin wrote:
 Hi Christian,
 Thanks for the thorough explanation.
 My case is Elastic Map Reduce on top of OpenStack with Ceph backend for 
 everything (block, object, images).
 With default configuration, performance is 300% worse than bare metal.
 I did a few changes:
 1) replication settings 2
 2) read ahead size 2048Kb 
 3) Max sync intervals 10s
 4) Large queue and large bytes
 5) OSD OP threads 20
 6) FileStore Flusher off
 7) Sync Flush On
 8) Object size 64 Mb

 And still the performance is poor when comparing to bare-metal.

Describing how you test performance with bare-metal would help identify
if this is expected behavior or a configuration problem. If you try to
compare sequential access to individual local disks with Ceph it's an
apple to orange comparison (for example Ceph RBD isn't optimized for
this by default and I'm not sure how far stripping/order/readahead
tuning can get you). If you try to compare random access to 3 way RAID1
devices to random access to RBD devices on pools with size=3 then it
becomes more relevant.

I didn't see any description of the hardware and network used for Ceph
which might help identify a bottleneck. The Ceph version is missing too.

When you test Ceph performance is ceph -s reporting HEALTH_OK (if not
this would have performance impact)? Is there any deep-scrubbing going
on (this will limit your IO bandwidth especially if several happens at
the same time)?

 The profiling shows the huge network demand (I'm running terasort) during the 
 map phase.

It's expected with Ceph. Your network should have the capacity for your
IO targets. Note that if your data is easy to restore you can get better
write performance with size=1 or size=2 depending on the trade-off you
want between durability and performance.

 I want to avoid shared-disk behavior of Ceph and I would like VM to read data 
 from the local volume as much as applicable.
 Am I wrong with mu assumptions?

Yes : Ceph is a distributed storage network, there's no provision for
local storage. Note that 10Gbit networks (especially dual 10Gbit) and
some tuning should in theory give you plenty of read performance with
Ceph (far more than any local disk could provide except NVME storage or
similar tech). You may be limited by latencies and the read or write
patterns of your clients though. Ceph total bandwith is usually reached
when you have heavy concurrent accesses.

Note that if you use map reduce with a Ceph cluster you should probably
write any intermediate results to local storage instead of Ceph as it
doesn't bring any real advantage for them (the only data that you should
store on Ceph is what you want to keep after the map reduce so probably
the initial input and the final output if it is meant to be stored).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-07-02 Thread Lionel Bouton

On 07/02/15 13:49, German Anders wrote:
 output from iostat:

 CEPHOSD01:

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 sdc(ceph-0)   0.00 0.001.00  389.00 0.0035.98  
 188.9660.32  120.12   16.00  120.39   1.26  49.20
 sdd(ceph-1)   0.00 0.000.000.00 0.00 0.00
 0.00 0.000.000.000.00   0.00   0.00
 sdf(ceph-2)   0.00 1.006.00  521.00 0.0260.72  
 236.05   143.10  309.75  484.00  307.74   1.90 100.00
 sdg(ceph-3)   0.00 0.00   11.00  535.00 0.0442.41  
 159.22   139.25  279.72  394.18  277.37   1.83 100.00
 sdi(ceph-4)   0.00 1.004.00  560.00 0.0254.87  
 199.32   125.96  187.07  562.00  184.39   1.65  93.20
 sdj(ceph-5)   0.00 0.000.00  566.00 0.0061.41  
 222.19   109.13  169.620.00  169.62   1.53  86.40
 sdl(ceph-6)   0.00 0.008.000.00 0.09 0.00   
 23.00 0.12   12.00   12.000.00   2.50   2.00
 sdm(ceph-7)   0.00 0.002.00  481.00 0.0144.59  
 189.12   116.64  241.41  268.00  241.30   2.05  99.20
 sdn(ceph-8)   0.00 0.001.000.00 0.00 0.00
 8.00 0.018.008.000.00   8.00   0.80
 fioa  0.00 0.000.00 1016.00 0.0019.09   
 38.47 0.000.060.000.06   0.00   0.00

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 sdc(ceph-0)   0.00 1.00   10.00  278.00 0.0426.07  
 185.6960.82  257.97  309.60  256.12   2.83  81.60
 sdd(ceph-1)   0.00 0.002.000.00 0.02 0.00   
 20.00 0.02   10.00   10.000.00  10.00   2.00
 sdf(ceph-2)   0.00 1.006.00  579.00 0.0254.16  
 189.68   142.78  246.55  328.67  245.70   1.71 100.00
 sdg(ceph-3)   0.00 0.00   10.00   75.00 0.05 5.32  
 129.41 4.94  185.08   11.20  208.27   4.05  34.40
 sdi(ceph-4)   0.00 0.00   19.00  147.00 0.0912.61  
 156.6317.88  230.89  114.32  245.96   3.37  56.00
 sdj(ceph-5)   0.00 1.002.00  629.00 0.0143.66  
 141.72   143.00  223.35  426.00  222.71   1.58 100.00
 sdl(ceph-6)   0.00 0.00   10.000.00 0.04 0.00
 8.00 0.16   18.40   18.400.00   5.60   5.60
 sdm(ceph-7)   0.00 0.00   11.004.00 0.05 0.01
 8.00 0.48   35.20   25.82   61.00  14.13  21.20
 sdn(ceph-8)   0.00 0.009.000.00 0.07 0.00   
 15.11 0.078.008.000.00   4.89   4.40
 fioa  0.00 0.000.00 6415.00 0.00   125.81   
 40.16 0.000.140.000.14   0.00   0.00

 CEPHOSD02:

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 sdc1(ceph-9)  0.00 0.00   13.000.00 0.11 0.00   
 16.62 0.17   13.23   13.230.00   4.92   6.40
 sdd1(ceph-10) 0.00 0.00   15.000.00 0.13 0.00   
 18.13 0.26   17.33   17.330.00   1.87   2.80
 sdf1(ceph-11) 0.00 0.00   22.00  650.00 0.1151.75  
 158.04   143.27  212.07  308.55  208.81   1.49 100.00
 sdg1(ceph-12) 0.00 0.00   12.00  282.00 0.0554.60  
 380.6813.16  120.52  352.00  110.67   2.91  85.60
 sdi1(ceph-13) 0.00 0.001.000.00 0.00 0.00
 8.00 0.018.008.000.00   8.00   0.80
 sdj1(ceph-14) 0.00 0.00   20.000.00 0.08 0.00
 8.00 0.26   12.80   12.800.00   3.60   7.20
 sdl1(ceph-15) 0.00 0.000.000.00 0.00 0.00
 0.00 0.000.000.000.00   0.00   0.00
 sdm1(ceph-16) 0.00 0.00   20.00  424.00 0.1132.20  
 149.0589.69  235.30  243.00  234.93   2.14  95.20
 sdn1(ceph-17) 0.00 0.005.00  411.00 0.0245.47  
 223.9498.32  182.28 1057.60  171.63   2.40 100.00

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s
 avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
 sdc1(ceph-9)  0.00 0.00   26.00  383.00 0.1134.32  
 172.4486.92  258.64  297.08  256.03   2.29  93.60
 sdd1(ceph-10) 0.00 0.008.00   31.00 0.09 1.86  
 101.95 0.84  178.15   94.00  199.87   6.46  25.20
 sdf1(ceph-11) 0.00 1.005.00  409.00 0.0548.34  
 239.3490.94  219.43  383.20  217.43   2.34  96.80
 sdg1(ceph-12) 0.00 0.000.00  238.00 0.00 1.64   
 14.1258.34  143.600.00  143.60   1.83  43.60
 sdi1(ceph-13) 0.00 0.00   11.000.00 0.05 0.00   
 10.18 0.16   14.18   14.180.00   5.09   5.60
 sdj1(ceph-14) 0.00 0.001.000.00 0.00 0.00
 8.00 0.02   16.00   16.000.00  16.00   1.60
 sdl1(ceph-15) 0.00 0.001.000.00 0.03 0.00

Re: [ceph-users] Unexpected issues with simulated 'rack' outage

2015-06-24 Thread Lionel Bouton

On 06/24/15 14:44, Romero Junior wrote:

 Hi,

  

 We are setting up a test environment using Ceph as the main storage
 solution for my QEMU-KVM virtualization platform, and everything works
 fine except for the following:

  

 When I simulate a failure by powering off the switches on one of our
 three racks my virtual machines get into a weird state, the
 illustration might help you to fully understand what is going on:
 http://i.imgur.com/clBApzK.jpg

  

 The PGs are distributed based on racks, there are not default crush rules.


What is ceph -s telling while you are in this state ?

16000 pgs might be a problem: when your rack goes down, if your crushmap
rules distribute pgs based on rack, with size = 2 approximately 2/3 of
your pgs should be in a degraded state. This means that ~10666 pgs will
have to copy data to get back to a active+clean state. Your 2 other
racks will then be really busy. You can probably tune the recovery
processes to avoid too much interference with your normal VM I/Os.
You didn't tell where the monitors are placed (and there are only 2 on
your illustration which means any one of them being unreachable will
bring down your cluster).

That said, I'm not sure that having a failure domain at the rack level
when you only have 3 racks is a good idea. What you end up with when a
switch fails is a reconfiguration of 2 third of your cluster, which is
not desirable in any case. If possible, either distribute the hardware
in more racks (4 racks : only 1/2 of your data will be affected, 5 racks
only 2/5, ...) or make the switches redundant (each server with OSD
connected to 2 switches, ...).

Not that with 33 servers per rack, 3 OSD per server and 3 racks you have
approximately 300 disks. With so many disks, size=2 is probably too low
to get at a negligible probability of losing data (even if the failure
case is 2 amongst 100 and not 300). With only ~20 disks we already got
near a 2 simultaneous failure once (admitedly it was the combination of
hardware and human error in the earlier days of our cluster). We
currently have one failed disk and one giving signs (erratic
performance) of hardware problems in a span of a few weeks.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-22 Thread Lionel Bouton

On 06/22/15 17:21, Erik Logtenberg wrote:
 I have the journals on a separate disk too. How do you disable the
 snapshotting on the OSD?
http://ceph.com/docs/master/rados/configuration/filestore-config-ref/ :

filestore btrfs snap = false
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-22 Thread Lionel Bouton

On 06/19/15 13:23, Erik Logtenberg wrote:
 I believe this may be the same issue I reported some time ago, which is
 as of yet unsolved.

 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg19770.html

 I used strace to figure out that the OSD's were doing an incredible
 amount of getxattr, setxattr and removexattr calls, for no apparent
 reason. Do you see the same write pattern?

 My OSD's are also btrfs-backed.

Thanks for the heads-up.

Did you witness this with no activity at all?
From your report, this was happening during CephFS reads and we don't
use CephFS, only RBD volumes.

The amount of written data in our case is fairly consistent too.
I'll try to launch a strace but I'm not sure if I will have the time
before we add SSDs to our currrent HDD-only setup.

If I can strace btrfs OSD without SSD journals I'll report here.

Lionel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-22 Thread Lionel Bouton

On 06/22/15 11:27, Jan Schermer wrote:
 I don’t run Ceph on btrfs, but isn’t this related to the btrfs
 snapshotting feature ceph uses to ensure a consistent journal?

It's possible: if I understand correctly the code, the btrfs filestore
backend creates a snapshot when syncing the journal. I'm a little
surprised that btrfs would need approximately 120MB written to disk to
perform a snapshot of a subvolume with ~160k files (and the removal of
the oldest one as the OSD maintains 2 active) but they aren't guaranteed
to be dirt cheap and probably weren't optimised for this frequency. I'm
surprised because I was under the impression that a snapshot on btrfs
was only a matter of keeping a reference to the root of the filesystem
btree which (at least in theory) seems cheap. In fact thinking while
writing this I realise it might very well be that it is the release of a
previous snapshot with its associated cleanups which is costly not the
snapshot creation.

We are about to add Intel DC SSDs for journals and I believe Krzysztof
is right: we should be able to disable the snapshots safely then. The
main reason for us to use btrfs is compression and crc at the fs level.
It seems performance could be too: we get constantly better latencies vs
xfs in our configuration. So I'm not particularly bothered by this: it
may be something useful to document (and at least leave a trace here for
others to find): btrfs with the default filestore max sync interval (5
seconds) may have serious performance problems in most configurations.

I'm not sure if I will have the time to trace the OSD processes to check
if I witness what Erik saw with CephFS (lots of xattr activity including
setxattr and removexattr): I'm not using CephFS and his findings didn't
specify if he was using btrfs and/or xfs backed OSD (we only see this
behaviour on btrfs).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: Re: Unexpected disk write activity with btrfs OSDs

2015-06-19 Thread Lionel Bouton

On 06/19/15 13:42, Burkhard Linke wrote:

 Forget the reply to the list...

  Forwarded Message 
 Subject:  Re: [ceph-users] Unexpected disk write activity with btrfs OSDs
 Date: Fri, 19 Jun 2015 09:06:33 +0200
 From: Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de
 To:   Lionel Bouton lionel+c...@bouton.name

 Hi,

 On 06/18/2015 11:28 PM, Lionel Bouton wrote:
  Hi,
 *snipsnap*

  - Disks with btrfs OSD have a spike of activity every 30s (2 intervals
  of 10s with nearly 0 activity, one interval with a total amount of
  writes of ~120MB). The averages are : 4MB/s, 100 IO/s.

 Just a guess:

 btrfs has a commit interval which defaults to 30 seconds.

 You can verify this by changing the interval with the commit=XYZ mount 
 option.

I know and I tested commit intervals of 60 and 120 seconds without any
change. As this is directly linked to filestore max sync interval I
didn't report this test result.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-18 Thread Lionel Bouton

Hi,

I've just noticed an odd behaviour with the btrfs OSDs. We monitor the
amount of disk writes on each device, our granularity is 10s (every 10s
the monitoring system collects the total amount of sector written and
write io performed since boot and computes both the B/s and IO/s).

With only residual write activity on our storage network (~450kB/s total
for the whole Ceph cluster, which amounts to a theoretical ~120kB/s on
each OSD once replication, double writes due to journal and number of
OSD are factored in) :
- Disks with btrfs OSD have a spike of activity every 30s (2 intervals
of 10s with nearly 0 activity, one interval with a total amount of
writes of ~120MB). The averages are : 4MB/s, 100 IO/s.
- Disks with xfs OSD (with journal on a separate partition but same
disk) don't have these spikes of activity and the averages are far lower
: 160kB/s and 5 IO/s. This is not far off what is expected from the
whole cluster write activity.

There's a setting of 30s on our platform :
filestore max sync interval

I changed it to 60s with
ceph tell osd.* injectargs '--filestore-max-sync-interval 60'
and the amount of writes was lowered to ~2.5MB/s.

I changed it to 5s (the default) with
ceph tell osd.* injectargs '--filestore-max-sync-interval 5'
the amount of writes to the device rose to an average of 10MB/s (and
given our sampling interval of 10s appeared constant).

During these tests the activity on disks hosting XFS OSDs didn't change
much.

So it seems filestore syncs generate far more activity on btrfs OSDs
compared to XFS OSDs (journal activity included for both).

Note that autodefrag is disabled on our btrfs OSDs. We use our own
scheduler which in the case of our OSD limits the amount of defragmented
data to ~10MB per minute in the worst case and usually (during low write
activity which was the case here) triggers a single file defragmentation
every 2 minutes (which amounts to a 4MB write as we only host RBDs with
the default order value). So defragmentation shouldn't be an issue here.

This doesn't seem to generate too much stress when filestore max sync
interval is 30s (our btrfs OSDs are faster than xfs OSDs with the same
amount of data according to apply latencies) but at 5s the btrfs OSDs
are far slower than our xfs OSDs with 10x the average apply latency (we
didn't let this continue more than 10 minutes as it began to make some
VMs wait for IOs too much).

Does anyone know if this is normal and why it is happening?

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Unexpected disk write activity with btrfs OSDs

2015-06-18 Thread Lionel Bouton

I just realized I forgot to add a proper context :

this is with Firefly 0.80.9 and the btrfs OSDs are running on kernel
4.0.5 (this was happening with previous kernel versions according to our
monitoring history), xfs OSDs run on 4.0.5 or 3.18.9. There are 23 OSDs
total and 2 of them are using btrfs.

On 06/18/15 23:28, Lionel Bouton wrote:
 Hi,

 I've just noticed an odd behaviour with the btrfs OSDs. We monitor the
 amount of disk writes on each device, our granularity is 10s (every 10s
 the monitoring system collects the total amount of sector written and
 write io performed since boot and computes both the B/s and IO/s).

 With only residual write activity on our storage network (~450kB/s total
 for the whole Ceph cluster, which amounts to a theoretical ~120kB/s on
 each OSD once replication, double writes due to journal and number of
 OSD are factored in) :
 - Disks with btrfs OSD have a spike of activity every 30s (2 intervals
 of 10s with nearly 0 activity, one interval with a total amount of
 writes of ~120MB). The averages are : 4MB/s, 100 IO/s.
 - Disks with xfs OSD (with journal on a separate partition but same
 disk) don't have these spikes of activity and the averages are far lower
 : 160kB/s and 5 IO/s. This is not far off what is expected from the
 whole cluster write activity.

 There's a setting of 30s on our platform :
 filestore max sync interval

 I changed it to 60s with
 ceph tell osd.* injectargs '--filestore-max-sync-interval 60'
 and the amount of writes was lowered to ~2.5MB/s.

 I changed it to 5s (the default) with
 ceph tell osd.* injectargs '--filestore-max-sync-interval 5'
 the amount of writes to the device rose to an average of 10MB/s (and
 given our sampling interval of 10s appeared constant).

 During these tests the activity on disks hosting XFS OSDs didn't change
 much.

 So it seems filestore syncs generate far more activity on btrfs OSDs
 compared to XFS OSDs (journal activity included for both).

 Note that autodefrag is disabled on our btrfs OSDs. We use our own
 scheduler which in the case of our OSD limits the amount of defragmented
 data to ~10MB per minute in the worst case and usually (during low write
 activity which was the case here) triggers a single file defragmentation
 every 2 minutes (which amounts to a 4MB write as we only host RBDs with
 the default order value). So defragmentation shouldn't be an issue here.

 This doesn't seem to generate too much stress when filestore max sync
 interval is 30s (our btrfs OSDs are faster than xfs OSDs with the same
 amount of data according to apply latencies) but at 5s the btrfs OSDs
 are far slower than our xfs OSDs with 10x the average apply latency (we
 didn't let this continue more than 10 minutes as it began to make some
 VMs wait for IOs too much).

 Does anyone know if this is normal and why it is happening?

 Best regards,

 Lionel
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is Ceph right for me?

2015-06-11 Thread Lionel Bouton

On 05/20/15 23:34, Trevor Robinson - Key4ce wrote:

 Hello,

  

 Could somebody please advise me if Ceph is suitable for our use?

  

 We are looking for a file system which is able to work over different
 locations which are connected by VPN. If one locations was to go
 offline then the filesystem will stay online at both sites and then
 once connection is regained the latest file version will take priority.


CephFS won't work well (or at all when the connections are lost). The
only part of Ceph which would work is RGW replication but you don't get
a filesystem with it and I'm under the impression that a multi-master
replication might be tricky (to be confirmer).

Coda's goals seems to match your needs. I'm not sure if it's still
actively developped (there is a client distributed with the Linux kernel
though).
http://www.coda.cs.cmu.edu/

Last time I tried it (several years ago) it worked well enough for me.

  

 The main use will be for website files so the changes are most likely
 to be any uploaded files and cache files as a lot of the data will be
 stored in a SQL database which is already replicated.


If your setup is not too complex, you might simply handle this with
rsync or unison.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Discuss: New default recovery config settings

2015-06-01 Thread Lionel Bouton

On 06/01/15 09:43, Jan Schermer wrote:
 We had to disable deep scrub or the cluster would me unusable - we need to 
 turn it back on sooner or later, though.
 With minimal scrubbing and recovery settings, everything is mostly good. 
 Turned out many issues we had were due to too few PGs - once we increased 
 them from 4K to 16K everything sped up nicely (because the chunks are 
 smaller), but during heavy activity we are still getting some “slow IOs”.
 I believe there is an ionice knob in newer versions (we still run Dumpling), 
 and that should do the trick no matter how much additional “load” is put on 
 the OSDs.
 Everybody’s bottleneck will be different - we run all flash so disk IO is not 
 a problem but an OSD daemon is - no ionice setting will help with that, it 
 just needs to be faster ;-)

If you are interested I'm currently testing a ruby script which
schedules the deep scrubs one at a time trying to simultaneously make
them fit in a given time window, avoid successive scrubs on the same OSD
and space the deep scrubs according to the amount of data scrubed.  I
use it because Ceph by itself can't prevent multiple scrubs to happen
simultaneously on the network and it can severely impact our VM performance.
I can clean it up and post it on Github.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Performance and CPU load on HP servers running ceph (DL380 G6, should apply to others too)

2015-05-26 Thread Lionel Bouton

On 05/26/15 10:06, Jan Schermer wrote:
 Turbo Boost will not hurt performance. Unless you have 100% load on
 all cores it will actually improve performance (vastly, in terms of
 bursty workloads).
 The issue you have could be related to CPU cores going to sleep mode.

Another possibility is that the system is overheating when Turbo Boost
is enabled. In this case it protects itself by throttling back the core
frequencies to a very low value (it may use other means too, like
lowering the system buses frequencies, halting the cores periodically,
...). This would explain the high loads.
If the system switches back and forth between normal loads and huge
loads and you can link that to CPU package temperature (and/or very low
CPU core frequencies), this is probably the cause. If the ambient
temperature isn't a problem (below 25°C any system should be fine and
most can tolerate 30°C or more) then you have an internal cooling problem.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Btrfs defragmentation

2015-05-12 Thread Lionel Bouton

On 05/06/15 20:28, Lionel Bouton wrote:
 Hi,

 On 05/06/15 20:07, Timofey Titovets wrote:
 2015-05-06 20:51 GMT+03:00 Lionel Bouton lionel+c...@bouton.name:
 Is there something that would explain why initially Btrfs creates the
 4MB files with 128k extents (32 extents / file) ? Is it a bad thing for
 performance ?
 This kind of behaviour is a reason why i ask you about compression.
 You can use filefrag to locate heavily fragmented files (may not work
 correctly with compression).
 https://btrfs.wiki.kernel.org/index.php/Gotchas

 Filefrag show each compressed chunk as separated extents, but he can
 be located linear. This is a problem in file frag =\
 Hum, I see. This could explain why we rarely see the number of extents
 go down. When data is replaced with incompressible data Btrfs must
 deactivate compression and be able to reduce the number of extents.

 This should not have much impact on the defragmentation process and
 performance: we check for extents being written sequentially next to
 each other and don't count this as a cost for file access. This is why
 these files aren't defragmented even if we ask for it and our tool
 reports a low overhead for them.

Here's more information, especially about compression.

1/ filefrag behaviour.

I use our tool to trace the fragmentation evolution after launching
btrfs fi defrag on each file (it calls filefrag -v asynchronously every
5 seconds until the defragmentation seems done).
filefrag output doesn't understand compression and doesn't seem to have
access to the latest on-disk layout.

- for compression, you can have a reported layout where an extent begins
in the middle of the previous pretty often. So I assume the physical
offset of the extent start is good but the end is computed from the
extent decompressed length (it's always 32x4096-bytes blocks which
matches the compression block size). We had to compensate for that
because we erroneously considered this case needing a seek although it
doesn't. This means you can't trust the number of extents reported by
filefrag -v (it is supposed to merge consecutive extents when run with -v).
- for access to the layout, I assume Btrfs reports what is committed to
disk. I base this assumption on the fact that for all defragmented
files, filefrag -v output becomes stable in at most 30 seconds after the
btrfs fi defrag command returns (30 seconds is the default commit
interval for Btrfs).

There's something odd going on with the 'shared' flag reported by
filefrag too: I assumed this was linked to clone_range or snapshots and
most of the time it seems so but on other (non-OSD) filesystems I found
files with this flag on extents and I couldn't find any explanation for it.

2/ compression influence on fragmentation

Even after compensating for filefrag -v errors, Btrfs clearly has more
difficulties defragmenting compressed files. At least our model for
computing the cost associated with a particular layout reports fewer
gains when defragmenting a compressed file. In our configuration and
according to our model of disk latencies we seem to hit a limit where
file reads cost ~ 2.75x what it would if the files where in an ideal,
sequential layout. If we try to go lower the majority of the files don't
benefit at all from defragmentation (the resulting layout isn't better
than the initial one).
Note that this doesn't account for NCQ/TCQ : we suppose the read is
isolated. So in practice reading from multiple threads should be less
costly and the OSD might not suffer much from this.
In fact 2 out of the 3 BTRFS OSD have lower latencies than most of the
rest of the cluster even with our tool slowly checking files and
triggering defragmentations in the background.

3/ History/heavy backfilling seems to have a large influence on performance

As I said 2 out of our 3 BTRFS OSDs have a very good behavior.
Unfortunately the third doesn't. This is the OSD where our tool was
deactivated during most of the initial backfilling process. It doesn't
have the most data, the most writes or the most reads of the group but
it had by far the worst latencies these last two days. I even checked
the disk for hardware problems and couldn't find any.
I don't have a clear explanation for the performance difference. Maybe
the 2.75x overhead target isn't low enough and this OSD has more
fragmented files than the others bellow this target (we don't compute
the average fragmentation yet). This would mean than we can expect the
performance of the 2 others to slowly degrade over time (so the test
isn't conclusive yet).

I've decided to remount this particular OSD without compression and let
our tool slowly bring down the maximum overhead to 1.5x (which should be
doable as without compression files are more easily defragmented) while
using primary-affinity = 0. I'll revert to primary-affinity 1 when the
defragmentation is done and see how the OSD/disk behave.

4/ autodefrag doesn't like Ceph OSDs

According to our previous experience by now all our Btrfs

Re: [ceph-users] Btrfs defragmentation

2015-05-07 Thread Lionel Bouton

On 05/06/15 19:51, Lionel Bouton wrote:

 During normal operation Btrfs OSD volumes continue to behave in the same
 way XFS ones do on the same system (sometimes faster/sometimes slower).
 What is really slow though it the OSD process startup. I've yet to make
 serious tests (umounting the filesystems to clear caches), but I've
 already seen 3 minutes of delay reading the pgs. Example :

 2015-05-05 16:01:24.854504 7f57c518b780  0 osd.17 22428 load_pgs
 2015-05-05 16:01:24.936111 7f57ae7fc700  0
 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint:
 ioctl SNAP_DESTROY got (2) No such file or directory
 2015-05-05 16:01:24.936137 7f57ae7fc700 -1
 filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
 'snap_1671188' got (2) No such file or directory
 2015-05-05 16:01:24.991629 7f57ae7fc700  0
 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint:
 ioctl SNAP_DESTROY got (2) No such file or directory
 2015-05-05 16:01:24.991654 7f57ae7fc700 -1
 filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
 'snap_1671189' got (2) No such file or directory
 2015-05-05 16:04:25.413110 7f57c518b780  0 osd.17 22428 load_pgs opened
 160 pgs

 The filesystem might not have reached its balance between fragmentation
 and defragmentation rate at this time (so this may change) but mirrors
 our initial experience with Btrfs where this was the first symptom of
 bad performance.

We've seen progress on this front. Unfortunately for us we had 2 power
outages and they seem to have damaged the disk controller of the system
we are testing Btrfs on: we just had a system crash.
On the positive side this gives us an update on the OSD boot time.

With a freshly booted system without anything in cache :
- the first Btrfs OSD we installed loaded the pgs in ~1mn30s which is
half of the previous time,
- the second Btrfs OSD where defragmentation was disabled for some time
and was considered more fragmented by our tool took nearly 10 minutes to
load its pgs (and even spent 1mn before starting to load them).
- the third Btrfs OSD which was always defragmented took 4mn30 seconds
to load its pgs (it was considered more fragmented than the first and
less than the second).

My current assumption is that the defragmentation process we use can't
handle large spikes of writes (at least when originally populating the
OSD with data through backfills) but then can repair the damage on
performance they cause at least partially (it's still slower to boot
than the 3 XFS OSDs on the same system where loading pgs took 6-9 seconds).
In the current setup the defragmentation is very slow to process because
I set it up to generate very little load on the filesystems it processes
: there may be room to improve.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Btrfs defragmentation

2015-05-07 Thread Lionel Bouton

Hi,

On 05/07/15 12:30, Burkhard Linke wrote:
 [...]
 Part of the OSD boot up process is also the handling of existing
 snapshots and journal replay. I've also had several btrfs based OSDs
 that took up to 20-30 minutes to start, especially after a crash.
 During journal replay the OSD daemon creates a number of new snapshot
 for its operations (newly created snap_XYZ directories that vanish
 after a short time). This snapshotting probably also adds overhead to
 the OSD startup time.
 I have disabled snapshots in my setup now, since the stock ubuntu
 trusty kernel had some stability problems with btrfs.

 I also had to establish cron jobs for rebalancing the btrfs
 partitions. It compacts the extents and may reduce the total amount of
 space taken.

I'm not sure what you mean by compacting extents. I'm sure balance
doesn't defragment or compress files. It moves extents and before 3.14
according to the Btrfs wiki it was used to reclaim allocated but unused
space.
This shouldn't affect performance and with modern kernels may not be
needed to reclaim unused space anymore.

 Unfortunately this procedure is not a default in most distribution (it
 definitely should be!). The problems associated with unbalanced
 extents should have been solved in kernel 3.18, but I didn't had the
 time to check it yet.

I don't have any btrfs filesystem running on 3.17 or earlier version
anymore (with a notable exception, see below) so I can't comment. I have
old btrfs filesystems that were created on 3.14 and are now on 3.18.x or
3.19.x (by the way avoid 3.18.9 to 3.19.4 if you can have any sort of
power failure, there's a possibility of a mount deadlock which requires
btrfs-zero-log to solve...). btrfs fi usage doesn't show anything
suspicious on these old fs.
I have a Jolla Phone which comes with a btrfs filesystem and uses an old
heavily patched 3.4 kernel. It didn't have any problem yet but I don't
stuff it with data (I've seen discussions about triggering a balance
before a SailfishOS upgrade).
I assume that you shouldn't have any problem with filesystems that
aren't heavily used which should be the case with Ceph OSD (for example
our current alert level is at 75% space usage).


 As a side note: I had several OSD with dangling snapshots (more than
 the two usually handled by the OSD). They are probably due to crashed
 OSD daemons. You have to remove the manually, otherwise they start to
 consume disk space.

Thanks a lot, I didn't think it could happen. I'll configure an alert
for this case.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Btrfs defragmentation

2015-05-06 Thread Lionel Bouton

Hi,

On 05/06/15 20:07, Timofey Titovets wrote:
 2015-05-06 20:51 GMT+03:00 Lionel Bouton lionel+c...@bouton.name:
 Is there something that would explain why initially Btrfs creates the
 4MB files with 128k extents (32 extents / file) ? Is it a bad thing for
 performance ?
 This kind of behaviour is a reason why i ask you about compression.
 You can use filefrag to locate heavily fragmented files (may not work
 correctly with compression).
 https://btrfs.wiki.kernel.org/index.php/Gotchas

 Filefrag show each compressed chunk as separated extents, but he can
 be located linear. This is a problem in file frag =\

Hum, I see. This could explain why we rarely see the number of extents
go down. When data is replaced with incompressible data Btrfs must
deactivate compression and be able to reduce the number of extents.

This should not have much impact on the defragmentation process and
performance: we check for extents being written sequentially next to
each other and don't count this as a cost for file access. This is why
these files aren't defragmented even if we ask for it and our tool
reports a low overhead for them.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Btrfs defragmentation

2015-05-06 Thread Lionel Bouton

On 05/05/15 02:24, Lionel Bouton wrote:
 On 05/04/15 01:34, Sage Weil wrote:
 On Mon, 4 May 2015, Lionel Bouton wrote:
 Hi,

 we began testing one Btrfs OSD volume last week and for this first test
 we disabled autodefrag and began to launch manual btrfs fi defrag.
 [...]
 Cool.. let us know how things look after it ages!
 [...]


 It worked for the past day. Before the algorithm change the Btrfs OSD
 disk was the slowest on the system compared to the three XFS ones by a
 large margin. This was confirmed both by iostat %util (often at 90-100%)
 and monitoring the disk average read/write latencies over time which
 often spiked one order of magnitude above the other disks (as high as 3
 seconds). Now the Btrfs OSD disk is at least comparable to the other
 disks if not a bit faster (comparing latencies).

 This is still too early to tell, but very encouraging.

Still going well, I added two new OSDs which are behaving correctly too.

The first of the two has finished catching up. There's a big difference
in the number of extents on XFS and on Btrfs. I've seen files backing
rbd (4MB files with rbd in their names) often have only 1 or 2 extents
on XFS.
On Btrfs they seem to start at 32 extents when they are created and
Btrfs doesn't seem to mind (ie: calling btrfs fi defrag file doesn't
reduce the number of extents, at least not in the following 30s where it
should go down). The extents aren't far from each other on disk though,
at least initially.

When my simple algorithm computes the fragmentation cost (the expected
overhead of reading a file vs its optimized version), it seems that just
after finishing catching up (between 3 hours and 1 day depending on the
cluster load and settings), the content is already heavily fragmented
(files are expected to take more than 6x time the read delay than
optimized versions would). Then my defragmentation scheduler manages to
bring down the maximum fragmentation cost (according to its own
definition) by a factor of 0.66 (the very first OSD volume is currently
sitting at a ~4x cost and occasionally reaches the 3.25-3.5 range).

Is there something that would explain why initially Btrfs creates the
4MB files with 128k extents (32 extents / file) ? Is it a bad thing for
performance ?

During normal operation Btrfs OSD volumes continue to behave in the same
way XFS ones do on the same system (sometimes faster/sometimes slower).
What is really slow though it the OSD process startup. I've yet to make
serious tests (umounting the filesystems to clear caches), but I've
already seen 3 minutes of delay reading the pgs. Example :

2015-05-05 16:01:24.854504 7f57c518b780  0 osd.17 22428 load_pgs
2015-05-05 16:01:24.936111 7f57ae7fc700  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint:
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.936137 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671188' got (2) No such file or directory
2015-05-05 16:01:24.991629 7f57ae7fc700  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-17) destroy_checkpoint:
ioctl SNAP_DESTROY got (2) No such file or directory
2015-05-05 16:01:24.991654 7f57ae7fc700 -1
filestore(/var/lib/ceph/osd/ceph-17) unable to destroy snap
'snap_1671189' got (2) No such file or directory
2015-05-05 16:04:25.413110 7f57c518b780  0 osd.17 22428 load_pgs opened
160 pgs

The filesystem might not have reached its balance between fragmentation
and defragmentation rate at this time (so this may change) but mirrors
our initial experience with Btrfs where this was the first symptom of
bad performance.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Btrfs defragmentation

2015-05-05 Thread Lionel Bouton

On 05/05/15 06:30, Timofey Titovets wrote:
 Hi list,
 Excuse me, what I'm saying is off topic

 @Lionel, if you use btrfs, did you already try to use btrfs compression for 
 OSD?
 If yes, сan you share the your experience?

Btrfs compresses by default using zlib. We force lzo compression instead
by using compress=lzo in fstab. Behaviour obviously depends on the kind
of data stored but in our case when we had more Btrfs OSD to compare
with XFS ones we got between 10 and 15% less disk space on average (on
this Ceph instance most files are in an already compressed format in the
RBD volumes). Although it looks like it again with only one OSD out of
24 I can't confirm this right now.

Best regards,

Lionel.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Btrfs defragmentation

2015-05-04 Thread Lionel Bouton

On 05/04/15 01:34, Sage Weil wrote:
 On Mon, 4 May 2015, Lionel Bouton wrote:
 Hi,

 we began testing one Btrfs OSD volume last week and for this first test
 we disabled autodefrag and began to launch manual btrfs fi defrag.
 [...]
 Cool.. let us know how things look after it ages!

We had the first signs of Btrfs aging yesterday's morning. Latencies
went up noticeably. The journal was at ~3000 extents back from a maximum
of ~13000 the day before. To verify my assumption that journal
fragmentation was not the cause of latencies, I defragmented it. It took
more than 7 minutes (10GB journal), left it at ~2300 extents (probably
because it was heavily used during the defragmentation) and the high
latencies weren't solved at all.

The initial algorithm selected files to defragment based solely on the
number of extents (files with more extents were processed first). This
was a simple approach to the problem that I hoped would be enough so I
had to make it more clever.

filefrag -v conveniently outputs each fragment relative position on the
device and the total file size. So I changed the algorithm so that it
can still use the result of a periodic find | xargs filefrag call (which
is relatively cheap and ends up fitting in a 100MB Ruby process) but
better model the fragmentation cost.

The new one computes the total cost of reading every file, counting an
initial seek, the total time based on sequential read speed and the time
associated with each seek from one extent to the next (which can be 0
when Btrfs managed to put an extent just after another, or very small if
it is not far from the first on the same HDD track). This total cost is
compared with the ideal defragmented case to know what the speedup could
be after defragmentation. Finally the result is normalized by dividing
it with the total size of each file. The normalization is done because
in the case of RBD (and probably most other uses) what is interesting is
how long a 128kB or 1MB read would take whatever the file and the offset
in the file, not how long a whole file read would take (there's an
assumption that each file as the same probability of being read which
might need to be revisited). There are approximations in the cost
computation and it's HDD centric but it's not very far from reality.

The idea was that it would be able to find the files where fragmentation
is the most painful faster instead of wasting time on less interesting
files. This would make the defragmentation more efficient even if it
didn't process as many files (the less defragmentation takes place the
less load we add).

It worked for the past day. Before the algorithm change the Btrfs OSD
disk was the slowest on the system compared to the three XFS ones by a
large margin. This was confirmed both by iostat %util (often at 90-100%)
and monitoring the disk average read/write latencies over time which
often spiked one order of magnitude above the other disks (as high as 3
seconds). Now the Btrfs OSD disk is at least comparable to the other
disks if not a bit faster (comparing latencies).

This is still too early to tell, but very encouraging.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Btrfs defragmentation

2015-05-03 Thread Lionel Bouton

Hi,

we began testing one Btrfs OSD volume last week and for this first test
we disabled autodefrag and began to launch manual btrfs fi defrag.

During the tests, I monitored the number of extents of the journal
(10GB) and it went through the roof (it currently sits at 8000+ extents
for example).
I was tempted to defragment it but after thinking a bit about it I think
it might not be a good idea.
With Btrfs, by default the data written to the journal on disk isn't
copied to its final destination. Ceph is using a clone_range feature to
reference the same data instead of copying it.
So if you defragment both the journal and the final destination, you are
moving the data around to attempt to get both references to satisfy a
one extent goal but most of the time can't get both of them at the same
time (unless the destination is a whole file instead of a fragment of one).

I assume the journal probably doesn't benefit at all from
defragmentation: it's overwritten constantly and as Btrfs uses CoW, the
previous extents won't be reused at all and new ones will be created for
the new data instead of overwritting the old in place. The final
destination files are reused (reread) and benefit from defragmentation.

Under these assumptions we excluded the journal file from
defragmentation, in fact we only defragment the current directory
(snapshot directories are probably only read from in rare cases and are
ephemeral so optimizing them is not interesting).

The filesystem is only one week old so we will have to wait a bit to see
if this strategy is better than the one used when mounting with
autodefrag (I couldn't find much about it but last year we had
unmanageable latencies).
We have a small Ruby script which triggers defragmentation based on the
number of extents and by default limits the rate of calls to btrfs fi
defrag to a negligible level to avoid trashing the filesystem. If
someone is interested I can attach it or push it on Github after a bit
of cleanup.

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Btrfs defragmentation

2015-05-03 Thread Lionel Bouton

On 05/04/15 01:34, Sage Weil wrote:
 On Mon, 4 May 2015, Lionel Bouton wrote:
 Hi, we began testing one Btrfs OSD volume last week and for this
 first test we disabled autodefrag and began to launch manual btrfs fi
 defrag. During the tests, I monitored the number of extents of the
 journal (10GB) and it went through the roof (it currently sits at
 8000+ extents for example). I was tempted to defragment it but after
 thinking a bit about it I think it might not be a good idea. With
 Btrfs, by default the data written to the journal on disk isn't
 copied to its final destination. Ceph is using a clone_range feature
 to reference the same data instead of copying it. 
 We've discussed this possibility but have never implemented it. The
 data is written twice: once to the journal and once to the object file.

That's odd. Here's an extract of filefrag output:

Filesystem type is: 9123683e
File size of /var/lib/ceph/osd/ceph-17/journal is 1048576 (256
blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   0:  155073097.. 155073097:  1:   
   1:1..1254:  155068587.. 155069840:   1254:  155073098: shared
   2: 1255..2296:  155071149.. 155072190:   1042:  155069841: shared
   3: 2297..2344:  148124256.. 148124303: 48:  155072191: shared
   4: 2345..4396:  148129654.. 148131705:   2052:  148124304: shared
   5: 4397..6446:  148137117.. 148139166:   2050:  148131706: shared
   6: 6447..6451:  150414237.. 150414241:  5:  148139167: shared
   7: 6452..   10552:  150432040.. 150436140:   4101:  150414242: shared
   8:10553..   12603:  150477824.. 150479874:   2051:  150436141: shared

Almost all extents of the journal are shared with another file (on one
occasion I've found 3 consecutive extents without the shared flag). I've
thought that it could be shared by a copy in a snapshot but the
snapshots are of the current subvolume.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Experience going through rebalancing with active VMs / questions

2015-05-02 Thread Lionel Bouton

Hi,

we are currently running the latest firefly (0.80.9) and we have
difficulties maintaining good throughput when Ceph is
backfilling/recovering and/or deep-scrubing after an outage. This got to
the point where when the VM using rbd start misbehaving (load rising,
some simple SQL update queries taking several seconds) I use a script
looping through tunable periods of max_backfills/max_recoveries = 1/0.

We recently had power outages and couldn't restart all the OSDs (one
server needed special care) so as we only have 4 servers with 6 OSDs
each, there was a fair amount or rebalancing.

What seems to work with our current load is the following :
1/ disable deep-scrub and scrub (deactivating scrub might not be needed
: it doesn't seem to have much of an impact on performance),
2/ activate the max_backfills/recoveries = 1/0 loop with 30 seconds for
each,
3/ wait for the rebalancing to finish, activate scrub,
4/ activate the (un)set nodeep_scrub loop with 30 seconds unset, 120
seconds set,
5/ wait for deep-scrubs to catch up (ie: none active during several
consecutive 30 seconds unset periods),
6/ revert to normal configuration.

This can take about a day for us (we have about 800GB per OSD when in
the degraded 3 servers configuration).

I have two ideas/questions :

1/ Deep scrub scheduling

Deep scrubs can happen in salves with the current algorithm which really
harms performance even with CFQ and lower priorities. We have a total of
1216 pg (1024 for our rbd pool) and a osd deep scrub interval of 2
weeks. This means that on average a deep scrub could happen about every
16 minutes globally.
When recovering from an outage the current algorithm wants to catch up
and even if only one scrub per OSD can happen at a time, VM disk
accesses involves many OSDs so having multiple deep-scrubs on the whole
cluster seems to hurt a bit more than when only one happens at a time.
So a smoother distribution when catching up could help a lot (at least
our experience seems to point in this direction). I'm even considering
scheduling scrubs ahead of time, setting the interval to 2 weeks in
ceph.conf, but distributing them at a rate that targets a completion in
a week. Does this make any sense ? Is there any development in this
direction already (Feature request #7288 didn't seem to go this far and
#7540 had no activity) ?

2/ Bigger journals

There's not much said about how to tune journal size and filestore max
sync interval. I'm not sure what the drawbacks of using much larger
journals and max sync interval are.
Obviously a sync would be more costly, but if it takes less time to
complete than the journal takes to fill up even while there are
deep-scrub or backfills, I'm not sure how this would hurt performance.

In our case we have a bit less than 5GB of data per pg (for the rbd
pool) and use 5GB journals (on the same disk than the OSD in a separate
partition at the beginning of the disk).
I'm wondering if we could get a lower impact of deep-scrubs if we could
buffer more activity in the journal. If we could lower the rate at which
each OSD are doing deep-scrubs (in the way I'm thinking of scheduling
them in the previous point) I'm wondering if it could give time to an
OSD to catch up doing filestore syncs between them and avoid contention
between deep-scrubs/journal writes/filestore sync happening all at the
same time. I assume deep scrubs and journal writes are mostly sequential
so in our configuration we can assume ~half of the available disk
throughput is available for each of them. So if we want to avoid
filestore syncs during deep-scrubs, it seems to me we should have a
journal at least twice the size of our largests pgs and tune the
filestore max sync interval to at least the expected duration of a deep
scrub. What is worrying me is that in our current configuration this
would mean at least twice our journal size (10GB instead of 5GB) and
given half of a ~120MB/s throughput a max interval of ~90 seconds (we
use 30 seconds currently). This is far from the default values (and as
we use about 50% of the storage capacity and have a good pg/OSD ratio we
may target twice these values to support pgs twice as large as our
current ones).
Does that make any sense ?
I'm not sure how backfills and recoveries work: I couldn't find a way to
let the OSD wait a bit between each batch to give a chance to the
filestore sync to go through. If this idea makes sense for deep-scrubs I
assume it might work for backfills/recoveries to smooth I/Os too (if
they can be configured to pause a bit between batches).

Best regards,

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cost- and Powerefficient OSD-Nodes

2015-04-29 Thread Lionel Bouton

Hi Dominik,

On 04/29/15 19:06, Dominik Hannen wrote:
 I had planned to use at maximum 80GB of the available 250GB.
 1 x 16GB OS
 4 x 8, 12 or 16GB partitions for osd-journals.

 For a total SSD Usage of 19.2%, 25.6% or 32%
 and over-provisioning of 80.8%, 74.3% or 68%.

 I am relatively certain that those SSDs would last ages with THAT
 much over-provisioning.


SSD lifespan is mostly linked to the total amount of data written as
they make efforts to distribute the writes evenly on all cells. In your
case this has nothing to do with over-provisioning and everything to do
with the amount of data you will write to the OSDs backed by these journals.

Over-provisioning can only be useful if :
- your SSD is bad at distributing writes with the space it already has
with its own over-provisioning and gets better when you add your own,
- you never touched data after the offset where your last partition ends
(or used TRIM manually to tell the SSD it's OK to use the cells they are
currently on as the SSD sees fit).

I see people assuming it's enough to not write on the whole LBA space to
automatically get better lifespan through better data redistribution by
the SSD. It can only work if the SSD is absolutely sure that the rest of
the LBA space is unused. This can only happen if this space as never
been written to or TRIM has been used to tell it to the SSD. If you had
a filesystem full of data at the end of the SSD and removed it without
trimming the space, it's as if it's still there from the point of view
of your SSD. If you tested the drive with some kind of benchmark, same
problem.
If you didn't call TRIM yourself on the whole drive, you can't assume
that over-provisioning will work. It depends on how the drive as been
initialized by the manufacturer and AFAIK you don't have access to this
information (I wouldn't be surprised if someone working in marketing at
a manufacturer would think a good idea to write a default partition
scheme with NTFS filesystems without asking techs to validate the idea
and that an intern would use some tools overwriting the whole drive to
do it).

Even with these precautions I've never seen comparative studies for
lifespans of SSD with various levels of manual over-provisioning so I'm
not even sure that this technique was ever successful (it can't really
do much harm as you don't want your journals to be too big anyway).

I would be *very* surprised if the drive would get any performance or
life expectancy benefit when used at 19.2% instead of 32% or even 75%...

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] long blocking with writes on rbds

2015-04-22 Thread Lionel Bouton

On 04/22/15 19:50, Lionel Bouton wrote:
 On 04/22/15 17:57, Jeff Epstein wrote:


 On 04/10/2015 10:10 AM, Lionel Bouton wrote:
 On 04/10/15 15:41, Jeff Epstein wrote:
 [...]
 This seems highly unlikely. We get very good performance without
 ceph. Requisitioning and manupulating block devices through LVM
 happens instantaneously. We expect that ceph will be a bit slower
 by its distributed nature, but we've seen operations block for up
 to an hour, which is clearly behind the pale. Furthermore, as the
 performance measure I posted show, read/write speed is not the
 bottleneck: ceph is simply/waiting/.

 So, does anyone else have any ideas why mkfs (and other operations)
 takes so long?


 As your use case is pretty unique and clearly not something Ceph was
 optimized for, if I were you I'd switch to a single pool with the
 appropriate number of pgs based on your pool size (replication) and
 the number of OSD you use (you should target 100 pgs/OSD to be in
 what seems the sweet spot) and create/delete rbd instead of the
 whole pool. You would be in known territory and any remaining
 performance problem would be easier to debug.

 I agree that this is a good suggestion. It took me a little while,
 but I've changed the configuration so that we now have only one pool,
 containing many rbds, and now all data is spread across all six of
 our OSD nodes. However, the performance has not perceptibly improved.
 We still have the occasional long (10 minutes) wait periods during
 write operations, and the bottleneck still seems to be ceph, rather
 than the hardware: the blocking process (most usually, but not
 always, mkfs) is stuck in a wait state (D in ps) but no I/O is
 actually being performed, so one can surmise that the physical
 limitations of the disk medium are not the bottleneck. This is
 similar to what is being reported in the thread titled 100% IO Wait
 with CEPH RBD and RSYNC.

 Do you have some idea how I can diagnose this problem?

 I'll look at ceph -s output while you get these stuck process to see
 if there's any unusual activity (scrub/deep
 scrub/recovery/bacfills/...). Is it correlated in any way with rbd
 removal (ie: write blocking don't appear unless you removed at least
 one rbd for say one hour before the write performance problems).

I'm not familiar with Amazon VMs. If you map the rbds using the kernel
driver to local block devices do you have control over the kernel you
run (I've seen reports of various problems with older kernels and you
probably want the latest possible) ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] long blocking with writes on rbds

2015-04-22 Thread Lionel Bouton

On 04/22/15 17:57, Jeff Epstein wrote:


 On 04/10/2015 10:10 AM, Lionel Bouton wrote:
 On 04/10/15 15:41, Jeff Epstein wrote:
 [...]
 This seems highly unlikely. We get very good performance without
 ceph. Requisitioning and manupulating block devices through LVM
 happens instantaneously. We expect that ceph will be a bit slower by
 its distributed nature, but we've seen operations block for up to an
 hour, which is clearly behind the pale. Furthermore, as the
 performance measure I posted show, read/write speed is not the
 bottleneck: ceph is simply/waiting/.

 So, does anyone else have any ideas why mkfs (and other operations)
 takes so long?


 As your use case is pretty unique and clearly not something Ceph was
 optimized for, if I were you I'd switch to a single pool with the
 appropriate number of pgs based on your pool size (replication) and
 the number of OSD you use (you should target 100 pgs/OSD to be in
 what seems the sweet spot) and create/delete rbd instead of the whole
 pool. You would be in known territory and any remaining performance
 problem would be easier to debug.

 I agree that this is a good suggestion. It took me a little while, but
 I've changed the configuration so that we now have only one pool,
 containing many rbds, and now all data is spread across all six of our
 OSD nodes. However, the performance has not perceptibly improved. We
 still have the occasional long (10 minutes) wait periods during write
 operations, and the bottleneck still seems to be ceph, rather than the
 hardware: the blocking process (most usually, but not always, mkfs) is
 stuck in a wait state (D in ps) but no I/O is actually being
 performed, so one can surmise that the physical limitations of the
 disk medium are not the bottleneck. This is similar to what is being
 reported in the thread titled 100% IO Wait with CEPH RBD and RSYNC.

 Do you have some idea how I can diagnose this problem?

I'll look at ceph -s output while you get these stuck process to see if
there's any unusual activity (scrub/deep scrub/recovery/bacfills/...).
Is it correlated in any way with rbd removal (ie: write blocking don't
appear unless you removed at least one rbd for say one hour before the
write performance problems).

Best regards,

Lionel Bouton
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 >

1 - 100 of 126 matches

Mail list logo