Re: [ceph-users] AutoScale PG Questions - EC Pool

2019-09-11 Thread Lars Marowsky-Bree
On 2019-09-10T13:36:53, Konstantin Shalygin  wrote:

> > So I am correct in 2048 being a very high number and should go for
> > either 256 or 512 like you said for a cluster of my size with the EC
> > Pool of 8+2?
> Indeed. I suggest stay at 256.

Might as well go to 512, but the 2^n is really important.

Also, please open a tracker issue with your cluster details so the
autoscaler can be fixed to make a more sensible recommendation.


Thanks!,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] even number of monitors

2019-08-05 Thread Lars Marowsky-Bree
On 2019-08-05T07:27:39, Alfredo Daniel Rezinovsky  wrote:

There's no massive problem with even MON counts.

As you note, n+2 doesn't really provide added fault tolerance compared
to n+1, so there's no win either. That's fairly obvious.

Somewhat less obvious - since the failure of any additional MON now will
lose quorum, and you know have, say, 3 instead of just 2, there's a
slightly higher chance that that case will trigger.

If the reason you're doing this is that you, say, want to standardize on
having one MON in each of your racks, and you happen to have 4 racks,
this is likely worth the trade-off.

And you can always manually lower the MON count to recover service even
then - from the durability perspective, you have one more copy of the
MON database afterall.

Probability is fun ;-)


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS snapshot for backup & disaster recovery

2019-08-05 Thread Lars Marowsky-Bree
On 2019-08-04T13:27:00, Eitan Mosenkis  wrote:

> I'm running a single-host Ceph cluster for CephFS and I'd like to keep
> backups in Amazon S3 for disaster recovery. Is there a simple way to
> extract a CephFS snapshot as a single file and/or to create a file that
> represents the incremental difference between two snapshots?

You could either use rclone to sync your CephFS to S3.

rsync can build the latter - see write-batch and only-write-batch
options.

(Unfortunately, there's not yet CephFS support for building the diff
between snapshots quickly, this will do a full fs scan, even though it
takes advantage of the mtime/ctime obviously.)

You could also use something like duplicity to run a regular incremental
backup.



-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adventures with large RGW buckets [EXT]

2019-08-02 Thread Lars Marowsky-Bree
On 2019-08-01T15:20:19, Matthew Vernon  wrote:

> One you don't mention is that multipart uploads break during resharding - so
> if our users are filling up a bucket with many writers uploading multipart
> objects, some of these will fail (rather than blocking) when the bucket is
> resharded.

Is that on the tracker? I couldn't find it. If you can reproduce, would
you add that please?

(I found https://tracker.ceph.com/issues/22368 and
https://tracker.ceph.com/issues/38486, which may be related.)


Thanks,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which tool to use for benchmarking rgw s3, yscb or cosbench

2019-07-22 Thread Lars Marowsky-Bree
On 2019-07-21T23:51:41, Wei Zhao  wrote:

> Hi:
>   I found cosbench is a very convenient tool for benchmaring rgw. But
> when I read papers ,  I found YCSB tool,
> https://github.com/brianfrankcooper/YCSB/tree/master/s3  . It seems
> that this is used for test cloud service , and seems a right tool for
> our service . Has  anyone tried this tool ?How is it  compared to
> cosbench ?

Depending on what you want to test/benchmark, there's also a (somewhat
simple) S3/Swift/DAV backend in the fio tool.

While that only implements somewhat straightforward GET/PUT/DELETE
operations for IO, it gives you a lot of control over the benchmark
parameters via the fio tool itself, and is very low overhead.

Another benefit is that it allows you to more or less directly compare
results across all protocols, since fio supports all the ways of
accessing Ceph (file, block, object, librados, kRBD, iSCSI, librbd
...).


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Allocation recommendations for separate blocks.db and WAL

2019-07-18 Thread Lars Marowsky-Bree
On 2019-07-17T11:56:25, Robert LeBlanc  wrote:

> So, I see the recommendation for 4% of OSD space for blocks.db/WAL and the
> corresponding discussion regrading the 3/30/300GB vs 6/60/600GB allocation.
> 
> How does this change when WAL is seperate from blocks.db?
> 
> Reading [0] it seems that 6/60/600 is not correct. It seems that to compact
> a 300GB DB, you taking values from the above layer (which is only 10% of
> the lower layer and only some percentage that exceeds the trigger point of
> that will be merged down) and merging that in, so at worse case you would
> need 333GB (300+30+3) plus some headroom.

I think the doubling of values is mainly used to leave sufficient
headroom for all possible overhead.

The most common choice we see here is the 60/64 GB scenario. (Computer
folks tend to think in powers of two. ;-)

It's not cost effective to haggle too much; at any given 1:n ratio, the
60 GB * n on the shared device is not the significant cost factor. Going
too low however would likely be rather annoying in the future, so why
not play it safe?

The 4% general advice seems incomplete; if anything, one should possibly
then round up to the next sensible value. But this heavily depends on
the workload - if the cluster only hosts RBDs, you'll see much less
metadata, for example. Unfortunately, we don't seem to have
significantly better recommendations yet.


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New best practices for osds???

2019-07-17 Thread Lars Marowsky-Bree
On 2019-07-17T08:27:46, John Petrini  wrote:

The main problem we've observed is that not all HBAs can just
efficiently and easily pass through disks 1:1. Some of those from a
more traditional server background insist on having some form of
mapping via RAID.

In that case it depends on whether 1 disk unit RAID0s or "JBOD"
passthrough is more efficient, that's also hardware specific.

As a general best practice, that'd be to just give Ceph OSDs the disks
as they are.



Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Returning to the performance in a small cluster topic

2019-07-16 Thread Lars Marowsky-Bree
On 2019-07-16T19:27:32, " Drobyshevskiy, Vladimir "  wrote:

> This is a replicated pool with size = 3.

> Is it possible to check this anyhow? ping shows average latency ~0.147 ms
> which is pretty high for IB but might be reasonable (IPoIB).

This means that, barring any other overhead, you're limited to a maximum
of ~2200 IOPS per single thread.

(Data needs to go to the primary, then to each of the two replica, and
back; that is about 3 hops at least. This assumes 0 time needed to
transmit the data, CPU/storage device processing, etc.)

And with so few devices, you're going to eventually and quickly get
serialized by the few primary OSDs.



-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Returning to the performance in a small cluster topic

2019-07-16 Thread Lars Marowsky-Bree
On 2019-07-15T19:08:26, " Drobyshevskiy, Vladimir "  wrote:

Hi Vladimir,

is this a replicated or EC pool?

>   The cluster itself:
>   nautilus
>   6 nodes, 7 SSD with 2 OSDs per SSD (14 OSDs in overall).

You mean 14 OSDs per node, right?

>   Each node: 2x Intel Xeon E5-2665 v1 (governor = performance, powersaving
> disabled), 64GB RAM, Samsung SM863 1.92TB SSD, QDR Infiniband.

I assume that's the cluster backend. How are the clients connected?

>   I've tried to make an RAID0 with mdraid and 2 virtual drives but haven't
> noticed any difference.

Your problem isn't bandwidth - it's the commit latency for the small IO.
In your enviroment, that's primarily going to be governed by network
(and possibly ceph-osd CPU) latency. That doesn't show up as high
utilization anywhere, because it's mainly waiting.

Most networking is terrifyingly slow compared to the latency of a local
flash storage device. And with Ceph, you've got to add at least two
roundtrips to every IO (client - primary OSD, primary OSD - replicas,
probably more, and if you us EC with ec_overwrites, definitely more
roundtrips).

Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Lars Marowsky-Bree
On 2019-07-11T09:46:47, Frank Schilder  wrote:

> Striping with stripe units other than 1 is something I also tested. I found 
> that with EC pools non-trivial striping should be avoided. Firstly, EC is 
> already a striped format and, secondly, striping on top of that with 
> stripe_unit>1 will make every write an ec_overwrite, because now shards are 
> rarely if ever written as a whole.

That's why I said that rbd's stripe_unit should match the EC pool's
stripe_width, or be a 2^n multiple of it. (Not sure what stripe_count
should be set to, probably also a small number of two.)

> The native striping in EC pools comes from k, data is striped over k disks. 
> The higher k the more throughput at the expense of cpu and network.

Increasing k also increases stripe_width though; this leads to more IO
suffering from the ec_overwrite penalty.

> In my long list, this should actually be point
> 
> 6) Use stripe_unit=1 (default).

You mean stripe-count?

> To get back to your question, this is another argument for k=power-of-two. 
> Object sizes in ceph are always powers of 2 and stripe sizes contain k as a 
> factor. Hence, any prime factor other than 2 in k will imply a mismatch. How 
> badly a mismatch affects performance should be tested.

Yes, of course. Depending on the IO pattern, this means more IO will be
misaligned or have non-stripe_width portions. (Most IO patterns, if they
strive for alignment, aim for a power of two alignment, obviously.)

> Results with non-trivial striping (stripe_size>1) were so poor, I did not 
> even include them in my report.

stripe_size?

> We use the 8+2 pool for ceph fs, where throughput is important. The 6+2 pool 
> is used for VMs (RBD images), where IOP/s are more important. It also offers 
> a higher redundancy level. Its an acceptable compromise for us.

Especially with RBDs, I'm surprised that k=6 works well for you. Block
device IO is most commonly aligned on power-of-two boundaries.


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] writable snapshots in cephfs? GDPR/DSGVO

2019-07-11 Thread Lars Marowsky-Bree
On 2019-07-10T09:59:08, Lars Täuber   wrote:

> Hi everbody!
> 
> Is it possible to make snapshots in cephfs writable?
> We need to remove files because of this General Data Protection Regulation 
> also from snapshots.

Removing data from existing WORM storage is tricky, snapshots being a
specific form thereof. If you want to avoid copying and altering all
existing records - which might clash with the requirement from other
fields that data needs to be immutable, but I guess you could store
checksums externally somewhere? -, this is difficult.

I think what you'd need is an additional layer - say, one holding the
decryption keys for the tenant/user (or whatever granularity you want to
be able to remove data at) - that you can still modify.

Once the keys have been successfully and permanently wiped, the old data
is effectively permanently deleted (from all media; whether Ceph snaps
or tape or other immutable storage).

You may have a record that you *had* the data.

Now, of course, you've got to manage keys, but that's significantly less
data to massage.

Not a lawyer, either.

Good luck.


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the best practice for Erasure Coding

2019-07-11 Thread Lars Marowsky-Bree
On 2019-07-09T07:27:28, Frank Schilder  wrote:

> Small addition:
> 
> This result holds for rbd bench. It seems to imply good performance for 
> large-file IO on cephfs, since cephfs will split large files into many 
> objects of size object_size. Small-file IO is a different story.
> 
> The formula should be N*alloc_size=object_size/k, where N is some integer. 
> alloc_size should be an integer multiple of object_size/k.

If using rbd striping, I'd also assume that making rbd's stripe_unit be
equal to, or at least a multiple of, the stripe_width of the EC pool is
sensible.

(Similar for CephFS's layout.)

Does this hold in your environment?


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph features and linux kernel version for upmap

2019-07-10 Thread Lars Marowsky-Bree
On 2019-07-09T16:32:22, Mattia Belluco  wrote:

> That of course in my case fails with:
> Error EPERM: cannot set require_min_compat_client to luminous: 29
> connected client(s) look like jewel (missing 0x800); 1
> connected client(s) look like jewel (missing 0x800); add
> --yes-i-really-mean-it to do it anyway

That's because instead of looking at the actual feature flags needed and
advertised by the clients, it tries to translate those back to a Ceph
version (which doesn't work, because it's not the same exact codestream)
and then compares that one. That is so harebrained it probably deserves
a tracker bug ;-)



Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Lars Marowsky-Bree
On 2019-07-08T19:37:13, Paul Emmerich  wrote:

> object_map can be a bottleneck for the first write in fresh images

We're working with CephFS here.


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Lars Marowsky-Bree
On 2019-07-08T14:36:31, Maged Mokhtar  wrote:

Hi Maged,

> Maybe not related, but we find with rbd, random 4k write iops start very low
> at first for a new image and then increase over time as we write. If we
> thick provision the image it work does not show this. This happens on random
> small block and not sequential or large. Probably related to initial
> obkect/chunk creation.

I don't see that this is related, we actually see faster performance for
random writes initially. (Unsurprising - writing to a non-existent
part/object means the OSD has nothing else to read, so no overwrite.)

> Also we use the default stripe width, maybe you try a pool with default
> width and see if it is a factor.

This is the default stripe_width for an EC pool with k=2.


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Lars Marowsky-Bree
On 2019-07-08T12:25:30, Dan van der Ster  wrote:

> Is there a specific bench result you're concerned about?

We're seeing ~5800 IOPS, ~23 MiB/s on 4 KiB IO (stripe_width 8192) on a
pool that could do 3 GiB/s with 4M blocksize. So, yeah, well, that is
rather harsh, even for EC.

> I would think that small write perf could be kept reasonable thanks to
> bluestore's deferred writes.

I believe we're being hit by the EC read-modify-write cycle on
overwrites.

> FWIW, our bench results (all flash cluster) didn't show a massive
> performance difference between 3 replica and 4+2 EC.

I'm guessing that this was not 4 KiB but a more reasonable blocksize
that was a multiple of stripe_width?


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure Coding performance for IO < stripe_width

2019-07-08 Thread Lars Marowsky-Bree
Morning all,

since Luminous/Mimic, Ceph supports allow_ec_overwrites. However, this has a
performance impact that looks even worse than what I'd expect from a
Read-Modify-Write cycle.

https://ceph.com/community/new-luminous-erasure-coding-rbd-cephfs/ also
mentions that the small writes would read the previous value from all
k+m OSDs; shouldn't the k stripes be sufficient (assuming we're not
currently degraded)?

Is there any suggestion on how to make this go faster, or suggestions on
which solution one could implement going forward?


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-26 Thread Lars Marowsky-Bree
On 2019-06-26T14:45:31, Sage Weil  wrote:

Hi Sage,

I think that makes sense. I'd have preferred the Oct/Nov target, but
that'd have made Octopus quite short.

Unsure whether freezing in December with a release in March is too long
though. But given how much people scramble, setting that as a goal
probably will help with stabilization.

I'm also hoping that one day, we can move towards a more agile
continuously integration model (like the Linux kernel does) instead of
massive yearly forklifts. But hey, that's just me ;-)



Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Mary Higgins, Sri Rasiah, HRB 21284 (AG 
Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New Ceph community manager: Mike Perez

2018-08-29 Thread Lars Marowsky-Bree
On 2018-08-29T01:13:24, Sage Weil  wrote:

Most excellent! Welcome, Mike!

I look forward to working with you.


Regards,
Lars

-- 
Architect SDS, Distinguished Engineer
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Aligning RBD stripe size with EC chunk size?

2018-06-14 Thread Lars Marowsky-Bree
Hi all,

so, I'm wondering right now (with some urgency, ahem) how to make RBD on
EC pools faster without resorting to cache tiering.

In a replicated pool, we had some success with RBD striping.

I wonder if it would be possible to align RBD stripe-unit with the EC
chunk size ...?

Is that worth pursuing at all? How would one best go about this?

(I also wonder if the
http://docs.ceph.com/docs/master/rados/operations/erasure-code-profile/
description of stripe_unit / stripe_width is wrong - shouldn't that be
the number of data+coding chunks, instead of just data chunks? They
need to go somewhere, right?)

Align k+m and RBD's object_size / (stripe-unit * stripe_count)?


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Architects should open possibilities and not determine everything." (Ueli 
Zbinden)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating standby mds from 12.2.2 to 12.2.4 caused up:active 12.2.2 mds's to suicide

2018-03-14 Thread Lars Marowsky-Bree
On 2018-03-14T06:57:08, Patrick Donnelly  wrote:

> Yes. But the real outcome is not "no MDS [is] active" but "some or all
> metadata I/O will pause" -- and there is no avoiding that. During an
> MDS upgrade, a standby must take over the MDS being shutdown (and
> upgraded).  During takeover, metadata I/O will briefly pause as the
> rank is unavailable. (Specifically, no other rank can obtains locks or
> communicate with the "failed" rank; so metadata I/O will necessarily
> pause until a standby takes over.) Single active vs. multiple active
> upgrade makes little difference in this outcome.

Fair, except that there's no standby MDS at this time in case the update
goes wrong.

> > Is another approach theoretically feasible? Have the updated MDS only go
> > into the incompatible mode once there's a quorum of new ones available,
> > or something?
> I believe so, yes. That option wasn't explored for this patch because
> it was just disambiguating the compatibility flags and the full
> side-effects weren't realized.

Would such a patch be accepted if we ended up pursuing this? Any
suggestions on how to best go about this?

Anything that requires magic sauce on updates beyond the normal "MONs
first, rolling through" makes me twitchy and tends to end with at least
a few customers getting it not quite right ;-)


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph iSCSI is a prank?

2018-03-14 Thread Lars Marowsky-Bree
On 2018-03-02T15:24:29, Joshua Chen  wrote:

> Dear all,
>   I wonder how we could support VM systems with ceph storage (block
> device)? my colleagues are waiting for my answer for vmware (vSphere 5) and
> I myself use oVirt (RHEV). the default protocol is iSCSI.

Lean on VMWare to stop being difficult about a native RBD driver. I can
guarantee you all the Ceph vendors are drooling at the bits to get that
done, but ... the VMware licensing terms for their header files, SDKs,
etc aren't exactly Open Source friendly.

iSCSI - yes, it works, but it's a work-around that introduces
significant performance penalties and architectural complexities.

If you are a VMWare customer, let them know you're considering moving
off to OpenStack et al to get Ceph supported better.

Imagine Linus scolding NVidia. Maybe eventually it'll help ;-)



Regards,
Lars

-- 
Architect SDS, Distinguished Engineer
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Updating standby mds from 12.2.2 to 12.2.4 caused up:active 12.2.2 mds's to suicide

2018-03-14 Thread Lars Marowsky-Bree
On 2018-02-28T02:38:34, Patrick Donnelly  wrote:

> I think it will be necessary to reduce the actives to 1 (max_mds -> 1;
> deactivate other ranks), shutdown standbys, upgrade the single active,
> then upgrade/start the standbys.
> 
> Unfortunately this didn't get flagged in upgrade testing. Thanks for
> the report Dan.

This means that - when the single active is being updated - there's a
time when there's no MDS active, right?

Is another approach theoretically feasible? Have the updated MDS only go
into the incompatible mode once there's a quorum of new ones available,
or something?

(From the point of view of a distributed system, this is double plus
ungood.)



Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD device as SBD device for pacemaker cluster

2018-02-06 Thread Lars Marowsky-Bree
On 2018-02-06T13:00:59, Kai Wagner  wrote:

> I had the idea to use a RBD device as the SBD device for a pacemaker
> cluster. So I don't have to fiddle with multipathing and all that stuff.
> Have someone already tested this somewhere and can tell how the cluster
> reacts on this?

SBD should work on top of RBD; any shared block device will do. I'd
recommend slightly higher timeouts than normal; check how/if the Ceph
cluster blocks IO during recovery.


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] who is using nfs-ganesha and cephfs?

2017-11-09 Thread Lars Marowsky-Bree
On 2017-11-08T21:41:41, Sage Weil  wrote:

> Who is running nfs-ganesha's FSAL to export CephFS?  What has your 
> experience been?
> 
> (We are working on building proper testing and support for this into 
> Mimic, but the ganesha FSAL has been around for years.)

We use it currently, and it works, but let's not discuss the performance
;-)

How else do you want to build this into Mimic?

Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Ceph-maintainers] Ceph release cadence

2017-09-07 Thread Lars Marowsky-Bree
On 2017-09-06T15:23:34, Sage Weil  wrote:

Hi Sage,

thanks for kicking off this discussion - after the L experience, it was
on my hot list to talk about too.

I do agree that we need predictable releases more than feature-rich
releases. Distributors like to plan, but that's not a reason. However,
we like to plan because *users* like to plan their schedules and
upgrades, and I think that matters more.

> - Not a lot of people seem to run the "odd" releases (e.g., infernalis, 
> kraken).  This limits the value of actually making them.  It also means 
> that those who *do* run them are running riskier code (fewer users -> more 
> bugs).

Yes. Odd releases never really make it to user systems. They're on the
previous LTS release. In the devel releases, the code is often too
unstable, and developers seem to cram everything in. Basically, the odd
releases are long periods working up to the next stable release.

(And they get all the cool names, which I find personally sad. I want my
users to run Infernalis, Kraken, and Mimic. ;-)

> - The more recent requirement that upgrading clusters must make a stop at 
> each LTS (e.g., hammer -> luminous not supported, must go hammer -> jewel 
> -> lumninous) has been hugely helpful on the development side by reducing 
> the amount of cross-version compatibility code to maintain and reducing 
> the number of upgrade combinations to test.

On this, I feel that it might make more sense to phrase this so that
such cross version compatibility is not tied to major releases (which
doesn't really help them plan lifecycles if those releases aren't
reliable), but to time periods.

> - When we try to do a time-based "train" release cadence, there always 
> seems to be some "must-have" thing that delays the release a bit.  This 
> doesn't happen as much with the odd releases, but it definitely happens 
> with the LTS releases.  When the next LTS is a year away, it is hard to 
> suck it up and wait that long.

Yes, I can see that. This is clearly something we'd want to avoid.

> A couple of options:
> 
> * Keep even/odd pattern, and continue being flexible with release dates

I admit I'm not a fan of this one.

> * Drop the odd releases but change nothing else (i.e., 12-month release 
> cadence)
>   + eliminate the confusing odd releases with dubious value

Periods too long for regular users. Admittedly, I suspect for RH and
SUSE with RHCS or SES respectively, this doesn't matter much - but it's
not good for the community as a whole. Also, this means not enough
community / end-user testing will happen for 11 out of those 12 months,
implying such long cycles make it hard to release n+1.0 in high
quality.

I've been doing software development for almost two decades, and no user
really touches betas before one calls it an RC, and even then ...

> * Drop the odd releases, and aim for a ~9 month cadence. This splits the 
> difference between the current even/odd pattern we've been doing.

It's a step up, but the period is still both too long, and unaligned.
This makes lifecycle management for everyone annoying.

> * Drop the odd releases, but relax the "must upgrade through every LTS" to 
> allow upgrades across 2 versions (e.g., luminous -> mimic or luminous -> 
> nautilus).  Shorten release cycle (~6-9 months).
> 
>   + more flexibility for users
>   + downstreams have greater choice in adopting an upstrema release
>   - more LTS branches to maintain
>   - more upgrade paths to consider

>From the list of options you provide, I like this one the best; the ~6
month release cycle means there should be one about once per year as
well, which makes cycling easier to plan.

> Other options we should consider?  Other thoughts?

With about 20-odd years in software development, I've become a big
believer in schedule-driven releases. If it's feature-based, you never
know when they'll get done.

If the schedule intervals are too long though, the urge to press too
much in (so as not to miss the next merge window) is just too high,
meaning the train gets derailed. (Which cascades into the future,
because the next time the pressure will be even higher based on the
previous experience.) This requires strictness.

We've had a few Linux kernel releases that were effectively feature
driven and never quite made it. 1.3.x? 1.5.x? My memory is bad, but they
were a disaster than eventually led Linus to evolve to the current
model.

That serves them really well, and I believe it might be worth
considering for us.

I'd try to move away from the major milestones. Features get integrated
into the next schedule-driven release when they deemed ready and stable;
when they're not, not a big deal, the next one is coming up "soonish".

(This effectively decouples feature development slightly from the
release schedule.)

We could even go for "a release every 3 months, sharp", merge window for
the first month, stabilization the second, release clean up the third,
ship.

Interoperability hacks for the 

Re: [ceph-users] upgrade procedure to Luminous

2017-07-17 Thread Lars Marowsky-Bree
On 2017-07-14T15:18:54, Sage Weil  wrote:

> Yes, but how many of those clusters can only upgrade by updating the 
> packages and rebooting?  Our documented procedures have always recommended 
> upgrading the packages, then restarting either mons or osds first and to 
> my recollection nobody has complained.  TBH my first encounter with the 
> "reboot on upgrade" procedure in the Linux world was with Fedora (which I 
> just recently switched to for my desktop)--and FWIW it felt very 
> anachronistic.

Admittedly, it is. This is my main reason for hoping for containers.

My main issue is not that they must be rebooted. In most cases, ceph-mon
can be restarted. My fear is that they *might* be rebooted by a failure
during that time, and it'd have been my expectation that normal
operation does not expose Ceph to such degraded scenarios. Ceph is,
after all, supposedly at least tolerant of one fault at a time.

And I'd obviously have considered upgrades a normal operation, not a
critical phase.

If one considers upgrades an operation that degrades redundancy, sure,
the current behaviour is in line.

> won't see something we haven't.  It also means, in this case, that we can 
> rip out out a ton of legacy code in luminous without having to keep 
> compatibility workarounds in place for another whole LTS cycle (a year!).  

Seriously, welcome to the world of enterprise software and customer
expectations ;-) 1 year! I wish! ;-)

> True, but this is rare, and even so the worst that can happen in this 
> case is the OSDs don't come up until the other mons are upgrade.  If the 
> admin plans to upgrade the mons in succession without lingering with 
> mixed-versions mon the worst-case downtime window is very small--and only 
> kicks in if *more than one* of the mon nodes fails (taking out OSDs in 
> more than one failure domain).

This is an interesting design philosophy in a fault tolerant distributed
system.

> > And customers don't always upgrade all nodes at once in a short period
> > (the benefit of a supposed rolling upgrade cycle), increasing the risk.
> I think they should plan to do this for the mons.  We can make a note 
> stating as much in the upgrade procedure docs?

Yes, we'll have to orchestrate this accordingly.

Upgrade all MONs; restart all MONs (while warning users that this is a
critical time period); start rebooting for the kernel/glibc updates.

> Anyway, does that make sense?  Yes, it means that you can't just reboot in 
> succession if your mons are mixed with OSDs.  But this time adding that 
> restriction let us do the SnapSet and snapdir conversion in a single 
> release, which is a *huge* win and will let us rip out a bunch of ugly OSD 
> code.  We might not have a need for it next time around (and can try to 
> avoid it), but I'm guessing something will come up and it will again be a 
> hard call to make balancing between sloppy/easy upgrades vs simpler 
> code...

The next major transition probably will be from non-containerized L to
fully-containerized N(autilus?). That'll be a fascinating can of worms
anyway. But that would *really* benefit if nodes could be more easily
redeployed and not just restarting daemon processes.

Thanks, at least now we know this is intentional. That was helpful, at
least!


-- 
Architect SDS
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] upgrade procedure to Luminous

2017-07-14 Thread Lars Marowsky-Bree
On 2017-07-14T10:34:35, Mike Lowe  wrote:

> Having run ceph clusters in production for the past six years and upgrading 
> from every stable release starting with argonaut to the next, I can honestly 
> say being careful about order of operations has not been a problem.

This requirement did not exist as a mandatory one for previous releases.

The problem is not the sunshine-all-is-good path. It's about what to do
in case of failures during the upgrade process.



-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] upgrade procedure to Luminous

2017-07-14 Thread Lars Marowsky-Bree
On 2017-07-14T14:12:08, Sage Weil  wrote:

> > Any thoughts on how to mitigate this, or on whether I got this all wrong and
> > am missing a crucial detail that blows this wall of text away, please let me
> > know.
> I don't know; the requirement that mons be upgraded before OSDs doesn't 
> seem that unreasonable to me.  That might be slightly more painful in a 
> hyperconverged scenario (osds and mons on the same host), but it should 
> just require some admin TLC (restart mon daemons instead of 
> rebooting).

I think it's quite unreasonable, to be quite honest. Collocated MONs
with OSDs is very typical for smaller cluster environments.

> Is there something in some distros that *requires* a reboot in order to 
> upgrade packages?

Not necessarily.

*But* once we've upgraded the packages, a failure or reboot might
trigger this.

And customers don't always upgrade all nodes at once in a short period
(the benefit of a supposed rolling upgrade cycle), increasing the risk.

I wish we'd already be fully containerized so indeed the MONs were truly
independent of everything else going on on the cluster, but ...

> Also, this only seems like it will affect users that are getting their 
> ceph packages from the distro itself and not from a ceph.com channel or a 
> special subscription/product channel (this is how the RHEL stuff works, I 
> think).

Even there, upgrading only the MON daemons and not the OSDs is tricky?




-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping filestore+btrfs testing for luminous

2017-07-05 Thread Lars Marowsky-Bree
On 2017-06-30T16:48:04, Sage Weil  wrote:

> > Simply disabling the tests while keeping the code in the distribution is
> > setting up users who happen to be using Btrfs for failure.
> 
> I don't think we can wait *another* cycle (year) to stop testing this.
> 
> We can, however,
> 
>  - prominently feature this in the luminous release notes, and
>  - require the 'enable experimental unrecoverable data corrupting features =
> btrfs' in order to use it, so that users are explicitly opting in to 
> luminous+btrfs territory.
> 
> The only good(ish) news is that we aren't touching FileStore if we can 
> help it, so it less likely to regress than other things.  And we'll 
> continue testing filestore+btrfs on jewel for some time.

That makes sense. Though btrfs is something users really shouldn't run
unless they get a heavily debugged and supported version from somewhere.

I'd also not mind just plain out dropping it completely, since I don't
believe any of our users runs it, they're all on XFS and will upconvert
to BlueStore.

That might be a good reasoning though: upgrading folks should be able to
get the OSDs on btrfs up (if they still have any) and go directly the
BlueStore, without having to first go via XFS.




Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.1.0 Luminous RC released

2017-06-29 Thread Lars Marowsky-Bree
On 2017-06-26T11:28:36, Ashley Merrick  wrote:

> With the EC Overwite support, if currently running behind a cache tier in 
> Jewel will the overwrite still be of benefit through the cache tier and 
> remove the need to promote the full block to make any edits?
> 
> Or we better totally removing the cache tier once fully upgraded?

I think cache tiering still promotes the whole object.

You will no longer *need* the cache tier from a functional perspective,
but it might still provide a significant performance advantage. 

Is that worth the additional capacity needs and complexity? (Plus the
promote/demote traffic, etc.)

This all will depend on your workload.


Regards,
Lars

-- 
Architect SDS
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] risk mitigation in 2 replica clusters

2017-06-29 Thread Lars Marowsky-Bree
On 2017-06-22T00:51:38, Blair Bethwaite  wrote:

> I'm doing some work to evaluate the risks involved in running 2r storage
> pools. On the face of it my naive disk failure calculations give me 4-5
> nines for a 2r pool of 100 OSDs (no copyset awareness, i.e., secondary disk
> failure based purely on chance of any 1 of the remaining 99 OSDs failing
> within recovery time). 5 nines is just fine for our purposes, but of course
> multiple disk failures are only part of the story.

You are confounding availability with data durability, too.

"Traditional" multi-node replicated storage solutions can get away with
only two nodes to mirrot the data inbetween because they typically have
an additional RAID5/6 at the local node level. (Which also helps with
recovery impact of a single device failure.) Ceph typically doesn't.

(That's why rbd-mirror between two Ceph clusters can be OK too.)

A disk failing while a node is down, or being rebooted, ...


> thereof, e.g., something that would enable the functional equivalent of:
> "this OSD/node is going to go offline so please create a 3rd replica in
> every PG it is participating in before we shutdown that/those OSD/s"...?

You can evacuate the node by setting it's weight to 0. It's a very
expensive operation, just like your proposal would be.


Regards,
Lars

-- 
Architect SDS
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous: ETA on LTS production release?

2017-06-19 Thread Lars Marowsky-Bree
On 2017-06-16T20:09:04, Gregory Farnum  wrote:

> There's a lot going into this release, and things keep popping up. I
> suspect it'll be another month or two, but I doubt anybody is capable of
> giving a more precise date. :/ The downside of giving up on train
> releases...

What's still outstanding though? When will be the time at which features
will be frozen, so that stabilization can happen?

(I wish we hadn't moved away from schedule-driven releases and stuck to
what the kernel does ;-)



Regards,
Lars

-- 
Architect SDS, Distinguished Engineer
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Feedback wanted: health warning when standby MDS dies?

2016-10-18 Thread Lars Marowsky-Bree
On 2016-10-18T12:46:48, John Spray  wrote:

> I've suggested a solution here:
> http://tracker.ceph.com/issues/17604
> 
> This is probably going to be a bit of a subjective thing in terms of
> whether people find it useful or find it to be annoying noise, so I'd
> be interested in feedback from people currently running cephfs.

I'm not adding much new value here, but, yes.

Being warned that we're in a degraded mode where the next failure might
cause an outage would be very useful. Maybe that's even worth flagging
as a separate health level.


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-18 Thread Lars Marowsky-Bree
On 2016-10-18T11:36:42, Christian Balzer  wrote:

> > Cache tiering in Ceph works for this use case. I assume you mean in
> > your UI?
> May well be, but Oliver suggested that cache-tiering is not supported with
> Hammer (0.94.x), which it most certainly is.

Right, we've got some success stories on Hammer already with it. Though
I much prefer Jewel too. ;-)

> > Though we all are waiting for Luminous to do away with the need for
> > cache tiering to do rbd to ec pools ...
> Well, there's the EC band-aid use case for cache-tiers, but they can be
> very helpful otherwise, depending on the size of working set/cache-pool,
> configuration of the cache-pool (write-back vs. read-forward) and
> specific use case.

Right. The main problem in my experience with cache tiering is that one
needs to know the answers to a bunch of questions most customers don't
have ;-) e.g., configuring the working set, flush/eviction policies,
etc. It's very helpful in some use cases, but mainly we're seeing it
used in the "band aid" mode. And I'd really love to get rid of those.


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Does anyone know why cephfs do not support EC pool?

2016-10-18 Thread Lars Marowsky-Bree
On 2016-10-18T00:06:57, Erick Perez - Quadrian Enterprises 
 wrote:

> Is EC in the roadmap for CEPH? Cant seem to find it. My question is because
> "all others" (Nutanix, Hypergrid) do EC storage for VMs as the default way
> of storage. It seems EC in ceph (as of Sept 2016) is considered by many
> "experimental" unless is used for cold data.

Right now (Jewel) you have to "cheat" and put a cache tier in front of
an EC pool if you want to use it for block or direct data access (like
CephFS). That actually works quite well, and you can then also use
much faster storage media in front of your dense EC pool.

The downside is you have to understand how to set some of the cache
tiering tunables and be able to size it appropriately, and understand
the impact of cache tiering on your cluster performance (e.g., the
overhead of the additional traffic between the tiers). Yet, it works
well.

Partial overwrites for EC pools directly is on the roadmap for
Kraken/Luminous, probably leveraging some of BlueStore's new features.
If I understood Sage correctly, he considers that a release criteria for
Luminous.

That'd allow us to directly access EC pools w/o the need for cache
tiering.



Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Lars Marowsky-Bree
On 2016-10-17T15:31:31, Maged Mokhtar  wrote:

> This is our first beta version, we do not support cache tiering. We
> definitely intend to support it.

Cache tiering in Ceph works for this use case. I assume you mean in
your UI?

Though we all are waiting for Luminous to do away with the need for
cache tiering to do rbd to ec pools ...


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] new Open Source Ceph based iSCSI SAN project

2016-10-17 Thread Lars Marowsky-Bree
On 2016-10-17T13:37:29, Maged Mokhtar  wrote:

Hi Maged,

glad to see our patches caught your attention. You're aware that they
are being upstreamed by David Disseldorp and Mike Christie, right? You
don't have to uplift patches from our backported SLES kernel ;-)

Also, curious why you based this on Hammer; SUSE Enterprise Storage at
this point is based on Jewel. Did you experience any problems with the
older release? The newer one has important fixes.

Is this supposed to be a separate product/project forever? I mean, there
are several management frontends for Ceph at this stage gaining the
iSCSI functionality.

And, lastly, if all I wanted to build was an iSCSI target and not expose
the rest of Ceph's functionality, I'd probably build it around drbd9.

But glad to see the iSCSI frontend is gaining more traction. We have
many customers in the field deploying it successfully with our support
package.

OK, not quite lastly - could you be convinced to make the source code
available in a bit more convenient form? I doubt that's the preferred
form of distribution for development ;-) A GitHub repo maybe?


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] suse_enterprise_storage3_rbd_LIO_vmware_performance_bad

2016-07-04 Thread Lars Marowsky-Bree
On 2016-07-01T19:11:34, Nick Fisk  wrote:

> To summarise,
> 
> LIO is just not working very well at the moment because of the ABORT Tasks 
> problem, this will hopefully be fixed at some point. I'm not sure if SUSE 
> works around this, but see below for other pain points with RBD + ESXi + iSCSI

Yes, the SUSE kernel has recent backports that fix these bugs. And
there's obviously on-going work to improve the performance and code.

That's not to say that I'd advocate iSCSI as a primary access mechanism
for Ceph. But the need to interface from non-Linux systems to a Ceph
cluster is unfortunately very real.

> With 1GB networking I think you will struggle to get your write latency much 
> below 10-15ms, but from your example ~30ms is still a bit high. I wonder if 
> the default queue depths on your iSCSI target are too low as well?

Thanks for all the insights on the performance issues. You're really
quite spot on.

The main concern here obviously is that the same 2x1GbE network is
carrying both the client/ESX traffic, the iSCSI target to OSD traffic,
and the OSD backend traffic. That is not advisable.


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] suse_enterprise_storage3_rbd_LIO_vmware_performance_bad

2016-07-04 Thread Lars Marowsky-Bree
On 2016-07-01T17:18:19, Christian Balzer  wrote:

> First off, it's somewhat funny that you're testing the repackaged SUSE
> Ceph, but asking for help here (with Ceph being owned by Red Hat).

*cough* Ceph is not owned by RH. RH acquired the InkTank team and the
various trademarks, that's true (and, admittedly, I'm a bit envious
about that ;-), but Ceph itself is an Open Source project that is not
owned by a single company.

You may want to check out the growing contributions from other
companies and the active involvement by them in the Ceph community ;-)


Regards,
Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI over RDB is a good idea ?

2015-11-05 Thread Lars Marowsky-Bree
On 2015-11-04T14:30:56, Hugo Slabbert  wrote:

> Sure. My post was not intended to say that iSCSI over RBD is *slow*, just 
> that it scales differently than native RBD client access.
> 
> If I have 10 OSD hosts with a 10G link each facing clients, provided the OSDs 
> can saturate the 10G links, I have 100G of aggregate nominal throughput under 
> ideal conditions. If I put an iSCSI target (or an active/passive pair of 
> targets) in front of that to connect iSCSI initiators to RBD devices, my 
> aggregate nominal throughput for iSCSI clients under ideal conditions is 10G.

It's worth noting that you can use multiple iSCSI target gateways using
MPIO, which allows you to scale the performance and availability
horizontally.

This doesn't help with the additional network/gateway hop, but the
bandwidth limitation is not the issue.

And that works today.


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] External XFS Filesystem Journal on OSD

2015-07-10 Thread Lars Marowsky-Bree
On 2015-07-10T15:20:23, Jacek Jarosiewicz jjarosiew...@supermedia.pl wrote:

 We have tried both - you can see performance gain, but we finally went
 toward ceph cache tier. It's much more flexible and gives similar gains in
 terms of performance.
 
 Downside to bcache is that you can't use it on a drive that already has data
 - only new, clean partitions can be added - and (although I've read that
 bcache is quite resiliant) you can not acces raw filesystem once bcache is
 added to your partition (data is only accessible through bcache, so
 potentially if bcache goes corrupt, your data goes corrupt).
 
 Downside to flashcache is that you can only combine partition on ssd with
 another partition on spinning drive, so you have to think ahead when
 planning your disc layout, ie.: if you partition your ssd with `n'
 partitions so that it can cache your `n' spinning drives, and then you want
 to add another spinning drive you either had to have left some space on the
 original ssd, or you have to add a new one. And if you have left some space
 - it's been just sitting there waiting for a new spinning drive.
 
 With cache tier you can have your cake and eat it too :) - add/remove ssd's
 on demand, and add/remove spinning drives as you wish - just tune the pool
 sizes after you change your drive layout.

Great feedback, too.

So the point about bcache is very valid. But then, a cache layer does
require a lot more tuning and has many more moving parts, requires more
memory, and a more complex ceph setup.

(I was specifically wondering if a bcache could help in front of SMR
drives, actually.)

But it's really useful to know you're seeing similar speed-ups with the
cache tiering.


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Dilip Upmanyu, Graham 
Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] External XFS Filesystem Journal on OSD

2015-06-04 Thread Lars Marowsky-Bree
On 2015-06-04T12:42:42, David Burley da...@slashdotmedia.com wrote:

 Are there any safety/consistency or other reasons we wouldn't want to try
 using an external XFS log device for our OSDs? I realize if that device
 fails the filesystem is pretty much lost, but beyond that?

I think with the XFS journal on the same SSD as ceph's OSD journal, that
could be a quite interesting setup. Please share performance numbers!

I've been meaning to benchmark bcache in front of the OSD backend,
especially for SMRs, but haven't gotten around to it yet.


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Dilip Upmanyu, Graham 
Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SLES Packages

2015-06-02 Thread Lars Marowsky-Bree
On 2015-06-01T15:41:05, Steffen Weißgerber weissgerb...@ksnb.de wrote:

 Hi,
 
 I'm searching for actual packages for SLES11 SP3.
 
 Via SMT-Updateserver it seems that there's only Version 0.80.8 available. Are 
 there
 other package sources available (at least for Giant)?

Hi Steffen,

we have only released the client side enablement for SLES 11 SP3. There
is no Ceph server side code available for this platform (at least not
from SUSE).

Our server-side offering is based on SLES 12 (SUSE Enterprise Storage).
Currently based on firefly 0.80.9, though as always, the next upgrade is
always in the works ;-) (Probably directly going to 0.80.11.)

Only on our next product release will be based on Hammer++.

A more community-oriented version, including more recent packages, is
available for openSUSE (via build.opensuse.org).

 What I want to do is mount ceph via rbd map natively instead mounting nfs 
 from another host
 on which I have actual packages available.

That should be possible with the SLES 11 SP3 packages that you have
access to. The rbd client code is included there.


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Dilip Upmanyu, Graham 
Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD and Cephfuse

2015-06-02 Thread Lars Marowsky-Bree
On 2015-06-02T15:40:54, gjprabu gjpr...@zohocorp.com wrote:

 Hi Team,
 
   We are newly using ceph with two OSD and two clients, our requirement 
 is when we write date through clients it should see in another client also,  
 storage is mounted using rbd because we running git clone with large amount 
 of small file and it is fast when use rbd mount, but data not sync in both 
 the clients. 

What file system are you using on top of RBD for this purpose? To
achieve this goal, you'd need to use a cluster-aware file system (with
all the complexity that entails) like OCFS2 or GFS2.

You cannot mount something like XFS/btrfs/ext4 multiple times; that
will, in fact, corrupt your data and likely crash the client's
kernels.


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Dilip Upmanyu, Graham 
Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD and Cephfuse

2015-06-02 Thread Lars Marowsky-Bree
On 2015-06-02T17:23:58, gjprabu gjpr...@zohocorp.com wrote:

 Hi Lars,
 
We installed centos in client machines with kernel version is 3.10 
 which is rbd supported modules. Now installed ocsfs2-tools and formated but 
 mount through error. Please check below.

You need to configure the ocfs2 cluster properly as well. You can use
either o2cb (which I'm not familiar with anymore), or the
pacemaker-integrated version:
https://www.suse.com/documentation/sle_ha/book_sleha/data/sec_ha_ocfs2_create_service.html
(should pretty much apply to CentOS as well).

From this point on, rbd is really just a shared block device, and you
may have better success if you use the us...@clusterlabs.org mailing
list if you wish to pursue this route.


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Dilip Upmanyu, Graham 
Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com