Re: [ceph-users] [External Email] RE: Beginner questions

2020-01-16 Thread DHilsbos
Paul;

So is the 3/30/300GB a limit of RocksDB, or of Bluestore?

The percentages you list, are they used DB / used data?  If so... Where do you 
get the used DB data from?

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


From: Paul Emmerich [mailto:paul.emmer...@croit.io] 
Sent: Thursday, January 16, 2020 3:23 PM
To: Bastiaan Visser
Cc: Dominic Hilsbos; Ceph Users
Subject: Re: [ceph-users] [External Email] RE: Beginner questions

Discussing DB size requirements without knowing the exact cluster requirements 
doesn't work.

Here are some real-world examples:

cluster1: CephFS, mostly large files, replicated x3
0.2% used for metadata

cluster2: radosgw, mix between replicated and erasure, mixed file sizes (lots 
of tiny files, though)
1.3% used for metadata

The 4%-10% quoted in the docs are *not based on any actual usage data*, they 
are just an absolute worst case estimate.


A 30 GB DB partition for a 12 TiB disk is 0.25% if the disk is completely full 
(which it won't be) is sufficient for many use cases.
I think cluster2 with 1.3% is one of the highest metadata usages that I've seen 
on an actual production cluster.
I can think of a setup that probably has more but I haven't ever explicitly 
checked it.

The restriction to 3/30/300 is temporary and might be fixed in a future 
release, so I'd just partition that disk into X DB devices.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jan 16, 2020 at 10:28 PM Bastiaan Visser  wrote:
Dave made a good point WAL + DB might end up a little over 60G, I would 
probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme 
drive is smart enough to spread the writes over all available capacity, mort 
recent nvme's are). I have not yet seen a WAL larger or even close to than a 
gigabyte.

We don't even think about EC-coded pools on clusters with less than 6 nodes 
(spindles, full SSD is another story).
EC pools neer more processing resources  We usually settle with 1 gig per TB of 
storage on replicated only sluters, but whet EC polls are involved, we add at 
least 50% to that. Also make sure your processors are up for it.

Do not base your calculations on a healthy cluster -> build to fail. 
How long are you willing to be in a degraded state on node failure. Especially 
when using many larger spindles. recovery time might be way longer than you 
think. 12 * 12TB is 144TB storage, on a 4+2 EC pool you might end up with over 
200 TB of traffic, on a 10Gig network that's roughly 2 and a half days to 
recover. IF your processors are not bottleneck due to EC parity calculations 
and all capacity is available for recovery (which is usually not the case, 
there is still production traffic that will eat up resources).

Op do 16 jan. 2020 om 21:30 schreef :
Dave;

I don't like reading inline responses, so...

I have zero experience with EC pools, so I won't pretend to give advice in that 
area.

I would think that small NVMe for DB would be better than nothing, but I don't 
know.

Once I got the hang of building clusters, it was relatively easy to wipe a 
cluster out and rebuild it.  Perhaps you could take some time, and benchmark 
different configurations?

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


-Original Message-
From: Dave Hall [mailto:kdh...@binghamton.edu] 
Sent: Thursday, January 16, 2020 1:04 PM
To: Dominic Hilsbos; ceph-users@lists.ceph.com
Subject: Re: [External Email] RE: [ceph-users] Beginner questions

Dominic,

We ended up with a 1.6TB PCIe NVMe in each node.  For 8 drives this 
worked out to a DB size of something like 163GB per OSD. Allowing for 
expansion to 12 drives brings it down to 124GB. So maybe just put the 
WALs on NVMe and leave the DBs on the platters?

Understood that we will want to move to more nodes rather than more 
drives per node, but our funding is grant and donation based, so we may 
end up adding drives in the short term.  The long term plan is to get to 
separate MON/MGR/MDS nodes and 10s of OSD nodes.

Due to our current low node count, we are considering erasure-coded PGs 
rather than replicated in order to maximize usable space.  Any 
guidelines or suggestions on this?

Also, sorry for not replying inline.  I haven't done this much in a 
while - I'll figure it out.

Thanks.

-Dave

On 1/16/2020 2:48 PM, dhils...@performair.com wrote:
> Dave;
>
> I'd like to expand on this answer, briefly...
>
> The information in the docs is wrong.  There have been many discussions about 
> changing it, but no good alternative has been suggested, thus it hasn't been 
> changed.
>
> The 3rd party project that Ceph's BlueStore uses for its database (Roc

Re: [ceph-users] [External Email] RE: Beginner questions

2020-01-16 Thread Paul Emmerich
Discussing DB size requirements without knowing the exact cluster
requirements doesn't work.

Here are some real-world examples:

cluster1: CephFS, mostly large files, replicated x3
0.2% used for metadata

cluster2: radosgw, mix between replicated and erasure, mixed file sizes
(lots of tiny files, though)
1.3% used for metadata

The 4%-10% quoted in the docs are *not based on any actual usage data*,
they are just an absolute worst case estimate.


A 30 GB DB partition for a 12 TiB disk is 0.25% if the disk is completely
full (which it won't be) is sufficient for many use cases.
I think cluster2 with 1.3% is one of the highest metadata usages that I've
seen on an actual production cluster.
I can think of a setup that probably has more but I haven't ever explicitly
checked it.

The restriction to 3/30/300 is temporary and might be fixed in a future
release, so I'd just partition that disk into X DB devices.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jan 16, 2020 at 10:28 PM Bastiaan Visser  wrote:

> Dave made a good point WAL + DB might end up a little over 60G, I would
> probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme
> drive is smart enough to spread the writes over all available capacity,
> mort recent nvme's are). I have not yet seen a WAL larger or even close to
> than a gigabyte.
>
> We don't even think about EC-coded pools on clusters with less than 6
> nodes (spindles, full SSD is another story).
> EC pools neer more processing resources  We usually settle with 1 gig per
> TB of storage on replicated only sluters, but whet EC polls are involved,
> we add at least 50% to that. Also make sure your processors are up for it.
>
> Do not base your calculations on a healthy cluster -> build to fail.
> How long are you willing to be in a degraded state on node failure.
> Especially when using many larger spindles. recovery time might be way
> longer than you think. 12 * 12TB is 144TB storage, on a 4+2 EC pool you
> might end up with over 200 TB of traffic, on a 10Gig network that's roughly
> 2 and a half days to recover. IF your processors are not bottleneck due to
> EC parity calculations and all capacity is available for recovery (which is
> usually not the case, there is still production traffic that will eat up
> resources).
>
> Op do 16 jan. 2020 om 21:30 schreef :
>
>> Dave;
>>
>> I don't like reading inline responses, so...
>>
>> I have zero experience with EC pools, so I won't pretend to give advice
>> in that area.
>>
>> I would think that small NVMe for DB would be better than nothing, but I
>> don't know.
>>
>> Once I got the hang of building clusters, it was relatively easy to wipe
>> a cluster out and rebuild it.  Perhaps you could take some time, and
>> benchmark different configurations?
>>
>> Thank you,
>>
>> Dominic L. Hilsbos, MBA
>> Director – Information Technology
>> Perform Air International Inc.
>> dhils...@performair.com
>> www.PerformAir.com
>>
>>
>> -Original Message-
>> From: Dave Hall [mailto:kdh...@binghamton.edu]
>> Sent: Thursday, January 16, 2020 1:04 PM
>> To: Dominic Hilsbos; ceph-users@lists.ceph.com
>> Subject: Re: [External Email] RE: [ceph-users] Beginner questions
>>
>> Dominic,
>>
>> We ended up with a 1.6TB PCIe NVMe in each node.  For 8 drives this
>> worked out to a DB size of something like 163GB per OSD. Allowing for
>> expansion to 12 drives brings it down to 124GB. So maybe just put the
>> WALs on NVMe and leave the DBs on the platters?
>>
>> Understood that we will want to move to more nodes rather than more
>> drives per node, but our funding is grant and donation based, so we may
>> end up adding drives in the short term.  The long term plan is to get to
>> separate MON/MGR/MDS nodes and 10s of OSD nodes.
>>
>> Due to our current low node count, we are considering erasure-coded PGs
>> rather than replicated in order to maximize usable space.  Any
>> guidelines or suggestions on this?
>>
>> Also, sorry for not replying inline.  I haven't done this much in a
>> while - I'll figure it out.
>>
>> Thanks.
>>
>> -Dave
>>
>> On 1/16/2020 2:48 PM, dhils...@performair.com wrote:
>> > Dave;
>> >
>> > I'd like to expand on this answer, briefly...
>> >
>> > The information in the docs is wrong.  There have been many discussions
>> about changing it, but no good alternative has been suggested, thus it
>> hasn't been changed.
>> >
>> > The 3rd party project that Ceph's BlueStore uses for its database
>> (RocksDB), apparently only uses DB sizes of 3GB, 30GB, and 300GB.  As Dave
>> mentions below, when RocksDB executes a compact operation, it creates a new
>> blob of the same target size, and writes the compacted data into it.  This
>> doubles the necessary space.  In addition, BlueStore places its Write Ahead
>> Log (WAL) into the fastest storage that is available to OSD daemon,  i.e.
>> NVMe 

Re: [ceph-users] [External Email] RE: Beginner questions

2020-01-16 Thread Bastiaan Visser
Dave made a good point WAL + DB might end up a little over 60G, I would
probably go with ~70Gig partitions /LV's per OSD in your case. (if the nvme
drive is smart enough to spread the writes over all available capacity,
mort recent nvme's are). I have not yet seen a WAL larger or even close to
than a gigabyte.

We don't even think about EC-coded pools on clusters with less than 6 nodes
(spindles, full SSD is another story).
EC pools neer more processing resources  We usually settle with 1 gig per
TB of storage on replicated only sluters, but whet EC polls are involved,
we add at least 50% to that. Also make sure your processors are up for it.

Do not base your calculations on a healthy cluster -> build to fail.
How long are you willing to be in a degraded state on node failure.
Especially when using many larger spindles. recovery time might be way
longer than you think. 12 * 12TB is 144TB storage, on a 4+2 EC pool you
might end up with over 200 TB of traffic, on a 10Gig network that's roughly
2 and a half days to recover. IF your processors are not bottleneck due to
EC parity calculations and all capacity is available for recovery (which is
usually not the case, there is still production traffic that will eat up
resources).

Op do 16 jan. 2020 om 21:30 schreef :

> Dave;
>
> I don't like reading inline responses, so...
>
> I have zero experience with EC pools, so I won't pretend to give advice in
> that area.
>
> I would think that small NVMe for DB would be better than nothing, but I
> don't know.
>
> Once I got the hang of building clusters, it was relatively easy to wipe a
> cluster out and rebuild it.  Perhaps you could take some time, and
> benchmark different configurations?
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director – Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
> -Original Message-
> From: Dave Hall [mailto:kdh...@binghamton.edu]
> Sent: Thursday, January 16, 2020 1:04 PM
> To: Dominic Hilsbos; ceph-users@lists.ceph.com
> Subject: Re: [External Email] RE: [ceph-users] Beginner questions
>
> Dominic,
>
> We ended up with a 1.6TB PCIe NVMe in each node.  For 8 drives this
> worked out to a DB size of something like 163GB per OSD. Allowing for
> expansion to 12 drives brings it down to 124GB. So maybe just put the
> WALs on NVMe and leave the DBs on the platters?
>
> Understood that we will want to move to more nodes rather than more
> drives per node, but our funding is grant and donation based, so we may
> end up adding drives in the short term.  The long term plan is to get to
> separate MON/MGR/MDS nodes and 10s of OSD nodes.
>
> Due to our current low node count, we are considering erasure-coded PGs
> rather than replicated in order to maximize usable space.  Any
> guidelines or suggestions on this?
>
> Also, sorry for not replying inline.  I haven't done this much in a
> while - I'll figure it out.
>
> Thanks.
>
> -Dave
>
> On 1/16/2020 2:48 PM, dhils...@performair.com wrote:
> > Dave;
> >
> > I'd like to expand on this answer, briefly...
> >
> > The information in the docs is wrong.  There have been many discussions
> about changing it, but no good alternative has been suggested, thus it
> hasn't been changed.
> >
> > The 3rd party project that Ceph's BlueStore uses for its database
> (RocksDB), apparently only uses DB sizes of 3GB, 30GB, and 300GB.  As Dave
> mentions below, when RocksDB executes a compact operation, it creates a new
> blob of the same target size, and writes the compacted data into it.  This
> doubles the necessary space.  In addition, BlueStore places its Write Ahead
> Log (WAL) into the fastest storage that is available to OSD daemon,  i.e.
> NVMe if available.  Since this is done before the first compaction is
> requested, the WAL can force compaction onto slower storage.
> >
> > Thus, the numbers I've had floating around in my head for our next
> cluster are: 7GB, 66GB, and 630GB.  From all the discussion I've seen
> around RocksDB, those seem like good, common sense targets.  Pick the
> largest one that works for your setup.
> >
> > All that said... You would really want to pair a 600GB+ NVMe with 12TB
> drives, otherwise your DB is almost guaranteed to overflow onto the
> spinning drive, and affect performance.
> >
> > I became aware of most of this after we planned our clusters, so I
> haven't tried it, YMMV.
> >
> > One final note: more hosts, and more spindles usually translates into
> better cluster-wide performance.  I can't predict what the relatively low
> client counts you're suggesting would impact that.
> >
> > Thank you,
> >
> > Dominic L. Hilsbos, MBA
> > Director – Information Technology
> > Perform Air International Inc.
> > dhils...@performair.com
> > www.PerformAir.com
> >
> >
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Bastiaan Visser
> > Sent: Thursday, January 16, 2020 10:55 AM
> > To: Dave Hall
> > Cc: ceph-users@lists.ceph.com
> > 

Re: [ceph-users] [External Email] RE: Beginner questions

2020-01-16 Thread DHilsbos
Dave;

I don't like reading inline responses, so...

I have zero experience with EC pools, so I won't pretend to give advice in that 
area.

I would think that small NVMe for DB would be better than nothing, but I don't 
know.

Once I got the hang of building clusters, it was relatively easy to wipe a 
cluster out and rebuild it.  Perhaps you could take some time, and benchmark 
different configurations?

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


-Original Message-
From: Dave Hall [mailto:kdh...@binghamton.edu] 
Sent: Thursday, January 16, 2020 1:04 PM
To: Dominic Hilsbos; ceph-users@lists.ceph.com
Subject: Re: [External Email] RE: [ceph-users] Beginner questions

Dominic,

We ended up with a 1.6TB PCIe NVMe in each node.  For 8 drives this 
worked out to a DB size of something like 163GB per OSD. Allowing for 
expansion to 12 drives brings it down to 124GB. So maybe just put the 
WALs on NVMe and leave the DBs on the platters?

Understood that we will want to move to more nodes rather than more 
drives per node, but our funding is grant and donation based, so we may 
end up adding drives in the short term.  The long term plan is to get to 
separate MON/MGR/MDS nodes and 10s of OSD nodes.

Due to our current low node count, we are considering erasure-coded PGs 
rather than replicated in order to maximize usable space.  Any 
guidelines or suggestions on this?

Also, sorry for not replying inline.  I haven't done this much in a 
while - I'll figure it out.

Thanks.

-Dave

On 1/16/2020 2:48 PM, dhils...@performair.com wrote:
> Dave;
>
> I'd like to expand on this answer, briefly...
>
> The information in the docs is wrong.  There have been many discussions about 
> changing it, but no good alternative has been suggested, thus it hasn't been 
> changed.
>
> The 3rd party project that Ceph's BlueStore uses for its database (RocksDB), 
> apparently only uses DB sizes of 3GB, 30GB, and 300GB.  As Dave mentions 
> below, when RocksDB executes a compact operation, it creates a new blob of 
> the same target size, and writes the compacted data into it.  This doubles 
> the necessary space.  In addition, BlueStore places its Write Ahead Log (WAL) 
> into the fastest storage that is available to OSD daemon,  i.e. NVMe if 
> available.  Since this is done before the first compaction is requested, the 
> WAL can force compaction onto slower storage.
>
> Thus, the numbers I've had floating around in my head for our next cluster 
> are: 7GB, 66GB, and 630GB.  From all the discussion I've seen around RocksDB, 
> those seem like good, common sense targets.  Pick the largest one that works 
> for your setup.
>
> All that said... You would really want to pair a 600GB+ NVMe with 12TB 
> drives, otherwise your DB is almost guaranteed to overflow onto the spinning 
> drive, and affect performance.
>
> I became aware of most of this after we planned our clusters, so I haven't 
> tried it, YMMV.
>
> One final note: more hosts, and more spindles usually translates into better 
> cluster-wide performance.  I can't predict what the relatively low client 
> counts you're suggesting would impact that.
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director – Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Bastiaan Visser
> Sent: Thursday, January 16, 2020 10:55 AM
> To: Dave Hall
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Beginner questions
>
> I would definitely go for Nautilus. there are quite some optimizations that 
> went in after mimic.
>
> Bluestore DB size usually ends up at either 30 or 60 GB.
> 30 GB is one of the sweet spots during normal operation. But during 
> compaction, ceph writes the new data before removing the old, hence the 60GB.
> Next sweetspot is 300/600GB. any size between 60 and 300 will never be unused.
>
> DB Usage is also dependent on ceph usage, object storage is known to use a 
> lot more db space than rbd images for example.
>
> Op do 16 jan. 2020 om 17:46 schreef Dave Hall :
> Hello all.
> Sorry for the beginner questions...
> I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster to 
> store some research data.  It is expected that this cluster will grow 
> significantly in the next year, possibly to multiple petabytes and 10s of 
> nodes.  At this time I'm expected a relatively small number of clients, with 
> only one or two actively writing collected data - albeit at a high volume per 
> day.
> Currently I'm deploying on Debian 9 via ceph-ansible.
> Before I put this cluster into production I have a couple questions based on 
> my experience to date:
> Luminous, Mimic, or Nautilus?  I need stability for this deployment, so I am 
> sticking with Debian 9 since Debian 10 is fairly new, and I have been 
> hesitant to go 

Re: [ceph-users] [External Email] RE: Beginner questions

2020-01-16 Thread Dave Hall

Dominic,

We ended up with a 1.6TB PCIe NVMe in each node.  For 8 drives this 
worked out to a DB size of something like 163GB per OSD. Allowing for 
expansion to 12 drives brings it down to 124GB. So maybe just put the 
WALs on NVMe and leave the DBs on the platters?


Understood that we will want to move to more nodes rather than more 
drives per node, but our funding is grant and donation based, so we may 
end up adding drives in the short term.  The long term plan is to get to 
separate MON/MGR/MDS nodes and 10s of OSD nodes.


Due to our current low node count, we are considering erasure-coded PGs 
rather than replicated in order to maximize usable space.  Any 
guidelines or suggestions on this?


Also, sorry for not replying inline.  I haven't done this much in a 
while - I'll figure it out.


Thanks.

-Dave

On 1/16/2020 2:48 PM, dhils...@performair.com wrote:

Dave;

I'd like to expand on this answer, briefly...

The information in the docs is wrong.  There have been many discussions about 
changing it, but no good alternative has been suggested, thus it hasn't been 
changed.

The 3rd party project that Ceph's BlueStore uses for its database (RocksDB), 
apparently only uses DB sizes of 3GB, 30GB, and 300GB.  As Dave mentions below, 
when RocksDB executes a compact operation, it creates a new blob of the same 
target size, and writes the compacted data into it.  This doubles the necessary 
space.  In addition, BlueStore places its Write Ahead Log (WAL) into the 
fastest storage that is available to OSD daemon,  i.e. NVMe if available.  
Since this is done before the first compaction is requested, the WAL can force 
compaction onto slower storage.

Thus, the numbers I've had floating around in my head for our next cluster are: 
7GB, 66GB, and 630GB.  From all the discussion I've seen around RocksDB, those 
seem like good, common sense targets.  Pick the largest one that works for your 
setup.

All that said... You would really want to pair a 600GB+ NVMe with 12TB drives, 
otherwise your DB is almost guaranteed to overflow onto the spinning drive, and 
affect performance.

I became aware of most of this after we planned our clusters, so I haven't 
tried it, YMMV.

One final note: more hosts, and more spindles usually translates into better 
cluster-wide performance.  I can't predict what the relatively low client 
counts you're suggesting would impact that.

Thank you,

Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Bastiaan Visser
Sent: Thursday, January 16, 2020 10:55 AM
To: Dave Hall
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Beginner questions

I would definitely go for Nautilus. there are quite some optimizations that 
went in after mimic.

Bluestore DB size usually ends up at either 30 or 60 GB.
30 GB is one of the sweet spots during normal operation. But during compaction, 
ceph writes the new data before removing the old, hence the 60GB.
Next sweetspot is 300/600GB. any size between 60 and 300 will never be unused.

DB Usage is also dependent on ceph usage, object storage is known to use a lot 
more db space than rbd images for example.

Op do 16 jan. 2020 om 17:46 schreef Dave Hall :
Hello all.
Sorry for the beginner questions...
I am in the process of setting up a small (3 nodes, 288TB) Ceph cluster to 
store some research data.  It is expected that this cluster will grow 
significantly in the next year, possibly to multiple petabytes and 10s of 
nodes.  At this time I'm expected a relatively small number of clients, with 
only one or two actively writing collected data - albeit at a high volume per 
day.
Currently I'm deploying on Debian 9 via ceph-ansible.
Before I put this cluster into production I have a couple questions based on my 
experience to date:
Luminous, Mimic, or Nautilus?  I need stability for this deployment, so I am 
sticking with Debian 9 since Debian 10 is fairly new, and I have been hesitant 
to go with Nautilus.  Yet Mimic seems to have had a hard road on Debian but for 
the efforts at Croit.
• Statements on the Releases page are now making more sense to me, but I would 
like to confirm that Nautilus is the right choice at this time?
Bluestore DB size:  My nodes currently have 8 x 12TB drives (plus 4 empty bays) 
and a PCIe NVMe drive.  If I understand the suggested calculation correctly, 
the DB size for a 12 TB Bluestore OSD would be 480GB.  If my NVMe isn't big 
enough to provide this size, should I skip provisioning the DBs on the NVMe, or 
should I give each OSD 1/12th of what I have available?  Also, should I try to 
shift budget a bit to get more NVMe as soon as I can, and redo the OSDs when 
sufficient NVMe is available?
Thanks.
-Dave
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [External Email] Re: Beginner questions

2020-01-16 Thread Dave Hall

Paul, Bastiaan,

Thank you for your responses and for alleviating my concerns about 
Nautilus.  The good news is that I can still easily move up to Debian 
10.  BTW, I assume that this is still with the 4.19 kernel?


Also, I'd like to inject additional customizations into my Debian 
configs via ceph-ansible - certain sysctls, ntp servers, and some 
additional packages.  Is anybody doing that, and could you share any 
hints on where to configure it?


Thanks.

-Dave

Dave Hall
Binghamton University
kdh...@binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)


On 1/16/2020 2:30 PM, Paul Emmerich wrote:
Don't use Mimic, support for it is far worse than Nautilus or 
Luminous. I think we were the only company who built a product around 
Mimic, both Redhat and Suse enterprise storage was Luminous and then 
Nautilus skipping Mimic entirely.


We only offered Mimic as a default for a limited time and immediately 
moved to Nautilus as it became available and Nautilus + Debian 10 has 
been great for us.
Mimic and Debian 9 was... well, hacked together, due to the gcc 
backport issues. That's not to say that it doesn't work, in fact Mimic 
(> 13.2.2) and Debian 9 worked perfectly fine for us.


Our Debian 10 and Nautilus packages are just so much better and more 
stable than Debian 9 + Mimic because we don't need to do weird things 
with Debian.
Check the mailing list for old posts around the Mimic release by me to 
see how we did that build. It's not pretty, but it was the only way to 
use Ceph >= Mimic on Debian 9.

All that mess has been eliminated with Debian 10.

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io 
Tel: +49 89 1896585 90


On Thu, Jan 16, 2020 at 6:55 PM Bastiaan Visser > wrote:


I would definitely go for Nautilus. there are quite some
optimizations that went in after mimic.

Bluestore DB size usually ends up at either 30 or 60 GB.
30 GB is one of the sweet spots during normal operation. But
during compaction, ceph writes the new data before removing the
old, hence the 60GB.
Next sweetspot is 300/600GB. any size between 60 and 300 will
never be unused.

DB Usage is also dependent on ceph usage, object storage is known
to use a lot more db space than rbd images for example.

Op do 16 jan. 2020 om 17:46 schreef Dave Hall
mailto:kdh...@binghamton.edu>>:

Hello all.

Sorry for the beginner questions...

I am in the process of setting up a small (3 nodes, 288TB)
Ceph cluster to store some research data.  It is expected that
this cluster will grow significantly in the next year,
possibly to multiple petabytes and 10s of nodes.  At this time
I'm expected a relatively small number of clients, with only
one or two actively writing collected data - albeit at a high
volume per day.

Currently I'm deploying on Debian 9 via ceph-ansible.

Before I put this cluster into production I have a couple
questions based on my experience to date:

Luminous, Mimic, or Nautilus?  I need stability for this
deployment, so I am sticking with Debian 9 since Debian 10 is
fairly new, and I have been hesitant to go with Nautilus.  Yet
Mimic seems to have had a hard road on Debian but for the
efforts at Croit.

  * Statements on the Releases page are now making more sense
to me, but I would like to confirm that Nautilus is the
right choice at this time?

Bluestore DB size:  My nodes currently have 8 x 12TB drives
(plus 4 empty bays) and a PCIe NVMe drive.  If I understand
the suggested calculation correctly, the DB size for a 12 TB
Bluestore OSD would be 480GB.  If my NVMe isn't big enough to
provide this size, should I skip provisioning the DBs on the
NVMe, or should I give each OSD 1/12th of what I have
available?  Also, should I try to shift budget a bit to get
more NVMe as soon as I can, and redo the OSDs when sufficient
NVMe is available?

Thanks.

-Dave

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com