Re: [ceph-users] How do you deal with "clock skew detected"?

2019-05-15 Thread Richard Hesketh
Another option would be adding a boot time script which uses ntpdate (or
something) to force an immediate sync with your timeservers before ntpd
starts - this is actually suggested in ntpdate's man page!

Rich

On 15/05/2019 13:00, Marco Stuurman wrote:
> Hi Yenya,
> 
> You could try to synchronize the system clock to the hardware clock
> before rebooting. Also try chrony, it catches up very fast.
> 
> 
> Kind regards,
> 
> Marco Stuurman
> 
> 
> Op wo 15 mei 2019 om 13:48 schreef Jan Kasprzak  >
> 
>         Hello, Ceph users,
> 
> how do you deal with the "clock skew detected" HEALTH_WARN message?
> 
> I think the internal RTC in most x86 servers does have 1 second resolution
> only, but Ceph skew limit is much smaller than that. So every time I 
> reboot
> one of my mons (for kernel upgrade or something), I have to wait for 
> several
> minutes for the system clock to synchronize over NTP, even though ntpd
> has been running before reboot and was started during the system
> boot again.
> 
> Thanks,
> 
> -Yenya



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Should ceph build against libcurl4 for Ubuntu 18.04 and later?

2018-11-22 Thread Richard Hesketh
Bionic's mimic packages do seem to depend on libcurl4 already, for what
that's worth:

root@vm-gw-1:/# apt-cache depends ceph-common
ceph-common
...
  Depends: libcurl4


On 22/11/2018 12:40, Matthew Vernon wrote:
> Hi,
> 
> The ceph.com ceph luminous packages for Ubuntu Bionic still depend on
> libcurl3 (specifically ceph-common, radosgw. librgw2 all depend on
> libcurl3 (>= 7.28.0)).
> 
> This means that anything that depends on libcurl4 (which is the default
> libcurl in bionic) isn't co-installable with ceph. That includes the
> "curl" binary itself, which we've been using in a number of our scripts
> / tests / etc. I would expect this to make ceph-test uninstallable on
> Bionic also...
> 
> ...so shouldn't ceph packages for Bionic and later releases be compiled
> against libcurl4 (and thus Depend upon it)? The same will apply to the
> next Debian release, I expect.
> 
> The curl authors claim the API doesn't have any incompatible changes.
> 
> Regards,
> 
> Matthew
> [the two packages libcurl3 and libcurl4 are not co-installable because
> libcurl3 includes a libcurl.so.4 for historical reasons :-( ]
> 
> 




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] WAL/DB size

2018-09-07 Thread Richard Hesketh
It can get confusing.

There will always be a WAL, and there will always be a metadata DB, for
a bluestore OSD. However, if a separate device is not specified for the
WAL, it is kept in the same device/partition as the DB; in the same way,
if a separate device is not specified for the DB, it is kept on the same
device as the actual data (an "all-in-one" OSD). Unless you have a
separate, even faster device for the WAL to go on, you shouldn't specify
it separately from the DB; just make one partition on your SSD per OSD,
and make them as large as will fit together on the SSD.

Also, just to be clear, the WAL is not exactly a journal in the same way
that Filestore required a journal. Because Bluestore can provide write
atomicity without requiring a separate journal, data is *usually*
written directly to the longterm storage; writes are only journalled in
the WAL to be flushed/synced later if they're below a certain size (IIRC
32kb by default), to avoid latency by excessive seeking on HDDs.

Rich

On 07/09/18 14:23, Muhammad Junaid wrote:
> Thanks again, but sorry again too. I couldn't understand the following.
> 
> 1. As per docs, blocks.db is used only for bluestore (file system meta
> data info etc). It has nothing to do with actual data (for journaling)
> which will ultimately written to slower disks. 
> 2. How will actual journaling will work if there is no WAL (As you
> suggested)?
> 
> Regards.
> 
> On Fri, Sep 7, 2018 at 6:09 PM Alfredo Deza  > wrote:
> 
> On Fri, Sep 7, 2018 at 9:02 AM, Muhammad Junaid
> mailto:junaid.fsd...@gmail.com>> wrote:
> > Thanks Alfredo. Just to clear that My configuration has 5 OSD's
> (7200 rpm
> > SAS HDDS) which are slower than the 200G SSD. Thats why I asked
> for a 10G
> > WAL partition for each OSD on the SSD.
> >
> > Are you asking us to do 40GB  * 5 partitions on SSD just for block.db?
> 
> Yes.
> 
> You don't need a separate WAL defined. It only makes sense when you
> have something *faster* than where block.db will live.
> 
> In your case 'data' will go in the slower spinning devices, 'block.db'
> will go in the SSD, and there is no need for WAL. You would only
> benefit
> from WAL if you had another device, like an NVMe, where 2GB partitions
> (or LVs) could be created for block.wal
> 
> 
> >
> > On Fri, Sep 7, 2018 at 5:36 PM Alfredo Deza  > wrote:
> >>
> >> On Fri, Sep 7, 2018 at 8:27 AM, Muhammad Junaid
> mailto:junaid.fsd...@gmail.com>>
> >> wrote:
> >> > Hi there
> >> >
> >> > Asking the questions as a newbie. May be asked a number of
> times before
> >> > by
> >> > many but sorry, it is not clear yet to me.
> >> >
> >> > 1. The WAL device is just like journaling device used before
> bluestore.
> >> > And
> >> > CEPH confirms Write to client after writing to it (Before
> actual write
> >> > to
> >> > primary device)?
> >> >
> >> > 2. If we have lets say 5 OSD's (4 TB SAS) and 1 200GB SSD.
> Should we
> >> > partition SSD in 10 partitions? Shoud/Can we set WAL Partition Size
> >> > against
> >> > each OSD as 10GB? Or what min/max we should set for WAL
> Partition? And
> >> > can
> >> > we set remaining 150GB as (30GB * 5) for 5 db partitions for
> all OSD's?
> >>
> >> A WAL partition would only help if you have a device faster than the
> >> SSD where the block.db would go.
> >>
> >> We recently updated our sizing recommendations for block.db at least
> >> 4% of the size of block (also referenced as the data device):
> >>
> >>
> >>
> 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
> >>
> >> In your case, what you want is to create 5 logical volumes from your
> >> 200GB at 40GB each, without a need for a WAL device.
> >>
> >>
> >> >
> >> > Thanks in advance. Regards.
> >> >
> >> > Muhammad Junaid
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com 
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore performance: SSD vs on the same spinning disk

2018-08-07 Thread Richard Hesketh
n 07/08/18 17:10, Robert Stanford wrote:
> 
>  I was surprised to see an email on this list a couple of days ago,
> which said that write performance would actually fall with BlueStore.  I
> thought the reason BlueStore existed was to increase performance. 
> Nevertheless, it seems like filestore is going away and everyone should
> upgrade.
> 
>  My question is: I have SSDs for filestore journals, for spinning OSDs. 
> When upgrading to BlueStore, am I better of using the SSDs for wal/db,
> or am I better of keeping everything (dat, wall, db) on the spinning
> disks (from a performance perspective)?
> 
>  Thanks
>   R

Your performance will always be better if you put journals/WAL/DB on
faster storage. An all-in-one HDD Bluestore OSD is not more performant
than a properly set up SSD/HDD split OSD. You do, however, need to take
care to make sure that you appropriately size the partitions on the SSD
that you designate as the DB devices - if you leave it naively to the
default settings and trust the tool to do it, you'll probably end up
with tiny partitions that are no use (unless that's changed since the
last time I looked).

Bluestore's performance is better overall than Filestore's. Write
performance is potentially worse than filestore largely in the specific
case of a filestore OSD with an SSD journal (compared to a bluestore OSD
with an SSD DB) subject to bursty, high bandwidth writes - as *all*
writes to filestore hit the journal first and ack once committed to
journal, that's potentially faster than bluestore which will usually
write direct to main storage on the data HDD and so acks slower. If
writing was constant, you would still hit a bottleneck when the
filestore journal has to be flushed to HDD, at which point performance
on bluestore wins out again because it writes to the long-term storage
more efficiently than filestore does.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best way to replace OSD

2018-08-06 Thread Richard Hesketh
I would have thought that with the write endurance on modern SSDs,
additional write wear from the occasional rebalance would honestly be
negligible? If you're hitting them hard enough that you're actually
worried about your write endurance, a rebalance or two is peanuts
compared to your normal I/O. If you're not, then there's more than
enough write endurance in an SSD to handle daily rebalances for years.

On 06/08/18 17:05, Reed Dier wrote:
> This has been my modus operandi when replacing drives.
> 
> Only having ~50 OSD’s for each drive type/pool, rebalancing can be a lengthy 
> process, and in the case of SSD’s, shuffling data adds unnecessary write wear 
> to the disks.
> 
> When migrating from filestore to bluestore, I would actually forklift an 
> entire failure domain using the below script, and the noout, norebalance, 
> norecover flags.
> 
> This would keep crush from pushing data around until I had all of the drives 
> replaced, and would then keep crush from trying to recover until I was ready.
> 
>> # $1 use $ID or osd.id
>> # $2 use $DATA or /dev/sdx
>> # $3 use $NVME or /dev/nvmeXnXpX
>>
>> sudo systemctl stop ceph-osd@$1.service
>> sudo ceph-osd -i $1 --flush-journal
>> sudo umount /var/lib/ceph/osd/ceph-$1
>> sudo ceph-volume lvm zap /dev/$2
>> ceph osd crush remove osd.$1
>> ceph auth del osd.$1
>> ceph osd rm osd.$1
>> sudo ceph-volume lvm create --bluestore --data /dev/$2 --block.db /dev/$3
> 
> For a single drive, this would stop it, remove it from crush, make a new one 
> (and let it retake the old/existing osd.id), and then after I unset the 
> norebalance/norecover flags, then it backfills from the other copies to the 
> replaced drive, and doesn’t move data around.
> That script is specific for filestore to bluestore somewhat, as the 
> flush-journal command is no longer used in bluestore.
> 
> Hope thats helpful.
> 
> Reed
> 
>> On Aug 6, 2018, at 9:30 AM, Richard Hesketh  
>> wrote:
>>
>> Waiting for rebalancing is considered the safest way, since it ensures
>> you retain your normal full number of replicas at all times. If you take
>> the disk out before rebalancing is complete, you will be causing some
>> PGs to lose a replica. That is a risk to your data redundancy, but it
>> might be an acceptable one if you prefer to just get the disk replaced
>> quickly.
>>
>> Personally, if running at 3+ replicas, briefly losing one isn't the end
>> of the world; you'd still need two more simultaneous disk failures to
>> actually lose data, though one failure would cause inactive PGs (because
>> you are running with min_size >= 2, right?). If running pools with only
>> two replicas at size = 2 I absolutely would not remove a disk without
>> waiting for rebalancing unless that disk was actively failing so badly
>> that it was making rebalancing impossible.
>>
>> Rich
>>
>> On 06/08/18 15:20, Josef Zelenka wrote:
>>> Hi, our procedure is usually(assured that the cluster was ok the
>>> failure, with 2 replicas as crush rule)
>>>
>>> 1.Stop the OSD process(to keep it from coming up and down and putting
>>> load on the cluster)
>>>
>>> 2. Wait for the "Reweight" to come to 0(happens after 5 min i think -
>>> can be set manually but i let it happen by itself)
>>>
>>> 3. remove the osd from cluster(ceph auth del, ceph osd crush remove,
>>> ceph osd rm)
>>>
>>> 4. note down the journal partitions if needed
>>>
>>> 5. umount drive, replace the disk with new one
>>>
>>> 6. ensure permissions are set to ceph:ceph in /dev
>>>
>>> 7. mklabel gpt on the new drive
>>>
>>> 8. create the new osd with ceph-disk prepare(automatically adds it to
>>> the crushmap)
>>>
>>>
>>> your procedure sounds reasonable to me, as far as i'm concerned you
>>> shouldn't have to wait for rebalancing after you remove the osd. all
>>> this might not be 100% per ceph books but it works for us :)
>>>
>>> Josef
>>>
>>>
>>> On 06/08/18 16:15, Iztok Gregori wrote:
>>>> Hi Everyone,
>>>>
>>>> Which is the best way to replace a failing (SMART Health Status:
>>>> HARDWARE IMPENDING FAILURE) OSD hard disk?
>>>>
>>>> Normally I will:
>>>>
>>>> 1. set the OSD as out
>>>> 2. wait for rebalancing
>>>> 3. stop the OSD on the osd-server (unmount if needed)
>>>> 4. purge the OSD from CEPH
>>>> 5. physically replac

Re: [ceph-users] Best way to replace OSD

2018-08-06 Thread Richard Hesketh
Waiting for rebalancing is considered the safest way, since it ensures
you retain your normal full number of replicas at all times. If you take
the disk out before rebalancing is complete, you will be causing some
PGs to lose a replica. That is a risk to your data redundancy, but it
might be an acceptable one if you prefer to just get the disk replaced
quickly.

Personally, if running at 3+ replicas, briefly losing one isn't the end
of the world; you'd still need two more simultaneous disk failures to
actually lose data, though one failure would cause inactive PGs (because
you are running with min_size >= 2, right?). If running pools with only
two replicas at size = 2 I absolutely would not remove a disk without
waiting for rebalancing unless that disk was actively failing so badly
that it was making rebalancing impossible.

Rich

On 06/08/18 15:20, Josef Zelenka wrote:
> Hi, our procedure is usually(assured that the cluster was ok the
> failure, with 2 replicas as crush rule)
> 
> 1.Stop the OSD process(to keep it from coming up and down and putting
> load on the cluster)
> 
> 2. Wait for the "Reweight" to come to 0(happens after 5 min i think -
> can be set manually but i let it happen by itself)
> 
> 3. remove the osd from cluster(ceph auth del, ceph osd crush remove,
> ceph osd rm)
> 
> 4. note down the journal partitions if needed
> 
> 5. umount drive, replace the disk with new one
> 
> 6. ensure permissions are set to ceph:ceph in /dev
> 
> 7. mklabel gpt on the new drive
> 
> 8. create the new osd with ceph-disk prepare(automatically adds it to
> the crushmap)
> 
> 
> your procedure sounds reasonable to me, as far as i'm concerned you
> shouldn't have to wait for rebalancing after you remove the osd. all
> this might not be 100% per ceph books but it works for us :)
> 
> Josef
> 
> 
> On 06/08/18 16:15, Iztok Gregori wrote:
>> Hi Everyone,
>>
>> Which is the best way to replace a failing (SMART Health Status:
>> HARDWARE IMPENDING FAILURE) OSD hard disk?
>>
>> Normally I will:
>>
>> 1. set the OSD as out
>> 2. wait for rebalancing
>> 3. stop the OSD on the osd-server (unmount if needed)
>> 4. purge the OSD from CEPH
>> 5. physically replace the disk with the new one
>> 6. with ceph-deploy:
>> 6a   zap the new disk (just in case)
>> 6b   create the new OSD
>> 7. add the new osd to the crush map.
>> 8. wait for rebalancing.
>>
>> My questions are:
>>
>> - Is my procedure reasonable?
>> - What if I skip the #2 and instead to wait for rebalancing I directly
>> purge the OSD?
>> - Is better to reweight the OSD before take it out?
>>
>> I'm running a Luminous (12.2.2) cluster with 332 OSDs, failure domain
>> is host.
>>
>> Thanks,
>> Iztok
>>
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore : Where is my WAL device ?

2018-06-05 Thread Richard Hesketh
On 05/06/18 14:49, rafael.diazmau...@univ-rennes1.fr wrote:
> Hello,
> 
> I run proxmox 5.2 with ceph 12.2 (bluestore).
> 
> I've created an OSD on a Hard Drive (/dev/sda) and tried to put both WAL and 
> Journal on a SSD part (/dev/sde1) like this :
> pveceph createosd /dev/sda --wal_dev /dev/sde1 --journal_dev /dev/sde1
> 
> It automaticaly creates 2 parts on the Hard Drive /dev/sda (the first one 
> /dev/sda1 called Ceph data and a 2nd one /dev/sda2 called Ceph block).
> 
> My problem is when I browse the OSD (/var/lib/ceph/osd/ceph-0/) I can't find 
> the symlink to the WAL device.
> -I see a symlink to the block.db (journal on /dev/sde1 as expected)
> -and another symlink to block that points to Hard Drive /dev/sda2 
> (by-partuuid)
> -but no block.wal at all...
> 
> Can you explain me how to verify if the WAL device a really on the /dev/sde1 ?
> And if not, how to put the WAL and the journal on the same SSD part.

If you're trying to put the metadata DB and the WAL on the same device, don't 
bother specifying them separately. Only specify the DB device; the WAL will 
automatically colocate with the rest of the metadata if it's not otherwise 
specified. When they are located together, you won't see a separate WAL symlink 
in the OSD folder; it sounds like since you specified both WAL and DB should be 
on the same partition, the tool was smart enough to not mess it up.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Jewel and Ubuntu 16.04

2018-04-17 Thread Richard Hesketh
Yes, that's what I did - as long as you don't affect the OSD data/partitions, 
they should come up again just fine once you reinstall ceph. I expect once the 
mandatory switch to ceph-volume finally happens this process might get a little 
more complicated but for jewel you're still using ceph-disk so it's just using 
udev rules to identify the OSDs and it will bring them up at start.

Rich

On 17/04/18 15:50, Shain Miley wrote:
> Rich,
> 
> Thank you for the information.  Are you suggesting that I do a clean install 
> of the OS partition...reinstall the Ceph packages, etc...and then Ceph should 
> be able to find the OSDs and all the original data.
> 
> I did do something similar one time when we had a MBR failure on a single 
> node and it would no longer boot up..however that was quite some time ago.
> 
> As far as I remember Ceph found all the OSD data just fine and everything did 
> startup in a good state post OS reinstall. 
> 
> Thanks again for your help on this issue.
> 
> Shain
> 
> 
> On 04/17/2018 06:00 AM, Richard Hesketh wrote:
>> On 16/04/18 18:32, Shain Miley wrote:
>>> Hello,
>>>
>>> We are currently running Ceph Jewel (10.2.10) on Ubuntu 14.04 in 
>>> production.  We have been running into a kernel panic bug off an on for a 
>>> while and I am starting to look into upgrading as a possible solution.  We 
>>> are currently running version 4.4.0-31-generic kernel on these servers and 
>>> have run across this same issue on multiple nodes over the course of the 
>>> last 12 months.
>>>
>>> My 3 questions are as follows:
>>>
>>> 1)Are there any known issues upgrading a Jewel cluster to 16.04?
>>>
>>> 2)At this point is Luminous stable enough to consider upgrading as well?
>>>
>>> 3)If I did decide to upgrade both Ubuntu and Ceph..in which order should 
>>> the upgrade occur?
>>>
>>> Thanks in advance for any insight that you are able to provide me on this 
>>> issue!
>>>
>>> Shain
>> When Luminous came out I did a multipart upgrade where I took my cluster 
>> from 14.04 to 16.04 (HWE kernel line), updated from jewel to luminous and 
>> migrated OSDs from filestore to bluestore (in that order). I had absolutely 
>> no issues throughout the process. The only thing I would suggest is that you 
>> reinstall your nodes to 16.04 rather than release-upgrading - previous 
>> experience with trying to release-upgrade on other hosts was sometimes 
>> painful, rebuilding was easier.
>>
>> Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Jewel and Ubuntu 16.04

2018-04-17 Thread Richard Hesketh
On 16/04/18 18:32, Shain Miley wrote:
> Hello,
> 
> We are currently running Ceph Jewel (10.2.10) on Ubuntu 14.04 in production.  
> We have been running into a kernel panic bug off an on for a while and I am 
> starting to look into upgrading as a possible solution.  We are currently 
> running version 4.4.0-31-generic kernel on these servers and have run across 
> this same issue on multiple nodes over the course of the last 12 months.
> 
> My 3 questions are as follows:
> 
> 1)Are there any known issues upgrading a Jewel cluster to 16.04?
> 
> 2)At this point is Luminous stable enough to consider upgrading as well?
> 
> 3)If I did decide to upgrade both Ubuntu and Ceph..in which order should the 
> upgrade occur?
> 
> Thanks in advance for any insight that you are able to provide me on this 
> issue!
> 
> Shain

When Luminous came out I did a multipart upgrade where I took my cluster from 
14.04 to 16.04 (HWE kernel line), updated from jewel to luminous and migrated 
OSDs from filestore to bluestore (in that order). I had absolutely no issues 
throughout the process. The only thing I would suggest is that you reinstall 
your nodes to 16.04 rather than release-upgrading - previous experience with 
trying to release-upgrade on other hosts was sometimes painful, rebuilding was 
easier.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Separate --block.wal --block.db bluestore not working as expected.

2018-04-10 Thread Richard Hesketh
No, you shouldn't invoke it that way, you should just not specify a WAL device 
at all if you want it to be stored with the DB - if not otherwise specified the 
WAL is automatically stored with the other metadata on the DB device. You 
should do something like:

ceph-volume lvm prepare --bluestore --data /dev/sdc --block.db /dev/sda1

Rich

On 09/04/18 09:19, Hervé Ballans wrote:
> Hi,
> 
> Just a little question regarding this operation :
> 
> [root@osdhost osd]# ceph-volume lvm prepare --bluestore --data /dev/sdc 
> --block.wal /dev/sda2 --block.db /dev/sda1
> 
> On a previous post, I understood that if both wal and db are stored on the 
> same separate device, then we could use a single partition for both...which 
> means we could do :
> 
> # ceph-volume lvm prepare --bluestore --data /dev/sdc --block.wal /dev/sda1 
> --block.db /dev/sda1
> 
> and so on with other uniq wal/db partition for other OSD...
> 
> Did I get that correctly ?
> 
> Thanks,
> 
> Hervé
> 
> 
> Le 07/04/2018 à 17:59, Gary Verhulp a écrit :
>>
>> I’m trying to create bluestore osds with separate --block.wal --block.db 
>> devices on a write intensive SSD
>>
>> I’ve split the SSD (/dev/sda) into two partditions sda1 and sda2 for db and 
>> wal
>>
>>  
>>
>> I seems to me the osd uuid is getting changed and I’m only able to start the 
>> last OSD
>>
>> Do I need to create a new partition or logical volume on the SSD for each 
>> OSD?
>>
>> I’m sure this is a simple fail in my understanding of how it is supposed to 
>> be provisioned.
>>
>> Any advice would be appreciated.
>>
>> Thanks,
>>
>> Gary



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] split brain case

2018-03-29 Thread Richard Hesketh
On 29/03/18 09:25, ST Wong (ITSC) wrote:
> Hi all,
> 
> We put 8 (4+4) OSD and 5 (2+3) MON servers in server rooms in 2 buildings for 
> redundancy.  The buildings are connected through direct connection.
> 
> While servers in each building have alternate uplinks.   What will happen in 
> case the link between the buildings is broken (application servers in each 
> server room will continue to write to OSDs in the same room) ?
> 
> Thanks a lot.

The 3 mons in your second building will be able to remain quorate (as 3 is a 
majority of 5) and keep running the cluster. The other 2 mons will refuse to do 
anything since they can't find enough other monitors to form quorum. For PGs 
that have enough replicas in the 3-mon building to be above min_size, they will 
continue to serve I/O; however, PGs with less than min_size copies available 
will block I/O until you either bring the link back, or the missing OSDs are 
manually/automatically marked out and enough time passes for them to recover up 
to enough replicas on the working side. As far as anything in the 2-mon 
building is concerned ceph will be entirely nonfunctional. Recovery would 
propagate any changes made on the working side when the link comes back up.

Ceph is designed to avoid split brain scenarios to protect data consistency, 
but the consequence is that if your cluster does get partitioned, a lot of it 
may stop working. You can design crush rules to help mitigate impact in the 
working part (for instance making sure that every PG places enough copies of 
itself on the 3-mon side that it will be able to continue serving I/O if the 
other building is lost) but you will never have a situation where the cluster 
is split into two and both sides continue operating and then join back up.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore bluestore_prefer_deferred_size and WAL size

2018-03-09 Thread Richard Hesketh
I am also curious about this, in light of the reported performance regression 
switching from Filestore to Bluestore (when using SSDs for journalling/metadata 
db). I didn't get any responses when I asked, though. The major consideration 
that seems obvious is that this potentially hugely increases the required size 
for the WAL, but I'm not sure if that has any implications beyond simply 
needing a larger WAL/DB device or if there's other config changes that you'd 
need to do.

Rich

On 09/03/18 09:35, Budai Laszlo wrote:
> Dear all,
> 
> I am wondering whether it helps to increase the 
> bluestore_prefer_deferred_size to 4MB so the RBD chunks are first written to 
> the WAL, and only later to the spinning disks.
> Any opinions/experiences about this?
> 
> Kind regards,
> Laszlo
> 
> On 08.03.2018 18:15, Budai Laszlo wrote:
>> Dear all,
>>
>> I'm reading about the bluestore_prefer_deferred_size parameter for 
>> Bluestore. Are there any hints about its size when using a dedicated SSD for 
>> bock.wal and block.db ?
>>
>> Thank you in advance!
>>
>> Laszlo



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RFC Bluestore-Cluster of SAMSUNG PM863a

2018-02-02 Thread Richard Hesketh
On 02/02/18 08:33, Kevin Olbrich wrote:
> Hi!
> 
> I am planning a new Flash-based cluster. In the past we used SAMSUNG PM863a 
> 480G as journal drives in our HDD cluster.
> After a lot of tests with luminous and bluestore on HDD clusters, we plan to 
> re-deploy our whole RBD pool (OpenNebula cloud) using these disks.
> 
> As far as I understand, it would be best to skip journaling / WAL and just 
> deploy every OSD 1-by-1. This would have the following pro's (correct me, if 
> I am wrong):
> - maximum performance as the journal is spread accross all devices
> - a lost drive does not affect any other drive
> 
> Currently we are on CentOS 7 with elrepo 4.4.x-kernel. We plan to migrate to 
> Ubuntu 16.04.3 with HWE (kernel 4.10).
> Clients will be Fedora 27 + OpenNebula.
> 
> Any comments?
> 
> Thank you.
> 
> Kind regards,
> Kevin

There is only a real advantage to separating the DB/WAL from the main data if 
they're going to be hosted on a device which is appreciably faster than the 
main storage. Since you're going all SSD, it makes sense to deploy each OSD 
all-in-one; as you say, you don't bottleneck on any one disk, and it also 
offers you more maintenance flexibility as you will be able to easily move OSDs 
between hosts if required. If you wanted to start pushing performance more, 
you'd be looking at putting NVMe disks in your hosts for DB/WAL.

FYI, the 16.04 HWE kernel has currently rolled on over to 4.13.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] WAL size constraints, bluestore_prefer_deferred_size

2018-01-08 Thread Richard Hesketh
I recently came across the bluestore_prefer_deferred_size family of config 
options, for controlling the upper size threshold on deferred writes. Given a 
number of users suggesting that write performance in filestore is better than 
write performance in bluestore - because filestore writing to an SSD journal 
has much better write latency than bluestore does, writing into the main 
storage - I was wondering if just bumping the deferred write threshold up to 
some arbitrarily large value and therefore allowing most writes to be deferred 
would close the performance gap in these cases?

This raises the secondary question of whether or not the WAL size is limited in 
such a way that this would not work or would require other config changes to 
support. I'm not sure if the WAL size is limited to a certain maximum or it 
will just grow as required like the DB does.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Different Ceph versions on OSD/MONs and Clients?

2018-01-05 Thread Richard Hesketh
Whoops meant to reply to-list

 Forwarded Message 
Subject: Re: [ceph-users] Different Ceph versions on OSD/MONs and Clients?
Date: Fri, 5 Jan 2018 15:10:56 +
From: Richard Hesketh <richard.hesk...@rd.bbc.co.uk>
To: Götz Reinicke <goetz.reini...@filmakademie.de>

On 05/01/18 15:03, Götz Reinicke wrote:
> Hi,
> 
> our OSD and MONs run on jewel and Centos7. Now I was wondering if an older 
> fileserver with RHEL 6 for which I just found hammer RPMs on the official 
> ceph site can use the RDBs creation on the cluster?
> 
> I think there might be problems with some kernel versions/ceph features. 
> 
> If it is not possible in that combination I have, nevertheless is it possible 
> to run different versions?
> 
>   Thanks for hints & suggestions . Regards Götz

It is possible for older clients to connect to newer ceph clusters, ceph is 
generally good about maintaining backwards compatibility on upgrades, with the 
caveat that if you enable newer features (via the cluster tunables parameters) 
than the clients can support, they will no longer work (you'll get a version 
mismatch complaint). Hammer clients on a Jewel cluster should be fine so long 
as your cluster tunables are at Hammer profile or earlier.

The command "ceph osd crush tunables {PROFILE}" will change your cluster 
tunables if necessary, but will also result in data movement (as per 
http://docs.ceph.com/docs/master/rados/operations/crush-map/ ), and may have a 
performance impact if you have to disable newer features.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on rbd resize

2018-01-03 Thread Richard Hesketh
No, most filesystems can be expanded pretty trivially (shrinking is a more 
complex operation but usually also doable). Assuming the likely case of an 
ext2/3/4 filesystem, the command "resize2fs /dev/rbd0" should resize the FS to 
cover the available space in the block device.

Rich

On 03/01/18 13:12, 13605702...@163.com wrote:
> hi Jason
> 
> the data won't be lost if i resize the filesystem in the image? 
> 
> thanks



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] in the same ceph cluster, why the object in the same osd some are 8M and some are 4M?

2018-01-02 Thread Richard Hesketh
On 02/01/18 02:36, linghucongsong wrote:
> Hi, all!
> 
> I just use ceph rbd for openstack.
> 
> my ceph version is 10.2.7.
> 
> I find a surprise thing that the object save in the osd , in some pgs the 
> objects are 8M, and in some pgs the objects are 4M, can someone tell me why?  
> thanks!
> 
> root@node04:/var/lib/ceph/osd/ceph-3/current/1.6e_head/DIR_E/DIR_6# ll -h
> -rw-r--r-- 1 ceph ceph 8.0M Dec 14 14:36 
> rbd\udata.0f5c1a238e1f29.012a__head_6967626E__1
> 
> root@node04:/var/lib/ceph/osd/ceph-3/current/3.13_head/DIR_3/DIR_1/DIR_3/DIR_6#
>  ll -h
> -rw-r--r--  1 ceph ceph 4.0M Oct 24 17:39 
> rbd\udata.106f835ba64e8d.04dc__head_5B646313__3
By default, rbds are striped across 4M objects, but that is a configurable 
value - you can make it larger or smaller if you like. I note that the PGs you 
are looking at are from different pools (1.xx vs 3.xx) - so I'm guessing you 
have multiple storage pools configured in your openstack cluster. Is it 
possible that for the larger ones, the rbd_store_chunk_size parameter is being 
overridden?

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow backfilling with bluestore, ssd and metadata pools

2017-12-21 Thread Richard Hesketh
On 21/12/17 10:28, Burkhard Linke wrote:
> OSD config section from ceph.conf:
> 
> [osd]
> osd_scrub_sleep = 0.05
> osd_journal_size = 10240
> osd_scrub_chunk_min = 1
> osd_scrub_chunk_max = 1
> max_pg_per_osd_hard_ratio = 4.0
> osd_max_pg_per_osd_hard_ratio = 4.0
> bluestore_cache_size_hdd = 5368709120
> mon_max_pg_per_osd = 400

Consider also playing with the following OSD parameters:

osd_recovery_max_active
osd_recovery_sleep
osd_recovery_sleep_hdd
osd_recovery_sleep_hybrid
osd_recovery_sleep_ssd

In my anecdotal experience, the forced wait between requests (controlled by the 
recovery_sleep parameters) was causing significant slowdown in recovery speed 
in my cluster, though even at the default values it wasn't making things go 
nearly as slowly as your cluster - it sounds like something else is probably 
wrong.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Proper way of removing osds

2017-12-21 Thread Richard Hesketh
On 21/12/17 10:21, Konstantin Shalygin wrote:
>> Is this the correct way to removes OSDs, or am I doing something wrong ?
> Generic way for maintenance (e.g. disk replace) is rebalance by change osd 
> weight:
> 
> 
> ceph osd crush reweight osdid 0
> 
> cluster migrate data "from this osd"
> 
> 
> When HEALTH_OK you can safe remove this OSD:
> 
> ceph osd out osd_id
> systemctl stop ceph-osd@osd_id
> ceph osd crush remove osd_id
> ceph auth del osd_id
> ceph osd rm osd_id
> 
> 
> 
> k

basically this, when you mark an OSD "out" it stops receiving data and PGs will 
be remapped but it is still part of the crushmap and influences the weights of 
buckets - so when you do the final purge your weights will shift and another 
rebalance occurs. Weighting the OSD to 0 first will ensure you don't incur any 
extra data movement when you finally purge it.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-06 Thread Richard Hesketh
On 06/12/17 09:17, Caspar Smit wrote:
> 
> 2017-12-05 18:39 GMT+01:00 Richard Hesketh <richard.hesk...@rd.bbc.co.uk 
> <mailto:richard.hesk...@rd.bbc.co.uk>>:
> 
> On 05/12/17 17:10, Graham Allan wrote:
> > On 12/05/2017 07:20 AM, Wido den Hollander wrote:
> >> Hi,
> >>
> >> I haven't tried this before but I expect it to work, but I wanted to
> >> check before proceeding.
> >>
> >> I have a Ceph cluster which is running with manually formatted
> >> FileStore XFS disks, Jewel, sysvinit and Ubuntu 14.04.
> >>
> >> I would like to upgrade this system to Luminous, but since I have to
> >> re-install all servers and re-format all disks I'd like to move it to
> >> BlueStore at the same time.
> >
> > You don't *have* to update the OS in order to update to Luminous, do 
> you? Luminous is still supported on Ubuntu 14.04 AFAIK.
> >
> > Though obviously I understand your desire to upgrade; I only ask 
> because I am in the same position (Ubuntu 14.04, xfs, sysvinit), though 
> happily with a smaller cluster. Personally I was planning to upgrade ours 
> entirely to Luminous while still on Ubuntu 14.04, before later going through 
> the same process of decommissioning one machine at a time to reinstall with 
> CentOS 7 and Bluestore. I too don't see any reason the mixed Jewel/Luminous 
> cluster wouldn't work, but still felt less comfortable with extending the 
> upgrade duration.
> >
> > Graham
> 
> Yes, you can run luminous on Trusty; one of my clusters is currently 
> Luminous/Bluestore/Trusty as I've not had time to sort out doing OS upgrades 
> on it. I second the suggestion that it would be better to do the luminous 
> upgrade first, retaining existing filestore OSDs, and then do the OS 
> upgrade/OSD recreation on each node in sequence. I don't think there should 
> realistically be any problems with running a mixed cluster for a while but 
> doing the jewel->luminous upgrade on the existing installs first shouldn't be 
> significant extra effort/time as you're already predicting at least two 
> months to upgrade everything, and it does minimise the amount of change at 
> any one time in case things do start going horribly wrong.
> 
> Also, at 48 nodes, I would've thought you could get away with cycling 
> more than one of them at once. Assuming they're homogenous taking out even 4 
> at a time should only raise utilisation on the rest of the cluster to a 
> little over 65%, which still seems safe to me, and you'd waste way less time 
> waiting for recovery. (I recognise that depending on the nature of your 
> employment situation this may not actually be desirable...)
> 
>  
> Assuming size=3 and min_size=2 and failure-domain=host:
> 
> I always thought that bringing down more then 1 host cause data 
> inaccessebility right away because the chance that a pg will have osd's in 
> these 2 hosts is there. Only if the failure-domain is higher then host (rack 
> or something) you can safely bring more then 1 host down (in the same failure 
> domain offcourse).
> 
> Am i right? 
> 
> Kind regards,
> Caspar

Oh, yeah, if you just bring them down immediately without rebalancing first, 
you'll have problems. But the intention is that rather than just killing the 
nodes, you first weight them to 0 and then wait for the cluster to rebalance 
the data off them so they are empty and harmless when you do shut them down. 
You minimise time spent waiting and overall data movement if you do this sort 
of replacement in larger batches. Others have correctly pointed out though that 
the larger the change you make at any one time, the more likely something might 
go wrong overall... I suspect a good rule of thumb is that you should try to 
add/replace/remove nodes/OSDs in batches of as many you can get away with at 
once without stretching outside the failure domain.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous v12.2.2 released

2017-12-05 Thread Richard Hesketh
You are safe to upgrade packages just by doing an apt-get update; apt-get 
upgrade, and you will then want to restart your ceph daemons to bring them to 
the new version - though you should of course stagger your restarts of each 
type to ensure your mons remain quorate (don't restart more than half at once, 
ideally one at a time), and your OSDs to keep at least min_size for your pools 
- if you have kept the default failure domain of host for your pools, 
restarting all the OSDs on one node and waiting for them to come back up before 
moving on to the next should be fine. Personally I tend to just reboot the 
entire node and wait for it to come back when I'm doing upgrades as there are 
usually also new kernels waiting to be live by the time I get around to it.

This is a minor version upgrade so you shouldn't need to restart daemon types 
in any particular order - I think that's only a concern when you're doing major 
version upgrades.

Rich

On 05/12/17 17:32, Rudi Ahlers wrote:
> Hi, 
> 
> Can you please tell me how to upgrade these? Would a simple apt-get update be 
> sufficient, or is there a better / safer way?
> 
> On Tue, Dec 5, 2017 at 5:41 PM, Sean Redmond  > wrote:
> 
> Hi Florent,
> 
> I have always done mons ,osds, rgw, mds, clients
> 
> Packages that don't auto restart services on update IMO is a good thing.
> 
> Thanks
> 
> On Tue, Dec 5, 2017 at 3:26 PM, Florent B  > wrote:
> 
> On Debian systems, upgrading packages does not restart services !
> 
> 
> On 05/12/2017 16:22, Oscar Segarra wrote:
>> I have executed:
>>
>> yum upgrade -y ceph 
>>
>> On each node and everything has worked fine...
>>
>> 2017-12-05 16:19 GMT+01:00 Florent B > >:
>>
>> Upgrade procedure is OSD or MON first ?
>>
>> There was a change on Luminous upgrade about it.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2017-12-05 Thread Richard Hesketh
On 05/12/17 17:10, Graham Allan wrote:
> On 12/05/2017 07:20 AM, Wido den Hollander wrote:
>> Hi,
>>
>> I haven't tried this before but I expect it to work, but I wanted to
>> check before proceeding.
>>
>> I have a Ceph cluster which is running with manually formatted
>> FileStore XFS disks, Jewel, sysvinit and Ubuntu 14.04.
>>
>> I would like to upgrade this system to Luminous, but since I have to
>> re-install all servers and re-format all disks I'd like to move it to
>> BlueStore at the same time.
> 
> You don't *have* to update the OS in order to update to Luminous, do you? 
> Luminous is still supported on Ubuntu 14.04 AFAIK.
> 
> Though obviously I understand your desire to upgrade; I only ask because I am 
> in the same position (Ubuntu 14.04, xfs, sysvinit), though happily with a 
> smaller cluster. Personally I was planning to upgrade ours entirely to 
> Luminous while still on Ubuntu 14.04, before later going through the same 
> process of decommissioning one machine at a time to reinstall with CentOS 7 
> and Bluestore. I too don't see any reason the mixed Jewel/Luminous cluster 
> wouldn't work, but still felt less comfortable with extending the upgrade 
> duration.
> 
> Graham

Yes, you can run luminous on Trusty; one of my clusters is currently 
Luminous/Bluestore/Trusty as I've not had time to sort out doing OS upgrades on 
it. I second the suggestion that it would be better to do the luminous upgrade 
first, retaining existing filestore OSDs, and then do the OS upgrade/OSD 
recreation on each node in sequence. I don't think there should realistically 
be any problems with running a mixed cluster for a while but doing the 
jewel->luminous upgrade on the existing installs first shouldn't be significant 
extra effort/time as you're already predicting at least two months to upgrade 
everything, and it does minimise the amount of change at any one time in case 
things do start going horribly wrong.

Also, at 48 nodes, I would've thought you could get away with cycling more than 
one of them at once. Assuming they're homogenous taking out even 4 at a time 
should only raise utilisation on the rest of the cluster to a little over 65%, 
which still seems safe to me, and you'd waste way less time waiting for 
recovery. (I recognise that depending on the nature of your employment 
situation this may not actually be desirable...)

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding multiple OSD

2017-12-05 Thread Richard Hesketh
On 05/12/17 09:20, Ronny Aasen wrote:
> On 05. des. 2017 00:14, Karun Josy wrote:
>> Thank you for detailed explanation!
>>
>> Got one another doubt,
>>
>> This is the total space available in the cluster :
>>
>> TOTAL : 23490G
>> Use  : 10170G
>> Avail : 13320G
>>
>>
>> But ecpool shows max avail as just 3 TB. What am I missing ?
>>
>> Karun Josy
> 
> without knowing details of your cluster, this is just assumption guessing, 
> but...
> 
> perhaps one of your hosts have less free space then the others, replicated 
> can pick 3 of the hosts that have plenty of space, but erasure perhaps 
> require more hosts, so the host with least space is the limiting factor.
> 
> check
> ceph osd df tree
> 
> to see how it looks.
> 
> 
> kinds regards
> Ronny Aasen

From previous emails the erasure code profile is k=5,m=3, with a host failure 
domain, so the EC pool does use all eight hosts for every object. I agree it's 
very likely that the problem is that your hosts currently have heterogeneous 
capacity and the maximum data in the EC pool will be limited by the size of the 
smallest host.

Also remember that with this profile, you have a 3/5 overhead on your data, so 
1GB of real data stored in the pool translates to 1.6GB of raw data on disk. 
The pool usage and max available stats are given in terms of real data, but the 
cluster TOTAL usage/availability is expressed in terms of the raw space (since 
real usable data will vary depending on pool settings). If you check, you will 
probably find that your lowest-capacity host has near 6TB of space free, which 
would let you store a little over 3.5TB of real data in your EC pool.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing Bluestore WAL

2017-11-24 Thread Richard Hesketh
On 23/11/17 17:19, meike.talb...@women-at-work.org wrote:
> Hello,
> 
> in our preset Ceph cluster we used to have 12 HDD OSDs per host.
> All OSDs shared a common SSD for journaling.
> The SSD was used as root device and the 12 journals were files in the 
> /usr/share directory, like this:
> 
> OSD 1 - data /dev/sda - journal /usr/share/sda
> OSD 2 - data /dev/sdb - journal /usr/share/sdb
> ...
> 
> We now want to migrate to Bluestore and continue to use this approach.
> I tried to use "ceph-deploy osd prepare test04:sdc --bluestore --block-db 
> /var/local/sdc-block --block-wal /var/local/sdc-wal" to setup an OSD which 
> essentially works.
> 
> However I'm wondering is this correct at all.
> And how can I make sure that the sdc-block and sdc-wal to not fill up the SSD 
> disk.
> Is there any option to limit the file size and what are the recommended value 
> of such an option?
> 
> Thank you
> 
> Meike

The maximum size of the WAL is dependent on cluster configuration values, but 
it will always be relatively small. There is no maximum DB size or, as it 
stands, good estimates for how large a DB may realistically grow. The expected 
behaviour is that if the DB outgrows its device it will spill over onto the 
data device. I don't believe there is any option that would let you effectively 
limit the size of files if you're using flat files to back your devices.

Using files for your DB/WAL is not recommended practice - you have the space 
problems that you mention and you'll also be suffering a performance hit by 
sticking a filesystem layer in the middle of things. Realistically, you should 
partition your SSD and provide entire partitions as the devices on which to 
store your OSD DBs. There is no point in specifying the WAL as a separate 
device unless you're doing something advanced; it will be stored alongside the 
DB on the DB device if not otherwise specified, and since you're putting them 
on the same device anyway you get no advantage to splitting them. With 
everything partitioned off correctly, you don't have to worry about Ceph data 
enroaching on your root FS space.

I would also worry that unless that one SSD is very large, 12 HDDs : 1 SSD 
could be overdoing it. Filestore journals sustained a lot of writing but didn't 
need to be very large, comparatively; Bluestore database w/ WAL is a lot 
lighter on the I/O but does need considerably more standing space since it's 
actually permanently storing metadata rather than just write journalling. If 
it's the case that you've only got a few GB of space you can spare for each DB, 
you're probably going to overgrow that very quickly and you won't see much 
actual benefit from using the SSD.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journal / WAL drive size?

2017-11-23 Thread Richard Hesketh
On 23/11/17 16:13, Rudi Ahlers wrote:
> Hi Caspar, 
> 
> Thanx. I don't see any mention that it's a bad idea to have the WAL and DB on 
> the same SSD, but I guess it could improve performance?

It's not that it's a bad idea to put WAL and DB on the same device - it's that 
if not otherwise specified the WAL is automatically included in the same 
partition with the metadata DB, so there is no point to making them go in 
different partitions on the same device unless you are specifically doing 
testing/debugging and it's helpful to split them out so you can watch them 
separately. Just use the --block.db argument without --block.wal mentioned at 
all.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Journal / WAL drive size?

2017-11-23 Thread Richard Hesketh
Keep in mind that as yet we don't really have good estimates for how large 
bluestore metadata DBs may become, but it will be somewhat proportional to your 
number of objects. Considering the size of your OSDs, a 15GB block.db partition 
is almost certainly too small. Unless you've got a compelling reason not to you 
should probably partition your SSDs to use all the available space. Personally, 
I manually partition my block.db devices into even partitions and then invoke 
creation with

ceph-disk prepare --bluestore /dev/sdX --block.db 
/dev/disk/by-partuuid/whateveritis

Invoking by UUID is done because if presented an existing partition rather than 
the root block device it will just symlink precisely to the argument given, so 
using /dev/sdXYZ arguments is dangerous as they may not be consistent across 
hardware changes or even just reboots depending on your system.

Rich

On 23/11/17 09:53, Caspar Smit wrote:
> Rudi,
> 
> First off all do not deploy an OSD specifying the same seperate device for DB 
> and WAL:
> 
> Please read the following why:
> 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/
> 
> 
> That said you have a fairly large amount of SSD size available so i recommend 
> using it as block.db:
> 
> You can specify a fixed size block.db size in ceph.conf using:
> 
> [global]
> bluestore_block_db_size = 16106127360
> 
> The above is a 15GB block.db size
> 
> Now when you deploy an OSD with a seperate block.db device the partition will 
> be 15GB.
> 
> The default size is a percentage of the device i believe and not always a 
> usable amount.
> 
> Caspar
> 
> Met vriendelijke groet,
> 
> Caspar Smit
> Systemengineer
> SuperNAS
> Dorsvlegelstraat 13
> 1445 PA Purmerend
> 
> t: (+31) 299 410 414
> e: caspars...@supernas.eu <mailto:caspars...@supernas.eu>
> w: www.supernas.eu <http://www.supernas.eu>
> 
> 2017-11-23 10:27 GMT+01:00 Rudi Ahlers <rudiahl...@gmail.com 
> <mailto:rudiahl...@gmail.com>>:
> 
> Hi, 
> 
> Can someone please explain this to me in layman's terms. How big a WAL 
> drive do I really need?
> 
> I have a 2x 400GB SSD drives used as WAL / DB drive and 4x 8TB HDD's used 
> as OSD's. When I look at the drive partitions the DB / WAL partitions are 
> only 576Mb & 1GB respectively. This feels a bit small. 
> 
> 
> root@virt1:~# lsblk
> NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
> sda                  8:0    0   7.3T  0 disk
> ├─sda1               8:1    0   100M  0 part /var/lib/ceph/osd/ceph-0
> └─sda2               8:2    0   7.3T  0 part
> sdb                  8:16   0   7.3T  0 disk
> ├─sdb1               8:17   0   100M  0 part /var/lib/ceph/osd/ceph-1
> └─sdb2               8:18   0   7.3T  0 part
> sdc                  8:32   0   7.3T  0 disk
> ├─sdc1               8:33   0   100M  0 part /var/lib/ceph/osd/ceph-2
> └─sdc2               8:34   0   7.3T  0 part
> sdd                  8:48   0   7.3T  0 disk
> ├─sdd1               8:49   0   100M  0 part /var/lib/ceph/osd/ceph-3
> └─sdd2               8:50   0   7.3T  0 part
> sde                  8:64   0 372.6G  0 disk
> ├─sde1               8:65   0     1G  0 part
> ├─sde2               8:66   0   576M  0 part
> ├─sde3               8:67   0     1G  0 part
> └─sde4               8:68   0   576M  0 part
> sdf                  8:80   0 372.6G  0 disk
> ├─sdf1               8:81   0     1G  0 part
> ├─sdf2               8:82   0   576M  0 part
> ├─sdf3               8:83   0     1G  0 part
> └─sdf4               8:84   0   576M  0 part
> sdg                  8:96   0   118G  0 disk
> ├─sdg1               8:97   0     1M  0 part
> ├─sdg2               8:98   0   256M  0 part /boot/efi
> └─sdg3               8:99   0 117.8G  0 part
>   ├─pve-swap       253:0    0     8G  0 lvm  [SWAP]
>   ├─pve-root       253:1    0  29.3G  0 lvm  /
>   ├─pve-data_tmeta 253:2    0    68M  0 lvm
>   │ └─pve-data     253:4    0  65.9G  0 lvm
>   └─pve-data_tdata 253:3    0  65.9G  0 lvm
>     └─pve-data     253:4    0  65.9G  0 lvm
> 
> 
> 
> 
> -- 
> Kind Regards
> Rudi Ahlers
> Website: http://www.rudiahlers.co.za
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-us

Re: [ceph-users] Separation of public/cluster networks

2017-11-15 Thread Richard Hesketh
On 15/11/17 12:58, Micha Krause wrote:
> Hi,
> 
> I've build a few clusters with separated public/cluster network, but I'm 
> wondering if this is really
> the way to go.
> 
> http://docs.ceph.com/docs/jewel/rados/configuration/network-config-ref
> 
> states 2 reasons:
> 
> 1. There is more traffic in the backend, which could cause latencies in the 
> public network.
> 
>  Is a low latency public network really an advantage, if my cluster network 
> has high latency?
> 
> 2. Security: evil users could cause damage in the cluster net.
> 
>  Couldn't you cause the same kind, or even more damage in the public network?
> 
> 
> On the other hand, if one host looses it's cluster network, it will report 
> random OSDs down over the
> remaining public net. (yes I know about the "mon osd min down reporters" 
> workaround)
> 
> 
> Advantages of a single, shared network:
> 
> 1. Hosts with network problems, that can't reach other OSDs, all so can't 
> reach the mon. So our mon server doesn't get conflicting informations.
> 
> 2. Given the same network bandwidth overall, OSDs can use a bigger part of 
> the bandwidth for backend traffic.
> 
> 3. KISS principle.
> 
> So if my server has 4 x 10GB/s network should I really split them in 2 x 
> 20GB/s (cluster/public) or am I
> better off using 1 x 40GB/s (shared)?
> 
> Micha Krause

I have two clusters, one running all-public-network and one with separated 
public/cluster networks. The latter is a bit of a pain because it's much more 
fiddly if I have to change anything, and also there is basically no point to it 
being set up this way (it all goes into the same switch so there's no real 
redundancy).

To quote Wido 
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-April/017527.html):
> I rarely use public/cluster networks as they don't add anything for most
> systems. 20Gbit of bandwidth per node is more then enough in most cases and
> my opinion is that multiple IPs per machine only add complexity.

Unless you actually have to make your cluster available on a public network 
which you don't control/trust I really don't think there's much point in 
splitting things up; just bond your links together. Even if you still want to 
logically split cluster/public network so they're in different subnets, you can 
just assign multiple IPs to the link or potentially set up VLAN tagging on the 
switch/interfaces if you want your traffic a bit more securely segregated.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to upgrade CEPH journal?

2017-11-09 Thread Richard Hesketh
fdisk -l | grep sde
> > Disk /dev/sde: 372.6 GiB, 400088457216 bytes, 781422768 sectors
> > /dev/sde1   2048 2099199 2097152   1G unknown
> >
> >
> > /dev/sda :
> >  /dev/sda1 ceph data, active, cluster ceph, osd.3, block /dev/sda2,
> > block.db /dev/sde1
> >  /dev/sda2 ceph block, for /dev/sda1
> > /dev/sdb :
> >  /dev/sdb1 ceph data, active, cluster ceph, osd.4, block /dev/sdb2,
> > block.db /dev/sdf1
> >  /dev/sdb2 ceph block, for /dev/sdb1
> > /dev/sdc :
> >  /dev/sdc1 ceph data, active, cluster ceph, osd.5, block /dev/sdc2,
> > block.db /dev/sdf2
> >  /dev/sdc2 ceph block, for /dev/sdc1
> > /dev/sdd :
> >  /dev/sdd1 other, xfs, mounted on /data/brick1
> >  /dev/sdd2 other, xfs, mounted on /data/brick2
> > /dev/sde :
> >  /dev/sde1 ceph block.db, for /dev/sda1
> > /dev/sdf :
> >  /dev/sdf1 ceph block.db, for /dev/sdb1
> >  /dev/sdf2 ceph block.db, for /dev/sdc1
> > /dev/sdg :
> >
> >
> > resizing the partition through fdisk didn't work. What is the 
> correct
> > procedure, please?
> >
> > Kind Regards
> > Rudi Ahlers
> > Website: http://www.rudiahlers.co.za
> 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> For Bluestore OSDs you need to set bluestore_block_size to geat a 
> bigger
> partition for the DB and bluestore_block_wal_size for the WAL.
> 
> ceph-disk prepare --bluestore \
> --block.db /dev/sde --block.wal /dev/sde /dev/sdX
> 
> This gives you in total four partitions on two different disks.
> 
> I think it will be less hassle to remove the OSD and prepare it again.
> 
> --
> Cheers,
> Alwin
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> 
> 
> -- 
> Kind Regards
> Rudi Ahlers
> Website: http://www.rudiahlers.co.za
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Richard Hesketh
Systems Engineer, Research Platforms
BBC Research & Development



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore - wal,db on faster devices?

2017-11-09 Thread Richard Hesketh
>> >> Bluestore does have a couple of shortcomings vs filestore currently.
>> >> The allocator is not as good as XFS's and can fragment more over time.
>> >> There is no server-side readahead so small sequential read
>> >> performance is very dependent on client-side readahead.  There's
>> >> still a number of optimizations to various things ranging from
>> >> threading and locking in the shardedopwq to pglog and dup_ops that
>> >> potentially could improve performance.
>> >>
>> >> I have a blog post that we've been working on that explores some of
>> >> these things but I'm still waiting on review before I publish it.
>> >>
>> >> Mark
>> >>
>> >> On 11/08/2017 05:53 AM, Wolfgang Lendl wrote:
>> >>> Hello,
>> >>>
>> >>> it's clear to me getting a performance gain from putting the journal
>> >>> on a fast device (ssd,nvme) when using filestore backend.
>> >>> it's not when it comes to bluestore - are there any resources,
>> >>> performance test, etc. out there how a fast wal,db device impacts
>> >>> performance?
>> >>>
>> >>>
>> >>> br
>> >>> wolfgang
>> >>>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Richard Hesketh
Systems Engineer, Research Platforms
BBC Research & Development



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Small cluster for VMs hosting

2017-11-07 Thread Richard Hesketh
On 07/11/17 13:16, Gandalf Corvotempesta wrote:
> Hi to all
> I've been far from ceph from a couple of years (CephFS was still unstable)
> 
> I would like to test it again, some questions for a production cluster for 
> VMs hosting:
> 
> 1. Is CephFS stable?

Yes, CephFS is stable and safe (though it can have performance issues relating 
to creating/removing files if your layout requires very large numbers of files 
in a single directory)

> 2. Can I spin up a 3 nodes cluster with mons, MDS and osds on the same 
> machine?

Recommended practice is not to co-locate OSDs with other ceph daemons, but 
realistically lots of people do this (me included) and it works fine. Just 
don't overload your nodes. In recent versions (kraken, luminous) there's a new 
ceph-mgr daemon to keep in mind too.

> 3. Hardware suggestions?

Depends quite a lot on your budget and what performance you need. Ceph's 
relatively CPU-heavy as these storage solutions go so good CPUs is advised, I 
understand that single-threaded performance is probably more important than 
having lots of cores if you're dealing with very very fast OSDs (like on NVMe). 
Default memory requirements are 1GB/HDD OSD and 3GB/SSD OSD when using the 
bluestore backend, but add maybe 50% for overhead due to fragmentation etc. 
plus the resource cost of your other daemons.

> 4. How can I understand the ceph health status output, in details? I've not 
> seen any docs about this

Read up on http://docs.ceph.com/docs/master/rados/operations/monitoring-osd-pg/ 
and http://docs.ceph.com/docs/master/rados/operations/pg-states/ - 
understanding what the different states that PGs and OSDs can be in mean should 
be enough for you to grok ceph status output.

> 5. How can I know if cluster is fully synced or if any background operation 
> (scrubbing, replication, ...) Is running?

"ceph status" ("ceph -s" for short) will give you a point in time report of 
your cluster state including PG states. If things are scrubbing or whatever 
that will represented in the PG states. "ceph -w" will give you status and then 
a rolling output of status changes/reports if the cluster does anything 
interesting. One of the functions available in the newer ceph-mgr daemon is an 
http dashboard giving you a quick overview of cluster health.
 
> 6. Is 10G Ethernet mandatory? Currently I only have 4 gigabit nic (2 for 
> public traffic, 2 for cluster traffic)

It's not mandatory, but the more bandwidth you can throw at ceph generally the 
happier it is. If you expect relatively lightweight usage I wouldn't worry - 
but if performance was an issue and nodes otherwise healthy, 1G links as 
bottlenecks would be the first thing I checked.

You seem interested in cephfs but you mention you're looking at ceph as a 
backend for VM hosting, is that coincidental or are you intending to use disk 
images stored as files in cephfs? Using RBDs would be a much more sensible idea 
if so.

-- 
Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how does recovery work

2017-10-19 Thread Richard Hesketh
On 19/10/17 11:00, Dennis Benndorf wrote:
> Hello @all,
> 
> givin the following config:
> 
>   * ceph.conf:
> 
> ...
> mon osd down out subtree limit = host
> osd_pool_default_size = 3
> osd_pool_default_min_size = 2
> ...
> 
>   * each OSD has its journal on a 30GB partition on a PCIe-Flash-Card
>   * 3 hosts
> 
> What would happen if one host goes down? I mean is there a limit of downtime 
> of this host/osds? How is Ceph detecting the differences between OSDs within 
> a placement group? Is there a binary log(which could run out of space) in the 
> journal/monitor or will it just copy all object within that pgs which had 
> unavailable osds?
> 
> Thanks in advance,
> Dennis

When the OSDs that were offline come back up, the PGs on those OSDs will 
resynchronise with the other replicas. Where there are new objects (or newer 
objects in the case of modifications), the new data will be copied from the 
other OSDs that remained active. There is no binary logging replication 
mechanism as you might be used to from mysql or similar.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-16 Thread Richard Hesketh
On 16/10/17 13:45, Wido den Hollander wrote:
>> Op 26 september 2017 om 16:39 schreef Mark Nelson :
>> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
>>> thanks David,
>>>
>>> that's confirming what I was assuming. To bad that there is no
>>> estimate/method to calculate the db partition size.
>>
>> It's possible that we might be able to get ranges for certain kinds of 
>> scenarios.  Maybe if you do lots of small random writes on RBD, you can 
>> expect a typical metadata size of X per object.  Or maybe if you do lots 
>> of large sequential object writes in RGW, it's more like Y.  I think 
>> it's probably going to be tough to make it accurate for everyone though.
> 
> So I did a quick test. I wrote 75.000 objects to a BlueStore device:
> 
> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
> 75085
> root@alpha:~# 
> 
> I then saw the RocksDB database was 450MB in size:
> 
> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
> 459276288
> root@alpha:~#
> 
> 459276288 / 75085 = 6116
> 
> So about 6kb of RocksDB data per object.
> 
> Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB 
> space.
> 
> Is this a safe assumption? Do you think that 6kb is normal? Low? High?
> 
> There aren't many of these numbers out there for BlueStore right now so I'm 
> trying to gather some numbers.
> 
> Wido

If I check for the same stats on OSDs in my production cluster I see similar 
but variable values:

root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: 
" ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 7490
osd.1 db per object: 7523
osd.2 db per object: 7378
osd.3 db per object: 7447
osd.4 db per object: 7233
osd.5 db per object: 7393
osd.6 db per object: 7074
osd.7 db per object: 7967
osd.8 db per object: 7253
osd.9 db per object: 7680

root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.10 db per object: 5168
osd.11 db per object: 5291
osd.12 db per object: 5476
osd.13 db per object: 4978
osd.14 db per object: 5252
osd.15 db per object: 5461
osd.16 db per object: 5135
osd.17 db per object: 5126
osd.18 db per object: 9336
osd.19 db per object: 4986

root@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.20 db per object: 5115
osd.21 db per object: 4844
osd.22 db per object: 5063
osd.23 db per object: 5486
osd.24 db per object: 5228
osd.25 db per object: 4966
osd.26 db per object: 5047
osd.27 db per object: 5021
osd.28 db per object: 5321
osd.29 db per object: 5150

root@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.30 db per object: 6658
osd.31 db per object: 6445
osd.32 db per object: 6259
osd.33 db per object: 6691
osd.34 db per object: 6513
osd.35 db per object: 6628
osd.36 db per object: 6779
osd.37 db per object: 6819
osd.38 db per object: 6677
osd.39 db per object: 6689

root@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.40 db per object: 5335
osd.41 db per object: 5203
osd.42 db per object: 5552
osd.43 db per object: 5188
osd.44 db per object: 5218
osd.45 db per object: 5157
osd.46 db per object: 4956
osd.47 db per object: 5370
osd.48 db per object: 5117
osd.49 db per object: 5313

I'm not sure why so much variance (these nodes are basically identical) and I 
think that the db_used_bytes includes the WAL at least in my case, as I don't 
have a separate WAL device. I'm not sure how big the WAL is relative to 
metadata and hence how much this might be thrown off, but ~6kb/object seems 
like a reasonable value to take for back-of-envelope calculating.

[bonus hilarity]
On my all-in-one-SSD OSDs, because bluestore reports them entirely as db space, 
I get results like:

root@vm-hv-01:~# for i in {60..65} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.60 db per object: 80273
osd.61 db per object: 68859
osd.62 db per object: 45560
osd.63 db per object: 38209
osd.64 db per object: 48258
osd.65 db per object: 50525

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Backup VM (Base image + snapshot)

2017-10-16 Thread Richard Hesketh
On 16/10/17 03:40, Alex Gorbachev wrote:
> On Sat, Oct 14, 2017 at 12:25 PM, Oscar Segarra  
> wrote:
>> Hi,
>>
>> In my VDI environment I have configured the suggested ceph
>> design/arquitecture:
>>
>> http://docs.ceph.com/docs/giant/rbd/rbd-snapshot/
>>
>> Where I have a Base Image + Protected Snapshot + 100 clones (one for each
>> persistent VDI).
>>
>> Now, I'd like to configure a backup script/mechanism to perform backups of
>> each persistent VDI VM to an external (non ceph) device, like NFS or
>> something similar...
>>
>> Then, some questions:
>>
>> 1.- Does anybody have been able to do this kind of backups?
> 
> Yes, we have been using export-diff successfully (note this is off a
> snapshot and not a clone) to back up and restore ceph images to
> non-ceph storage.  You can use merge-diff to create "synthetic fulls"
> and even do some basic replication to another cluster.
> 
> http://ceph.com/geen-categorie/incremental-snapshots-with-rbd/
> 
> http://docs.ceph.com/docs/master/dev/rbd-export/
> 
> http://cephnotes.ksperis.com/blog/2014/08/12/rbd-replication
> 
> --
> Alex Gorbachev
> Storcium
> 
>> 2.- Is it possible to export BaseImage in qcow2 format and snapshots in
>> qcow2 format as well as "linked clones" ?
>> 3.- Is it possible to export the Base Image in raw format, snapshots in raw
>> format as well and, when recover is required, import both images and
>> "relink" them?
>> 4.- What is the suggested solution for this scenario?
>>
>> Thanks a lot everybody!

In my setup I backup individually complete raw disk images to file, because 
then they're easier to manually inspect and grab data off in the event of 
catastrophic cluster failure. I haven't personally bothered trying to preserve 
the layering between master/clone images in backup form; that sounds like a 
bunch of effort and by inspection the amount of space it'd actually save in my 
use case is really minimal.

However I do use export-diff in order to make backups efficient - a rolling 
snapshot on each RBD is used to export the day's diff out of the cluster and 
then the ceph_apply_diff utility from https://gp2x.org/ceph/ is used to apply 
that diff to the raw image file (though I did patch it to work with streaming 
input and eliminate the necessity for a temporary file containing the diff). 
There are a handful of very large RBDs in my cluster for which exporting the 
full disk image takes a prohibitively long time, which made leveraging diffs 
necessary.

For a while, I was instead just exporting diffs and using merge-diff to munge 
them together into big super-diffs, and the restoration procedure would be to 
apply the merged diff to a freshly made image in the cluster. This worked, but 
it is a more fiddly recovery process; importing complete disk images is easier. 
I don't think it's possible to create two images in the cluster and then link 
them into a layering relationship; you'd have to import the base image, clone 
it, and them import a diff onto that clone if you wanted to recreate the 
original layering.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MGR Dahhsboard hostname missing

2017-10-12 Thread Richard Hesketh
On 12/10/17 17:15, Josy wrote:
> Hello,
> 
> After taking down couple of OSDs, the dashboard is not showing the 
> corresponding hostname.

Ceph-mgr is known to have issues with associated services with hostnames 
sometimes, e.g. http://tracker.ceph.com/issues/20887

Fixes look to be incoming.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "ceph osd status" fails

2017-10-06 Thread Richard Hesketh
When I try to run the command "ceph osd status" on my cluster, I just get an 
error. Luckily unlike the last issue I had with ceph fs commands it doesn't 
seem to be crashing any of the daemons.

root@vm-ds-01:/var/log/ceph# ceph osd status
Error EINVAL: Traceback (most recent call last):
  File "/usr/lib/ceph/mgr/status/module.py", line 293, in handle_command
return self.handle_osd_status(cmd)
  File "/usr/lib/ceph/mgr/status/module.py", line 273, in handle_osd_status
stats = osd_stats[osd_id]
KeyError: (78L,)

Example and relevant excerpt from the ceph-mgr log shown at 
https://gist.github.com/rjhesketh/378ec118e42289a2dd0b1dd2462aae92

Is this trying to poll stats for an OSD which doesn't exist and therefore 
breaking?

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "ceph fs" commands hang forever and kill monitors

2017-09-28 Thread Richard Hesketh
On 27/09/17 19:35, John Spray wrote:
> On Wed, Sep 27, 2017 at 1:18 PM, Richard Hesketh
> <richard.hesk...@rd.bbc.co.uk> wrote:
>> On 27/09/17 12:32, John Spray wrote:
>>> On Wed, Sep 27, 2017 at 12:15 PM, Richard Hesketh
>>> <richard.hesk...@rd.bbc.co.uk> wrote:
>>>> As the subject says... any ceph fs administrative command I try to run 
>>>> hangs forever and kills monitors in the background - sometimes they come 
>>>> back, on a couple of occasions I had to manually stop/restart a suffering 
>>>> mon. Trying to load the filesystem tab in the ceph-mgr dashboard dumps an 
>>>> error and can also kill a monitor. However, clients can mount the 
>>>> filesystem and read/write data without issue.
>>>>
>>>> Relevant excerpt from logs on an affected monitor, just trying to run 
>>>> 'ceph fs ls':
>>>>
>>>> 2017-09-26 13:20:50.716087 7fc85fdd9700  0 mon.vm-ds-01@0(leader) e19 
>>>> handle_command mon_command({"prefix": "fs ls"} v 0) v1
>>>> 2017-09-26 13:20:50.727612 7fc85fdd9700  0 log_channel(audit) log [DBG] : 
>>>> from='client.? 10.10.10.1:0/2771553898' entity='client.admin' 
>>>> cmd=[{"prefix": "fs ls"}]: dispatch
>>>> 2017-09-26 13:20:50.950373 7fc85fdd9700 -1 
>>>> /build/ceph-12.2.0/src/osd/OSDMap.h: In function 'const string& 
>>>> OSDMap::get_pool_name(int64_t) const' thread 7fc85fdd9700 time 2017-09-26 
>>>> 13:20:50.727676
>>>> /build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != 
>>>> pool_name.end())
>>>>
>>>>  ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous 
>>>> (rc)
>>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>>>> const*)+0x102) [0x55a8ca0bb642]
>>>>  2: (()+0x48165f) [0x55a8c9f4165f]
>>>>  3: 
>>>> (MDSMonitor::preprocess_command(boost::intrusive_ptr)+0x1d18)
>>>>  [0x55a8ca047688]
>>>>  4: 
>>>> (MDSMonitor::preprocess_query(boost::intrusive_ptr)+0x2a8) 
>>>> [0x55a8ca048008]
>>>>  5: (PaxosService::dispatch(boost::intrusive_ptr)+0x700) 
>>>> [0x55a8c9f9d1b0]
>>>>  6: (Monitor::handle_command(boost::intrusive_ptr)+0x1f93) 
>>>> [0x55a8c9e63193]
>>>>  7: (Monitor::dispatch_op(boost::intrusive_ptr)+0xa0e) 
>>>> [0x55a8c9e6a52e]
>>>>  8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b]
>>>>  9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053]
>>>>  10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a]
>>>>  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d]
>>>>  12: (()+0x76ba) [0x7fc86b3ac6ba]
>>>>  13: (clone()+0x6d) [0x7fc869bd63dd]
>>>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed 
>>>> to interpret this.
>>>>
>>>> I'm running Luminous. The cluster and FS have been in service since Hammer 
>>>> and have default data/metadata pool names. I discovered the issue after 
>>>> attempting to enable directory sharding.
>>>
>>> Well that's not good...
>>>
>>> The assertion is because your FSMap is referring to a pool that
>>> apparently no longer exists in the OSDMap.  This should be impossible
>>> in current Ceph (we forbid removing pools if they're in use), but
>>> could perhaps have been caused in an earlier version of Ceph when it
>>> was possible to remove a pool even if CephFS was referring to it?
>>>
>>> Alternatively, perhaps something more severe is going on that's
>>> causing your mons to see a wrong/inconsistent view of the world.  Has
>>> the cluster ever been through any traumatic disaster recovery type
>>> activity involving hand-editing any of the cluster maps?  What
>>> intermediate versions has it passed through on the way from Hammer to
>>> Luminous?
>>>
>>> Opened a ticket here: http://tracker.ceph.com/issues/21568
>>>
>>> John
>>
>> I've reviewed my notes (i.e. I've grepped my IRC logs); I actually inherited 
>> this cluster from a colleague who left shortly after I joined, so 
>> unfortunately there is some of its history I cannot fill in.
>>
>> Turns out the cluster actually predates Firefly. Looking at dates my 
>> suspicion is that it went Emperor -> Firefly -> Giant -> Hammer. I inherited 
>> it at Hammer, and took it Hammer -> Infernalis -> Jewel -&

Re: [ceph-users] Different recovery times for OSDs joining and leaving the cluster

2017-09-27 Thread Richard Hesketh
Just to add, assuming other settings are default, IOPS and maximum physical 
write speed are probably not the actual limiting factors in the tests you have 
been doing; ceph by default limits recovery I/O on any given OSD quite a bit in 
order to ensure recovery operations don't adversely impact client I/O too much. 
You can experiment with the osd_max_backfills, osd_recovery_max_active and 
osd_recovery_sleep[_hdd,_ssd,_hybrid] family of settings to tune recovery 
speed. You can probably make it a lot faster, but you will probably still see 
the discrepancy; fundamentally you're still comparing the difference between 30 
workers shuffling around 10% of your data to 2 workers taking on about 10% of 
your data.

Rich

On 27/09/17 14:23, David Turner wrote:
> When you lose 2 osds you have 30 osds accepting the degraded data and 
> performing the backfilling. When the 2 osds are added back in you only have 2 
> osds receiving the majority of the data from the backfilling.  2 osds have a 
> lot less available iops and spindle speed than the other 30 did when they 
> were recovering from the loss causing your bottleneck.
> 
> Adding osds is generally a slower operation than losing them due to this.  
> Even for brand-new nodes increasing your cluster size.
> 
> 
> On Wed, Sep 27, 2017, 8:43 AM Jonas Jaszkowic <jonasjaszkowic.w...@gmail.com 
> <mailto:jonasjaszkowic.w...@gmail.com>> wrote:
> 
> Hello all, 
> 
> I have setup a Ceph cluster consisting of one monitor, 32 OSD hosts (1 
> OSD of size 320GB per host) and 16 clients which are reading
> and writing to the cluster. I have one erasure coded pool (shec plugin) 
> with k=8, m=4, c=3 and pg_num=256. Failure domain is host.
> I am able to reach a HEALTH_OK state and everything is working as 
> expected. The pool was populated with
> 114048 files of different sizes ranging from 1kB to 4GB. Total amount of 
> data in the pool was around 3TB. The capacity of the
> pool was around 10TB.
> 
> I want to evaluate how Ceph is rebalancing data when 
> 
> 1) I take out two OSDs and 
> 2) when I rejoin these two OSDS.
> 
> For scenario 1) I am „killing" two OSDs via *ceph osd out . *Ceph 
> notices this failure and starts to rebalance data until I 
> reach HEALTH_OK again.
> 
> For scenario 2) I am rejoining the previously killed OSDs via *ceph osd 
> in . *Again, Ceph notices this failure and starts to 
> rebalance data until HEALTH_OK state.
> 
> I repeated this whole scenario four times. *What I am noticing is that 
> the rebalancing process in the event of two OSDs joining the*
> *cluster takes more than 3 times longer than in the event of the loss of 
> two OSDs. This was consistent over the four runs.*
> 
> I expected both recovering times to be equally long since at both 
> scenarios the number of degraded objects was around 8% and the
> number of missing objects around 2%. I attached a visualization of the 
> recovery process in terms of degraded and missing objects, 
> first part is the scenario where two OSDs „failed“, second one is the 
> rejoining of these two OSDs. Note how it takes significantly longer
> to recover in the second case.
> 
> Now I want to understand why it takes longer! I appreciate all hints.
> 
> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Richard Hesketh
Systems Engineer, Research Platforms
BBC Research & Development



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] "ceph fs" commands hang forever and kill monitors

2017-09-27 Thread Richard Hesketh
On 27/09/17 12:32, John Spray wrote:
> On Wed, Sep 27, 2017 at 12:15 PM, Richard Hesketh
> <richard.hesk...@rd.bbc.co.uk> wrote:
>> As the subject says... any ceph fs administrative command I try to run hangs 
>> forever and kills monitors in the background - sometimes they come back, on 
>> a couple of occasions I had to manually stop/restart a suffering mon. Trying 
>> to load the filesystem tab in the ceph-mgr dashboard dumps an error and can 
>> also kill a monitor. However, clients can mount the filesystem and 
>> read/write data without issue.
>>
>> Relevant excerpt from logs on an affected monitor, just trying to run 'ceph 
>> fs ls':
>>
>> 2017-09-26 13:20:50.716087 7fc85fdd9700  0 mon.vm-ds-01@0(leader) e19 
>> handle_command mon_command({"prefix": "fs ls"} v 0) v1
>> 2017-09-26 13:20:50.727612 7fc85fdd9700  0 log_channel(audit) log [DBG] : 
>> from='client.? 10.10.10.1:0/2771553898' entity='client.admin' 
>> cmd=[{"prefix": "fs ls"}]: dispatch
>> 2017-09-26 13:20:50.950373 7fc85fdd9700 -1 
>> /build/ceph-12.2.0/src/osd/OSDMap.h: In function 'const string& 
>> OSDMap::get_pool_name(int64_t) const' thread 7fc85fdd9700 time 2017-09-26 
>> 13:20:50.727676
>> /build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != 
>> pool_name.end())
>>
>>  ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
>> const*)+0x102) [0x55a8ca0bb642]
>>  2: (()+0x48165f) [0x55a8c9f4165f]
>>  3: 
>> (MDSMonitor::preprocess_command(boost::intrusive_ptr)+0x1d18) 
>> [0x55a8ca047688]
>>  4: (MDSMonitor::preprocess_query(boost::intrusive_ptr)+0x2a8) 
>> [0x55a8ca048008]
>>  5: (PaxosService::dispatch(boost::intrusive_ptr)+0x700) 
>> [0x55a8c9f9d1b0]
>>  6: (Monitor::handle_command(boost::intrusive_ptr)+0x1f93) 
>> [0x55a8c9e63193]
>>  7: (Monitor::dispatch_op(boost::intrusive_ptr)+0xa0e) 
>> [0x55a8c9e6a52e]
>>  8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b]
>>  9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053]
>>  10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a]
>>  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d]
>>  12: (()+0x76ba) [0x7fc86b3ac6ba]
>>  13: (clone()+0x6d) [0x7fc869bd63dd]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
>> interpret this.
>>
>> I'm running Luminous. The cluster and FS have been in service since Hammer 
>> and have default data/metadata pool names. I discovered the issue after 
>> attempting to enable directory sharding.
> 
> Well that's not good...
> 
> The assertion is because your FSMap is referring to a pool that
> apparently no longer exists in the OSDMap.  This should be impossible
> in current Ceph (we forbid removing pools if they're in use), but
> could perhaps have been caused in an earlier version of Ceph when it
> was possible to remove a pool even if CephFS was referring to it?
> 
> Alternatively, perhaps something more severe is going on that's
> causing your mons to see a wrong/inconsistent view of the world.  Has
> the cluster ever been through any traumatic disaster recovery type
> activity involving hand-editing any of the cluster maps?  What
> intermediate versions has it passed through on the way from Hammer to
> Luminous?
> 
> Opened a ticket here: http://tracker.ceph.com/issues/21568
> 
> John

I've reviewed my notes (i.e. I've grepped my IRC logs); I actually inherited 
this cluster from a colleague who left shortly after I joined, so unfortunately 
there is some of its history I cannot fill in.

Turns out the cluster actually predates Firefly. Looking at dates my suspicion 
is that it went Emperor -> Firefly -> Giant -> Hammer. I inherited it at 
Hammer, and took it Hammer -> Infernalis -> Jewel -> Luminous myself. I know I 
did make sure to do the tmap_upgrade step on cephfs but can't remember if I did 
it at Infernalis or Jewel.

Infernalis was a tricky upgrade; the attempt was aborted once after the first 
set of OSDs didn't come back up after upgrade (had to remove/downgrade and 
readd), and setting sortbitwise as the documentation suggested after a 
successful second attempt caused everything to break and degrade slowly until 
it was unset and recovered. Never had disaster recovery involve mucking around 
with the pools while I was administrating it, but unfortunately I cannot speak 
for the cluster's pre-Hammer history. The only pools I have removed were ones I 
created temporarily for testing crush rules/benchmarking.

I have hand-edited the crush map (extract, decompile, modify, recompile, 
inject) at times because I found it more convenient for creating new crush 
rules than using the CLI tools, but not the OSD map.

Why would the cephfs have been referring to other pools?

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] "ceph fs" commands hang forever and kill monitors

2017-09-27 Thread Richard Hesketh
As the subject says... any ceph fs administrative command I try to run hangs 
forever and kills monitors in the background - sometimes they come back, on a 
couple of occasions I had to manually stop/restart a suffering mon. Trying to 
load the filesystem tab in the ceph-mgr dashboard dumps an error and can also 
kill a monitor. However, clients can mount the filesystem and read/write data 
without issue.

Relevant excerpt from logs on an affected monitor, just trying to run 'ceph fs 
ls':

2017-09-26 13:20:50.716087 7fc85fdd9700  0 mon.vm-ds-01@0(leader) e19 
handle_command mon_command({"prefix": "fs ls"} v 0) v1
2017-09-26 13:20:50.727612 7fc85fdd9700  0 log_channel(audit) log [DBG] : 
from='client.? 10.10.10.1:0/2771553898' entity='client.admin' cmd=[{"prefix": 
"fs ls"}]: dispatch
2017-09-26 13:20:50.950373 7fc85fdd9700 -1 /build/ceph-12.2.0/src/osd/OSDMap.h: 
In function 'const string& OSDMap::get_pool_name(int64_t) const' thread 
7fc85fdd9700 time 2017-09-26 13:20:50.727676
/build/ceph-12.2.0/src/osd/OSDMap.h: 1176: FAILED assert(i != pool_name.end())

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x102) [0x55a8ca0bb642]
 2: (()+0x48165f) [0x55a8c9f4165f]
 3: (MDSMonitor::preprocess_command(boost::intrusive_ptr)+0x1d18) 
[0x55a8ca047688]
 4: (MDSMonitor::preprocess_query(boost::intrusive_ptr)+0x2a8) 
[0x55a8ca048008]
 5: (PaxosService::dispatch(boost::intrusive_ptr)+0x700) 
[0x55a8c9f9d1b0]
 6: (Monitor::handle_command(boost::intrusive_ptr)+0x1f93) 
[0x55a8c9e63193]
 7: (Monitor::dispatch_op(boost::intrusive_ptr)+0xa0e) 
[0x55a8c9e6a52e]
 8: (Monitor::_ms_dispatch(Message*)+0x6db) [0x55a8c9e6b57b]
 9: (Monitor::ms_dispatch(Message*)+0x23) [0x55a8c9e9a053]
 10: (DispatchQueue::entry()+0xf4a) [0x55a8ca3b5f7a]
 11: (DispatchQueue::DispatchThread::entry()+0xd) [0x55a8ca16bc1d]
 12: (()+0x76ba) [0x7fc86b3ac6ba]
 13: (clone()+0x6d) [0x7fc869bd63dd]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

I'm running Luminous. The cluster and FS have been in service since Hammer and 
have default data/metadata pool names. I discovered the issue after attempting 
to enable directory sharding.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-09-22 Thread Richard Hesketh
I asked the same question a couple of weeks ago. No response I got 
contradicted the documentation but nobody actively confirmed the 
documentation was correct on this subject, either; my end state was that 
I was relatively confident I wasn't making some horrible mistake by 
simply specifying a big DB partition and letting bluestore work itself 
out (in my case, I've just got HDDs and SSDs that were journals under 
filestore), but I could not be sure there wasn't some sort of 
performance tuning I was missing out on by not specifying them separately.


Rich

On 21/09/17 20:37, Benjeman Meekhof wrote:

Some of this thread seems to contradict the documentation and confuses
me.  Is the statement below correct?

"The BlueStore journal will always be placed on the fastest device
available, so using a DB device will provide the same benefit that the
WAL device would while also allowing additional metadata to be stored
there (if it will fix)."

http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#devices

  it seems to be saying that there's no reason to create separate WAL
and DB partitions if they are on the same device.  Specifying one
large DB partition per OSD will cover both uses.

thanks,
Ben

On Thu, Sep 21, 2017 at 12:15 PM, Dietmar Rieder
 wrote:

On 09/21/2017 05:03 PM, Mark Nelson wrote:


On 09/21/2017 03:17 AM, Dietmar Rieder wrote:

On 09/21/2017 09:45 AM, Maged Mokhtar wrote:

On 2017-09-21 07:56, Lazuardi Nasution wrote:


Hi,

I'm still looking for the answer of these questions. Maybe someone can
share their thought on these. Any comment will be helpful too.

Best regards,

On Sat, Sep 16, 2017 at 1:39 AM, Lazuardi Nasution
> wrote:

 Hi,

 1. Is it possible configure use osd_data not as small partition on
 OSD but a folder (ex. on root disk)? If yes, how to do that with
 ceph-disk and any pros/cons of doing that?
 2. Is WAL & DB size calculated based on OSD size or expected
 throughput like on journal device of filestore? If no, what is the
 default value and pro/cons of adjusting that?
 3. Is partition alignment matter on Bluestore, including WAL & DB
 if using separate device for them?

 Best regards,


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



I am also looking for recommendations on wal/db partition sizes. Some
hints:

ceph-disk defaults used in case it does not find
bluestore_block_wal_size or bluestore_block_db_size in config file:

wal =  512MB

db = if bluestore_block_size (data size) is in config file it uses 1/100
of it else it uses 1G.

There is also a presentation by Sage back in March, see page 16:

https://www.slideshare.net/sageweil1/bluestore-a-new-storage-backend-for-ceph-one-year-in


wal: 512 MB

db: "a few" GB

the wal size is probably not debatable, it will be like a journal for
small block sizes which are constrained by iops hence 512 MB is more
than enough. Probably we will see more on the db size in the future.

This is what I understood so far.
I wonder if it makes sense to set the db size as big as possible and
divide entire db device is  by the number of OSDs it will serve.

E.g. 10 OSDs / 1 NVME (800GB)

  (800GB - 10x1GB wal ) / 10 = ~79Gb db size per OSD

Is this smart/stupid?

Personally I'd use 512MB-2GB for the WAL (larger buffers reduce write
amp but mean larger memtables and potentially higher overhead scanning
through memtables).  4x256MB buffers works pretty well, but it means
memory overhead too.  Beyond that, I'd devote the entire rest of the
device to DB partitions.


thanks for your suggestion Mark!

So, just to make sure I understood this right:

You'd  use a separeate 512MB-2GB WAL partition for each OSD and the
entire rest for DB partitions.

In the example case with 10xHDD OSD and 1 NVME it would then be 10 WAL
partitions with each 512MB-2GB and 10 equal sized DB partitions
consuming the rest of the NVME.


Thanks
   Dietmar
--
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?) [and recovery sleep]

2017-09-14 Thread Richard Hesketh
I do run with osd_max_backfills and osd_recovery_max_active turned up quite a 
bit from the defaults, I'm trying for as much recovery throughput as possible. 
I would hazard a guess that the impact seen from the sleep settings is 
proportionally much smaller if your other recovery-related parameters are more 
default - but it starts to dominate if you remove other bottlenecks on recovery 
I/O.

Rich

On 14/09/17 15:02, Mark Nelson wrote:
> I'm really glad to hear that it wasn't bluestore! :)
> 
> It raises another concern though. We didn't expect to see that much of a 
> slowdown with the current throttle settings.  An order of magnitude slowdown 
> in recovery performance isn't ideal at all.
> 
> I wonder if we could improve things dramatically if we kept track of client 
> IO activity on the OSD and remove the throttle if there's been no client 
> activity for X seconds.  Theoretically more advanced heuristics might cover 
> this, but in the interim it seems to me like this would solve the very 
> specific problem you are seeing while still throttling recovery when IO is 
> happening.
> 
> Mark
> 
> On 09/14/2017 06:19 AM, Richard Hesketh wrote:
>> Yeah, that hit the nail on the head. Significantly reducing/eliminating the 
>> recovery sleep times increases the recovery speed back up (and beyond!) the 
>> levels I was expecting to see - recovery is almost an order of magnitude 
>> faster now. Thanks for educating me about those changes!
>>
>> Rich
>>
>> On 14/09/17 11:16, Richard Hesketh wrote:
>>> Hi Mark,
>>>
>>> No, I wasn't familiar with that work. I am in fact comparing speed of 
>>> recovery to maintenance work I did while the cluster was in Jewel; I 
>>> haven't manually done anything to sleep settings, only adjusted max 
>>> backfills OSD settings. New options that introduce arbitrary slowdown to 
>>> recovery operations to preserve client performance would explain what I'm 
>>> seeing! I'll have a tinker with adjusting those values (in my particular 
>>> case client load on the cluster is very low and I don't have to honour any 
>>> guarantees about client performance - getting back into HEALTH_OK asap is 
>>> preferable).
>>>
>>> Rich
>>>
>>> On 13/09/17 21:14, Mark Nelson wrote:
>>>> Hi Richard,
>>>>
>>>> Regarding recovery speed, have you looked through any of Neha's results on 
>>>> recovery sleep testing earlier this summer?
>>>>
>>>> https://www.spinics.net/lists/ceph-devel/msg37665.html
>>>>
>>>> She tested bluestore and filestore under a couple of different scenarios.  
>>>> The gist of it is that time to recover changes pretty dramatically 
>>>> depending on the sleep setting.
>>>>
>>>> I don't recall if you said earlier, but are you comparing filestore and 
>>>> bluestore recovery performance on the same version of ceph with the same 
>>>> sleep settings?
>>>>
>>>> Mark
>>>>
>>>> On 09/12/2017 05:24 AM, Richard Hesketh wrote:
>>>>> Thanks for the links. That does seem to largely confirm that what I 
>>>>> haven't horribly misunderstood anything and I've not been doing anything 
>>>>> obviously wrong while converting my disks; there's no point specifying 
>>>>> separate WAL/DB partitions if they're going to go on the same device, 
>>>>> throw as much space as you have available at the DB partitions and 
>>>>> they'll use all the space they can, and significantly reduced I/O on the 
>>>>> DB/WAL device compared to Filestore is expected since bluestore's nixed 
>>>>> the write amplification as much as possible.
>>>>>
>>>>> I'm still seeing much reduced recovery speed on my newly Bluestored 
>>>>> cluster, but I guess that's a tuning issue rather than evidence of 
>>>>> catastrophe.
>>>>>
>>>>> Rich
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Richard Hesketh
Systems Engineer, Research Platforms
BBC Research & Development



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?) [and recovery sleep]

2017-09-14 Thread Richard Hesketh
Yeah, that hit the nail on the head. Significantly reducing/eliminating the 
recovery sleep times increases the recovery speed back up (and beyond!) the 
levels I was expecting to see - recovery is almost an order of magnitude faster 
now. Thanks for educating me about those changes!

Rich

On 14/09/17 11:16, Richard Hesketh wrote:
> Hi Mark,
> 
> No, I wasn't familiar with that work. I am in fact comparing speed of 
> recovery to maintenance work I did while the cluster was in Jewel; I haven't 
> manually done anything to sleep settings, only adjusted max backfills OSD 
> settings. New options that introduce arbitrary slowdown to recovery 
> operations to preserve client performance would explain what I'm seeing! I'll 
> have a tinker with adjusting those values (in my particular case client load 
> on the cluster is very low and I don't have to honour any guarantees about 
> client performance - getting back into HEALTH_OK asap is preferable).
> 
> Rich
> 
> On 13/09/17 21:14, Mark Nelson wrote:
>> Hi Richard,
>>
>> Regarding recovery speed, have you looked through any of Neha's results on 
>> recovery sleep testing earlier this summer?
>>
>> https://www.spinics.net/lists/ceph-devel/msg37665.html
>>
>> She tested bluestore and filestore under a couple of different scenarios.  
>> The gist of it is that time to recover changes pretty dramatically depending 
>> on the sleep setting.
>>
>> I don't recall if you said earlier, but are you comparing filestore and 
>> bluestore recovery performance on the same version of ceph with the same 
>> sleep settings?
>>
>> Mark
>>
>> On 09/12/2017 05:24 AM, Richard Hesketh wrote:
>>> Thanks for the links. That does seem to largely confirm that what I haven't 
>>> horribly misunderstood anything and I've not been doing anything obviously 
>>> wrong while converting my disks; there's no point specifying separate 
>>> WAL/DB partitions if they're going to go on the same device, throw as much 
>>> space as you have available at the DB partitions and they'll use all the 
>>> space they can, and significantly reduced I/O on the DB/WAL device compared 
>>> to Filestore is expected since bluestore's nixed the write amplification as 
>>> much as possible.
>>>
>>> I'm still seeing much reduced recovery speed on my newly Bluestored 
>>> cluster, but I guess that's a tuning issue rather than evidence of 
>>> catastrophe.
>>>
>>> Rich
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Richard Hesketh
Systems Engineer, Research Platforms
BBC Research & Development



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?)

2017-09-12 Thread Richard Hesketh
Thanks for the links. That does seem to largely confirm that what I haven't 
horribly misunderstood anything and I've not been doing anything obviously 
wrong while converting my disks; there's no point specifying separate WAL/DB 
partitions if they're going to go on the same device, throw as much space as 
you have available at the DB partitions and they'll use all the space they can, 
and significantly reduced I/O on the DB/WAL device compared to Filestore is 
expected since bluestore's nixed the write amplification as much as possible.

I'm still seeing much reduced recovery speed on my newly Bluestored cluster, 
but I guess that's a tuning issue rather than evidence of catastrophe.

Rich

On 12/09/17 00:13, Brad Hubbard wrote:
> Take a look at these which should answer at least some of your questions.
> 
> http://ceph.com/community/new-luminous-bluestore/
> 
> http://ceph.com/planet/understanding-bluestore-cephs-new-storage-backend/
> 
> On Mon, Sep 11, 2017 at 8:45 PM, Richard Hesketh
> <richard.hesk...@rd.bbc.co.uk> wrote:
>> On 08/09/17 11:44, Richard Hesketh wrote:
>>> Hi,
>>>
>>> Reading the ceph-users list I'm obviously seeing a lot of people talking 
>>> about using bluestore now that Luminous has been released. I note that many 
>>> users seem to be under the impression that they need separate block devices 
>>> for the bluestore data block, the DB, and the WAL... even when they are 
>>> going to put the DB and the WAL on the same device!
>>>
>>> As per the docs at 
>>> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ 
>>> this is nonsense:
>>>
>>>> If there is only a small amount of fast storage available (e.g., less than 
>>>> a gigabyte), we recommend using it as a WAL device. If there is more, 
>>>> provisioning a DB
>>>> device makes more sense. The BlueStore journal will always be placed on 
>>>> the fastest device available, so using a DB device will provide the same 
>>>> benefit that the WAL
>>>> device would while also allowing additional metadata to be stored there 
>>>> (if it will fix). [sic, I assume that should be "fit"]
>>>
>>> I understand that if you've got three speeds of storage available, there 
>>> may be some sense to dividing these. For instance, if you've got lots of 
>>> HDD, a bit of SSD, and a tiny NVMe available in the same host, data on HDD, 
>>> DB on SSD and WAL on NVMe may be a sensible division of data. That's not 
>>> the case for most of the examples I'm reading; they're talking about 
>>> putting DB and WAL on the same block device, but in different partitions. 
>>> There's even one example of someone suggesting to try partitioning a single 
>>> SSD to put data/DB/WAL all in separate partitions!
>>>
>>> Are the docs wrong and/or I am missing something about optimal bluestore 
>>> setup, or do people simply have the wrong end of the stick? I ask because 
>>> I'm just going through switching all my OSDs over to Bluestore now and I've 
>>> just been reusing the partitions I set up for journals on my SSDs as DB 
>>> devices for Bluestore HDDs without specifying anything to do with the WAL, 
>>> and I'd like to know sooner rather than later if I'm making some sort of 
>>> horrible mistake.
>>>
>>> Rich
>>
>> Having had no explanatory reply so far I'll ask further...
>>
>> I have been continuing to update my OSDs and so far the performance offered 
>> by bluestore has been somewhat underwhelming. Recovery operations after 
>> replacing the Filestore OSDs with Bluestore equivalents have been much 
>> slower than expected, not even half the speed of recovery ops when I was 
>> upgrading Filestore OSDs with larger disks a few months ago. This 
>> contributes to my sense that I am doing something wrong.
>>
>> I've found that if I allow ceph-disk to partition my DB SSDs rather than 
>> reusing the rather large journal partitions I originally created for 
>> Filestore, it is only creating very small 1GB partitions. Attempting to 
>> search for bluestore configuration parameters has pointed me towards 
>> bluestore_block_db_size and bluestore_block_wal_size config settings. 
>> Unfortunately these settings are completely undocumented so I'm not sure 
>> what their functional purpose is. In any event in my running config I seem 
>> to have the following default values:
>>
>> # ceph-conf --show-config | grep bluestore
>> ...
>> bluestore_block_create = true
>> bluestore_block_db_create = false
>

Re: [ceph-users] Bluestore "separate" WAL and DB (and WAL/DB size?)

2017-09-11 Thread Richard Hesketh
On 08/09/17 11:44, Richard Hesketh wrote:
> Hi,
> 
> Reading the ceph-users list I'm obviously seeing a lot of people talking 
> about using bluestore now that Luminous has been released. I note that many 
> users seem to be under the impression that they need separate block devices 
> for the bluestore data block, the DB, and the WAL... even when they are going 
> to put the DB and the WAL on the same device!
> 
> As per the docs at 
> http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ 
> this is nonsense:
> 
>> If there is only a small amount of fast storage available (e.g., less than a 
>> gigabyte), we recommend using it as a WAL device. If there is more, 
>> provisioning a DB
>> device makes more sense. The BlueStore journal will always be placed on the 
>> fastest device available, so using a DB device will provide the same benefit 
>> that the WAL
>> device would while also allowing additional metadata to be stored there (if 
>> it will fix). [sic, I assume that should be "fit"]
> 
> I understand that if you've got three speeds of storage available, there may 
> be some sense to dividing these. For instance, if you've got lots of HDD, a 
> bit of SSD, and a tiny NVMe available in the same host, data on HDD, DB on 
> SSD and WAL on NVMe may be a sensible division of data. That's not the case 
> for most of the examples I'm reading; they're talking about putting DB and 
> WAL on the same block device, but in different partitions. There's even one 
> example of someone suggesting to try partitioning a single SSD to put 
> data/DB/WAL all in separate partitions!
> 
> Are the docs wrong and/or I am missing something about optimal bluestore 
> setup, or do people simply have the wrong end of the stick? I ask because I'm 
> just going through switching all my OSDs over to Bluestore now and I've just 
> been reusing the partitions I set up for journals on my SSDs as DB devices 
> for Bluestore HDDs without specifying anything to do with the WAL, and I'd 
> like to know sooner rather than later if I'm making some sort of horrible 
> mistake.
> 
> Rich

Having had no explanatory reply so far I'll ask further...

I have been continuing to update my OSDs and so far the performance offered by 
bluestore has been somewhat underwhelming. Recovery operations after replacing 
the Filestore OSDs with Bluestore equivalents have been much slower than 
expected, not even half the speed of recovery ops when I was upgrading 
Filestore OSDs with larger disks a few months ago. This contributes to my sense 
that I am doing something wrong.

I've found that if I allow ceph-disk to partition my DB SSDs rather than 
reusing the rather large journal partitions I originally created for Filestore, 
it is only creating very small 1GB partitions. Attempting to search for 
bluestore configuration parameters has pointed me towards 
bluestore_block_db_size and bluestore_block_wal_size config settings. 
Unfortunately these settings are completely undocumented so I'm not sure what 
their functional purpose is. In any event in my running config I seem to have 
the following default values:

# ceph-conf --show-config | grep bluestore
...
bluestore_block_create = true
bluestore_block_db_create = false
bluestore_block_db_path = 
bluestore_block_db_size = 0
bluestore_block_path = 
bluestore_block_preallocate_file = false
bluestore_block_size = 10737418240
bluestore_block_wal_create = false
bluestore_block_wal_path = 
bluestore_block_wal_size = 100663296
...

I have been creating bluestore osds by:

ceph-disk prepare --bluestore /dev/sdX --block.db /dev/sdY1 --osd-id Z # 
re-using existing partitions for DB
or
ceph-disk prepare --bluestore /dev/sdX --block.db /dev/sdY --osd-id Z # letting 
ceph-disk partition DB, after zapping original partitions

Are these sane values? What does it mean that block_db_size is 0 - is it just 
using the entire block device specified or not actually using it at all? Is the 
WAL actually being placed on the DB block device? And is that 1GB default 
really a sensible size for the DB partition?

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore "separate" WAL and DB

2017-09-08 Thread Richard Hesketh
Hi,

Reading the ceph-users list I'm obviously seeing a lot of people talking about 
using bluestore now that Luminous has been released. I note that many users 
seem to be under the impression that they need separate block devices for the 
bluestore data block, the DB, and the WAL... even when they are going to put 
the DB and the WAL on the same device!

As per the docs at 
http://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/ this 
is nonsense:

> If there is only a small amount of fast storage available (e.g., less than a 
> gigabyte), we recommend using it as a WAL device. If there is more, 
> provisioning a DB
> device makes more sense. The BlueStore journal will always be placed on the 
> fastest device available, so using a DB device will provide the same benefit 
> that the WAL
> device would while also allowing additional metadata to be stored there (if 
> it will fix). [sic, I assume that should be "fit"]

I understand that if you've got three speeds of storage available, there may be 
some sense to dividing these. For instance, if you've got lots of HDD, a bit of 
SSD, and a tiny NVMe available in the same host, data on HDD, DB on SSD and WAL 
on NVMe may be a sensible division of data. That's not the case for most of the 
examples I'm reading; they're talking about putting DB and WAL on the same 
block device, but in different partitions. There's even one example of someone 
suggesting to try partitioning a single SSD to put data/DB/WAL all in separate 
partitions!

Are the docs wrong and/or I am missing something about optimal bluestore setup, 
or do people simply have the wrong end of the stick? I ask because I'm just 
going through switching all my OSDs over to Bluestore now and I've just been 
reusing the partitions I set up for journals on my SSDs as DB devices for 
Bluestore HDDs without specifying anything to do with the WAL, and I'd like to 
know sooner rather than later if I'm making some sort of horrible mistake.

Rich
-- 
Richard Hesketh




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Maintenance

2017-08-01 Thread Richard Hesketh
On 01/08/17 12:41, Osama Hasebou wrote:
> Hi,
> 
> What would be the best possible and efficient way for big Ceph clusters when 
> maintenance needs to be performed ?
> 
> Lets say that we have 3 copies of data, and one of the servers needs to be 
> maintained, and maintenance might take 1-2 days due to some unprepared issues 
> that come up.
> 
> Setting node to no-out is a bit of risk since only 2 copies will be active. 
> So in that case what would be proper way taking down node to rebalance and 
> then perform maintanence , and in case one needs to being it back online 
> without rebalancing right away to check if its functioning properly or not as 
> a server 1st  and once all looks good, one can introduce  rebalancing again ?
> 
> 
> Thank you.
> 
> Regards,
> Ossi

The recommended practice would be to use "ceph osd crush reweight" to set the 
crush weight on the OSDs that will be down to 0. The cluster will then 
rebalance, and once it's HEALTH_OK again, you can take those OSDs offline 
without losing any redundancy (though you will need to ensure you have enough 
spare space in what's left of the cluster that you don't push disk usage too 
high on your other nodes).

When you're ready to bring them online again, make sure that you have 
"osd_crush_update_on_start = false" set in your ceph.conf so they don't 
potentially mess with their weights when they come back. Then they will be up 
but still at crush weight 0 so no data will be assigned to them. When you're 
happy everything's okay, use "ceph osd crush reweight" again to bring them back 
to their original weights. Lots of people like to do that in increments of 0.1 
weight at a time, so the recovery is staggered and doesn't impact your active 
I/O too much.

This assumes your crush layout is such that you can still have three replicas 
with one server missing.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Client behavior when adding and removing mons

2017-07-31 Thread Richard Hesketh
On 31/07/17 14:05, Edward R Huyer wrote:
> I’m migrating my Ceph cluster to entirely new hardware.  Part of that is 
> replacing the monitors.  My plan is to add new monitors and remove old ones, 
> updating config files on client machines as I go.
> 
> I have clients actively using the cluster.  They are all QEMU/libvirt and 
> kernel clients using RBDs.  Will they continue to function correctly as long 
> as I update the applicable config files with new monitor info?  Or will I 
> need to restart the VMs and remap the RBDs on the physical boxes?
> 
> I **think** the clients keep an updated list of the current active monitors 
> while the clients are running, but I’m not positive.  Hopefully someone knows 
> definitively.
> 
> Thanks in advance!
> 
> -
> 
> Edward Huyer

The monitor list specified in the config file is an "initial set" the 
client/daemons use to establish first contact with the cluster. Once they've 
been able to communicate with a monitor successfully, they thereafter use the 
provided monmap to keep track of monitors and the config file is irrelevant - 
so as long as you keep a quorum of monitors online at all times, you shouldn't 
have any issues with running clients. I did an OS/hardware upgrade recently and 
spun up temporary monitors on normally non-mon nodes to keep the cluster online 
while I was working on the regular hosts, and no restarting of clients was 
required.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread Richard Hesketh
In my case my cluster is under very little active load and so I have never had 
to be concerned about recovery operations impacting on client traffic. In fact, 
I generally tune up from the defaults (increase osx max backfills) to improve 
recovery speed when I'm doing major changes, because there's plenty of spare 
capacity in the cluster; and either way I'm in the fortunate position where I 
can place a higher value on having a HEALTH_OK cluster ASAP than on the client 
I/O being consistent.

Rich

On 19/07/17 16:27, Laszlo Budai wrote:
> Hi Rich,
> 
> Thank you for your answer. This is good news to hear :)
> Regarding the reconfiguration you've done: if I understand correctly, you 
> have changed it all at once (like download the crush map, edit it - add all 
> the new OSDs, and upload the new map to the cluster). How did you controlled 
> the impact of the recovery/refilling operation on your clients' data traffic? 
> What setting have you used to avoid slow requests?
> 
> Kind regards,
> Laszlo
> 
> 
> On 19.07.2017 17:40, Richard Hesketh wrote:
>> On 19/07/17 15:14, Laszlo Budai wrote:
>>> Hi David,
>>>
>>> Thank you for that reference about CRUSH. It's a nice one.
>>> There I could read about expanding the cluster, but in one of my cases we 
>>> want to do more: we want to move from host failure domain to chassis 
>>> failure domain. Our concern is: how will ceph behave for those PGs where 
>>> all the three replicas currently are in the same chassis? Because in this 
>>> case according to the new CRUSH map two replicas are in the wrong place.
>>>
>>> Kind regards,
>>> Laszlo
>>
>> Changing crush rules resulting in PGs being remapped works exactly the same 
>> way as changes in crush weights causing remapped data. The PGs will be 
>> remapped in accordance with the new crushmap/rules and then recovery 
>> operations will copy them over to the new OSDs as usual. Even if a PG is 
>> entirely remapped, the OSDs that were originally hosting it will operate as 
>> an acting set and continue to serve I/O and replicate data until copies on 
>> the new OSDs are ready to take over - ceph won't throw an upset because the 
>> acting set doesn't comply with the crush rules. I have done, for instance, a 
>> crush rule change which resulted in an entire pool being entirely remapped - 
>> switching the cephfs metadata pool from an HDD root to an SSD root rule, so 
>> every single PG was moved to a completely different set of OSDs - and it all 
>> continued to work fine while recovery took place.
>>
>> Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices for expanding hammer cluster

2017-07-19 Thread Richard Hesketh
On 19/07/17 15:14, Laszlo Budai wrote:
> Hi David,
> 
> Thank you for that reference about CRUSH. It's a nice one.
> There I could read about expanding the cluster, but in one of my cases we 
> want to do more: we want to move from host failure domain to chassis failure 
> domain. Our concern is: how will ceph behave for those PGs where all the 
> three replicas currently are in the same chassis? Because in this case 
> according to the new CRUSH map two replicas are in the wrong place.
> 
> Kind regards,
> Laszlo

Changing crush rules resulting in PGs being remapped works exactly the same way 
as changes in crush weights causing remapped data. The PGs will be remapped in 
accordance with the new crushmap/rules and then recovery operations will copy 
them over to the new OSDs as usual. Even if a PG is entirely remapped, the OSDs 
that were originally hosting it will operate as an acting set and continue to 
serve I/O and replicate data until copies on the new OSDs are ready to take 
over - ceph won't throw an upset because the acting set doesn't comply with the 
crush rules. I have done, for instance, a crush rule change which resulted in 
an entire pool being entirely remapped - switching the cephfs metadata pool 
from an HDD root to an SSD root rule, so every single PG was moved to a 
completely different set of OSDs - and it all continued to work fine while 
recovery took place.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] missing feature 400000000000000 ?

2017-07-14 Thread Richard Hesketh
On 14/07/17 11:03, Ilya Dryomov wrote:
> On Fri, Jul 14, 2017 at 11:29 AM, Riccardo Murri
>  wrote:
>> Hello,
>>
>> I am trying to install a test CephFS "Luminous" system on Ubuntu 16.04.
>>
>> Everything looks fine, but the `mount.ceph` command fails (error 110, 
>> timeout);
>> kernel logs show a number of messages like these before the `mount`
>> prog gives up:
>>
>> libceph: ... feature set mismatch, my 107b84a842aca < server's
>> 40107b84a842aca, missing 400
>>
>> I read in [1] that this is feature
>> CEPH_FEATURE_NEW_OSDOPREPLY_ENCODING which is only supported in
>> kernels 4.5 and up -- whereas Ubuntu 16.04 runs Linux 4.4.
>>
>> Is there some tunable or configuration file entry that I can set,
>> which will make Luminous FS mounting work on the std Ubuntu 16.04
>> Linux kernel?  I.e., is there a way I can avoid upgrading the kernel?
>>
>> [1]: 
>> http://cephnotes.ksperis.com/blog/2014/01/21/feature-set-mismatch-error-on-ceph-kernel-client
> 
> Yes, you should be able to set your CRUSH tunables profile to hammer
> with "ceph osd crush tunables hammer".
> 
> Thanks,
> 
> Ilya

Alternatively, keep in mind you can install ceph-fuse and mount the FS using 
that userland client instead, if you'd prefer the tunables in your cluster to 
be up to date.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects while OSD is being added/filled

2017-07-12 Thread Richard Hesketh
On 11/07/17 20:05, Eino Tuominen wrote:
> Hi Richard,
> 
> Thanks for the explanation, that makes perfect sense. I've missed the 
> difference between ceph osd reweight and ceph osd crush reweight. I have to 
> study that better.
> 
> Is there a way to get ceph to prioritise fixing degraded objects over fixing 
> misplaced ones?

The difference between out/reweight and crush weight caught me out recently as 
well, I was seeing a lot more data movement that I expected during the process 
of replacing old disks until someone explained the difference to me.

As I understand it, ceph already prioritises fixing things which are degraded 
over things which are misplaced - the problem is that this is granular at the 
level of PGs, then objects. It will try and choose to recover PGs with degraded 
objects before PGs which only have misplaced objects, and I think that within a 
specific PG recovery it will try to fix degraded objects before misplaced 
objects. However it will still finish recovering the whole PG, misplaced 
objects and all - it's got no mechanism to put a recovery on hold once it's 
fixed the degraded objects to free capacity for recovering a different PG.

The default recovery limits are really conservative though, you can probably 
increase the rate quite a lot by increasing osd_max_backfills above the default 
limit of 1 - "ceph tell osd.X injectargs '--osd_max_backfills Y'" to update a 
running config, "osd_max_backfills = Y" in the [osd] section of your ceph.conf 
to make it persist.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating RGW from FastCGI to Civetweb

2017-07-12 Thread Richard Hesketh
Oh, correcting myself. When HTTP proxying Apache translates the host header to 
whatever was specified in the ProxyPass line, so your civetweb server is 
receiving requests with host headers for localhost! Presumably for fcgi 
protocol it works differently. Nonetheless ProxyPreserveHost should solve your 
problem.

Rich

On 12/07/17 10:40, Richard Hesketh wrote:
> Best guess, apache is munging together everything it picks up using the 
> aliases and translating the host to the ServerName before passing on the 
> request. Try setting ProxyPreserveHost on as per 
> https://httpd.apache.org/docs/2.4/mod/mod_proxy.html#proxypreservehost ?
> 
> Rich
> 
> On 11/07/17 21:47, Roger Brown wrote:
>> Thank you Richard, that mostly worked for me. 
>>
>> But I notice that when I switch it from FastCGI to Civitweb that the 
>> S3-style subdomains (e.g., bucket-name.domain-name.com 
>> <http://bucket-name.domain-name.com>) stops working and I haven't been able 
>> to figure out why on my own.
>>
>> - ceph.conf excerpt:
>> [client.radosgw.gateway]
>> host = nuc1
>> keyring = /etc/ceph/ceph.client.radosgw.keyring
>> log file = /var/log/ceph/client.radosgw.gateway.log
>> rgw dns name = s3.e-prepared.com <http://s3.e-prepared.com>
>> # FASTCGI SETTINGS
>> rgw socket path = ""
>> rgw print continue = false
>> rgw frontends = fastcgi socket_port=9000 socket_host=0.0.0.0
>> # CIVETWEB SETTINGS
>> #rgw frontends = civetweb port=7480
>>
>> - httpd.conf excerpt
>> 
>> ServerName s3.e-prepared.com <http://s3.e-prepared.com>
>> ServerAlias *.s3.e-prepared.com <http://s3.e-prepared.com>
>> ServerAlias s3.amazonaws.com <http://s3.amazonaws.com>
>> ServerAlias *.amazonaws.com <http://amazonaws.com>
>> DocumentRoot /srv/www/html/e-prepared_com/s3
>> ErrorLog /var/log/httpd/rgw_error.log
>> CustomLog /var/log/httpd/rgw_access.log combined
>> # LogLevel debug
>> RewriteEngine On
>> RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
>> SetEnv proxy-nokeepalive 1
>> # FASTCGI SETTINGS
>> ProxyPass / fcgi://localhost:9000/
>> # CIVETWEB SETTINGS
>> #ProxyPass / http://localhost:7480/
>> #ProxyPassReverse / http://localhost:7480/
>> 
>>
>> With the above FastCGI settings, S3-style subdomains work. Eg.
>> [root@nuc1 ~]# curl http://roger-public.s3.e-prepared.com/index.html
>> 
>> 
>>   
>> Hello, World!
>>   
>> 
>>
>> But when I comment out the fastcgi settings, uncomment the civetweb 
>> settings, and restart ceph-radosgw and http (and disable selinux), I get 
>> output like this:
>> [root@nuc1 ~]# curl http://roger-public.s3.e-prepared.com/index.html
>> > encoding="UTF-8"?>NoSuchBucketindex.htmltx3-00596536b0-1465f8-default1465f8-default-default
>>
>> However I can still access the bucket the old-fashioned way (e.g., 
>> domain-name.com/bucket-name <http://domain-name.com/bucket-name>) even with 
>> Civetweb running:
>> [root@nuc1 ~]# curl http://s3.e-prepared.com/roger-public/index.html 
>> 
>> 
>>   
>> Hello, World!
>>   
>> 
>>
>> Thoughts, anyone?
>>
>> Roger
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
---
Richard Hesketh
Linux Systems Administrator, Research Platforms
BBC Research & Development



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating RGW from FastCGI to Civetweb

2017-07-12 Thread Richard Hesketh
Best guess, apache is munging together everything it picks up using the aliases 
and translating the host to the ServerName before passing on the request. Try 
setting ProxyPreserveHost on as per 
https://httpd.apache.org/docs/2.4/mod/mod_proxy.html#proxypreservehost ?

Rich

On 11/07/17 21:47, Roger Brown wrote:
> Thank you Richard, that mostly worked for me. 
> 
> But I notice that when I switch it from FastCGI to Civitweb that the S3-style 
> subdomains (e.g., bucket-name.domain-name.com 
> ) stops working and I haven't been able 
> to figure out why on my own.
> 
> - ceph.conf excerpt:
> [client.radosgw.gateway]
> host = nuc1
> keyring = /etc/ceph/ceph.client.radosgw.keyring
> log file = /var/log/ceph/client.radosgw.gateway.log
> rgw dns name = s3.e-prepared.com 
> # FASTCGI SETTINGS
> rgw socket path = ""
> rgw print continue = false
> rgw frontends = fastcgi socket_port=9000 socket_host=0.0.0.0
> # CIVETWEB SETTINGS
> #rgw frontends = civetweb port=7480
> 
> - httpd.conf excerpt
> 
> ServerName s3.e-prepared.com 
> ServerAlias *.s3.e-prepared.com 
> ServerAlias s3.amazonaws.com 
> ServerAlias *.amazonaws.com 
> DocumentRoot /srv/www/html/e-prepared_com/s3
> ErrorLog /var/log/httpd/rgw_error.log
> CustomLog /var/log/httpd/rgw_access.log combined
> # LogLevel debug
> RewriteEngine On
> RewriteRule .* - [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
> SetEnv proxy-nokeepalive 1
> # FASTCGI SETTINGS
> ProxyPass / fcgi://localhost:9000/
> # CIVETWEB SETTINGS
> #ProxyPass / http://localhost:7480/
> #ProxyPassReverse / http://localhost:7480/
> 
> 
> With the above FastCGI settings, S3-style subdomains work. Eg.
> [root@nuc1 ~]# curl http://roger-public.s3.e-prepared.com/index.html
> 
> 
>   
> Hello, World!
>   
> 
> 
> But when I comment out the fastcgi settings, uncomment the civetweb settings, 
> and restart ceph-radosgw and http (and disable selinux), I get output like 
> this:
> [root@nuc1 ~]# curl http://roger-public.s3.e-prepared.com/index.html
>  encoding="UTF-8"?>NoSuchBucketindex.htmltx3-00596536b0-1465f8-default1465f8-default-default
> 
> However I can still access the bucket the old-fashioned way (e.g., 
> domain-name.com/bucket-name ) even with 
> Civetweb running:
> [root@nuc1 ~]# curl http://s3.e-prepared.com/roger-public/index.html 
> 
> 
>   
> Hello, World!
>   
> 
> 
> Thoughts, anyone?
> 
> Roger



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Migrating RGW from FastCGI to Civetweb

2017-07-11 Thread Richard Hesketh
On 11/07/17 17:08, Roger Brown wrote:
> What are some options for migrating from Apache/FastCGI to Civetweb for 
> RadosGW object gateway *without* breaking other websites on the domain?
> 
> I found documention on how to migrate the object gateway to Civetweb 
> (http://docs.ceph.com/docs/luminous/install/install-ceph-gateway/#migrating-from-apache-to-civetweb),
>  but not seeing how not to break the other sites when it says, "Migrating to 
> use Civetweb basically involves removing your Apache installation."
> 
> Example:
> @domain.tld, www.domain.tld : served by Apache
> s3.domain.tld, *.s3.domain.tld: was served by Apache on the same server, but 
> I want to serve it by Civetweb (maybe on different server) because I hear 
> FastCGI support is going away.

I think it switching off apache is assumed because you can't have both 
listening on the same port. If I were you, I would leave apache listening on 
port 80/443, set up the RGW civetweb to run on some other port, and then 
configure your relevant vhosts in apache to reverse proxy to the civetweb 
install instead so that no client configuration has to change. The civetweb 
config is easy - just "rgw_frontends = civetweb port=8080" or whatever free 
port you want to use - you should be able to find plenty of guides on setting 
up an apache vhost as a reverse proxy if you google it.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Degraded objects while OSD is being added/filled

2017-07-11 Thread Richard Hesketh
First of all, your disk removal process needs tuning. "ceph osd out" sets the 
disk reweight to 0 but NOT the crush weight; this is why you're seeing 
misplaced objects after removing the osd, because the crush weights have 
changed (even though reweight meant that disk currently held no data). Use 
"ceph osd crush reweight osd.$X 0" to change the OSD's crush weight first, wait 
for everything to rebalance, then take it out and down - you shouldn't see any 
extra data movement or repeering after you remove a disk that way.

The degraded objects caused when adding/removing disks are probably down to 
writes taking to place to PGs which are not fully peered, right? While not yet 
fully peered, the primary can't replicate writes to all secondaries, so once 
they are peered again and agree on state they will recognise that some 
secondaries have out of date objects. Any repeering on an active cluster could 
be expected to cause a relatively small number of degraded objects due to 
writes taking place in that window, right?

Rich

On 11/07/17 14:17, Eino Tuominen wrote:
> Hi all,
> 
> 
> One more example: 
> 
> 
> *osd.109*down out weight 0 up_from 306818 up_thru 397714 down_at 397717 
> last_clean_interval [306031,306809) 130.232.243.80:6814/4733 
> 192.168.70.113:6814/4733 192.168.70.113:6815/4733 130.232.243.80:6815/4733 
> exists cabdfaec-eb39-4e5a-8012-9bade04c5e03
> 
> 
> root@ceph-osd-13:~# ceph status
> 
> cluster 0a9f2d69-5905-4369-81ae-e36e4a791831
> 
>  health HEALTH_OK
> 
>  monmap e3: 3 mons at 
> {0=130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0}
> 
> election epoch 356, quorum 0,1,2 0,1,2
> 
>  osdmap e397837: 260 osds: 259 up, 241 in
> 
> flags require_jewel_osds
> 
>   pgmap v81199361: 25728 pgs, 8 pools, 203 TB data, 89794 kobjects
> 
> 613 TB used, 295 TB / 909 TB avail
> 
>25696 active+clean
> 
>   32 active+clean+scrubbing+deep
> 
>   client io 587 kB/s rd, 1422 kB/s wr, 357 op/s rd, 88 op/s wr
> 
> 
> 
> Then, I remove the osd that has been evacuated:
> 
> root@ceph-osd-13:~# ceph osd crush remove osd.109
> 
> removed item id 109 name 'osd.109' from crush map
> 
> 
> Wait a few seconds to let peering to finish:
> 
> root@ceph-osd-13:~# ceph status
> 
> cluster 0a9f2d69-5905-4369-81ae-e36e4a791831
> 
>  health HEALTH_WARN
> 
> 484 pgs backfill_wait
> 
> 81 pgs backfilling
> 
> 40 pgs degraded
> 
> 40 pgs recovery_wait
> 
> 499 pgs stuck unclean
> 
> recovery 58391/279059773 objects degraded (0.021%)
> 
> recovery 6402109/279059773 objects misplaced (2.294%)
> 
>  monmap e3: 3 mons at 
> {0=130.232.243.65:6789/0,1=130.232.243.66:6789/0,2=130.232.243.67:6789/0}
> 
> election epoch 356, quorum 0,1,2 0,1,2
> 
>  osdmap e397853: 260 osds: 259 up, 241 in; 565 remapped pgs
> 
> flags require_jewel_osds
> 
>   pgmap v81199470: 25728 pgs, 8 pools, 203 TB data, 89794 kobjects
> 
> 613 TB used, 295 TB / 909 TB avail
> 
> 58391/279059773 objects degraded (0.021%)
> 
> 6402109/279059773 objects misplaced (2.294%)
> 
>25100 active+clean
> 
>  484 active+remapped+wait_backfill
> 
>   81 active+remapped+backfilling
> 
>   40 active+recovery_wait+degraded
> 
>   23 active+clean+scrubbing+deep
> 
> recovery io 2117 MB/s, 0 objects/s
> 
>   client io 737 kB/s rd, 6719 kB/s wr, 119 op/s rd, 0 op/s wr
> 
> 
> --
>   Eino Tuominen
> 
> --
> *From:* ceph-users  on behalf of Eino 
> Tuominen 
> *Sent:* Monday, July 10, 2017 14:35
> *To:* Gregory Farnum; Andras Pataki; ceph-users
> *Subject:* Re: [ceph-users] Degraded objects while OSD is being added/filled
>  
> 
> [replying to my post]
> 
> 
> In fact, I did just this:
> 
> 
> 1. On a HEALTH_OK​ 

[ceph-users] Prioritise recovery on specific PGs/OSDs?

2017-06-20 Thread Richard Hesketh
Is there a way, either by individual PG or by OSD, I can prioritise 
backfill/recovery on a set of PGs which are currently particularly important to 
me?

For context, I am replacing disks in a 5-node Jewel cluster, on a node-by-node 
basis - mark out the OSDs on a node, wait for them to clear, replace OSDs, 
bring up and in, mark out the OSDs on the next set, etc. I've done my first 
node, but the significant CRUSH map changes means most of my data is moving. I 
only currently care about the PGs on my next set of OSDs to replace - the other 
remapped PGs I don't care about settling because they're only going to end up 
moving around again after I do the next set of disks. I do want the PGs 
specifically on the OSDs I am about to replace to backfill because I don't want 
to compromise data integrity by downing them while they host active PGs. If I 
could specifically prioritise the backfill on those PGs/OSDs, I could get on 
with replacing disks without worrying about causing degraded PGs.

I'm in a situation right now where there is merely a couple of dozen PGs on the 
disks I want to replace, which are all remapped and waiting to backfill - but 
there are 2200 other PGs also waiting to backfill because they've moved around 
too, and it's extremely frustating to be sat waiting to see when the ones I 
care about will finally be handled so I can get on with replacing those disks.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Reg: PG

2017-05-04 Thread Richard Hesketh
The extra pools are probably the data and metadata pools that are
automatically created for cephfs.

http://ceph.com/pgcalc/ is a useful tool for helping to work out how
many PGs your pools should have.

Rich

On 04/05/17 15:41, David Turner wrote:
> I'm guessing you have more than just the 1 pool with 128 PGs in your
> cluster (seeing as you have 320 PGs total, I would guess 2 pools with
> 128 PGs and 1 pool with 64 PGs).  The combined total number of PGs for
> all of your pools is 320 and with only 3 OSDs and most likely replica
> size 3... that leaves you with too many (320) PGs per OSD.  This will
> not likely affect your testing, but if you want to fix the problem you
> will need to delete and recreate your pools with a combined lower
> total number of PGs.
>
> The number of PGs is supposed to reflect how much data each pool is
> going to have.  If you have 1 pool that will have 75% of your
> cluster's data, another pool with 20%, and a third pool with 5%...
> then the number of PGs they have should reflect that.  Based on trying
> to have somewhere between 100-200 PGs per osd, and the above
> estimation for data distribution, you should have 128 PGs in the first
> pool, 32 PGs in the second, and 8 PGs in the third.  Each OSD would
> have 168 PGs and each PG will be roughly the same size between each
> pool.  If you were to add more OSDs, then you would need to increase
> those numbers to account for the additional OSDs to maintain the same
> distribution.  The above math is only for 3 OSDs.  If you had 6 OSDs,
> then the goal would be to have somewhere between 200-400 PGs total to
> maintain the same 100-200 PGs per OSD.
>
> On Thu, May 4, 2017 at 10:24 AM psuresh  > wrote:
>
> Hi,
>
> I'm running 3 osd in my test setup.   I have created PG pool with
> 128 as per the ceph documentation.  
> But i'm getting too many PGs warning.   Can anyone clarify? why
> i'm getting this warning.  
>
> Each OSD contain 240GB disk.
>
> cluster 9d325da2-3d87-4b6b-8cca-e52a4b65aa08
>  health HEALTH_WARN
>*too many PGs per OSD (320 > max 300)*
>  monmap e2: 3 mons at
> {dev-ceph-mon1:6789/0,dev-ceph-mon2:6789/0,dev-ceph-mon3:6789/0}
> election epoch 6, quorum 0,1,2
> dev-ceph-mon1,dev-ceph-mon2,dev-ceph-mon3
>   fsmap e40: 1/1/1 up {0=dev-ceph-mds-active=up:active}
>  osdmap e356: 3 osds: 3 up, 3 in
> flags sortbitwise,require_jewel_osds
>   pgmap v32407: 320 pgs, 3 pools, 27456 MB data, 220 kobjects
> 100843 MB used, 735 GB / 833 GB avail
>  320 active+clean
>
> Regards,
> Suresh
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-20 Thread Richard Hesketh
On 19/04/17 21:08, Reed Dier wrote:
> Hi Maxime,
> 
> This is a very interesting concept. Instead of the primary affinity being 
> used to choose SSD for primary copy, you set crush rule to first choose an 
> osd in the ‘ssd-root’, then the ‘hdd-root’ for the second set.
> 
> And with 'step chooseleaf first {num}’
>> If {num} > 0 && < pool-num-replicas, choose that many buckets. 
> So 1 chooses that bucket
>> If {num} < 0, it means pool-num-replicas - {num}
> And -1 means it will fill remaining replicas on this bucket.
> 
> This is a very interesting concept, one I had not considered.
> Really appreciate this feedback.
> 
> Thanks,
> 
> Reed
> 
>> On Apr 19, 2017, at 12:15 PM, Maxime Guyot <maxime.gu...@elits.com> wrote:
>>
>> Hi,
>>
>>>> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
>>> 1:4-5 is common but depends on your needs and the devices in question, ie. 
>>> assuming LFF drives and that you aren’t using crummy journals.
>>
>> You might be speaking about different ratios here. I think that Anthony is 
>> speaking about journal/OSD and Reed speaking about capacity ratio between 
>> and HDD and SSD tier/root. 
>>
>> I have been experimenting with hybrid setups (1 copy on SSD + 2 copies on 
>> HDD), like Richard says you’ll get much better random read performance with 
>> primary OSD on SSD but write performance won’t be amazing since you still 
>> have 2 HDD copies to write before ACK. 
>>
>> I know the doc suggests using primary affinity but since it’s a OSD level 
>> setting it does not play well with other storage tiers so I searched for 
>> other options. From what I have tested, a rule that selects the 
>> first/primary OSD from the ssd-root then the rest of the copies from the 
>> hdd-root works. Though I am not sure it is *guaranteed* that the first OSD 
>> selected will be primary.
>>
>> “rule hybrid {
>>  ruleset 2
>>  type replicated
>>  min_size 1
>>  max_size 10
>>  step take ssd-root
>>  step chooseleaf firstn 1 type host
>>  step emit
>>  step take hdd-root
>>  step chooseleaf firstn -1 type host
>>  step emit
>> }”
>>
>> Cheers,
>> Maxime

FWIW splitting my HDDs and SSDs into two separate roots and using a crush rule 
to first choose a host from the SSD root and take remaining replicas on the HDD 
root was the way I did it, too. By inspection, it did seem that all PGs in the 
pool had an SSD for a primary, so I think this is a reliable way of doing it. 
You would of course end up with an acting primary on one of the slow spinners 
for a brief period if you lost an SSD for whatever reason and it needed to 
rebalance.

The only downside is that if you have your SSD and HDD OSDs on the same 
physical hosts I'm not sure how you set up your failure domains and rules to 
make sure that you don't take an SSD primary and HDD replica on the same host. 
In my case, SSDs and HDDs are on different hosts, so it didn't matter to me.
-- 
Richard Hesketh



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Primary Affinity

2017-04-19 Thread Richard Hesketh
On 18/04/17 22:28, Anthony D'Atri wrote:
> I get digests, so please forgive me if this has been covered already.
> 
>> Assuming production level, we would keep a pretty close 1:2 SSD:HDD ratio,
> 
> 1:4-5 is common but depends on your needs and the devices in question, ie. 
> assuming LFF drives and that you aren’t using crummy journals.
> 
>> First of all, is this even a valid architecture decision? 
> 
> Inktank described it to me back in 2014/2015 so I don’t think it’s ultra 
> outré.  It does sound like a lot of work to maintain, especially when 
> components get replaced or added.
> 
>> it should boost performance levels considerably compared to spinning disks,
> 
> Performance in which sense?  I would expect it to boost read performance but 
> not so much writes.
> 
> I haven’t used cache tiering so can’t comment on the relative merits.  Your 
> local workload may be a factor.
> 
> — aad

As it happens I've got a ceph cluster with a 1:2 SSD to HDD ratio and I did 
some fio testing a while ago with an SSD-primary pool to see how it performed, 
investigating as an alternative to a cache layer. Generally the results were as 
aad predicts - read performance for the pool was considerably better, almost as 
good as a pure SSD pool. Write performance was better but not so significantly 
improved, only going up to maybe 50% faster depending on the exact workload.

In the end I went with splitting the HDDs and SSDs into separate pools, and 
just using the SSD pool for VMs/datablocks which needed to be snappier. For 
most of my users it didn't matter that the backing pool was kind of slow, and 
only a few were wanting to do I/O intensive workloads where the speed was 
required, so putting so much of the data on the SSDs would have been something 
of a waste.

-- 
Richard Hesketh



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hummer upgrade stuck all OSDs down

2017-04-12 Thread Richard Hesketh
On 12/04/17 09:47, Siniša Denić wrote:
> Hi to all, my cluster stuck after upgrade from hammer 0.94.5 to luminous.
> Iit seems somehow osds stuck at hammer version despite
>  
> Can I somehow overcome this situation and what could happened during the 
> upgrade?
> I performed upgrade from hammer by ceph-deploy install --release luminous
> 
> Thank you, best regards.
I don't know if you can fix it, but I think you've caused your problem by 
trying to jump directly from Hammer to Luminous. That's not an upgrade path 
that Ceph supports - see Kraken's release notes:

>All clusters must first be upgraded to Jewel 10.2.z before upgrading to Kraken 
>11.2.z (or, eventually, Luminous 12.2.z).
>The sortbitwise flag must be set on the Jewel cluster before upgrading to 
>Kraken. The latest Jewel (10.2.4+) releases issue a health warning if the flag 
>is not set, so this is probably already set. If it is not, Kraken OSDs will 
>refuse to start and will print and error message in their log.

Maybe you can do a Jewel install over it and do the Hammer -> Jewel upgrade to 
unbreak things.



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] What's the actual justification for min_size?

2017-03-22 Thread Richard Hesketh
I definitely saw it on a Hammer cluster, though I decided to check my IRC logs 
for more context and found that in my specific cases it was due to PGs going 
incomplete. `ceph health detail` offered the following, for instance:

pg 8.31f is remapped+incomplete, acting [39] (reducing pool one min_size from 2 
may help; search ceph.com/docs for 'incomplete')

And I had to do it on at least a couple of occasions while managing that 
cluster. I don't remember ever having the issue again after going to Infernalis 
and beyond, though. FWIW it was a 60-disk cluster with an above-average failure 
rate cause many of my disks were donations from another project and were 
several years old already.

I guess my curiosity is sated - min_size is relevant when you're also 
considering the transient faults that may take disks down and up to prevent 
inconsistent state and lost writes. It's not so relevant when you're talking 
about complete disk failures, because if a replica is irretrievably lost all 
you can do is rebuild it anyway, and you're only $size badly-timed disk fails 
away from losing a PG entirely regardless of the setting of min_size.

On 21/03/17 23:14, Anthony D'Atri wrote:
> I’m fairly sure I saw it as recently as Hammer, definitely Firefly. YMMV.
> 
> 
>> On Mar 21, 2017, at 4:09 PM, Gregory Farnum  wrote:
>>
>> You shouldn't need to set min_size to 1 in order to heal any more. That was 
>> the case a long time ago but it's been several major LTS releases now. :)
>> So: just don't ever set min_size to 1.
>> -Greg
>> On Tue, Mar 21, 2017 at 6:04 PM Anthony D'Atri  wrote:
 a min_size of 1 is dangerous though because it means you are 1 hard disk 
 failure away from losing the objects within that placement group entirely. 
 a min_size of 2 is generally considered the minimum you want but many 
 people ignore that advice, some wish they hadn't.
>>>
>>> I admit I am having difficulty following why this is the case
>>
>> I think we have a case of fervently agreeing.
>>
>> Setting min_size on a specific pool to 1 to allow PG’s to heal is absolutely 
>> a normal thing in certain circumstances, but it’s important to
>>
>> 1) Know _exactly_ what you’re doing, to which pool, and why
>> 2) Do it very carefully, changing ‘size’ instead of ‘min_size’ on a busy 
>> pool with a bunch of PG’s and data can be quite the rude awakening.
>> 3) Most importantly, _only_ set it for the minimum time needed, with eyes 
>> watching the healing, and set it back immediately after all affected PG’s 
>> have peered and healed.
>>
>> The danger, which I think is what Wes was getting at, is in leaving it set 
>> to 1 all the time, or forgetting to revert it.  THAT is, as we used to say, 
>> begging to lose.
>>
>> — aad



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What's the actual justification for min_size? (was: Re: I/O hangs with 2 node failure even if one node isn't involved in I/O)

2017-03-21 Thread Richard Hesketh
On 21/03/17 17:48, Wes Dillingham wrote:
> a min_size of 1 is dangerous though because it means you are 1 hard disk 
> failure away from losing the objects within that placement group entirely. a 
> min_size of 2 is generally considered the minimum you want but many people 
> ignore that advice, some wish they hadn't. 

I admit I am having difficulty following why this is the case. From searching 
about I understand that the min_size parameter prevents I/O to a PG which does 
not have the required number of replicas, but the justification confuses me - 
if your min_size is one, and you have a PG which now only exists on one OSD, 
surely you are one OSD failure away from losing that PG entirely regardless of 
whether or not you are doing any I/O to it, as that's the last copy of your 
data? And the OSD itself likely serves many other placement groups which are 
above the min_size, so it is not as if freezing I/O on that PG prevents the 
actual disk from doing any activity which could possibly exacerbate a failure. 
Is the assumption that the other lost OSDs could be coming back with their old 
copy of the PG so any newer writes to the PG would be lost if you're unlucky 
enough that the last remaining OSD went down before the others came back? Which 
is not the same thing as losing the objects in that PG entirely, though 
obviously it's not at all ideal, and is also completely irrelevant if you know 
the other OSDs will not be coming back. I am sure I remember having to reduce 
min_size to 1 temporarily in the past to allow recovery from having two drives 
irrecoverably die at the same time in one of my clusters.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question regarding CRUSH algorithm

2017-02-17 Thread Richard Hesketh
On 16/02/17 20:44, girish kenkere wrote:
> Thanks David,
> 
> Its not quiet what i was looking for. Let me explain my question in more 
> detail -
> 
> This is excerpt from Crush paper, this explains how crush algo running on 
> each client/osd maps pg to an osd during the write operation[lets assume].
> 
> /"Tree buckets are structured as a weighted binary search tree with items at 
> the leaves. Each interior node knows the total weight of its left and right 
> subtrees and is labeled according to a fixed strategy (described below). In 
> order to select an item within a bucket, CRUSH starts at the root of the tree 
> and calculates the hash of the input key x, replica number r, the bucket 
> identifier, and the label at the current tree node (initially the root). The 
> result is compared to the weight ratio of the left and right subtrees to 
> decide which child node to visit next. This process is repeated until a leaf 
> node is reached, at which point the associated item in the bucket is chosen. 
> Only logn hashes and node comparisons are needed to locate an item.:"/
> 
>  My question is along the way the tree structure changes, weights of the 
> nodes change and some nodes even go away. In that case, how are future reads 
> lead to pg to same osd mapping? Its not cached anywhere, same algo runs for 
> every future read - what i am missing is how it picks the same osd[where data 
> resides] every time. With a modified crush map, won't we end up with 
> different leaf node if we apply same algo? 
> 
> Thanks
> Girish
> 
> On Thu, Feb 16, 2017 at 12:05 PM, David Turner  > wrote:
> 
> As a piece to the puzzle, the client always has an up to date osd map 
> (which includes the crush map).  If it's out of date, then it has to get a 
> new one before it can request to read or write to the cluster.  That way the 
> client will never have old information and if you add or remove storage, the 
> client will always have the most up to date map to know where the current 
> copies of the files are.
> 
> This can cause slow downs in your cluster performance if you are updating 
> your osdmap frequently, which can be caused by deleting a lot of snapshots as 
> an example.

> 
> --
> *From:* ceph-users [ceph-users-boun...@lists.ceph.com 
> ] on behalf of girish kenkere 
> [kngen...@gmail.com ]
> *Sent:* Thursday, February 16, 2017 12:43 PM
> *To:* ceph-users@lists.ceph.com 
> *Subject:* [ceph-users] Question regarding CRUSH algorithm
> 
> Hi, I have a question regarding CRUSH algorithm - please let me know how 
> this works. CRUSH paper talks about how given an object we select OSD via two 
> mapping - first one is obj to PG and then PG to OSD. 
> 
> This PG to OSD mapping is something i dont understand. It uses pg#, 
> cluster map, and placement rules. How is it guaranteed to return correct OSD 
> for future reads after the cluster map/placement rules has changed due to 
> nodes coming and out?
> 
> Thanks
> Girish

I think there is confusion over when the CRUSH algorithm is being run. It's my 
understanding that the object->PG mapping is always dynamically computed, and 
that's pretty simple (hash the object ID, take it modulo [num_pgs in pool], 
prepend pool ID, 8.0b's your uncle), but the PG->OSD mapping is only computed 
when new PGs are created or the CRUSH map changes. The result of that 
computation is stored in the cluster map and then locating a particular PG is a 
matter of looking it up in the map, not recalculating its location - PG 
placement is pseudorandom and nondeterministic anyway, so that would never work.

So - the client DOES run CRUSH to find the location of an object, but only in 
the sense of working out which PG it's in. It then looks up the PG in the 
cluster map (which includes the osdmap that David