[ceph-users] noout, nodown and blocked requests

2017-03-12 Thread Shain Miley
Hello,
One of the nodes in our 14 node cluster is offline and before I totally commit 
to fully removing the node from the cluster (there is a chance I can get the 
node back in working order in the next few days) I would like to run the 
cluster with that single node out for a few days.

Currently I have the. noout and nodown flags set while doing the maintenance 
work.

Some users are complaining about disconnects and other oddities when try to 
save and access files currently on the cluster.

I am also seeing some blocked requests when viewing the cluster status (at this 
point I see 160 block requests spread over 15 to 20 osd’s).

Currently I have a replication level of 3 on this pool and a min_size of 1. 

My question is this…is there a better method to use (other than using noout and 
nodown) in this scenario where I do not want data movement yet…but I do want 
the reads and writes to the cluster to to respond as normally as possible for 
the end users?

Thanks in advance,

Shain


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Latest Jewel New OSD Creation

2017-03-12 Thread Ashley Merrick
After rolling back to 10.2.5 the issue has gone, seems there has been a change 
in 10.2.6 which breaks this.

,Ashley

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ashley 
Merrick
Sent: Saturday, 11 March 2017 11:32 AM
To: ceph-us...@ceph.com
Subject: [ceph-users] Latest Jewel New OSD Creation


This sender failed our fraud detection checks and may not be who they appear to 
be. Learn about spoofing

Feedback

Hello,

I am trying to add a new OSD to my CEPH Cluster, I am running Proxmox so 
attempted to as normal via the GUI as normal however received an error output 
at the following command:

ceph-disk prepare --zap-disk --fs-type xfs --cluster ceph --cluster-uuid 
51c1b5c5-e510-4ed3-8b09-417214edb3f4 --journal-dev /dev/sdc /dev/sdm1

Output : ceph-disk: Error: journal specified but not allowed by osd backend

This is only happening since updated to v10.2.6, it looks like ceph-disk for 
some reason is maybe thinking the OSD should be a bluestore OSD?

Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience

2017-03-12 Thread Christian Balzer

Hello,

On Sun, 12 Mar 2017 19:54:10 +0100 Florian Haas wrote:

> On Sat, Mar 11, 2017 at 12:21 PM,  wrote:
> > The upgrade of our biggest cluster, nr 4, did not go without
> > problems. Since we where expecting a lot of "failed to encode map
> > e with expected crc" messages, we disabled clog to monitors
> > with 'ceph tell osd.* injectargs -- --clog_to_monitors=false' so our
> > monitors would not choke in those messages. The upgrade of the
> > monitors did go as expected, without any problem, the problems
> > started when we started the upgrade of the OSDs. In the upgrade
> > procedure, we had to change the ownership of the files from root to
> > the user ceph and that process was taking so long on our cluster that
> > completing the upgrade would take more then a week. We decided to
> > keep the permissions as they where for now, so in the upstart init
> > script /etc/init/ceph-osd.conf, we changed '--setuser ceph --setgroup
> > ceph' to  '--setuser root --setgroup root' and fix that OSD by OSD
> > after the upgrade was completely done  
> 
> For others following this thread who still have the hammer→jewel upgrade
> ahead: there is a ceph.conf option you can use here; no need to fiddle
> with the upstart scripts.
> 
> setuser match path = /var/lib/ceph/$type/$cluster-$id
>

Yes, I was thinking about mentioning this, too.
Alas in my experience with a wonky test cluster this failed with MDS,
maybe because of an odd name, maybe because nobody ever tested it.
MONs and OSDs were fine.
 
> What this will do is it will check which user owns files in the
> respective directories, and then start your Ceph daemons under the
> appropriate user and group IDs. In other words, if you enable this and
> you upgrade from Hammer to Jewel, and your files are still owned by
> root, your daemons will also continue run as root:root (as they did in
> hammer). Then, you can stop your OSDs, run the recursive chown, and
> restart the OSDs one-by-one. When they come back up, they will just
> automatically switch to running as ceph:ceph.
> 
Though if you have external journals and didn't use ceph-deploy, you're
boned with the whole ceph:ceph approach.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience

2017-03-12 Thread Christian Balzer

Hello,

On Sun, 12 Mar 2017 19:52:12 +1000 Brad Hubbard wrote:

> On Sun, Mar 12, 2017 at 6:36 AM, Christian Theune  
> wrote:
> > Hi,
> >
> > thanks for that report! Glad to hear a mostly happy report. I’m still on the
> > fence … ;)
> >
> > I have had reports that Qemu (librbd connections) will require
> > updates/restarts before upgrading. What was your experience on that side?
> > Did you upgrade the clients? Did you start using any of the new RBD
> > features, like fast diff?  
> 
> You don't need to restart qemu-kvm instances *before* upgrading but
> you do need to restart or migrate them *after* updating. The updated
> binaries are only loaded into the qemu process address space at
> start-up so to load the newly installed binaries (libraries) you need
> to restart or do a migration to an upgraded host.
> 

Well, the OP wrote about live migration problems, but those were not in the
qemu part of things but libvirt/openstack related.

To wit, I did upgrade a test cluster from hammer to Jewel and live
migration under ganeti worked fine.

I've also not seen any problems on other instances that since have not
been restarted, nor would I hope that an upgrade from one stable version
to the next should EVER require such a step (at least immediately). 

Christian

> >
> > What’s your experience with load/performance after the upgrade? Found any
> > new issues that indicate shifted hotspots?
> >
> > Cheers and thanks again,
> > Christian
> >
> > On Mar 11, 2017, at 12:21 PM, cephmailingl...@mosibi.nl wrote:
> >
> > Hello list,
> >
> > A week ago we upgraded our Ceph clusters from Hammer to Jewel and with this
> > email we want to share our experiences.
> >
> >
> > We have four clusters:
> >
> > 1) Test cluster for all the fun things, completely virtual.
> >
> > 2) Test cluster for Openstack: 3 monitors and 9 OSDs, all baremetal
> >
> > 3) Cluster where we store backups: 3 monitors and 153 OSDs. 554 TB storage
> >
> > 4) Main cluster (used for our custom software stack and openstack): 5
> > monitors and 1917 OSDs. 8 PB storage
> >
> >
> > All the clusters are running on Ubuntu 14.04 LTS and we use the Ceph
> > packages from ceph.com. On every cluster we upgraded the monitors first and
> > after that, the OSDs. Our backup cluster is the only cluster that also
> > serves S3 via the RadosGW and that service is upgraded at the same time as
> > the OSDs in that cluster. The upgrade of clusters 1, 2 and 3 went without
> > any problem, just an apt-get upgrade on every component. We did  see the
> > message "failed to encode map e with expected crc", but that
> > message disappeared when all the OSDs where upgraded.
> >
> > The upgrade of our biggest cluster, nr 4, did not go without problems. Since
> > we where expecting a lot of "failed to encode map e with expected
> > crc" messages, we disabled clog to monitors with 'ceph tell osd.* injectargs
> > -- --clog_to_monitors=false' so our monitors would not choke in those
> > messages. The upgrade of the monitors did go as expected, without any
> > problem, the problems started when we started the upgrade of the OSDs. In
> > the upgrade procedure, we had to change the ownership of the files from root
> > to the user ceph and that process was taking so long on our cluster that
> > completing the upgrade would take more then a week. We decided to keep the
> > permissions as they where for now, so in the upstart init script
> > /etc/init/ceph-osd.conf, we changed '--setuser ceph --setgroup ceph' to
> > '--setuser root --setgroup root' and fix that OSD by OSD after the upgrade
> > was completely done
> >
> > On cluster 3 (backup) we could change the permissions in a shorter time with
> > the following procedure:
> >
> > a) apt-get -y install ceph-common
> > b) mount|egrep 'on \/var.*ceph.*osd'|awk '{print $3}'|while read P; do
> > echo chown -R ceph:ceph $P \&;done > t ; bash t ; rm t
> > c) (wait for all the chown's to complete)
> > d) stop ceph-all
> > e) find /var/lib/ceph/ ! -uid 64045 -print0|xargs -0  chown ceph:ceph
> > f) start ceph-all
> >
> > This procedure did not work on our main (4) cluster because the load on the
> > OSDs became 100% in step b and that resulted in blocked I/O on some virtual
> > instances in the Openstack cluster. Also at that time one of our pools got a
> > lot of extra data, those files where stored with root permissions since we
> > did not restarted the Ceph daemons yet, the 'find' in step e found so much
> > files that xargs (the shell) could not handle it (too many arguments). At
> > that time we decided to keep the permissions on root in the upgrade phase.
> >
> > The next and biggest problem we encountered had to do with the CRC errors on
> > the OSD map. On every map update, the OSDs that were not upgraded yet, got
> > that CRC error and asked the monitor for a full OSD map instead of just a
> > delta update. At first we did not understand what exactly happened, we ran
> > the upgrade 

Re: [ceph-users] speed decrease with size

2017-03-12 Thread Christian Balzer

Hello,

On Sun, 12 Mar 2017 19:37:16 -0400 Ben Erridge wrote:

> I am testing attached volume storage on our openstack cluster which uses
> ceph for block storage.
> our Ceph nodes have large SSD's for their journals 50+GB for each OSD. I'm
> thinking some parameter is a little off because with relatively small
> writes I am seeing drastically reduced write speeds.
> 
Large journals are a waste for most people, especially when your backing
storage are HDDs.

> 
> we have 2 nodes withs 12 total OSD's each with 50GB SSD Journal.
> 
I hope that's not your plan for production, with a replica of 2 you're
looking at pretty much guaranteed data loss over time, unless your OSDs
are actually RAIDs.

5GB journals tend to be overkill already.
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008606.html

If you were to actually look at your OSD nodes during those tests with
something like atop or "iostat -x", you'd likely see that with prolonged
writes you wind up with the speed of what your HDDs can do, i.e. see them
(all or individually) being quite busy.

Lastly, for nearly everybody in real life situations the
bandwidth/throughput becomes a distant second to latency considerations. 

Christian

> 
>  here is our Ceph config
> 
> [global]
> fsid = 19bc15fd-c0cc-4f35-acd2-292a86fbcf7d
> mon_initial_members = node-5 node-4 node-3
> mon_host = 192.168.0.8 192.168.0.7 192.168.0.13
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> log_to_syslog_level = info
> log_to_syslog = True
> osd_pool_default_size = 1
> osd_pool_default_min_size = 1
> osd_pool_default_pg_num = 64
> public_network = 192.168.0.0/24
> log_to_syslog_facility = LOG_LOCAL0
> osd_journal_size = 5
> auth_supported = cephx
> osd_pool_default_pgp_num = 64
> osd_mkfs_type = xfs
> cluster_network = 192.168.1.0/24
> osd_recovery_max_active = 1
> osd_max_backfills = 1
> 
> [client]
> rbd_cache = True
> rbd_cache_writethrough_until_flush = True
> 
> [client.radosgw.gateway]
> rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator
> keyring = /etc/ceph/keyring.radosgw.gateway
> rgw_socket_path = /tmp/radosgw.sock
> rgw_keystone_revocation_interval = 100
> rgw_keystone_url = 192.168.0.2:35357
> rgw_keystone_admin_token = ZBz37Vlv
> host = node-3
> rgw_dns_name = *.ciminc.com
> rgw_print_continue = True
> rgw_keystone_token_cache_size = 10
> rgw_data = /var/lib/ceph/radosgw
> user = www-data
> 
> This is the degradation I am speaking of..
> 
> 
> dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=1k; rm -f
> /mnt/ext4/output;
> 1024+0 records in
> 1024+0 records out
> 1048576000 bytes (1.0 GB) copied, 0.887431 s, 1.2 GB/s
> 
> dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=2k; rm -f
> /mnt/ext4/output;
> 2048+0 records in
> 2048+0 records out
> 2097152000 bytes (2.1 GB) copied, 3.75782 s, 558 MB/s
> 
>  dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=3k; rm -f
> /mnt/ext4/output;
> 3072+0 records in
> 3072+0 records out
> 3145728000 bytes (3.1 GB) copied, 10.0054 s, 314 MB/s
> 
> dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=5k; rm -f
> /mnt/ext4/output;
> 5120+0 records in
> 5120+0 records out
> 524288 bytes (5.2 GB) copied, 24.1971 s, 217 MB/s
> 
> Any suggestions for improving the large write degradation?


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_disk_thread_ioprio_priority help

2017-03-12 Thread Florian Haas
On Sat, Mar 11, 2017 at 4:24 PM, Laszlo Budai  wrote:
>>> Can someone explain the meaning of osd_disk_thread_ioprio_priority. I'm
>>> [...]
>>>
>>> Now I am confused  :(
>>>
>>> Can somebody bring some light here?
>>
>>
>> Only to confuse you some more. If you are running Jewel or above then
>> scrubbing is now done in the main operation thread and so setting this
>> value
>> will have no effect.
>
>
> There is the hammer version of ceph.

As the documentation
(http://docs.ceph.com/docs/hammer/rados/configuration/osd-config-ref/)
explains, osd_disk_thread_ioprio_priority is ineffective unless you're
using the CFQ I/O scheduler. Most server I/O stack configurations
(including typical Ceph OSD nodes) run with the deadline scheduler
either by default or by best-practice configuration, so this option
has no effect on such systems.

You can check /sys/block//queue/scheduler to see which
scheduler your devices are currently using.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience

2017-03-12 Thread Florian Haas
On Sat, Mar 11, 2017 at 12:21 PM,  wrote:
> The upgrade of our biggest cluster, nr 4, did not go without
> problems. Since we where expecting a lot of "failed to encode map
> e with expected crc" messages, we disabled clog to monitors
> with 'ceph tell osd.* injectargs -- --clog_to_monitors=false' so our
> monitors would not choke in those messages. The upgrade of the
> monitors did go as expected, without any problem, the problems
> started when we started the upgrade of the OSDs. In the upgrade
> procedure, we had to change the ownership of the files from root to
> the user ceph and that process was taking so long on our cluster that
> completing the upgrade would take more then a week. We decided to
> keep the permissions as they where for now, so in the upstart init
> script /etc/init/ceph-osd.conf, we changed '--setuser ceph --setgroup
> ceph' to  '--setuser root --setgroup root' and fix that OSD by OSD
> after the upgrade was completely done

For others following this thread who still have the hammer→jewel upgrade
ahead: there is a ceph.conf option you can use here; no need to fiddle
with the upstart scripts.

setuser match path = /var/lib/ceph/$type/$cluster-$id

What this will do is it will check which user owns files in the
respective directories, and then start your Ceph daemons under the
appropriate user and group IDs. In other words, if you enable this and
you upgrade from Hammer to Jewel, and your files are still owned by
root, your daemons will also continue run as root:root (as they did in
hammer). Then, you can stop your OSDs, run the recursive chown, and
restart the OSDs one-by-one. When they come back up, they will just
automatically switch to running as ceph:ceph.

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pgs stuck inactive

2017-03-12 Thread Brad Hubbard
On Sun, Mar 12, 2017 at 7:51 PM, Laszlo Budai  wrote:
> Hello,
>
> I have already done the export with ceph_objectstore_tool. I just have to
> decide which OSDs to keep.
> Can you tell me why the directory structure in the OSDs is different for the
> same PG when checking on different OSDs?
> For instance, in OSD 2 and 63 there are NO subdirectories in the
> 3.367__head, while OSD 28, 35 contains
> ./DIR_7/DIR_6/DIR_B/
> ./DIR_7/DIR_6/DIR_3/
>
> When are these subdirectories created?
>
> The files are identical on all the OSDs, only the way how these are stored
> is different. It would be enough if you could point me to some documentation
> that explain these, I'll read it. So far, searching for the architecture of
> an OSD, I could not find the gory details about these directories.

https://github.com/ceph/ceph/blob/master/src/os/filestore/HashIndex.h

>
> Kind regards,
> Laszlo
>
>
> On 12.03.2017 02:12, Brad Hubbard wrote:
>>
>> On Sat, Mar 11, 2017 at 7:43 PM, Laszlo Budai 
>> wrote:
>>>
>>> Hello,
>>>
>>> Thank you for your answer.
>>>
>>> indeed the min_size is 1:
>>>
>>> # ceph osd pool get volumes size
>>> size: 3
>>> # ceph osd pool get volumes min_size
>>> min_size: 1
>>> #
>>> I'm gonna try to find the mentioned discussions on the mailing lists, and
>>> read them. If you have a link at hand, that would be nice if you would
>>> send
>>> it to me.
>>
>>
>> This thread is one example, there are lots more.
>>
>>
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html
>>
>>>
>>> In the attached file you can see the contents of the directory containing
>>> PG
>>> data on the different OSDs (all that have appeared in the pg query).
>>> According to the md5sums the files are identical. What bothers me is the
>>> directory structure (you can see the ls -R in each dir that contains
>>> files).
>>
>>
>> So I mixed up 63 and 68, my list should have read 2, 28, 35 and 63
>> since 68 is listed as empty in the pg query.
>>
>>>
>>> Where can I read about how/why those DIR# subdirectories have appeared?
>>>
>>> Given that the files themselves are identical on the "current" OSDs
>>> belonging to the PG, and as the osd.63 (currently not belonging to the
>>> PG)
>>> has the same files, is it safe to stop the OSD.2, remove the 3.367_head
>>> dir,
>>> and then restart the OSD? (all these with the noout flag set of course)
>>
>>
>> *You* need to decide which is the "good" copy and then follow the
>> instructions in the links I provided to try and recover the pg. Back
>> those known copies on 2, 28, 35 and 63 up with the
>> ceph_objectstore_tool before proceeding. They may well be identical
>> but the peering process still needs to "see" the relevant logs and
>> currently something is stopping it doing so.
>>
>>>
>>> Kind regards,
>>> Laszlo
>>>
>>>
>>> On 11.03.2017 00:32, Brad Hubbard wrote:


 So this is why it happened I guess.

 pool 3 'volumes' replicated size 3 min_size 1

 min_size = 1 is a recipe for disasters like this and there are plenty
 of ML threads about not setting it below 2.

 The past intervals in the pg query show several intervals where a
 single OSD may have gone rw.

 How important is this data?

 I would suggest checking which of these OSDs actually have the data
 for this pg. From the pg query it looks like 2, 35 and 68 and possibly
 28 since it's the primary. Check all OSDs in the pg query output. I
 would then back up all copies and work out which copy, if any, you
 want to keep and then attempt something like the following.

 https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17820.html

 If you want to abandon the pg see


 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html
 for a possible solution.

 http://ceph.com/community/incomplete-pgs-oh-my/ may also give some
 ideas.


 On Fri, Mar 10, 2017 at 9:44 PM, Laszlo Budai 
 wrote:
>
>
> The OSDs are all there.
>
> $ sudo ceph osd stat
>  osdmap e60609: 72 osds: 72 up, 72 in
>
> an I have attached the result of ceph osd tree, and ceph osd dump
> commands.
> I got some extra info about the network problem. A faulty network
> device
> has
> flooded the network eating up all the bandwidth so the OSDs were not
> able
> to
> properly communicate with each other. This has lasted for almost 1 day.
>
> Thank you,
> Laszlo
>
>
>
> On 10.03.2017 12:19, Brad Hubbard wrote:
>>
>>
>>
>> To me it looks like someone may have done an "rm" on these OSDs but
>> not removed them from the crushmap. This does not happen
>> automatically.
>>
>> Do these OSDs show up in "ceph osd tree" and "ceph osd dump" ? If so,
>> paste the output.
>>
>> Without knowing 

Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience

2017-03-12 Thread Brad Hubbard
On Sun, Mar 12, 2017 at 6:36 AM, Christian Theune  wrote:
> Hi,
>
> thanks for that report! Glad to hear a mostly happy report. I’m still on the
> fence … ;)
>
> I have had reports that Qemu (librbd connections) will require
> updates/restarts before upgrading. What was your experience on that side?
> Did you upgrade the clients? Did you start using any of the new RBD
> features, like fast diff?

You don't need to restart qemu-kvm instances *before* upgrading but
you do need to restart or migrate them *after* updating. The updated
binaries are only loaded into the qemu process address space at
start-up so to load the newly installed binaries (libraries) you need
to restart or do a migration to an upgraded host.

>
> What’s your experience with load/performance after the upgrade? Found any
> new issues that indicate shifted hotspots?
>
> Cheers and thanks again,
> Christian
>
> On Mar 11, 2017, at 12:21 PM, cephmailingl...@mosibi.nl wrote:
>
> Hello list,
>
> A week ago we upgraded our Ceph clusters from Hammer to Jewel and with this
> email we want to share our experiences.
>
>
> We have four clusters:
>
> 1) Test cluster for all the fun things, completely virtual.
>
> 2) Test cluster for Openstack: 3 monitors and 9 OSDs, all baremetal
>
> 3) Cluster where we store backups: 3 monitors and 153 OSDs. 554 TB storage
>
> 4) Main cluster (used for our custom software stack and openstack): 5
> monitors and 1917 OSDs. 8 PB storage
>
>
> All the clusters are running on Ubuntu 14.04 LTS and we use the Ceph
> packages from ceph.com. On every cluster we upgraded the monitors first and
> after that, the OSDs. Our backup cluster is the only cluster that also
> serves S3 via the RadosGW and that service is upgraded at the same time as
> the OSDs in that cluster. The upgrade of clusters 1, 2 and 3 went without
> any problem, just an apt-get upgrade on every component. We did  see the
> message "failed to encode map e with expected crc", but that
> message disappeared when all the OSDs where upgraded.
>
> The upgrade of our biggest cluster, nr 4, did not go without problems. Since
> we where expecting a lot of "failed to encode map e with expected
> crc" messages, we disabled clog to monitors with 'ceph tell osd.* injectargs
> -- --clog_to_monitors=false' so our monitors would not choke in those
> messages. The upgrade of the monitors did go as expected, without any
> problem, the problems started when we started the upgrade of the OSDs. In
> the upgrade procedure, we had to change the ownership of the files from root
> to the user ceph and that process was taking so long on our cluster that
> completing the upgrade would take more then a week. We decided to keep the
> permissions as they where for now, so in the upstart init script
> /etc/init/ceph-osd.conf, we changed '--setuser ceph --setgroup ceph' to
> '--setuser root --setgroup root' and fix that OSD by OSD after the upgrade
> was completely done
>
> On cluster 3 (backup) we could change the permissions in a shorter time with
> the following procedure:
>
> a) apt-get -y install ceph-common
> b) mount|egrep 'on \/var.*ceph.*osd'|awk '{print $3}'|while read P; do
> echo chown -R ceph:ceph $P \&;done > t ; bash t ; rm t
> c) (wait for all the chown's to complete)
> d) stop ceph-all
> e) find /var/lib/ceph/ ! -uid 64045 -print0|xargs -0  chown ceph:ceph
> f) start ceph-all
>
> This procedure did not work on our main (4) cluster because the load on the
> OSDs became 100% in step b and that resulted in blocked I/O on some virtual
> instances in the Openstack cluster. Also at that time one of our pools got a
> lot of extra data, those files where stored with root permissions since we
> did not restarted the Ceph daemons yet, the 'find' in step e found so much
> files that xargs (the shell) could not handle it (too many arguments). At
> that time we decided to keep the permissions on root in the upgrade phase.
>
> The next and biggest problem we encountered had to do with the CRC errors on
> the OSD map. On every map update, the OSDs that were not upgraded yet, got
> that CRC error and asked the monitor for a full OSD map instead of just a
> delta update. At first we did not understand what exactly happened, we ran
> the upgrade per node using a script and in that script we watch the state of
> the cluster and when the cluster is healthy again, we upgrade the next host.
> Every time we started the script (skipping the already upgraded hosts) the
> first host(s) upgraded without issues and then we got blocked I/O on the
> cluster. The blocked I/O went away within a minute of 2 (not measured).
> After investigation we found out that the blocked I/O happened when nodes
> where asking the monitor for a (full) OSD map and that resulted shortly in a
> full saturated network link on our monitor.
>
> In the next graph the statistics for one of our Ceph monitor is shown. Our
> hosts are equipped with 10 gbit/s NIC's and every time at 

Re: [ceph-users] pgs stuck inactive

2017-03-12 Thread Laszlo Budai

Hello,

I have already done the export with ceph_objectstore_tool. I just have to 
decide which OSDs to keep.
Can you tell me why the directory structure in the OSDs is different for the 
same PG when checking on different OSDs?
For instance, in OSD 2 and 63 there are NO subdirectories in the 3.367__head, 
while OSD 28, 35 contains
./DIR_7/DIR_6/DIR_B/
./DIR_7/DIR_6/DIR_3/

When are these subdirectories created?

The files are identical on all the OSDs, only the way how these are stored is 
different. It would be enough if you could point me to some documentation that 
explain these, I'll read it. So far, searching for the architecture of an OSD, 
I could not find the gory details about these directories.

Kind regards,
Laszlo

On 12.03.2017 02:12, Brad Hubbard wrote:

On Sat, Mar 11, 2017 at 7:43 PM, Laszlo Budai  wrote:

Hello,

Thank you for your answer.

indeed the min_size is 1:

# ceph osd pool get volumes size
size: 3
# ceph osd pool get volumes min_size
min_size: 1
#
I'm gonna try to find the mentioned discussions on the mailing lists, and
read them. If you have a link at hand, that would be nice if you would send
it to me.


This thread is one example, there are lots more.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014846.html



In the attached file you can see the contents of the directory containing PG
data on the different OSDs (all that have appeared in the pg query).
According to the md5sums the files are identical. What bothers me is the
directory structure (you can see the ls -R in each dir that contains files).


So I mixed up 63 and 68, my list should have read 2, 28, 35 and 63
since 68 is listed as empty in the pg query.



Where can I read about how/why those DIR# subdirectories have appeared?

Given that the files themselves are identical on the "current" OSDs
belonging to the PG, and as the osd.63 (currently not belonging to the PG)
has the same files, is it safe to stop the OSD.2, remove the 3.367_head dir,
and then restart the OSD? (all these with the noout flag set of course)


*You* need to decide which is the "good" copy and then follow the
instructions in the links I provided to try and recover the pg. Back
those known copies on 2, 28, 35 and 63 up with the
ceph_objectstore_tool before proceeding. They may well be identical
but the peering process still needs to "see" the relevant logs and
currently something is stopping it doing so.



Kind regards,
Laszlo


On 11.03.2017 00:32, Brad Hubbard wrote:


So this is why it happened I guess.

pool 3 'volumes' replicated size 3 min_size 1

min_size = 1 is a recipe for disasters like this and there are plenty
of ML threads about not setting it below 2.

The past intervals in the pg query show several intervals where a
single OSD may have gone rw.

How important is this data?

I would suggest checking which of these OSDs actually have the data
for this pg. From the pg query it looks like 2, 35 and 68 and possibly
28 since it's the primary. Check all OSDs in the pg query output. I
would then back up all copies and work out which copy, if any, you
want to keep and then attempt something like the following.

https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17820.html

If you want to abandon the pg see

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-September/012778.html
for a possible solution.

http://ceph.com/community/incomplete-pgs-oh-my/ may also give some ideas.


On Fri, Mar 10, 2017 at 9:44 PM, Laszlo Budai 
wrote:


The OSDs are all there.

$ sudo ceph osd stat
 osdmap e60609: 72 osds: 72 up, 72 in

an I have attached the result of ceph osd tree, and ceph osd dump
commands.
I got some extra info about the network problem. A faulty network device
has
flooded the network eating up all the bandwidth so the OSDs were not able
to
properly communicate with each other. This has lasted for almost 1 day.

Thank you,
Laszlo



On 10.03.2017 12:19, Brad Hubbard wrote:



To me it looks like someone may have done an "rm" on these OSDs but
not removed them from the crushmap. This does not happen
automatically.

Do these OSDs show up in "ceph osd tree" and "ceph osd dump" ? If so,
paste the output.

Without knowing what exactly happened here it may be difficult to work
out how to proceed.

In order to go clean the primary needs to communicate with multiple
OSDs, some of which are marked DNE and seem to be uncontactable.

This seems to be more than a network issue (unless the outage is still
happening).



http://docs.ceph.com/docs/master/rados/operations/pg-states/?highlight=incomplete



On Fri, Mar 10, 2017 at 6:09 PM, Laszlo Budai 
wrote:



Hello,

I was informed that due to a networking issue the ceph cluster network
was
affected. There was a huge packet loss, and network interfaces were
flipping. That's all I got.
This outage has lasted a longer period of time. So I assume that some
OSD
may have been considered dead 

Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience

2017-03-12 Thread cephmailinglist

On 03/11/2017 09:49 PM, Udo Lembke wrote:

Hi Udo,

Perhaps would an "find /var/lib/ceph/ ! -uid 64045 -exec chown
ceph:ceph" do an better job?!


We did exactly that (and also tried other combinations) and that is a 
workaround for the 'argument too long' problem, but then it would call 
an exec for every file it finds. All those forks took forever... :)



--
With regards,

Richard Arends.
Snow BV / http://snow.nl

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrading 2K OSDs from Hammer to Jewel. Our experience

2017-03-12 Thread cephmailinglist

On 03/11/2017 09:36 PM, Christian Theune wrote:

Hello,

I have had reports that Qemu (librbd connections) will require 
updates/restarts before upgrading. What was your experience on that 
side? Did you upgrade the clients? Did you start using any of the new 
RBD features, like fast diff?


We have two types of clients, 1) Openstack hosts and components like 
Cinder and 2) clients that use librbd (from Java and C). We combine Ceph 
and Openstack on the same host, meaning that when we upgraded Ceph for 
the OSDs, the libraries for Openstack was updated at the same time. The 
other type of clients where already using the Jewel libraries and 
binaries for some time. We did not changed anything on the clients, so 
we are not using the newly introduced features (yet)


What’s your experience with load/performance after the upgrade? Found 
any new issues that indicate shifted hotspots?


We did not see any difference.

--
With regards,

Richard Arends.
Snow BV / http://snow.nl

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com