Re: [ceph-users] After kernel upgrade OSD's on different disk.

2016-11-01 Thread Peter Maloney
On 11/01/16 00:10, jan hugo prins wrote:
> After the kernel upgrade, I also upgraded the cluster to 10.2.3 from
> 10.2.2.
> Let's hope I only hit a bug and that this bug is now fixed, on the other
> hand, I think I also saw the issue with a 10.2.3 node, but I'm not sure.
It's not a bug for disks to change names... you should never expect them
to be static for any Linux system, ceph or not. As Henrik has already
said, this is normal.
> On 10/31/2016 11:41 PM, Henrik Korkuc wrote:
>> this is normal. You should expect that your disks may get reordered
>> after reboot.

>> On 16-10-31 18:32, jan hugo prins wrote:
>>> My idea to fix this is to use the Disk UUID instead of the dev name
>>> (/dev/disk/by-uuid/ instead of /dev/sda) when activating the disk.
>>> But I really don't know if this is possible.
>>>
>>> Could anyone tell me if I can prevent this issue in the future?
This is what I would do... always use something that doesn't change,
such as the filesystem UUID, GPT partlabel, GPT partuuid, etc.

And I wouldn't use udev the way the others suggested... I think it's
much simpler to use a static name than make it enforce that the normally
dynamic names are static.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about writing a program that transfer snapshot diffs between ceph clusters

2016-11-01 Thread Peter Maloney
On 11/01/16 06:57, xxhdx1985126 wrote:
> Hi, everyone.
>
> I'm trying to write a program based on the librbd API that transfers
> snapshot diffs between ceph clusters without the need for a temporary
> storage which is required if I use the "rbd export-diff" and "rbd
> import-diff" pair.

You don't need a temp file for this... eg.


ssh node1 rbd export-diff rbd/blah@snap1 | rbd import-diff rbd/blah
ssh node1 rbd export-diff --from-snap snap1 rbd/blah@snap2 | rbd
import-diff rbd/blah

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about writing a program that transfer snapshot diffs between ceph clusters

2016-11-01 Thread Peter Maloney
On 11/01/16 10:22, Peter Maloney wrote:
> On 11/01/16 06:57, xxhdx1985126 wrote:
>> Hi, everyone.
>>
>> I'm trying to write a program based on the librbd API that transfers
>> snapshot diffs between ceph clusters without the need for a temporary
>> storage which is required if I use the "rbd export-diff" and "rbd
>> import-diff" pair.
>
> You don't need a temp file for this... eg.
>
>
oops forgot the "-" in the commands corrected:
> ssh node1 rbd export-diff rbd/blah@snap1 - | rbd import-diff - rbd/blah
> ssh node1 rbd export-diff --from-snap snap1 rbd/blah@snap2 - | rbd
> import-diff - rbd/blah
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After kernel upgrade OSD's on different disk.

2016-11-01 Thread jan hugo prins
Below are the block ID's for the OSD drives.
I have one Journal disk in this system and because I'm testing the setup
at the moment, one disk has it's journal local and the other 2 OSD's
have the journal on the journal disk (/dev/sdb). There is also one
journal to many but this is because I took out /dev/sda and put it back
into the cluster with a local journal instead of a journal on the
journal drive.

/dev/sdc1: UUID="f16b94cc-f691-40b8-92a2-06c5263683d6" TYPE="xfs"
PARTLABEL="ceph data" PARTUUID="ea5fd156-f82b-4686-8d62-c86fc430098c"
/dev/sdd1: UUID="f9114559-af27-4a10-96b5-5c1b8bce8fbd" TYPE="xfs"
PARTLABEL="ceph data" PARTUUID="28716fa4-c7ba-4db0-9117-da5ad781b3e5"
/dev/sda1: UUID="05048fb7-c79c-46da-80a1-95aa7be0dd41" TYPE="xfs"
PARTLABEL="ceph data" PARTUUID="10fa40ab-1cfe-4bf6-8f06-967158ab6aa3"
/dev/sdb1: PARTLABEL="ceph journal"
PARTUUID="e270318c-1921-44d6-9bf5-e5832c0c57e4"
/dev/sdb2: PARTLABEL="ceph journal"
PARTUUID="6fff2c84-d28d-4be6-bc53-b80da87701d4"
/dev/sdb3: PARTLABEL="ceph journal"
PARTUUID="ce6e335b-fba3-413f-a657-64c7727f6289"
/dev/sda2: PARTLABEL="ceph journal"
PARTUUID="e80b53aa-324a-4689-a06d-ea3aae79702e"

The 95-ceph-osd.rules file is the same on all systems, so I would think
they are from the Jewel Ceph RPM.

[root@blsceph01-1 ~]# md5sum /lib/udev/rules.d/95-ceph-osd.rules
b4132c970fd72e718fda1865f458210e  /lib/udev/rules.d/95-ceph-osd.rules
[root@blsceph01-2 ~]# md5sum /lib/udev/rules.d/95-ceph-osd.rules
b4132c970fd72e718fda1865f458210e  /lib/udev/rules.d/95-ceph-osd.rules
[root@blsceph01-3 ~]# md5sum /lib/udev/rules.d/95-ceph-osd.rules
b4132c970fd72e718fda1865f458210e  /lib/udev/rules.d/95-ceph-osd.rules

I just rebooted the blsceph01-1 and all the disks came back normally.

I'm still very curious to what will happen the next time I will get a
kernel update or any other time my systems at boot time decide to
rearrange the disks again.

Jan Hugo



On 11/01/2016 12:15 AM, Henrik Korkuc wrote:
> How are your OSDs setup? It is possible that udev rules didn't
> activate your OSDs if it didn't match rules. Refer to
> /lib/udev/rules.d/95-ceph-osd.rules. Basically your partition types
> must be of correct type for it to work
>
> On 16-10-31 19:10, jan hugo prins wrote:
>> After the kernel upgrade, I also upgraded the cluster to 10.2.3 from
>> 10.2.2.
>> Let's hope I only hit a bug and that this bug is now fixed, on the other
>> hand, I think I also saw the issue with a 10.2.3 node, but I'm not sure.
>>
>> Jan Hugo
>>
>>
>> On 10/31/2016 11:41 PM, Henrik Korkuc wrote:
>>> this is normal. You should expect that your disks may get reordered
>>> after reboot. I am not sure about your setup details, but in 10.2.3
>>> udev should be able to activate your OSDs no matter the naming (there
>>> were some bugs in previous 10.2.x releases)
>>>
>>> On 16-10-31 18:32, jan hugo prins wrote:
 Hello,

 After patching my OSD servers with the latest Centos kernel and
 rebooting the nodes, all OSD drives moved to different positions.

 Before the reboot:

 Systemdisk: /dev/sda
 Journaldisk: /dev/sdb
 OSD disk 1: /dev/sdc
 OSD disk 2: /dev/sdd
 OSD disk 3: /dev/sde

 After the reboot:

 Systemdisk: /dev/sde
 journaldisk: /dev/sdb
 OSD disk 1: /dev/sda
 OSD disk 2: /dev/sdc
 OSD disk 3: /dev/sdd

 The result was that the OSD didn't start at boot-up and I had to
 manually activate them again.
 After rebooting OSD node 1 I checked the state of the Ceph cluster
 before rebooting node number 2. I found that the disks were not online
 and I needed to fix this. In the end I was able to do all the upgrades
 etc, but this was a big surprise to me.

 My idea to fix this is to use the Disk UUID instead of the dev name
 (/dev/disk/by-uuid/ instead of /dev/sda) when activating the
 disk.
 But I really don't know if this is possible.

 Could anyone tell me if I can prevent this issue in the future?

>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Met vriendelijke groet / Best regards,

Jan Hugo Prins
Infra and Isilon storage consultant

Better.be B.V.
Auke Vleerstraat 140 E | 7547 AN Enschede | KvK 08097527
T +31 (0) 53 48 00 694 | M +31 (0)6 26 358 951
jpr...@betterbe.com | www.betterbe.com

This e-mail is intended exclusively for the addressee(s), and may not
be passed on to, or made available for use by any person other than 
the addressee(s). Better.be B.V. rules out any and every liability 
resulting from any electronic transmission.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Integrating Ceph Jewel and Mitaka

2016-11-01 Thread fridifree
Hi everybody,

I am trying to integrate Ceph with Mitaka and I get an error about
cinder-volume cannot connect to cluster

2016-11-01 11:40:51.110 13762 ERROR oslo_service.service
VolumeBackendAPIException: Bad or unexpected response from the storage
volume backend API: Error connecting to ceph cluster.

it works with hammer and mitaka


any suggestions?

Thank you
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Question about writing a program that transfer snapshot diffs between ceph clusters

2016-11-01 Thread Wes Dillingham
You might want to have a look at this:
https://github.com/camptocamp/ceph-rbd-backup/blob/master/ceph-rbd-backup.py

I have a bash implementation of this, but it basically boils down to
wrapping what peter said: an export-diff to stdout piped to an
import-diff on a different cluster. The "transfer" node is a client of
both clusters and simpy iterates over all rbd devices, snapshotting
them daily, and exporting the diff between todays snap and yesterdays
snap and layering that diff onto a sister rbd on the remote side.


On Tue, Nov 1, 2016 at 5:23 AM, Peter Maloney
 wrote:
> On 11/01/16 10:22, Peter Maloney wrote:
>
> On 11/01/16 06:57, xxhdx1985126 wrote:
>
> Hi, everyone.
>
> I'm trying to write a program based on the librbd API that transfers
> snapshot diffs between ceph clusters without the need for a temporary
> storage which is required if I use the "rbd export-diff" and "rbd
> import-diff" pair.
>
>
> You don't need a temp file for this... eg.
>
>
> oops forgot the "-" in the commands corrected:
>
> ssh node1 rbd export-diff rbd/blah@snap1 - | rbd import-diff - rbd/blah
> ssh node1 rbd export-diff --from-snap snap1 rbd/blah@snap2 - | rbd
> import-diff - rbd/blah
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10Gbit switch advice for small ceph cluster upgrade

2016-11-01 Thread Alexandre DERUMIER
>>We use Mellanox SX1036 and SX1012, which can function in 10 and 56GbE modes.  
>>It uses QSFP, Twinax or MPO, which terminates with LC fiber connections.  
>>While not dirt cheap, or entry >>level, we like these as being considerably 
>>cheaper than even a decent SDN solution.  We have been able to build MLAG and 
>>leaf and spine solutions pretty easily with these.

We use sx1012 too , pretty happy with them (in production since 2 years now).

We are going to use new SN2100  (16 ports 40GB or 16 ports 100GB, with breakout 
cables that's 48 ports 10G or 48ports 25G).

around 6000€ for the 40GB, and 12000€ with 100GB. (with mlx-os, but you can use 
also cumulus with theses new mellanox switches :)



- Mail original -
De: "Simon Leinen" 
À: "Erik McCormick" 
Cc: "ceph-users" 
Envoyé: Lundi 31 Octobre 2016 11:28:07
Objet: Re: [ceph-users] 10Gbit switch advice for small ceph cluster upgrade

Erik McCormick writes: 
> We use Edge-Core 5712-54x running Cumulus Linux. Anything off their 
> compatibility list would be good though. The switch is 48 10G sfp+ 
> ports. We just use copper cables with attached sfp. It also had 6 40G 
> ports. The switch cost around $4800 and the cumulus license is about 
> 3k for a perpetual license. 

Similar here, except we use Quanta switches (T5032-LY6). 

SFP+ slots and DAC cables. Actually our switches are 32*40GE, and we 
use "fan-out" DAC cables (QSFP on one side, 4 SFP+ on the other). 

Compared to 10GBaseT (RJ45), DAC cables are thicker, which may 
complicate cable management a little. On the other hand I think DAC 
still needs less power than 10GBaseT. And with the 40G setup, we have 
good port density and a smooth migration path to 40GE. We already use 
40GE for our leaf-spine uplinks. Another advantage for us is that we 
can use a single SKU for both leaf and spine switches. 

The Cumulus licenses are a bit more expensive for those 40GE switches 
(as are the switches themselves), but it's still a good deal for us. 

Maybe these days it makes sense to look at 100GE switches in preference 
to 40GE; 100GE ports can normally be used as 2*50GE, 4*25GE, 1*40GE or 
4*10GE as well, so the upgrade paths seem even nicer. And the prices 
are getting competitive I think. 
-- 
Simon. 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Hammer Cache Tiering

2016-11-01 Thread Ashley Merrick
Hello,

Currently using a Proxmox & CEPH cluster, currently they are running on Hammer 
looking to update to Jewel shortly, I know I can do a manual upgrade however 
would like to keep what is tested well with Proxmox.

Looking to put a SSD Cache tier in front, however have seen and read there has 
been a few bug's with Cache Tiering causing corruption, from what I read all 
fixed on Jewel however not 100% if they have been pushed to Hammer (even though 
is still not EOL for a little while).

Is anyone running Cache Tiering on Hammer in production and had no issues, or 
is anyone aware of any bugs' / issues that means I should hold off till I 
upgrade to Jewel, or any reason basically to hold off for a month or so to 
update to Jewel before enabling a cache tier.

Thanks!
,Ashley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After kernel upgrade OSD's on different disk.

2016-11-01 Thread David Turner
Peter nailed this on the head.  You shouldn't setup your journals using 
/dev/sdx naming.  You should use /dev/disk/by-partuuid or something similar.  
This way it will not matter what letter your drives are assigned on reboot.  
Your /dev/sdx letter assignments can change on a reboot regardless if you 
changed your kernel or ceph version.



[cid:image0d645c.JPG@efba2a75.42aeb59c]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Peter Maloney 
[peter.malo...@brockmann-consult.de]
Sent: Tuesday, November 01, 2016 3:18 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] After kernel upgrade OSD's on different disk.

On 11/01/16 00:10, jan hugo prins wrote:
> After the kernel upgrade, I also upgraded the cluster to 10.2.3 from
> 10.2.2.
> Let's hope I only hit a bug and that this bug is now fixed, on the other
> hand, I think I also saw the issue with a 10.2.3 node, but I'm not sure.
It's not a bug for disks to change names... you should never expect them
to be static for any Linux system, ceph or not. As Henrik has already
said, this is normal.
> On 10/31/2016 11:41 PM, Henrik Korkuc wrote:
>> this is normal. You should expect that your disks may get reordered
>> after reboot.

>> On 16-10-31 18:32, jan hugo prins wrote:
>>> My idea to fix this is to use the Disk UUID instead of the dev name
>>> (/dev/disk/by-uuid/ instead of /dev/sda) when activating the disk.
>>> But I really don't know if this is possible.
>>>
>>> Could anyone tell me if I can prevent this issue in the future?
This is what I would do... always use something that doesn't change,
such as the filesystem UUID, GPT partlabel, GPT partuuid, etc.

And I wouldn't use udev the way the others suggested... I think it's
much simpler to use a static name than make it enforce that the normally
dynamic names are static.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I need help building the source code can anyone help?

2016-11-01 Thread Kamble, Nitin A
Building ceph is bit involved process.

what version are you trying to build?
For building are you following README in the code?

- Nitin
On Oct 28, 2016, at 12:16 AM, 刘 畅 
> wrote:

After I successfully run the install-deps.sh , I try to run cmake and return as 
follow:

ubuntu@i-c9rgl1y5:~/projects/ceph/build$ ls
bin  CMakeCache.txt  CMakeFiles  doc  include  man  src
ubuntu@i-c9rgl1y5:~/projects/ceph/build$ cmake ..
-- /usr/lib/x86_64-linux-gnu/libatomic_ops.a
-- NSS_LIBRARIES: 
/usr/lib/x86_64-linux-gnu/libssl3.so;/usr/lib/x86_64-linux-gnu/libsmime3.so;/usr/lib/x86_64-linux-gnu/libnss3.so;/usr/lib/x86_64-linux-gnu/libnssutil3.so
-- NSS_INCLUDE_DIRS: /usr/include/nss
-- SSL with NSS selected (Libs: 
/usr/lib/x86_64-linux-gnu/libssl3.so;/usr/lib/x86_64-linux-gnu/libsmime3.so;/usr/lib/x86_64-linux-gnu/libnss3.so;/usr/lib/x86_64-linux-gnu/libnssutil3.so)
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython2.7.so (found suitable 
version "2.7.12", minimum required is "2.7")
-- Boost version: 1.58.0
-- Found the following Boost libraries:
--   python
-- Boost version: 1.58.0
-- Found the following Boost libraries:
--   thread
--   system
--   regex
--   random
--   program_options
--   date_time
--   iostreams
--   chrono
--   atomic
--  we have a modern and working yasm
--  we are x84_64
--  we are not x32
--  yasm can also build the isa-l stuff
-- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython2.7.so (found suitable 
version "2.7.12", minimum required is "2")
--  Using EventEpoll for events.
CMake Error at src/CMakeLists.txt:508 (add_subdirectory):
  The source directory

/home/ubuntu/projects/ceph/src/lua

  does not contain a CMakeLists.txt file.


-- Found cython
CMake Error at /usr/share/cmake-3.5/Modules/ExternalProject.cmake:1915 
(message):
  No download info given for 'rocksdb_ext' and its source directory:

   /home/ubuntu/projects/ceph/src/rocksdb

  is not an existing non-empty directory.  Please specify one of:

   * SOURCE_DIR with an existing non-empty directory
   * URL
   * GIT_REPOSITORY
   * HG_REPOSITORY
   * CVS_REPOSITORY and CVS_MODULE
   * SVN_REVISION
   * DOWNLOAD_COMMAND
Call Stack (most recent call first):
  /usr/share/cmake-3.5/Modules/ExternalProject.cmake:2459 
(_ep_add_download_command)
  src/CMakeLists.txt:655 (ExternalProject_Add)


CMake Error at src/CMakeLists.txt:706 (add_subdirectory):
  add_subdirectory given source "googletest/googlemock" which is not an
  existing directory.


-- Configuring incomplete, errors occurred!
See also "/home/ubuntu/projects/ceph/build/CMakeFiles/CMakeOutput.log".
See also "/home/ubuntu/projects/ceph/build/CMakeFiles/CMakeError.log".
ubuntu@i-c9rgl1y5:~/projects/ceph/build$
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg stuck with unfound objects on non exsisting osd's

2016-11-01 Thread Ronny Aasen

Hello.

I have a cluster stuck with 2 pg's stuck undersized degraded, with 25 
unfound objects.


# ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck 
unclean; 2 pgs stuck undersized; 2 pgs undersized; recovery 294599/149522370 
objects degraded (0.197%); recovery 640073/149522370 objects misplaced 
(0.428%); recovery 25/46579241 unfound (0.000%); noout flag(s) set
pg 6.d4 is stuck unclean for 8893374.380079, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck unclean for 8896787.249470, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is stuck undersized for 438122.427341, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck undersized for 416947.461950, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is stuck degraded for 438122.427402, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck degraded for 416947.462010, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is active+recovering+undersized+degraded+remapped, acting [62], 25 
unfound
pg 6.ab is active+recovering+undersized+degraded+remapped, acting [18,12]
recovery 294599/149522370 objects degraded (0.197%)
recovery 640073/149522370 objects misplaced (0.428%)
recovery 25/46579241 unfound (0.000%)
noout flag(s) set


have been following the troubleshooting guide at 
http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/ 
but gets stuck without a resolution.


luckily it is not critical data. so i wanted to mark the pg lost so it 
could become health-ok



# ceph pg 6.d4 mark_unfound_lost delete
Error EINVAL: pg has 25 unfound objects but we haven't probed all 
sources, not marking lost


querying the pg i see that it would want osd.80 and osd 36

 {
"osd": "80",
"status": "osd is down"
},

trying to mark the osd's lost does not work either. since the osd's was 
removed from the cluster a long time ago.


# ceph osd lost 80 --yes-i-really-mean-it
osd.80 is not down or doesn't exist

# ceph osd lost 36 --yes-i-really-mean-it
osd.36 is not down or doesn't exist


and this is where i am stuck.

have tried stopping and starting the 3 osd's but that did not have any 
effect.


Anyone have any advice how to proceed ?

full output at:  http://paste.debian.net/hidden/be03a185/

this is hammer 0.94.9  on debian 8.


kind regards

Ronny Aasen



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg stuck with unfound objects on non exsisting osd's

2016-11-01 Thread ceph
Hello Ronny,

if it is possible for you, try to Reboot all OSD Nodes. 

I had this issue on my test Cluster and it become healthy after rebooting.

Hth
- Mehmet

Am 1. November 2016 19:55:07 MEZ, schrieb Ronny Aasen 
:
>Hello.
>
>I have a cluster stuck with 2 pg's stuck undersized degraded, with 25 
>unfound objects.
>
># ceph health detail
>HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2
>pgs stuck unclean; 2 pgs stuck undersized; 2 pgs undersized; recovery
>294599/149522370 objects degraded (0.197%); recovery 640073/149522370
>objects misplaced (0.428%); recovery 25/46579241 unfound (0.000%);
>noout flag(s) set
>pg 6.d4 is stuck unclean for 8893374.380079, current state
>active+recovering+undersized+degraded+remapped, last acting [62]
>pg 6.ab is stuck unclean for 8896787.249470, current state
>active+recovering+undersized+degraded+remapped, last acting [18,12]
>pg 6.d4 is stuck undersized for 438122.427341, current state
>active+recovering+undersized+degraded+remapped, last acting [62]
>pg 6.ab is stuck undersized for 416947.461950, current state
>active+recovering+undersized+degraded+remapped, last acting [18,12]
>pg 6.d4 is stuck degraded for 438122.427402, current state
>active+recovering+undersized+degraded+remapped, last acting [62]
>pg 6.ab is stuck degraded for 416947.462010, current state
>active+recovering+undersized+degraded+remapped, last acting [18,12]
>pg 6.d4 is active+recovering+undersized+degraded+remapped, acting [62],
>25 unfound
>pg 6.ab is active+recovering+undersized+degraded+remapped, acting
>[18,12]
>recovery 294599/149522370 objects degraded (0.197%)
>recovery 640073/149522370 objects misplaced (0.428%)
>recovery 25/46579241 unfound (0.000%)
>noout flag(s) set
>
>
>have been following the troubleshooting guide at 
>http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/
>
>but gets stuck without a resolution.
>
>luckily it is not critical data. so i wanted to mark the pg lost so it 
>could become health-ok
>
>
># ceph pg 6.d4 mark_unfound_lost delete
>Error EINVAL: pg has 25 unfound objects but we haven't probed all 
>sources, not marking lost
>
>querying the pg i see that it would want osd.80 and osd 36
>
>  {
> "osd": "80",
> "status": "osd is down"
> },
>
>trying to mark the osd's lost does not work either. since the osd's was
>
>removed from the cluster a long time ago.
>
># ceph osd lost 80 --yes-i-really-mean-it
>osd.80 is not down or doesn't exist
>
># ceph osd lost 36 --yes-i-really-mean-it
>osd.36 is not down or doesn't exist
>
>
>and this is where i am stuck.
>
>have tried stopping and starting the 3 osd's but that did not have any 
>effect.
>
>Anyone have any advice how to proceed ?
>
>full output at:  http://paste.debian.net/hidden/be03a185/
>
>this is hammer 0.94.9  on debian 8.
>
>
>kind regards
>
>Ronny Aasen
>
>
>
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uniquely identifying a Ceph client

2016-11-01 Thread Sage Weil
On Tue, 1 Nov 2016, Travis Rhoden wrote:
> Hello,
> Is there a consistent, reliable way to identify a Ceph client? I'm looking
> for a string/ID (UUID, for example) that can be traced back to a client
> doing RBD maps.
> 
> There are a couple of possibilities out there, but they aren't quite what
> I'm looking for.  When checking "rbd status", for example, the output is the
> following:
> 
> # rbd status travis2
> Watchers:
> watcher=172.21.12.10:0/1492902152 client.4100 cookie=1
> # rbd status travis3
> Watchers:
> watcher=172.21.12.10:0/1492902152 client.4100 cookie=2
> 
> 
> The IP:port/nonce string is an option, and so is the "client." string,
> but neither of these is actually that helpful because they don't the same
> strings when an advisory lock is added to the RBD images. For example:

Both are sufficient.  The  in client. is the most concise and is 
unique per client instance.

I think the problem you're seeing is actually that qemu is using two 
different librbd/librados instances, one for each mapped device?

> # rbd lock list travis2
> There is 1 exclusive lock on this image.
> Locker      ID     Address
> client.4201 test 172.21.12.100:0/967432549
> # rbd lock list travis3
> There is 1 exclusive lock on this image.
> Locker      ID     Address
> client.4240 test 172.21.12.10:0/2888955091
> 
> Note that neither the nonce nor the client ID match -- so by looking at the
> rbd lock output, you can't match that information against the output from
> "rbd status". I believe this is because the nonce the client identifier is
> reflecting the CephX session between client and cluster, and while this is
> persistent across "rbd map" calls (because the rbd kmod has a shared session
> by default, though that can be changed as well), each call to "rbd lock"
> initiates a new session. Hence a new nonce and client ID.
> 
> That pretty much leaves the IP address. These would seem to be problematic
> as an identifier if the client happened to behind NAT.
> 
> I am trying to be able to definitely determine what client has an RBD mapped
> and locked, but I'm not seeing a way to guarantee that you've uniquely
> identified a client. Am I missing something obvious?
> 
> Perhaps my concern about NAT is overblown -- I've never mounted an RBD from
> a client that is behind NAT, and I'm not sure how common that would be
> (though I think it would work).

It should work, but it's untested.  :)

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Total free space in addition to MAX AVAIL

2016-11-01 Thread Sage Weil
On Tue, 1 Nov 2016, Stillwell, Bryan J wrote:
> I recently learned that 'MAX AVAIL' in the 'ceph df' output doesn't
> represent what I thought it did.  It actually represents the amount of
> data that can be used before the first OSD becomes full, and not the sum
> of all free space across a set of OSDs.  This means that balancing the
> data with 'ceph osd reweight' will actually increase the value of 'MAX
> AVAIL'.
> 
> Knowing this I would like to graph both 'MAX AVAIL' and the total free
> space across two different sets of OSDs so I can get an idea how out of
> balance the cluster is.
> 
> This is where I'm running into trouble.  I have two different types of
> Ceph nodes in my cluster.  One with all HDDs+SSD journals, and the other
> with all SSDs using co-located journals.  There isn't any cache tiering
> going on, so a pool either uses the all-HDD root, or the all-SSD root, but
> not both.
> 
> The only method I can think of to get this information is to walk the
> CRUSH tree to figure out which OSDs are under a given root, and then use
> the output of 'ceph osd df -f json' to sum up the free space of each OSD.
> Is there a better way?

Try

ceph osd df tree -f json-pretty

I think that'll give you all the right fields you need to sum.

I wonder if this is something we should be reporting elsewhere, though?  
Summing up all free space is one thing.  Doing it per CRUSH hierarchy is 
something else.  Maybe the 'ceph osd df tree' output could have a field 
summing freespace for self + children in the json dump only...

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded

2016-11-01 Thread Marcus Müller
Hi all,

i have a big problem and i really hope someone can help me!

We are running a ceph cluster since a year now. Version is: 0.94.7 (Hammer)
Here is some info:

Our osd map is:

ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 26.67998 root default 
-2  3.64000 host ceph1   
 0  3.64000 osd.0   up  1.0  1.0 
-3  3.5 host ceph2   
 1  3.5 osd.1   up  1.0  1.0 
-4  3.64000 host ceph3   
 2  3.64000 osd.2   up  1.0  1.0 
-5 15.89998 host ceph4   
 3  4.0 osd.3   up  1.0  1.0 
 4  3.5 osd.4   up  1.0  1.0 
 5  3.2 osd.5   up  1.0  1.0 
 6  5.0 osd.6   up  1.0  1.0 

ceph df:

GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED 
40972G 26821G   14151G 34.54 
POOLS:
NAMEID USED  %USED MAX AVAIL OBJECTS 
blocks  7  4490G 10.96 1237G 7037004 
commits 8   473M 0 1237G  802353 
fs  9  9666M  0.02 1237G 7863422 

ceph osd df:

ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  
 0 3.64000  1.0  3724G  3128G   595G 84.01 2.43 
 1 3.5  1.0  3724G  3237G   487G 86.92 2.52 
 2 3.64000  1.0  3724G  3180G   543G 85.41 2.47 
 3 4.0  1.0  7450G  1616G  5833G 21.70 0.63 
 4 3.5  1.0  7450G  1246G  6203G 16.74 0.48 
 5 3.2  1.0  7450G  1181G  6268G 15.86 0.46 
 6 5.0  1.0  7450G   560G  6889G  7.52 0.22 
  TOTAL 40972G 14151G 26820G 34.54  
MIN/MAX VAR: 0.22/2.52  STDDEV: 36.53


Our current cluster state is: 

 health HEALTH_WARN
63 pgs backfill
8 pgs backfill_toofull
9 pgs backfilling
11 pgs degraded
1 pgs recovering
10 pgs recovery_wait
11 pgs stuck degraded
89 pgs stuck unclean
recovery 8237/52179437 objects degraded (0.016%)
recovery 9620295/52179437 objects misplaced (18.437%)
2 near full osd(s)
noout,noscrub,nodeep-scrub flag(s) set
 monmap e8: 4 mons at 
{ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0}
election epoch 400, quorum 0,1,2,3 ceph1,ceph2,ceph3,ceph4
 osdmap e1774: 7 osds: 7 up, 7 in; 84 remapped pgs
flags noout,noscrub,nodeep-scrub
  pgmap v7316159: 320 pgs, 3 pools, 4501 GB data, 15336 kobjects
14152 GB used, 26820 GB / 40972 GB avail
8237/52179437 objects degraded (0.016%)
9620295/52179437 objects misplaced (18.437%)
 231 active+clean
  61 active+remapped+wait_backfill
   9 active+remapped+backfilling
   6 active+recovery_wait+degraded+remapped
   6 active+remapped+backfill_toofull
   4 active+recovery_wait+degraded
   2 active+remapped+wait_backfill+backfill_toofull
   1 active+recovering+degraded
recovery io 11754 kB/s, 35 objects/s
  client io 1748 kB/s rd, 249 kB/s wr, 44 op/s


My main problems are: 

- As you can see from the osd tree, we have three separate hosts with only one 
osd each. Another one has four osds. Ceph allows me not to get data back from 
these three nodes with only one HDD, which are all near full. I tried to set 
the weight of the osds in the bigger node higher but this just does not work. 
So i added a new osd yesterday which made things not better, as you can see 
now. What do i have to do to just become these three nodes empty again and put 
more data on the other node with the four HDDs.

- I added the „ceph4“ node later, this resulted in a strange ip change as you 
can see in the mon list. The public network and the cluster network were 
swapped or not assigned right. See ceph.conf

[global]
fsid = xxx
mon_initial_members = ceph1
mon_host = 192.168.10.3, 192.168.10.4, 192.168.10.5, 192.168.10.11
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = 192.168.60.0/24
cluster_network = 192.168.10.0/24
osd pool default size = 3
osd pool default min size = 1
osd pool default pg num = 128
osd pool default pgp num = 128
osd recovery max active = 50
osd recovery threads = 3
mon_pg_warn_max_per_osd = 0

  What can i do in this case (it’s no big problem since the network is 2x 10 
GBE and everything works)?

- One other thing. Even if i just prepare the osd, it’s automatically added to 
the cluster. I can not activate it. Has had someone other already such 

Re: [ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded

2016-11-01 Thread Ronny Aasen
if you have the default crushmap and osd pool default size = 3, then 
ceph creates 3 copies of each object. and store

it on 3 separate nodes.

so the best way to solve your space problems is to try to even out the 
space between your hosts. either by adding disks to ceph1 ceph2 ceph3, 
or by adding more nodes.



kind regards
Ronny Aasen




On 01.11.2016 20:14, Marcus Müller wrote:
> Hi all,
>
> i have a big problem and i really hope someone can help me!
>
> We are running a ceph cluster since a year now. Version is: 0.94.7 
(Hammer)

> Here is some info:
>
> Our osd map is:
>
> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 26.67998 root default
> -2  3.64000 host ceph1
>  0  3.64000 osd.0   up  1.0  1.0
> -3  3.5 host ceph2
>  1  3.5 osd.1   up  1.0  1.0
> -4  3.64000 host ceph3
>  2  3.64000 osd.2   up  1.0  1.0
> -5 15.89998 host ceph4
>  3  4.0 osd.3   up  1.0  1.0
>  4  3.5 osd.4   up  1.0  1.0
>  5  3.2 osd.5   up  1.0  1.0
>  6  5.0 osd.6   up  1.0  1.0
>
> ceph df:
>
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED
> 40972G 26821G   14151G 34.54
> POOLS:
> NAMEID USED  %USED MAX AVAIL OBJECTS
> blocks  7  4490G 10.96 1237G 7037004
> commits 8   473M 0 1237G  802353
> fs  9  9666M  0.02 1237G 7863422
>
> ceph osd df:
>
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR
>  0 3.64000  1.0  3724G  3128G   595G 84.01 2.43
>  1 3.5  1.0  3724G  3237G   487G 86.92 2.52
>  2 3.64000  1.0  3724G  3180G   543G 85.41 2.47
>  3 4.0  1.0  7450G  1616G  5833G 21.70 0.63
>  4 3.5  1.0  7450G  1246G  6203G 16.74 0.48
>  5 3.2  1.0  7450G  1181G  6268G 15.86 0.46
>  6 5.0  1.0  7450G   560G  6889G  7.52 0.22
>   TOTAL 40972G 14151G 26820G 34.54
> MIN/MAX VAR: 0.22/2.52  STDDEV: 36.53
>
>
> Our current cluster state is:
>
>  health HEALTH_WARN
> 63 pgs backfill
> 8 pgs backfill_toofull
> 9 pgs backfilling
> 11 pgs degraded
> 1 pgs recovering
> 10 pgs recovery_wait
> 11 pgs stuck degraded
> 89 pgs stuck unclean
> recovery 8237/52179437 objects degraded (0.016%)
> recovery 9620295/52179437 objects misplaced (18.437%)
> 2 near full osd(s)
> noout,noscrub,nodeep-scrub flag(s) set
>  monmap e8: 4 mons at 
{ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0}

> election epoch 400, quorum 0,1,2,3 ceph1,ceph2,ceph3,ceph4
>  osdmap e1774: 7 osds: 7 up, 7 in; 84 remapped pgs
> flags noout,noscrub,nodeep-scrub
>   pgmap v7316159: 320 pgs, 3 pools, 4501 GB data, 15336 kobjects
> 14152 GB used, 26820 GB / 40972 GB avail
> 8237/52179437 objects degraded (0.016%)
> 9620295/52179437 objects misplaced (18.437%)
>  231 active+clean
>   61 active+remapped+wait_backfill
>9 active+remapped+backfilling
>6 active+recovery_wait+degraded+remapped
>6 active+remapped+backfill_toofull
>4 active+recovery_wait+degraded
>2 active+remapped+wait_backfill+backfill_toofull
>1 active+recovering+degraded
> recovery io 11754 kB/s, 35 objects/s
>   client io 1748 kB/s rd, 249 kB/s wr, 44 op/s
>
>
> My main problems are:
>
> - As you can see from the osd tree, we have three separate hosts with 
only one osd each. Another one has four osds. Ceph allows me not to get 
data back from these three nodes with only one HDD, which are all near 
full. I tried to set the weight of the osds in the bigger node higher 
but this just does not work. So i added a new osd yesterday which made 
things not better, as you can see now. What do i have to do to just 
become these three nodes empty again and put more data on the other node 
with the four HDDs.

>
> - I added the „ceph4“ node later, this resulted in a strange ip 
change as you can see in the mon list. The public network and the 
cluster network were swapped or not assigned right. See ceph.conf

>
> [global]
> fsid = xxx
> mon_initial_members = ceph1
> mon_host = 192.168.10.3, 192.168.10.4, 192.168.10.5, 192.168.10.11
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> public_network = 192.168.60.0/24
> cluster_network = 192.168.10.0/24
> osd pool default size = 3
> osd pool default min size = 1
> osd pool default pg num = 128
> osd pool default 

Re: [ceph-users] Total free space in addition to MAX AVAIL

2016-11-01 Thread Stillwell, Bryan J
On 11/1/16, 1:45 PM, "Sage Weil"  wrote:

>On Tue, 1 Nov 2016, Stillwell, Bryan J wrote:
>> I recently learned that 'MAX AVAIL' in the 'ceph df' output doesn't
>> represent what I thought it did.  It actually represents the amount of
>> data that can be used before the first OSD becomes full, and not the sum
>> of all free space across a set of OSDs.  This means that balancing the
>> data with 'ceph osd reweight' will actually increase the value of 'MAX
>> AVAIL'.
>> 
>> Knowing this I would like to graph both 'MAX AVAIL' and the total free
>> space across two different sets of OSDs so I can get an idea how out of
>> balance the cluster is.
>> 
>> This is where I'm running into trouble.  I have two different types of
>> Ceph nodes in my cluster.  One with all HDDs+SSD journals, and the other
>> with all SSDs using co-located journals.  There isn't any cache tiering
>> going on, so a pool either uses the all-HDD root, or the all-SSD root,
>>but
>> not both.
>> 
>> The only method I can think of to get this information is to walk the
>> CRUSH tree to figure out which OSDs are under a given root, and then use
>> the output of 'ceph osd df -f json' to sum up the free space of each
>>OSD.
>> Is there a better way?
>
>Try
>
>   ceph osd df tree -f json-pretty
>
>I think that'll give you all the right fields you need to sum.
>
>I wonder if this is something we should be reporting elsewhere, though?
>Summing up all free space is one thing.  Doing it per CRUSH hierarchy is
>something else.  Maybe the 'ceph osd df tree' output could have a field
>summing freespace for self + children in the json dump only...

That's just what I was looking for!  It also looks like the regular 'ceph
osd df tree' output has this information too:

# ceph osd df tree

ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  TYPE NAME
-8 0.52199-   521G 61835M   461G 11.57 0.57 root ceph-ssd
-5 0.17400-   173G 20615M   153G 11.58 0.57 host
dev-ceph-ssd-001
 9 0.05800  1.0 59361M  5374M 53987M  9.05 0.45 osd.9
10 0.05800  1.0 59361M  6837M 52524M 11.52 0.57 osd.10
11 0.05800  1.0 59361M  8404M 50957M 14.16 0.70 osd.11
-6 0.17400-   173G 20615M   153G 11.58 0.57 host
dev-ceph-ssd-002
12 0.05800  1.0 59361M  7165M 52196M 12.07 0.60 osd.12
13 0.05800  1.0 59361M  6762M 52599M 11.39 0.56 osd.13
14 0.05800  1.0 59361M  6688M 52673M 11.27 0.56 osd.14
-7 0.17400-   173G 20604M   153G 11.57 0.57 host
dev-ceph-ssd-003
15 0.05800  1.0 59361M  8189M 51172M 13.80 0.68 osd.15
16 0.05800  1.0 59361M  4835M 54526M  8.15 0.40 osd.16
17 0.05800  1.0 59361M  7579M 51782M 12.77 0.63 osd.17
-1 0.57596-   575G   161G   414G 27.97 1.39 root ceph-hdd
-2 0.19199-   191G 49990M   143G 25.44 1.26 host
dev-ceph-hdd-001
 0 0.06400  0.75000 65502M 15785M 49717M 24.10 1.19 osd.0
 1 0.06400  0.64999 65502M 17127M 48375M 26.15 1.30 osd.1
 2 0.06400  0.5 65502M 17077M 48425M 26.07 1.29 osd.2
-3 0.19199-   191G 63885M   129G 32.51 1.61 host
dev-ceph-hdd-002
 3 0.06400  1.0 65502M 28681M 36821M 43.79 2.17 osd.3
 4 0.06400  0.5 65502M 17246M 48256M 26.33 1.30 osd.4
 5 0.06400  0.84999 65502M 17958M 47544M 27.42 1.36 osd.5
-4 0.19199-   191G 51038M   142G 25.97 1.29 host
dev-ceph-hdd-003
 6 0.06400  0.64999 65502M 16617M 48885M 25.37 1.26 osd.6
 7 0.06400  0.7 65502M 16391M 49111M 25.02 1.24 osd.7
 8 0.06400  0.64999 65502M 18029M 47473M 27.52 1.36 osd.8
  TOTAL  1097G   221G   876G 20.18
MIN/MAX VAR: 0.40/2.17  STDDEV: 9.68



As you can tell I set the weights so that osd.3 would make the MAX AVAIL
difference more pronounced.  Also it appears like VAR is calculated on the
whole cluster instead of each root.

Thanks!
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Uniquely identifying a Ceph client

2016-11-01 Thread Travis Rhoden
Hello,

Is there a consistent, reliable way to identify a Ceph client? I'm looking
for a string/ID (UUID, for example) that can be traced back to a client
doing RBD maps.

There are a couple of possibilities out there, but they aren't quite what
I'm looking for.  When checking "rbd status", for example, the output is
the following:

# rbd status travis2
Watchers:
watcher=172.21.12.10:0/1492902152 client.4100 cookie=1
# rbd status travis3
Watchers:
watcher=172.21.12.10:0/1492902152 client.4100 cookie=2


The IP:port/nonce string is an option, and so is the "client." string,
but neither of these is actually that helpful because they don't the same
strings when an advisory lock is added to the RBD images. For example:

# rbd lock list travis2
There is 1 exclusive lock on this image.
Locker  ID Address
client.4201 test 172.21.12.100:0/967432549
# rbd lock list travis3
There is 1 exclusive lock on this image.
Locker  ID Address
client.4240 test 172.21.12.10:0/2888955091

Note that neither the nonce nor the client ID match -- so by looking at the
rbd lock output, you can't match that information against the output from
"rbd status". I believe this is because the nonce the client identifier is
reflecting the CephX session between client and cluster, and while this is
persistent across "rbd map" calls (because the rbd kmod has a shared
session by default, though that can be changed as well), each call to "rbd
lock" initiates a new session. Hence a new nonce and client ID.

That pretty much leaves the IP address. These would seem to be problematic
as an identifier if the client happened to behind NAT.

I am trying to be able to definitely determine what client has an RBD
mapped and locked, but I'm not seeing a way to guarantee that you've
uniquely identified a client. Am I missing something obvious?

Perhaps my concern about NAT is overblown -- I've never mounted an RBD from
a client that is behind NAT, and I'm not sure how common that would be
(though I think it would work).

 - Travis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded

2016-11-01 Thread David Turner
Your weights are very poorly managed.  if you have a 1TB drive, it's weight 
should be about 1, if you have an 8TB drive, it's weight should be about 8.  
You have 4TB drives with a weight of 3.64 (which is good), but the new node you 
added with 4x 8TB drives have weights ranging from 3.3-5.  The weight on the 
8TB drives are telling the cluster they don't want data and the 4TB drives are 
the recipients of that by being way too full.

Like Ronny said, you also have your nodes unbalanced.  You have 32TB in ceph4 
and 12TB between the other 3 nodes.  The best case for your data to settle 
right now (assuming the default settings of 3 replica size and HOST failure 
domain) is to have 1/3 of your data on ceph4 with 32TB of disks and 2/3 of your 
data split between ceph1, ceph2, & ceph3 with 12TB of disks.  Your cluster 
would have disks too full at about 5-6TB of actual data taking 16TB of raw 
space.

The easiest way to resolve this would probably be to move 2 osds from ceph4 
into 2 of the other hosts and to set the weight on all of the 8TB drives to 
7.45.  You can migrate osds between hosts without removing and adding them back 
in.

Can you please confirm what your replication size is and what your failure 
domain is for the cluster?



[cid:image67fe5e.JPG@abc9dc6c.429c7280]   David 
Turner | Cloud Operations Engineer | StorageCraft Technology 
Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943



If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this message 
is prohibited.




From: ceph-users [ceph-users-boun...@lists.ceph.com] on behalf of Marcus Müller 
[mueller.mar...@posteo.de]
Sent: Tuesday, November 01, 2016 1:14 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Need help! Ceph backfill_toofull and 
recovery_wait+degraded

Hi all,

i have a big problem and i really hope someone can help me!

We are running a ceph cluster since a year now. Version is: 0.94.7 (Hammer)
Here is some info:

Our osd map is:

ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 26.67998 root default
-2  3.64000 host ceph1
 0  3.64000 osd.0   up  1.0  1.0
-3  3.5 host ceph2
 1  3.5 osd.1   up  1.0  1.0
-4  3.64000 host ceph3
 2  3.64000 osd.2   up  1.0  1.0
-5 15.89998 host ceph4
 3  4.0 osd.3   up  1.0  1.0
 4  3.5 osd.4   up  1.0  1.0
 5  3.2 osd.5   up  1.0  1.0
 6  5.0 osd.6   up  1.0  1.0

ceph df:

GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
40972G 26821G   14151G 34.54
POOLS:
NAMEID USED  %USED MAX AVAIL OBJECTS
blocks  7  4490G 10.96 1237G 7037004
commits 8   473M 0 1237G  802353
fs  9  9666M  0.02 1237G 7863422

ceph osd df:

ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR
 0 3.64000  1.0  3724G  3128G   595G 84.01 2.43
 1 3.5  1.0  3724G  3237G   487G 86.92 2.52
 2 3.64000  1.0  3724G  3180G   543G 85.41 2.47
 3 4.0  1.0  7450G  1616G  5833G 21.70 0.63
 4 3.5  1.0  7450G  1246G  6203G 16.74 0.48
 5 3.2  1.0  7450G  1181G  6268G 15.86 0.46
 6 5.0  1.0  7450G   560G  6889G  7.52 0.22
  TOTAL 40972G 14151G 26820G 34.54
MIN/MAX VAR: 0.22/2.52  STDDEV: 36.53


Our current cluster state is:

 health HEALTH_WARN
63 pgs backfill
8 pgs backfill_toofull
9 pgs backfilling
11 pgs degraded
1 pgs recovering
10 pgs recovery_wait
11 pgs stuck degraded
89 pgs stuck unclean
recovery 8237/52179437 objects degraded (0.016%)
recovery 9620295/52179437 objects misplaced (18.437%)
2 near full osd(s)
noout,noscrub,nodeep-scrub flag(s) set
 monmap e8: 4 mons at 
{ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0}
election epoch 400, quorum 0,1,2,3 ceph1,ceph2,ceph3,ceph4
 osdmap e1774: 7 osds: 7 up, 7 in; 84 remapped pgs
flags noout,noscrub,nodeep-scrub
  pgmap v7316159: 320 pgs, 3 pools, 4501 GB data, 15336 kobjects
14152 GB used, 26820 GB / 40972 GB avail
8237/52179437 objects degraded (0.016%)
9620295/52179437 objects misplaced (18.437%)
 231 active+clean
 

Re: [ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded

2016-11-01 Thread Udo Lembke
Hi again,

and change the value with something like this

ceph tell osd.* injectargs '--mon_osd_full_ratio 0.96'

Udo

On 01.11.2016 21:16, Udo Lembke wrote:
> Hi Marcus,
>
> for a fast help you can perhaps increase the mon_osd_full_ratio?
>
> What values do you have?
> Please post the output of (on host ceph1, because osd.0.asok)
>
> ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
> full_ratio
>
> after that it would be helpfull to use on all hosts 2 OSDs...
>
>
> Udo
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uniquely identifying a Ceph client

2016-11-01 Thread Travis Rhoden
On Tue, Nov 1, 2016 at 11:45 AM, Sage Weil  wrote:
> On Tue, 1 Nov 2016, Travis Rhoden wrote:
>> Hello,
>> Is there a consistent, reliable way to identify a Ceph client? I'm looking
>> for a string/ID (UUID, for example) that can be traced back to a client
>> doing RBD maps.
>>
>> There are a couple of possibilities out there, but they aren't quite what
>> I'm looking for.  When checking "rbd status", for example, the output is the
>> following:
>>
>> # rbd status travis2
>> Watchers:
>> watcher=172.21.12.10:0/1492902152 client.4100 cookie=1
>> # rbd status travis3
>> Watchers:
>> watcher=172.21.12.10:0/1492902152 client.4100 cookie=2
>>
>>
>> The IP:port/nonce string is an option, and so is the "client." string,
>> but neither of these is actually that helpful because they don't the same
>> strings when an advisory lock is added to the RBD images. For example:
>
> Both are sufficient.  The  in client. is the most concise and is
> unique per client instance.
>
> I think the problem you're seeing is actually that qemu is using two
> different librbd/librados instances, one for each mapped device?

Not using qemu in this scenario.  Just rbd map && rbd lock.  It's more
that I can't match the output from "rbd lock" against the output from
"rbd status", because they are using different librados instances.
I'm just trying to capture who has an image mapped and locked, and to
those not in the know, it would be a surprise that client. and
client. are actually the same host. :)

I understand why it is, I was checking to see if there was another
field or indicator that I should use instead. I think I'm just going
to have to use the IP address, because that's the value that will have
real meaning to people.

Thanks!

>
>> # rbd lock list travis2
>> There is 1 exclusive lock on this image.
>> Locker  ID Address
>> client.4201 test 172.21.12.100:0/967432549
>> # rbd lock list travis3
>> There is 1 exclusive lock on this image.
>> Locker  ID Address
>> client.4240 test 172.21.12.10:0/2888955091
>>
>> Note that neither the nonce nor the client ID match -- so by looking at the
>> rbd lock output, you can't match that information against the output from
>> "rbd status". I believe this is because the nonce the client identifier is
>> reflecting the CephX session between client and cluster, and while this is
>> persistent across "rbd map" calls (because the rbd kmod has a shared session
>> by default, though that can be changed as well), each call to "rbd lock"
>> initiates a new session. Hence a new nonce and client ID.
>>
>> That pretty much leaves the IP address. These would seem to be problematic
>> as an identifier if the client happened to behind NAT.
>>
>> I am trying to be able to definitely determine what client has an RBD mapped
>> and locked, but I'm not seeing a way to guarantee that you've uniquely
>> identified a client. Am I missing something obvious?
>>
>> Perhaps my concern about NAT is overblown -- I've never mounted an RBD from
>> a client that is behind NAT, and I'm not sure how common that would be
>> (though I think it would work).
>
> It should work, but it's untested.  :)
>
> sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg stuck with unfound objects on non exsisting osd's

2016-11-01 Thread Ronny Aasen

thanks for the suggestion.

is a rolling reboot sufficient? or must all osd's be down at the same 
time ?

one is no problem.  the other takes some scheduling..

Ronny Aasen


On 01.11.2016 21:52, c...@elchaka.de wrote:

Hello Ronny,

if it is possible for you, try to Reboot all OSD Nodes.

I had this issue on my test Cluster and it become healthy after rebooting.

Hth
- Mehmet

Am 1. November 2016 19:55:07 MEZ, schrieb Ronny Aasen 
:


Hello.

I have a cluster stuck with 2 pg's stuck undersized degraded, with 25
unfound objects.

# ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs 
stuck unclean; 2 pgs stuck undersized; 2 pgs undersized; recovery 
294599/149522370 objects degraded (0.197%); recovery 640073/149522370 objects 
misplaced (0.428%); recovery 25/46579241 unfound (0.000%); noout flag(s) set
pg 6.d4 is stuck unclean for 8893374.380079, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck unclean for 8896787.249470, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is stuck undersized for 438122.427341, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck undersized for 416947.461950, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]*pg 6.d4 is 
stuck degraded for 438122.427402, current state
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck degraded for 416947.462010, current state
active+recovering+undersized+degraded+remapped, last acting
[18,12] pg 6.d4 is active+recovering+undersized+degraded+remapped,
acting [62], 25 unfound pg 6.ab is
active+recovering+undersized+degraded+remapped, acting [18,12]
recovery 294599/149522370 objects degraded (0.197%) recovery
640073/149522370 objects misplaced (0.428%) recovery 25/46579241
unfound (0.000%) noout flag(s) set have been following the
troubleshooting guide at
http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/
but gets stuck without a resolution. luckily it is not critical
data. so i wanted to mark the pg lost so it could become
health-ok< br /> # ceph pg 6.d4 mark_unfound_lost delete Error
EINVAL: pg has 25 unfound objects but we haven't probed all
sources, not marking lost querying the pg i see that it would want
osd.80 and osd 36 { "osd": "80", "status": "osd is down" }, trying
to mark the osd's lost does not work either. since the osd's was
removed from the cluster a long time ago. # ceph osd lost 80
--yes-i-really-mean-it osd.80 is not down or doesn't exist # ceph
osd lost 36 --yes-i-really-mean-it osd.36 is not down or doesn't
exist and this is where i am stuck. have tried stopping and
starting the 3 osd's but that did not have any effect. Anyone have
any advice how to proceed ? full output at:
http://paste.debian.net/hidden/be03a185/ this is hammer 0.94.9 on
debian 8. kind regards Ronny Aasen

ceph-users mailing list ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com *

**


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer Cache Tiering

2016-11-01 Thread Ashley Merrick
Hello,

Thanks for your reply, when you say latest's version do you .6 and not .5?

The use case is large scale storage VM's, which may have a burst of high 
write's during new storage being loaded onto the environment, looking to place 
the SSD Cache in front currently with a replica of 3 and useable size of 1.5TB.

Looking to run in Read-forward Mode, so reads will come direct from the OSD 
layer where there is no issue with current read performance, however any large 
write's will first go to the SSD and then at a later date flushed to the OSD's 
as the SSD cache hits for example 60%.

So the use case is not as such to store hot DB data that will stay there, but 
to act as a temp sponge for high but short writes in bursts.

,Ashley

-Original Message-
From: Christian Balzer [mailto:ch...@gol.com] 
Sent: Wednesday, 2 November 2016 11:48 AM
To: ceph-us...@ceph.com
Cc: Ashley Merrick 
Subject: Re: [ceph-users] Hammer Cache Tiering


Hello,

On Tue, 1 Nov 2016 15:07:33 + Ashley Merrick wrote:

> Hello,
> 
> Currently using a Proxmox & CEPH cluster, currently they are running on 
> Hammer looking to update to Jewel shortly, I know I can do a manual upgrade 
> however would like to keep what is tested well with Proxmox.
> 
> Looking to put a SSD Cache tier in front, however have seen and read there 
> has been a few bug's with Cache Tiering causing corruption, from what I read 
> all fixed on Jewel however not 100% if they have been pushed to Hammer (even 
> though is still not EOL for a little while).
>
You will want to read at LEAST the last two threads about "cache tier" in this 
ML, more if you can.

> Is anyone running Cache Tiering on Hammer in production and had no issues, or 
> is anyone aware of any bugs' / issues that means I should hold off till I 
> upgrade to Jewel, or any reason basically to hold off for a month or so to 
> update to Jewel before enabling a cache tier.
> 
The latest Hammer should be fine, 0.94.5 has been working for me a long time, 
0.94.6 is DEFINITELY to be avoided at all costs.

A cache tier is a complex beast. 
Does it fit your needs/use patterns, can you afford to make it large enough to 
actually fit all your hot data in it?

Jewel has more control knobs to help you, so unless you are 100% sure that you 
know what you're doing or have a cache pool in mind that's as large as your 
current used data, waiting for Jewel might be a better proposition.

Of course the lack of any official response to the last relevant thread here 
about the future of cache tiering makes adding/designing a cache tier an 
additional challenge...


Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Uniquely identifying a Ceph client

2016-11-01 Thread Jason Dillaman
> Not using qemu in this scenario.  Just rbd map && rbd lock.  It's more
> that I can't match the output from "rbd lock" against the output from
> "rbd status", because they are using different librados instances.
> I'm just trying to capture who has an image mapped and locked, and to
> those not in the know, it would be a surprise that client. and
> client. are actually the same host. :)

Yeah, the reason is because the lock was acquired by a transient
client (your CLI invocation of 'rbd lock') and not by krbd, so they
are both different clients.

> I understand why it is, I was checking to see if there was another
> field or indicator that I should use instead. I think I'm just going
> to have to use the IP address, because that's the value that will have
> real meaning to people.

Assuming you mapping script is always adding an advisory lock from the
same host it is mapping the image from, I think the IP address is the
best you will be able to do.

-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Total free space in addition to MAX AVAIL

2016-11-01 Thread Stillwell, Bryan J
I recently learned that 'MAX AVAIL' in the 'ceph df' output doesn't
represent what I thought it did.  It actually represents the amount of
data that can be used before the first OSD becomes full, and not the sum
of all free space across a set of OSDs.  This means that balancing the
data with 'ceph osd reweight' will actually increase the value of 'MAX
AVAIL'.

Knowing this I would like to graph both 'MAX AVAIL' and the total free
space across two different sets of OSDs so I can get an idea how out of
balance the cluster is.

This is where I'm running into trouble.  I have two different types of
Ceph nodes in my cluster.  One with all HDDs+SSD journals, and the other
with all SSDs using co-located journals.  There isn't any cache tiering
going on, so a pool either uses the all-HDD root, or the all-SSD root, but
not both.

The only method I can think of to get this information is to walk the
CRUSH tree to figure out which OSDs are under a given root, and then use
the output of 'ceph osd df -f json' to sum up the free space of each OSD.
Is there a better way?

Thanks,
Bryan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded

2016-11-01 Thread Udo Lembke
Hi Marcus,

for a fast help you can perhaps increase the mon_osd_full_ratio?

What values do you have?
Please post the output of (on host ceph1, because osd.0.asok)

ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
full_ratio

after that it would be helpfull to use on all hosts 2 OSDs...


Udo


On 01.11.2016 20:14, Marcus Müller wrote:
> Hi all,
>
> i have a big problem and i really hope someone can help me!
>
> We are running a ceph cluster since a year now. Version is: 0.94.7
> (Hammer)
> Here is some info:
>
> Our osd map is:
>
> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
> -1 26.67998 root default 
> -2  3.64000 host ceph1   
>  0  3.64000 osd.0   up  1.0  1.0 
> -3  3.5 host ceph2   
>  1  3.5 osd.1   up  1.0  1.0 
> -4  3.64000 host ceph3   
>  2  3.64000 osd.2   up  1.0  1.0 
> -5 15.89998 host ceph4   
>  3  4.0 osd.3   up  1.0  1.0 
>  4  3.5 osd.4   up  1.0  1.0 
>  5  3.2 osd.5   up  1.0  1.0 
>  6  5.0 osd.6   up  1.0  1.0 
>
> ceph df:
>
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED 
> 40972G 26821G   14151G 34.54 
> POOLS:
> NAMEID USED  %USED MAX AVAIL OBJECTS 
> blocks  7  4490G 10.96 1237G 7037004 
> commits 8   473M 0 1237G  802353 
> fs  9  9666M  0.02 1237G 7863422 
>
> ceph osd df:
>
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR  
>  0 3.64000  1.0  3724G  3128G   595G 84.01 2.43 
>  1 3.5  1.0  3724G  3237G   487G 86.92 2.52 
>  2 3.64000  1.0  3724G  3180G   543G 85.41 2.47 
>  3 4.0  1.0  7450G  1616G  5833G 21.70 0.63 
>  4 3.5  1.0  7450G  1246G  6203G 16.74 0.48 
>  5 3.2  1.0  7450G  1181G  6268G 15.86 0.46 
>  6 5.0  1.0  7450G   560G  6889G  7.52 0.22 
>   TOTAL 40972G 14151G 26820G 34.54  
> MIN/MAX VAR: 0.22/2.52  STDDEV: 36.53
>
>
> Our current cluster state is: 
>
>  health HEALTH_WARN
> 63 pgs backfill
> 8 pgs backfill_toofull
> 9 pgs backfilling
> 11 pgs degraded
> 1 pgs recovering
> 10 pgs recovery_wait
> 11 pgs stuck degraded
> 89 pgs stuck unclean
> recovery 8237/52179437 objects degraded (0.016%)
> recovery 9620295/52179437 objects misplaced (18.437%)
> 2 near full osd(s)
> noout,noscrub,nodeep-scrub flag(s) set
>  monmap e8: 4 mons at
> {ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0}
> election epoch 400, quorum 0,1,2,3 ceph1,ceph2,ceph3,ceph4
>  osdmap e1774: 7 osds: 7 up, 7 in; 84 remapped pgs
> flags noout,noscrub,nodeep-scrub
>   pgmap v7316159: 320 pgs, 3 pools, 4501 GB data, 15336 kobjects
> 14152 GB used, 26820 GB / 40972 GB avail
> 8237/52179437 objects degraded (0.016%)
> 9620295/52179437 objects misplaced (18.437%)
>  231 active+clean
>   61 active+remapped+wait_backfill
>9 active+remapped+backfilling
>6 active+recovery_wait+degraded+remapped
>6 active+remapped+backfill_toofull
>4 active+recovery_wait+degraded
>2 active+remapped+wait_backfill+backfill_toofull
>1 active+recovering+degraded
> recovery io 11754 kB/s, 35 objects/s
>   client io 1748 kB/s rd, 249 kB/s wr, 44 op/s
>
>
> My main problems are: 
>
> - As you can see from the osd tree, we have three separate hosts with
> only one osd each. Another one has four osds. Ceph allows me not to
> get data back from these three nodes with only one HDD, which are all
> near full. I tried to set the weight of the osds in the bigger node
> higher but this just does not work. So i added a new osd yesterday
> which made things not better, as you can see now. What do i have to do
> to just become these three nodes empty again and put more data on the
> other node with the four HDDs.
>
> - I added the „ceph4“ node later, this resulted in a strange ip change
> as you can see in the mon list. The public network and the cluster
> network were swapped or not assigned right. See ceph.conf
>
> [global]
> fsid = xxx
> mon_initial_members = ceph1
> mon_host = 192.168.10.3, 192.168.10.4, 192.168.10.5, 192.168.10.11
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> 

Re: [ceph-users] Hammer Cache Tiering

2016-11-01 Thread Christian Balzer

Hello,

On Tue, 1 Nov 2016 15:07:33 + Ashley Merrick wrote:

> Hello,
> 
> Currently using a Proxmox & CEPH cluster, currently they are running on 
> Hammer looking to update to Jewel shortly, I know I can do a manual upgrade 
> however would like to keep what is tested well with Proxmox.
> 
> Looking to put a SSD Cache tier in front, however have seen and read there 
> has been a few bug's with Cache Tiering causing corruption, from what I read 
> all fixed on Jewel however not 100% if they have been pushed to Hammer (even 
> though is still not EOL for a little while).
>
You will want to read at LEAST the last two threads about "cache tier" in
this ML, more if you can.

> Is anyone running Cache Tiering on Hammer in production and had no issues, or 
> is anyone aware of any bugs' / issues that means I should hold off till I 
> upgrade to Jewel, or any reason basically to hold off for a month or so to 
> update to Jewel before enabling a cache tier.
> 
The latest Hammer should be fine, 0.94.5 has been working for me a
long time, 0.94.6 is DEFINITELY to be avoided at all costs.

A cache tier is a complex beast. 
Does it fit your needs/use patterns, can you afford to make it large
enough to actually fit all your hot data in it?

Jewel has more control knobs to help you, so unless you are 100% sure that
you know what you're doing or have a cache pool in mind that's as large as
your current used data, waiting for Jewel might be a better proposition.

Of course the lack of any official response to the last relevant thread
here about the future of cache tiering makes adding/designing a cache tier
an additional challenge...


Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [EXTERNAL] Re: pg stuck with unfound objects on non exsisting osd's

2016-11-01 Thread Will . Boege
Start with a rolling restart of just the OSDs one system at a time, checking 
the status after each restart.

On Nov 1, 2016, at 6:20 PM, Ronny Aasen 
> wrote:

thanks for the suggestion.

is a rolling reboot sufficient? or must all osd's be down at the same time ?
one is no problem.  the other takes some scheduling..

Ronny Aasen


On 01.11.2016 21:52, c...@elchaka.de wrote:
Hello Ronny,

if it is possible for you, try to Reboot all OSD Nodes.

I had this issue on my test Cluster and it become healthy after rebooting.

Hth
- Mehmet

Am 1. November 2016 19:55:07 MEZ, schrieb Ronny Aasen 
:

Hello.

I have a cluster stuck with 2 pg's stuck undersized degraded, with 25
unfound objects.

# ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck 
unclean; 2 pgs stuck undersized; 2 pgs undersized; recovery 294599/149522370 
objects degraded (0.197%); recovery 640073/149522370 objects misplaced 
(0.428%); recovery 25/46579241 unfound (0.000%); noout flag(s) set
pg 6.d4 is stuck unclean for 8893374.380079, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck unclean for 8896787.249470, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is stuck undersized for 438122.427341, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck undersized for 416947.461950, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]pg
6.d4 is stuck degraded for 438122.427402, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck degraded for 416947.462010, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is active+recovering+undersized+degraded+remapped, acting [62], 25 
unfound
pg 6.ab is active+recovering+undersized+degraded+remapped, acting [18,12]
recovery 294599/149522370 objects degraded (0.197%)
recovery 640073/149522370 objects misplaced (0.428%)
recovery 25/46579241 unfound (0.000%)
noout flag(s) set


have been following the troubleshooting guide at
http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/
but gets stuck without a resolution.

luckily it is not critical data. so i wanted to mark the pg lost so it
could become health-ok<
 br
/>

# ceph pg 6.d4 mark_unfound_lost delete
Error EINVAL: pg has 25 unfound objects but we haven't probed all
sources, not marking lost

querying the pg i see that it would want osd.80 and osd 36

  {
 "osd": "80",
 "status": "osd is down"
 },

trying to mark the osd's lost does not work either. since the osd's was
removed from the cluster a long time ago.

# ceph osd lost 80 --yes-i-really-mean-it
osd.80 is not down or doesn't exist

# ceph osd lost 36 --yes-i-really-mean-it
osd.36 is not down or doesn't exist


and this is where i am stuck.

have tried stopping and starting the 3 osd's but that did not have any
effect.

Anyone have any advice how to proceed ?

full output at:  http://paste.debian.net/hidden/be03a185/

this is hammer 0.94.9  on debian 8.


kind regards

Ronny Aasen






ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor troubles

2016-11-01 Thread Tracy Reed
I initially setup my ceph cluster on CentOS 7 with just one monitor. The
monitor runs on an osd server (not ideal, will change soon).  I've
tested it quite a lot over the last couple of months and things have
gone well. I knew I needed to add a couple more monitors so I did the
following:

ceph-deploy mon create ceph02

And then the cluster hung. I did some googling and found some things
which said I need to add a public network etc. I did so and restarted
the mons. No luck. I also added them to mon_initial_members and
mon_host. My current ceph.conf looks like this:

[global]
osd pool default size = 2
fsid = e2e43abc-e634-4a04-ae24-0c486a035b6e
mon_initial_members = ceph01,ceph02
mon_host = 10.0.5.2,10.0.5.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
# All mons/osds are on 10.0.5.0 but deploy-server is on 10.0.10.0. I
# expect this second subnet is unnecessary to list here but thought it
# couldn't hurt. None of the mons/osds have a 10.0.10.0 interface so
# there can't be confusion, right?
public_network = 10.0.5.0/24,10.0.10.0/24 

[client]
rbd default features = 1


I then discovered and started following:
http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-mon/

Are the monitors running? Yes

Are you able to connect to the monitor’s servers? Yes

Does ceph -s run and obtain a reply from the cluster? No

What if ceph -s doesn’t finish? It says try "ceph ping mon.ID"

[ceph-deploy@ceph-deploy my-cluster]$ ceph ping mon.ceph01
Error connecting to cluster: ObjectNotFound

Then it suggests trying the monitor admin socket. This works:

[root@ceph01 ~]# ceph daemon mon.ceph01 mon_status  

   
{
"name": "ceph01",
"rank": 0,
"state": "probing",
"election_epoch": 0,
"quorum": [],
"outside_quorum": [
"ceph01"
],
"extra_probe_peers": [],
"sync_provider": [],
"monmap": {
"epoch": 2,
"fsid": "3e84db5d-3dc8-4104-89e7-da23c103ef50",
"modified": "2016-11-01 19:55:28.083057",
"created": "2016-09-05 01:22:09.228315",
"mons": [
{
"rank": 0,
"name": "ceph01",
"addr": "10.0.5.2:6789\/0"
},
{
"rank": 1,
"name": "ceph02",
"addr": "10.0.5.3:6789\/0"
}
]
}
}


[root@ceph02 ~]# ceph daemon mon.ceph02 mon_status
{
"name": "ceph02",
"rank": 0,
"state": "probing",
"election_epoch": 0,
"quorum": [],
"outside_quorum": [
"ceph02"
],
"extra_probe_peers": [
"10.0.5.2:6789\/0"
],
"sync_provider": [],
"monmap": {
"epoch": 0,
"fsid": "e2e43abc-e634-4a04-ae24-0c486a035b6e",
"modified": "2016-11-01 19:33:06.242314",
"created": "2016-11-01 19:33:06.242314",
"mons": [
{
"rank": 0,
"name": "ceph02",
"addr": "10.0.5.3:6789\/0"
},
{
"rank": 1,
"name": "ceph01",
"addr": "0.0.0.0:0\/1"
}
]
}
}

So they are both in probing state, they each say they are
outside_quorum, and ceph02 shows addr 0.0.0.0 for ceph01. I tried
telling ceph02 the address of ceph01 using "ceph daemon mon.ceph02
add_bootstrap_peer_hint 10.0.5.2" which is why it appears in
extra_probe_peers. It does not seem to have helped. I notice the fsid's
are different in the mon_status output. No idea why. The proper cluster
fsid is e2e43abc-e634-4a04-ae24-0c486a035b6e. Could this be what is
messing things up? ceph01 is the original monitor. What's weird though
is that I see from weeks ago when I first setup that cluster that fsid
appears in the deployment log:

[2016-10-05 14:48:51,811][ceph01][INFO  ] Running command: sudo systemctl 
enable ceph.target
[2016-10-05 14:48:51,946][ceph01][INFO  ] Running command: sudo systemctl 
enable ceph-mon@ceph01
[2016-10-05 14:48:52,073][ceph01][INFO  ] Running command: sudo systemctl start 
ceph-mon@ceph01
[2016-10-05 14:48:54,104][ceph01][INFO  ] Running command: sudo ceph 
--cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.ceph01.asok mon_status
[2016-10-05 14:48:54,272][ceph01][DEBUG ] 

[2016-10-05 14:48:54,273][ceph01][DEBUG ] status for monitor: mon.ceph01
[2016-10-05 14:48:54,274][ceph01][DEBUG ] {
[2016-10-05 14:48:54,275][ceph01][DEBUG ]   "election_epoch": 5, 
[2016-10-05 14:48:54,275][ceph01][DEBUG ]   "extra_probe_peers": [], 
[2016-10-05 14:48:54,275][ceph01][DEBUG ]   "monmap": {
[2016-10-05 14:48:54,276][ceph01][DEBUG ] "created": "2016-09-05 
01:22:09.228315", 
[2016-10-05 14:48:54,276][ceph01][DEBUG ] "epoch": 1, 
[2016-10-05 14:48:54,276][ceph01][DEBUG ] "fsid": 

Re: [ceph-users] Monitor troubles

2016-11-01 Thread Tracy Reed
On Tue, Nov 01, 2016 at 09:36:16PM PDT, Tracy Reed spake thusly:
> I initially setup my ceph cluster on CentOS 7 with just one monitor. The
> monitor runs on an osd server (not ideal, will change soon).  I've

Sorry, forgot to add that I'm running the following ceph version from
the ceph repo:

# rpm -qa|grep ceph
libcephfs1-10.2.3-0.el7.x86_64
ceph-release-1-1.el7.noarch
ceph-mds-10.2.3-0.el7.x86_64
ceph-radosgw-10.2.3-0.el7.x86_64
python-cephfs-10.2.3-0.el7.x86_64
ceph-common-10.2.3-0.el7.x86_64
ceph-selinux-10.2.3-0.el7.x86_64
ceph-mon-10.2.3-0.el7.x86_64
ceph-10.2.3-0.el7.x86_64
ceph-base-10.2.3-0.el7.x86_64
ceph-osd-10.2.3-0.el7.x86_64


-- 
Tracy Reed


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer Cache Tiering

2016-11-01 Thread Christian Wuerdig
On Wed, Nov 2, 2016 at 5:19 PM, Ashley Merrick 
wrote:

> Hello,
>
> Thanks for your reply, when you say latest's version do you .6 and not .5?
>
> The use case is large scale storage VM's, which may have a burst of high
> write's during new storage being loaded onto the environment, looking to
> place the SSD Cache in front currently with a replica of 3 and useable size
> of 1.5TB.
>
> Looking to run in Read-forward Mode, so reads will come direct from the
> OSD layer where there is no issue with current read performance, however
> any large write's will first go to the SSD and then at a later date flushed
> to the OSD's as the SSD cache hits for example 60%.
>
> So the use case is not as such to store hot DB data that will stay there,
> but to act as a temp sponge for high but short writes in bursts.
>

This is precisely what the journals are for. From what I've seen and read
on this list so far I'd say you will be way better of putting your journals
on SSDs in the OSD nodes than to try setting up a cache tier. In general
using a cache for write buffer to me at least sounds the wrong way round -
typically you want a cache for fast read access (i.e. serving very
frequently read data as fast as possible).


>
> ,Ashley
>
> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com]
> Sent: Wednesday, 2 November 2016 11:48 AM
> To: ceph-us...@ceph.com
> Cc: Ashley Merrick 
> Subject: Re: [ceph-users] Hammer Cache Tiering
>
>
> Hello,
>
> On Tue, 1 Nov 2016 15:07:33 + Ashley Merrick wrote:
>
> > Hello,
> >
> > Currently using a Proxmox & CEPH cluster, currently they are running on
> Hammer looking to update to Jewel shortly, I know I can do a manual upgrade
> however would like to keep what is tested well with Proxmox.
> >
> > Looking to put a SSD Cache tier in front, however have seen and read
> there has been a few bug's with Cache Tiering causing corruption, from what
> I read all fixed on Jewel however not 100% if they have been pushed to
> Hammer (even though is still not EOL for a little while).
> >
> You will want to read at LEAST the last two threads about "cache tier" in
> this ML, more if you can.
>
> > Is anyone running Cache Tiering on Hammer in production and had no
> issues, or is anyone aware of any bugs' / issues that means I should hold
> off till I upgrade to Jewel, or any reason basically to hold off for a
> month or so to update to Jewel before enabling a cache tier.
> >
> The latest Hammer should be fine, 0.94.5 has been working for me a long
> time, 0.94.6 is DEFINITELY to be avoided at all costs.
>
> A cache tier is a complex beast.
> Does it fit your needs/use patterns, can you afford to make it large
> enough to actually fit all your hot data in it?
>
> Jewel has more control knobs to help you, so unless you are 100% sure that
> you know what you're doing or have a cache pool in mind that's as large as
> your current used data, waiting for Jewel might be a better proposition.
>
> Of course the lack of any official response to the last relevant thread
> here about the future of cache tiering makes adding/designing a cache tier
> an additional challenge...
>
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Hammer Cache Tiering

2016-11-01 Thread Ashley Merrick
Hello,

Already have Journals on SSD’s, but a Journal is only designed for very short 
write bursts, and is not going to help when I have someone writing for example 
a 100GB Backup file, where in my eyes a SSD Tier set to cache write’s only will 
allow the 100GB Write to be completed much quicker and with higher IOPS. 
Leaving the backend to respond to any Read requests, then any operations on the 
recent wrote file will be completed at the cache layer and pushed to the OSD at 
a later time when no longer required (colder storage).

It’s something I am looking to test and see if there is decent performance or 
not before I decide to keep or not, it was mainly to check there was no issues 
in the hammer release which could lead to corruption at FS level, which I have 
sent in a few old ML emails.

Thanks,
,Ashley

From: Christian Wuerdig [mailto:christian.wuer...@gmail.com]
Sent: Wednesday, 2 November 2016 12:57 PM
To: Ashley Merrick 
Cc: Christian Balzer ; ceph-us...@ceph.com
Subject: Re: [ceph-users] Hammer Cache Tiering



On Wed, Nov 2, 2016 at 5:19 PM, Ashley Merrick 
> wrote:
Hello,

Thanks for your reply, when you say latest's version do you .6 and not .5?

The use case is large scale storage VM's, which may have a burst of high 
write's during new storage being loaded onto the environment, looking to place 
the SSD Cache in front currently with a replica of 3 and useable size of 1.5TB.

Looking to run in Read-forward Mode, so reads will come direct from the OSD 
layer where there is no issue with current read performance, however any large 
write's will first go to the SSD and then at a later date flushed to the OSD's 
as the SSD cache hits for example 60%.

So the use case is not as such to store hot DB data that will stay there, but 
to act as a temp sponge for high but short writes in bursts.

This is precisely what the journals are for. From what I've seen and read on 
this list so far I'd say you will be way better of putting your journals on 
SSDs in the OSD nodes than to try setting up a cache tier. In general using a 
cache for write buffer to me at least sounds the wrong way round - typically 
you want a cache for fast read access (i.e. serving very frequently read data 
as fast as possible).


,Ashley

-Original Message-
From: Christian Balzer [mailto:ch...@gol.com]
Sent: Wednesday, 2 November 2016 11:48 AM
To: ceph-us...@ceph.com
Cc: Ashley Merrick >
Subject: Re: [ceph-users] Hammer Cache Tiering


Hello,

On Tue, 1 Nov 2016 15:07:33 + Ashley Merrick wrote:

> Hello,
>
> Currently using a Proxmox & CEPH cluster, currently they are running on 
> Hammer looking to update to Jewel shortly, I know I can do a manual upgrade 
> however would like to keep what is tested well with Proxmox.
>
> Looking to put a SSD Cache tier in front, however have seen and read there 
> has been a few bug's with Cache Tiering causing corruption, from what I read 
> all fixed on Jewel however not 100% if they have been pushed to Hammer (even 
> though is still not EOL for a little while).
>
You will want to read at LEAST the last two threads about "cache tier" in this 
ML, more if you can.

> Is anyone running Cache Tiering on Hammer in production and had no issues, or 
> is anyone aware of any bugs' / issues that means I should hold off till I 
> upgrade to Jewel, or any reason basically to hold off for a month or so to 
> update to Jewel before enabling a cache tier.
>
The latest Hammer should be fine, 0.94.5 has been working for me a long time, 
0.94.6 is DEFINITELY to be avoided at all costs.

A cache tier is a complex beast.
Does it fit your needs/use patterns, can you afford to make it large enough to 
actually fit all your hot data in it?

Jewel has more control knobs to help you, so unless you are 100% sure that you 
know what you're doing or have a cache pool in mind that's as large as your 
current used data, waiting for Jewel might be a better proposition.

Of course the lack of any official response to the last relevant thread here 
about the future of cache tiering makes adding/designing a cache tier an 
additional challenge...


Christian
--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten 
Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com