date:20150326

[ceph-users] (no subject)

2015-03-26 Thread Sreenath BH

Thanks for the information.

-Sreenath

-

Date: Wed, 25 Mar 2015 04:11:11 +0100
From: Francois Lafont flafdiv...@free.fr
To: ceph-users ceph-us...@ceph.com
Subject: Re: [ceph-users] PG calculator queries
Message-ID: 5512274f.1000...@free.fr
Content-Type: text/plain; charset=utf-8

Hi,

Sreenath BH wrote :

 consider following values for a pool:

 Size = 3
 OSDs = 400
 %Data = 100
 Target PGs per OSD = 200 (This is default)

 The PG calculator generates number of PGs for this pool as : 32768.

 Questions:

 1. The Ceph documentation recommends around 100 PGs/OSD, whereas the
 calculator takes 200 as default value. Are there any changes in the
 recommended value of PGs/OSD?

Not really I think. Here http://ceph.com/pgcalc/, we can read:

Target PGs per OSD
This value should be populated based on the following guidance:
- 100 If the cluster OSD count is not expected to increase in
  the foreseeable future.
- 200 If the cluster OSD count is expected to increase (up to
  double the size) in the foreseeable future.
- 300 If the cluster OSD count is expected to increase between
  2x and 3x in the foreseeable future.

So, it seems to me cautious to recommend 100 in the official documentation
because you can increase the pg_num but it's impossible to decrease it.
So, if I should recommend just one value, It would be 100.

 2. Under notes it says:
 Total PG Count below table will be the count of Primary PG copies.
 However, when calculating total PGs per OSD average, you must include
 all copies.

 However, the number of 200 PGs/OSD already seems to include the
 primary as well as replica PGs in a OSD. Is the note a typo mistake or
 am I missing something?

To my mind, in the site, the Total PG Count doesn't include all copies.
So, for me, there is no typo. Here is 2 basic examples from
http://ceph.com/pgcalc/
with just *one* pool.

1.
Pool-Name  Size  OSD#  %DataTarget-PGs-per-OSD  Suggested-PG-count
rbd2 10100.00%  100 512

2.
Pool-Name  Size  OSD#  %DataTarget-PGs-per-OSD  Suggested-PG-count
rbd2 10100.00%  200 1024

In the first example, I have:   512/10 =  51.2  but (Size x  512)/10 = 102.4
In the second example, I have: 1024/10 = 102.4  but (Size x 1024)/10 = 204.8

HTH.

--
Fran?ois Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5

2015-03-26 Thread Saverio Proto

Thanks for the answer. Now the meaning of MB data and MB used is
clear, and if all the pools have size=3 I expect a ratio 1 to 3 of the
two values.

I still can't understand why MB used is so big in my setup.
All my pools are size =3 but the ratio MB data and MB used is 1 to
5 instead of 1 to 3.

My first guess was that I wrote a wrong crushmap that was making more
than 3 copies.. (is it really possible to make such a mistake?)

So I changed my crushmap and I put the default one, that just spreads
data across hosts, but I see no change, the ratio is still 1 to 5.

I thought maybe my 3 monitors have different views of the pgmap, so I
tried to restart the monitors but this also did not help.

What useful information may I share here to troubleshoot this issue further ?
ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)

Thank you

Saverio



2015-03-25 14:55 GMT+01:00 Gregory Farnum g...@gregs42.com:
 On Wed, Mar 25, 2015 at 1:24 AM, Saverio Proto ziopr...@gmail.com wrote:
 Hello there,

 I started to push data into my ceph cluster. There is something I
 cannot understand in the output of ceph -w.

 When I run ceph -w I get this kinkd of output:

 2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056
 active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail


 2379MB is actually the data I pushed into the cluster, I can see it
 also in the ceph df output, and the numbers are consistent.

 What I dont understand is 19788MB used. All my pools have size 3, so I
 expected something like 2379 * 3. Instead this number is very big.

 I really need to understand how MB used grows because I need to know
 how many disks to buy.

 MB used is the summation of (the programmatic equivalent to) df
 across all your nodes, whereas MB data is calculated by the OSDs
 based on data they've written down. Depending on your configuration
 MB used can include thing like the OSD journals, or even totally
 unrelated data if the disks are shared with other applications.

 MB used including the space used by the OSD journals is my first
 guess about what you're seeing here, in which case you'll notice that
 it won't grow any faster than MB data does once the journal is fully
 allocated.
 -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 active+undersized+degraded

2015-03-26 Thread Udo Lembke

Hi Don,
after a lot of trouble due an unfinished setcrushmap, I was able to remove the 
new EC pool.
Load the old crushmap and edit agin. After include an step set_choose_tries 
100 in the crushmap the EC pool creation with
ceph osd pool create ec7archiv 1024 1024 erasure 7hostprofile
work without trouble.

Due to defect PGs from this test, I remove the cache tier from the old EC pool 
which gaves the next trouble - but this
is another story!


Thanks again

Udo

Am 25.03.2015 20:37, schrieb Don Doerner:
 More info please: how did you create your EC pool?  It's hard to imagine that 
 you could have specified enough PGs to make it impossible to form PGs out of 
 84 OSDs (I'm assuming your SSDs are in a separate root) but I have to ask...
 
 -don-
 
 

 -Original Message-
 From: Udo Lembke [mailto:ulem...@polarzone.de] 
 Sent: 25 March, 2015 08:54
 To: Don Doerner; ceph-us...@ceph.com
 Subject: Re: [ceph-users] Strange osd in PG with new EC-Pool - pgs: 2 
 active+undersized+degraded
 
 Hi Don,
 thanks for the info!
 
 looks that choose_tries set to 200 do the trick.
 
 But the setcrushmap takes a long long time (alarming, but the client have 
 still IO)... hope it's finished soon ;-)
 
 
 Udo
 
 Am 25.03.2015 16:00, schrieb Don Doerner:
 Assuming you've calculated the number of PGs reasonably, see here 
 https://urldefense.proofpoint.com/v1/url?u=http://tracker.ceph.com/issues/10350k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=Uyb56Qt%2BKVFbsV03VYVYpn8wSfEZJBXMjOz%2BQX5j0fY%3D%0As=b2547ec4aefa0f1b25d47bc813cab344a24c22c2464d4ff2cb199be0ef9b15cf
  and here 
 https://urldefense.proofpoint.com/v1/url?u=http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/%23crush-gives-up-too-soonhttp://ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/k=8F5TVnBDKF32UabxXsxZiA%3D%3D%0Ar=klXZewu0kUquU7GVFsSHwpsWEaffmLRymeSfL%2FX1EJo%3D%0Am=Uyb56Qt%2BKVFbsV03VYVYpn8wSfEZJBXMjOz%2BQX5j0fY%3D%0As=09d9aeb34481797e2d8f24938980db3697f26d94e92ff4c72714651181329de9.
 I'm guessing these will address your issue.  That weird number means that no 
 OSD was found/assigned to the PG.

  

 -don-
 
 --
 The information contained in this transmission may be confidential. Any 
 disclosure, copying, or further distribution of confidential information is 
 not permitted unless such privilege is explicitly granted in writing by 
 Quantum. Quantum reserves the right to have electronic communications, 
 including email and attachments, sent across its networks filtered through 
 anti virus and spam software programs and retain such messages in order to 
 comply with applicable data security and retention requirements. Quantum is 
 not responsible for the proper and complete transmission of the substance of 
 this communication or for any delay in its receipt.
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] more human readable log to track request or using mapreduce for data statistics

2015-03-26 Thread Steffen W Sørensen


On 26/03/2015, at 09.05, 池信泽 xmdx...@gmail.com wrote:

 hi，ceph:
 
 Currently, the command ”ceph --admin-daemon
 /var/run/ceph/ceph-osd.0.asok dump_historic_ops“ may return as below:
 
 { description: osd_op(client.4436.1:11617
 rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92),
  received_at: 2015-03-25 19:41:47.146145,
  age: 2.186521,
  duration: 1.237882,
  type_data: [
commit sent; apply or cleanup,
{ client: client.4436,
  tid: 11617},
[
{ time: 2015-03-25 19:41:47.150803,
  event: event1},
{ time: 2015-03-25 19:41:47.150873,
  event: event2},
{ time: 2015-03-25 19:41:47.150895,
  event: event3},
{ time: 2015-03-25 19:41:48.384027,
  event: event4}]]}
Seems like JSON format
So consider doing your custom conversion by some means of CLI
convert json format to string

 
 I think this message is not so suitable for grep log or using
 mapreduce for data statistics. Such as, I want to know
 the write request avg latency for each rbd everyday. If we could
 output the all latency in just one line, it would be very easy to
 achieve it.
 
 Such as, the output log maybe something like this:
 2015-03-26 03:30:53.859759 osd=osd.0 pg=2.11 op=(client.4436.1:11617
 rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92)
 received_at=1427355253 age=2.186521 duration=1.237882 tid=11617
 client=client.4436 event1=20ms event2=300ms event3=400ms event4=100ms.
 
 The above:
 
 duration means: the time between (reply_to_client_stamp -
 request_received_stamp)
 event1 means: the time between (the event1_stamp - request_received_stamp)
 ...
 event4 means: the time between (the event4_stamp - request_received_stamp)
 
 Now, If we output the every log as above. it would be every easy to
 know the write request avg latency for each rbd everyday.
 Or if I use grep it is more easy to find out which stage is the bottleneck.
 
 -- 
 Regards,
 xinze
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Hammer release data and a Design question

2015-03-26 Thread 10 minus

Hi ,

I 'm just starting on small Ceph implementation and wanted to know the
release date for Hammer.
Will it coincide with relase of Openstack.

My Conf:  (using 10G and Jumboframes on Centos 7 / RHEL7 )

3x Mons (VMs) :
CPU - 2
Memory - 4G
Storage - 20 GB

4x OSDs :
CPU - Haswell Xeon
Memory - 8 GB
Sata - 3x 2TB (3 OSD per node)
SSD - 2x 480 GB ( Journaling and if possible tiering)


This is a test environment to see how all the components play . If all goes
well
then we plan to increase the OSDs to 24 per node and RAM to 32 GB and a
dual Socket Haswell Xeons

The storage is primarily will be  used to provide Cinder and Swift.
Just wanted to know  what the Expert opinion on how to scale will be.
- Keep the nodes Symmetric
- Just add the new beefy nodes and grow.

Thanks in advance
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Udo Lembke

Hi all,
due an very silly approach, I removed the cache tier of an filled EC pool.

After recreate the pool and connect with the EC pool I don't see any content.
How can I see the rbd_data and other files through the new ssd cache tier?

I think, that I must recreate the rbd_directory (and fill with setomapval), but 
I don't see anything yet!

$ rados ls -p ecarchiv | more
rbd_data.2e47de674b0dc51.00390074
rbd_data.2e47de674b0dc51.0020b64f
rbd_data.2fbb1952ae8944a.0016184c
rbd_data.2cfc7ce74b0dc51.00363527
rbd_data.2cfc7ce74b0dc51.0004c35f
rbd_data.2fbb1952ae8944a.0008db43
rbd_data.2cfc7ce74b0dc51.0015895a
rbd_data.31229f0238e1f29.000135eb
...

$ rados ls -p ssd-archiv
 nothing 

generation of the cache tier:
$ rados mkpool ssd-archiv
$ ceph osd pool set ssd-archiv crush_ruleset 5
$ ceph osd tier add ecarchiv ssd-archiv
$ ceph osd tier cache-mode ssd-archiv writeback
$ ceph osd pool set ssd-archiv hit_set_type bloom
$ ceph osd pool set ssd-archiv hit_set_count 1
$ ceph osd pool set ssd-archiv hit_set_period 3600
$ ceph osd pool set ssd-archiv target_max_bytes 500


rule ssd {
ruleset 5
type replicated
min_size 1
max_size 10
step take ssd
step choose firstn 0 type osd
step emit
}


Are there any magic (or which command I missed?) to see the excisting data 
throug the cache tier?


regards - and hoping for answers

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread Wido den Hollander

On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote:
 Hi Wido,
 Am 26.03.2015 um 11:59 schrieb Wido den Hollander:
 On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote:
 Hi,

 in the past i rwad pretty often that it's not a good idea to run ceph
 and qemu / the hypervisors on the same nodes.

 But why is this a bad idea? You save space and can better use the
 ressources you have in the nodes anyway.


 Memory pressure during recovery *might* become a problem. If you make
 sure that you don't allocate more then let's say 50% for the guests it
 could work.
 
 mhm sure? I've never seen problems like that. Currently i ran each ceph
 node with 64GB of memory and each hypervisor node with around 512GB to
 1TB RAM while having 48 cores.
 

Yes, it can happen. You have machines with enough memory, but if you
overprovision the machines it can happen.

 Using cgroups you could also prevent that the OSDs eat up all memory or CPU.
 Never seen an OSD doing so crazy things.
 

Again, it really depends on the available memory and CPU. If you buy big
machines for this purpose it probably won't be a problem.

 Stefan
 
 So technically it could work, but memorey and CPU pressure is something
 which might give you problems.

 Stefan

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread Stefan Priebe - Profihost AG

Hi,

in the past i rwad pretty often that it's not a good idea to run ceph
and qemu / the hypervisors on the same nodes.

But why is this a bad idea? You save space and can better use the
ressources you have in the nodes anyway.

Stefan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread Wido den Hollander

On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote:
 Hi,
 
 in the past i rwad pretty often that it's not a good idea to run ceph
 and qemu / the hypervisors on the same nodes.
 
 But why is this a bad idea? You save space and can better use the
 ressources you have in the nodes anyway.
 

Memory pressure during recovery *might* become a problem. If you make
sure that you don't allocate more then let's say 50% for the guests it
could work.

Using cgroups you could also prevent that the OSDs eat up all memory or CPU.

So technically it could work, but memorey and CPU pressure is something
which might give you problems.

 Stefan
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 


-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread David Burley

A word of caution: While normally my OSDs use very little CPU, I have
occasionally had an issue where the OSDs saturate the CPU (not necessarily
during a rebuild). This might be a kernel thing, or a driver thing specific
to our hosts, but were this to happen to you, it now impacts your VMs as
well potentially. And even during a rebuild, but when things are acting
normally, CPU usage goes up by a lot relative to steady-state for periods.
On top of this, you would also be sharing other system resources which
would be potential abuse vectors -- network for one. I would avoid.

On Thu, Mar 26, 2015 at 8:11 AM, Wido den Hollander w...@42on.com wrote:

 On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote:
  Hi Wido,
  Am 26.03.2015 um 11:59 schrieb Wido den Hollander:
  On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote:
  Hi,
 
  in the past i rwad pretty often that it's not a good idea to run ceph
  and qemu / the hypervisors on the same nodes.
 
  But why is this a bad idea? You save space and can better use the
  ressources you have in the nodes anyway.
 
 
  Memory pressure during recovery *might* become a problem. If you make
  sure that you don't allocate more then let's say 50% for the guests it
  could work.
 
  mhm sure? I've never seen problems like that. Currently i ran each ceph
  node with 64GB of memory and each hypervisor node with around 512GB to
  1TB RAM while having 48 cores.
 

 Yes, it can happen. You have machines with enough memory, but if you
 overprovision the machines it can happen.

  Using cgroups you could also prevent that the OSDs eat up all memory or
 CPU.
  Never seen an OSD doing so crazy things.
 

 Again, it really depends on the available memory and CPU. If you buy big
 machines for this purpose it probably won't be a problem.

  Stefan
 
  So technically it could work, but memorey and CPU pressure is something
  which might give you problems.
 
  Stefan
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 


 --
 Wido den Hollander
 42on B.V.

 Phone: +31 (0)20 700 9902
 Skype: contact42on
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
David Burley
NOC Manager, Sr. Systems Programmer/Analyst
Slashdot Media

e: da...@slashdotmedia.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread Mark Nelson

It's kind of a philosophical question.  Technically there's nothing that 
prevents you from putting ceph and the hypervisor on the same boxes. 
It's a question of whether or not potential cost savings are worth 
increased risk of failure and contention.  You can minimize those things 
through various means (cgroups, ristricting NUMA nodes, etc).  What is 
more difficult is isolating disk IO contention (say if you want local 
SSDs for VMs), memory bus and QPI contention, network contention, etc. 
If the VMs are working really hard you can restrict them to their own 
socket, and you can even restrict memory usage to the local socket, but 
what about remote socket network or disk IO? (you will almost certainly 
want these things on the ceph socket)  I wonder as well about increased 
risk of hardware failure with the increased load, but I don't have any 
statistics.


I'm guessing if you spent enough time at it you could make it work 
relatively well, but at least personally I question how beneficial it 
really is after all of that.  If you are going for cost savings, I 
suspect efficient compute and storage node designs will be nearly as 
good with much less complexity.


Mark

On 03/26/2015 07:11 AM, Wido den Hollander wrote:

On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote:

Hi Wido,
Am 26.03.2015 um 11:59 schrieb Wido den Hollander:

On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote:

Hi,

in the past i rwad pretty often that it's not a good idea to run ceph
and qemu / the hypervisors on the same nodes.

But why is this a bad idea? You save space and can better use the
ressources you have in the nodes anyway.



Memory pressure during recovery *might* become a problem. If you make
sure that you don't allocate more then let's say 50% for the guests it
could work.


mhm sure? I've never seen problems like that. Currently i ran each ceph
node with 64GB of memory and each hypervisor node with around 512GB to
1TB RAM while having 48 cores.



Yes, it can happen. You have machines with enough memory, but if you
overprovision the machines it can happen.


Using cgroups you could also prevent that the OSDs eat up all memory or CPU.

Never seen an OSD doing so crazy things.



Again, it really depends on the available memory and CPU. If you buy big
machines for this purpose it probably won't be a problem.


Stefan


So technically it could work, but memorey and CPU pressure is something
which might give you problems.


Stefan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com








___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] more human readable log to track request or using mapreduce for data statistics

2015-03-26 Thread Steffen W Sørensen

On 26/03/2015, at 12.14, 池信泽 xmdx...@gmail.com wrote:
 
 It is not so convenience to do conversion in custom.
 Because there are many kinds of log in ceph-osd.log. we only need some
 of them including latency.
 But now, It is hard to grep the log what we want and decode them.
Still run output through a pipe which either knows and reads json and either 
print directly what your need and/or stores data i whatever data repository you 
what to accumulate statistic in.

eg.: ceph —admin-daemon … dump_history | myjsonreaderNformatter.php | grep, 
awk, sed, cut, posix-1 filter-cmd

Don’t expect ceph developers to alter ceph code base to complement your exact 
need when you still wants to filter output through grep whatever anyway ImHO :)

 
 2015-03-26 16:38 GMT+08:00 Steffen W Sørensen ste...@me.com:
 
 On 26/03/2015, at 09.05, 池信泽 xmdx...@gmail.com wrote:
 
 hi，ceph:
 
 Currently, the command ”ceph --admin-daemon
 /var/run/ceph/ceph-osd.0.asok dump_historic_ops“ may return as below:
 
 { description: osd_op(client.4436.1:11617
 rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92),
 received_at: 2015-03-25 19:41:47.146145,
 age: 2.186521,
 duration: 1.237882,
 type_data: [
   commit sent; apply or cleanup,
   { client: client.4436,
 tid: 11617},
   [
   { time: 2015-03-25 19:41:47.150803,
 event: event1},
   { time: 2015-03-25 19:41:47.150873,
 event: event2},
   { time: 2015-03-25 19:41:47.150895,
 event: event3},
   { time: 2015-03-25 19:41:48.384027,
 event: event4}]]}
 
 Seems like JSON format
 So consider doing your custom conversion by some means of CLI
 convert json format to string
 
 
 I think this message is not so suitable for grep log or using
 mapreduce for data statistics. Such as, I want to know
 the write request avg latency for each rbd everyday. If we could
 output the all latency in just one line, it would be very easy to
 achieve it.
 
 Such as, the output log maybe something like this:
 2015-03-26 03:30:53.859759 osd=osd.0 pg=2.11 op=(client.4436.1:11617
 rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92)
 received_at=1427355253 age=2.186521 duration=1.237882 tid=11617
 client=client.4436 event1=20ms event2=300ms event3=400ms event4=100ms.
 
 The above:
 
 duration means: the time between (reply_to_client_stamp -
 request_received_stamp)
 event1 means: the time between (the event1_stamp - request_received_stamp)
 ...
 event4 means: the time between (the event4_stamp - request_received_stamp)
 
 Now, If we output the every log as above. it would be every easy to
 know the write request avg latency for each rbd everyday.
 Or if I use grep it is more easy to find out which stage is the bottleneck.
 
 --
 Regards,
 xinze
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
 
 -- 
 Regards,
 xinze

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Where is the systemd files?

2015-03-26 Thread Robert LeBlanc

I understand that Giant should have systemd service files, but I don't
see them in the CentOS 7 packages.

https://github.com/ceph/ceph/tree/giant/systemd

[ulhglive-root@mon1 systemd]# rpm -qa | grep --color=always ceph
ceph-common-0.93-0.el7.centos.x86_64
python-cephfs-0.93-0.el7.centos.x86_64
libcephfs1-0.93-0.el7.centos.x86_64
ceph-0.93-0.el7.centos.x86_64
ceph-deploy-1.5.22-0.noarch
[ulhglive-root@mon1 systemd]# for i in $(rpm -qa | grep ceph); do rpm
-ql $i | grep -i --color=always systemd; done
[nothing returned]

Thanks,
Robert LeBlanc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Calamari Deployment

2015-03-26 Thread LaBarre, James (CTR) A6IT

For that matter, is there a way to build Calamari without going the whole 
vagrant path at all?  Some way of just building it through command-line tools?  
I would be building it on an Openstack instance, no GUI.  Seems silly to have 
to install an entire virtualbox environment inside something that’s already a 
VM.


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of JESUS 
CHAVEZ ARGUELLES
Sent: Monday, March 02, 2015 3:00 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Calamari Deployment


Does anybody know how to succesful install Calamari in rhel7 ? I have tried the 
vagrant thug without sucesss and it seems like a nightmare there is a Kind of 
Sidur when you do vagrant up where it seems not to find the vm path...

Regards

Jesus Chavez
SYSTEMS ENGINEER-C.SALES

jesch...@cisco.commailto:jesch...@cisco.com
Phone: +52 55 5267 3146tel:+52%2055%205267%203146
Mobile: +51 1 5538883255tel:+51%201%205538883255

CCIE - 44433
--
CONFIDENTIALITY NOTICE: If you have received this email in error,
please immediately notify the sender by e-mail at the address shown. 
This email transmission may contain confidential information.  This
information is intended only for the use of the individual(s) or entity to
whom it is intended even if addressed incorrectly.  Please delete it from
your files if you are not the intended recipient.  Thank you for your
compliance.  Copyright (c) 2015 Cigna
==
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Gregory Farnum

I don't know why you're mucking about manually with the rbd directory;
the rbd tool and rados handle cache pools correctly as far as I know.
-Greg

On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi Greg,
 ok!

 It's looks like, that my problem is more setomapval-related...

 I must o something like
 rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
 \0x0f\0x00\0x00\0x002cfc7ce74b0dc51

 but rados setomapval don't use the hexvalues - instead of this I got
 rados -p ssd-archiv listomapvals rbd_directory
 name_vm-409-disk-2
 value: (35 bytes) :
  : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\
 0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d
 0020 : 63 35 31: c51


 hmm, strange. With  rados -p ssd-archiv getomapval rbd_directory 
 name_vm-409-disk-2 name_vm-409-disk-2
 I got the binary inside the file name_vm-409-disk-2, but reverse do an
 rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
 name_vm-409-disk-2
 fill the variable with name_vm-409-disk-2 and not with the content of the 
 file...

 Are there other tools for the rbd_directory?

 regards

 Udo

 Am 26.03.2015 15:03, schrieb Gregory Farnum:
 You shouldn't rely on rados ls when working with cache pools. It
 doesn't behave properly and is a silly operation to run against a pool
 of any size even when it does. :)

 More specifically, rados ls is invoking the pgls operation. Normal
 read/write ops will go query the backing store for objects if they're
 not in the cache tier. pgls is different — it just tells you what
 objects are present in the PG on that OSD right now. So any objects
 which aren't in cache won't show up when listing on the cache pool.
 -Greg

 On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi all,
 due an very silly approach, I removed the cache tier of an filled EC pool.

 After recreate the pool and connect with the EC pool I don't see any 
 content.
 How can I see the rbd_data and other files through the new ssd cache tier?

 I think, that I must recreate the rbd_directory (and fill with setomapval), 
 but I don't see anything yet!

 $ rados ls -p ecarchiv | more
 rbd_data.2e47de674b0dc51.00390074
 rbd_data.2e47de674b0dc51.0020b64f
 rbd_data.2fbb1952ae8944a.0016184c
 rbd_data.2cfc7ce74b0dc51.00363527
 rbd_data.2cfc7ce74b0dc51.0004c35f
 rbd_data.2fbb1952ae8944a.0008db43
 rbd_data.2cfc7ce74b0dc51.0015895a
 rbd_data.31229f0238e1f29.000135eb
 ...

 $ rados ls -p ssd-archiv
  nothing 

 generation of the cache tier:
 $ rados mkpool ssd-archiv
 $ ceph osd pool set ssd-archiv crush_ruleset 5
 $ ceph osd tier add ecarchiv ssd-archiv
 $ ceph osd tier cache-mode ssd-archiv writeback
 $ ceph osd pool set ssd-archiv hit_set_type bloom
 $ ceph osd pool set ssd-archiv hit_set_count 1
 $ ceph osd pool set ssd-archiv hit_set_period 3600
 $ ceph osd pool set ssd-archiv target_max_bytes 500


 rule ssd {
 ruleset 5
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step choose firstn 0 type osd
 step emit
 }


 Are there any magic (or which command I missed?) to see the excisting 
 data throug the cache tier?


 regards - and hoping for answers

 Udo
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Calamari Deployment

2015-03-26 Thread Quentin Hartman

I used this as a guide for building calamari packages w/o using vagrant.
Worked great:
http://bryanapperson.com/blog/compiling-calamari-ceph-ubuntu-14-04/

On Thu, Mar 26, 2015 at 10:30 AM, Steffen W Sørensen ste...@me.com wrote:


 On 26/03/2015, at 17.18, LaBarre, James (CTR) A6IT 
 james.laba...@cigna.com wrote:
 For that matter, is there a way to build Calamari without going the whole
 vagrant path at all?  Some way of just building it through command-line
 tools?  I would be building it on an Openstack instance, no GUI.  Seems
 silly to have to install an entire virtualbox environment inside something
 that’s already a VM.

 Agreed... if U wanted to built in on your server farm/cloud stack env.
 I just built my packages for Debian Wheezy (with CentOS+RHEL rpms as a
 bonus) on my desktop Mac/OS-X with use of virtualbox and vagrant ( vagrant
 is an easy disposable built-env:)



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
 ceph-users-boun...@lists.ceph.com] *On Behalf Of *JESUS CHAVEZ ARGUELLES
 *Sent:* Monday, March 02, 2015 3:00 PM
 *To:* ceph-users@lists.ceph.com
 *Subject:* [ceph-users] Calamari Deployment


 Does anybody know how to succesful install Calamari in rhel7 ? I have
 tried the vagrant thug without sucesss and it seems like a nightmare there
 is a Kind of Sidur when you do vagrant up where it seems not to find the vm
 path...

 Regards

 *Jesus Chavez*
 SYSTEMS ENGINEER-C.SALES

 jesch...@cisco.com
 Phone: *+52 55 5267 3146 +52%2055%205267%203146*
 Mobile: *+51 1 5538883255 +51%201%205538883255*

 CCIE - 44433


 --
 CONFIDENTIALITY NOTICE: If you have received this email in error,
 please immediately notify the sender by e-mail at the address shown.
 This email transmission may contain confidential information.  This
 information is intended only for the use of the individual(s) or entity to

 whom it is intended even if addressed incorrectly.  Please delete it from
 your files if you are not the intended recipient.  Thank you for your
 compliance.  Copyright (c) 2015 Cigna

 ==
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread Chris Jones

We run many clusters in a similar config with shared Hypervisor/OSD/RGW/RBD
in production and in staging but we have been looking into moving our
storage to it's own cluster so that we can scale independently. We used AWS
and scaled up a ton of virtual users using JMeter clustering to test
performance and max loads. We found over all of our test with the same
config and upstream network traffic, the latency went from 45ms to 2.2s
after a 1,000 users. It stayed that way for the duration of the hour long
test. The response time was of course higher than latency (as defined by
JMeter) and our payload was a 2MB byte range request of video clips.

Our use case is also changing from a standpoint that our object storage is
becoming very popular within the company so it has to scale differently but
we're not there yet. We plan on a new rollout being separated so we can
test it before jumping all in but so far the numbers are there. Both
options are valid and work. It really depends on the use cases.

My 2 cents,
Chris

On Thu, Mar 26, 2015 at 11:36 AM, Mark Nelson mnel...@redhat.com wrote:

 I suspect a config like this where you only have 3 OSDs per node would be
 more manageable than something denser.

 IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U super
 micro chassis for a semi-dense converged solution.  You could attempt to
 restrict the OSDs to one socket and then use a second E5-2697v3 for VMs.
 Maybe after you've got cgroups setup properly and if you've otherwise
 balanced things it would work out ok.  I question though how much you
 really benefit by doing this rather than running a 36 drive storage server
 with lower bin CPUs and a 2nd 1U box for VMs (which you don't need as many
 of because you can dedicate both sockets to VMs).

 It probably depends quite a bit on how memory, network, and disk intensive
 the VMs are, but my take is that it's better to error on the side of
 simplicity rather than making things overly complicated.  Every second you
 are screwing around trying to make the setup work right eats into any
 savings you might gain by going with the converged setup.

 Mark

 On 03/26/2015 10:12 AM, Quentin Hartman wrote:

 I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1
 SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs
 for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of
 RAM unused on each node for OSD / OS overhead. All the VMs are backed by
 ceph volumes and things generally work very well. I would prefer a
 dedicated storage layer simply because it seems more right, but I
 can't say that any of the common concerns of using this kind of setup
 have come up for me. Aside from shaving off that 3GB of RAM, my
 deployment isn't any more complex than a split stack deployment would
 be. After running like this for the better part of a year, I would have
 a hard time honestly making a real business case for the extra hardware
 a split stack cluster would require.

 QH

 On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson mnel...@redhat.com
 mailto:mnel...@redhat.com wrote:

 It's kind of a philosophical question.  Technically there's nothing
 that prevents you from putting ceph and the hypervisor on the same
 boxes. It's a question of whether or not potential cost savings are
 worth increased risk of failure and contention.  You can minimize
 those things through various means (cgroups, ristricting NUMA nodes,
 etc).  What is more difficult is isolating disk IO contention (say
 if you want local SSDs for VMs), memory bus and QPI contention,
 network contention, etc. If the VMs are working really hard you can
 restrict them to their own socket, and you can even restrict memory
 usage to the local socket, but what about remote socket network or
 disk IO? (you will almost certainly want these things on the ceph
 socket)  I wonder as well about increased risk of hardware failure
 with the increased load, but I don't have any statistics.

 I'm guessing if you spent enough time at it you could make it work
 relatively well, but at least personally I question how beneficial
 it really is after all of that.  If you are going for cost savings,
 I suspect efficient compute and storage node designs will be nearly
 as good with much less complexity.

 Mark


 On 03/26/2015 07:11 AM, Wido den Hollander wrote:

 On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote:

 Hi Wido,
 Am 26.03.2015 um 11:59 schrieb Wido den Hollander:

 On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote:

 Hi,

 in the past i rwad pretty often that it's not a good
 idea to run ceph
 and qemu / the hypervisors on the same nodes.

 But why is this a bad idea? You save space and can
 better use the

Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)

2015-03-26 Thread Jake Grimmett


On 03/25/2015 05:44 PM, Gregory Farnum wrote:

On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett j...@mrc-lmb.cam.ac.uk wrote:

Dear All,

Please forgive this post if it's naive, I'm trying to familiarise myself
with cephfs!

I'm using Scientific Linux 6.6. with Ceph 0.87.1

My first steps with cephfs using a replicated pool worked OK.

Now trying now to test cephfs via a replicated caching tier on top of an
erasure pool. I've created an erasure pool, cannot put it under the existing
replicated pool.

My thoughts were to delete the existing cephfs, and start again, however I
cannot delete the existing cephfs:

errors are as follows:

[root@ceph1 ~]# ceph fs rm cephfs2
Error EINVAL: all MDS daemons must be inactive before removing filesystem

I've tried killing the ceph-mds process, but this does not prevent the above
error.

I've also tried this, which also errors:

[root@ceph1 ~]# ceph mds stop 0
Error EBUSY: must decrease max_mds or else MDS will immediately reactivate


Right, so did you run ceph mds set_max_mds 0 and then repeating the
stop command? :)



This also fail...

[root@ceph1 ~]# ceph-deploy mds destroy
[ceph_deploy.conf][DEBUG ] found configuration file at:
/root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.21): /usr/bin/ceph-deploy mds destroy
[ceph_deploy.mds][ERROR ] subcommand destroy not implemented

Am I doing the right thing in trying to wipe the original cephfs config
before attempting to use an erasure cold tier? Or can I just redefine the
cephfs?


Yeah, unfortunately you need to recreate it if you want to try and use
an EC pool with cache tiering, because CephFS knows what pools it
expects data to belong to. Things are unlikely to behave correctly if
you try and stick an EC pool under an existing one. :(

Sounds like this is all just testing, which is good because the
suitability of EC+cache is very dependent on how much hot data you
have, etc...good luck!
-Greg



many thanks,

Jake Grimmett
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Thanks for your help - much appreciated.

The set_max_mds 0 command worked, but only after I rebooted the 
server, and restarted ceph twice. Before this I still got an

mds active error, and so was unable to destroy the cephfs.

Possibly I was being impatient, and needed to let mds go inactive? 
there were ~1 million files on the system.


[root@ceph1 ~]# ceph mds set_max_mds 0
max_mds = 0

[root@ceph1 ~]# ceph mds stop 0
telling mds.0 10.1.0.86:6811/3249 to deactivate

[root@ceph1 ~]# ceph mds stop 0
Error EEXIST: mds.0 not active (up:stopping)

[root@ceph1 ~]# ceph fs rm cephfs2
Error EINVAL: all MDS daemons must be inactive before removing filesystem

There shouldn't be any other mds servers running..
[root@ceph1 ~]# ceph mds stop 1
Error EEXIST: mds.1 not active (down:dne)

At this point I rebooted the server, did a service ceph restart twice. 
Shutdown ceph, then restarted ceph before this command worked:


[root@ceph1 ~]# ceph fs rm cephfs2 --yes-i-really-mean-it

Anyhow, I've now been able to create an erasure coded pool, with a 
replicated tier which cephfs is running on :)


*Lots* of testing to go!

Again, many thanks

Jake
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread Quentin Hartman

That one big server sounds great, but it also sounds like a single point of
failure. It's also not cheap. I've been able to build this cluster for
about $1400 per node, including the 10Gb networking gear, which is less
than what I see the _empty case_ you describe going for new. Even used, the
lowest I've seen (lacking trays at that price) is what I paid for one of my
nodes including CPU and RAM, and drive trays. So, it's been a pretty
inexpensive venture considering what we get out of it. I have no per-node
fault tolerance, but if one of my nodes dies, I just restart the VMs that
were on it somewhere else and wait for ceph to heal. I also benefit from
higher aggregate network bandwidth because I have more ports on the wire.
And better per-U cpu and RAM density (for the money). *shrug* different
strokes.

As for difficulty of management, any screwing around I've done has had
nothing to do with the converged nature of the setup, aside from
discovering and changing the one setting I mentioned. So, for me at least,
it's been a pretty well unqualified net win. I can imagine all sorts of
scenarios where that wouldn't be, but I think it's probably debatable
whether or not those constitute a common case. The higher node count does
add some complexity, but that's easily overcome with some simple
automation. Again though, that has no bearing on the converged setup, it's
just a factor of how much CPU and RAM we need for our use case.

I guess what I'm trying to say is that I don't think the answer is as cut
and dry as you seem to think.

QH

On Thu, Mar 26, 2015 at 9:36 AM, Mark Nelson mnel...@redhat.com wrote:

 I suspect a config like this where you only have 3 OSDs per node would be
 more manageable than something denser.

 IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U super
 micro chassis for a semi-dense converged solution.  You could attempt to
 restrict the OSDs to one socket and then use a second E5-2697v3 for VMs.
 Maybe after you've got cgroups setup properly and if you've otherwise
 balanced things it would work out ok.  I question though how much you
 really benefit by doing this rather than running a 36 drive storage server
 with lower bin CPUs and a 2nd 1U box for VMs (which you don't need as many
 of because you can dedicate both sockets to VMs).

 It probably depends quite a bit on how memory, network, and disk intensive
 the VMs are, but my take is that it's better to error on the side of
 simplicity rather than making things overly complicated.  Every second you
 are screwing around trying to make the setup work right eats into any
 savings you might gain by going with the converged setup.

 Mark

 On 03/26/2015 10:12 AM, Quentin Hartman wrote:

 I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1
 SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs
 for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of
 RAM unused on each node for OSD / OS overhead. All the VMs are backed by
 ceph volumes and things generally work very well. I would prefer a
 dedicated storage layer simply because it seems more right, but I
 can't say that any of the common concerns of using this kind of setup
 have come up for me. Aside from shaving off that 3GB of RAM, my
 deployment isn't any more complex than a split stack deployment would
 be. After running like this for the better part of a year, I would have
 a hard time honestly making a real business case for the extra hardware
 a split stack cluster would require.

 QH

 On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson mnel...@redhat.com
 mailto:mnel...@redhat.com wrote:

 It's kind of a philosophical question.  Technically there's nothing
 that prevents you from putting ceph and the hypervisor on the same
 boxes. It's a question of whether or not potential cost savings are
 worth increased risk of failure and contention.  You can minimize
 those things through various means (cgroups, ristricting NUMA nodes,
 etc).  What is more difficult is isolating disk IO contention (say
 if you want local SSDs for VMs), memory bus and QPI contention,
 network contention, etc. If the VMs are working really hard you can
 restrict them to their own socket, and you can even restrict memory
 usage to the local socket, but what about remote socket network or
 disk IO? (you will almost certainly want these things on the ceph
 socket)  I wonder as well about increased risk of hardware failure
 with the increased load, but I don't have any statistics.

 I'm guessing if you spent enough time at it you could make it work
 relatively well, but at least personally I question how beneficial
 it really is after all of that.  If you are going for cost savings,
 I suspect efficient compute and storage node designs will be nearly
 as good with much less complexity.

 Mark


 On 03/26/2015 07:11 AM, Wido den Hollander wrote:

 On

Re: [ceph-users] Calamari Deployment

2015-03-26 Thread Lee Revell

The first step is incorrect:

echo deb http://ppa.launchpad.net/saltstack/salt/ubuntu lsb_release -sc
main | sudo tee /etc/apt/sources.list.d/saltstack.list

should be

echo deb http://ppa.launchpad.net/saltstack/salt/ubuntu $(lsb_release -sc)
main | sudo tee /etc/apt/sources.list.d/saltstack.list

Anyway this process fails for me at the ./configure stage for Node:

creating  ./config.mk
Traceback (most recent call last):
  File tools/gyp_node, line 57, in module
run_gyp(gyp_args)
  File tools/gyp_node, line 18, in run_gyp
rc = gyp.main(args)
  File ./tools/gyp/pylib/gyp/__init__.py, line 526, in main
return gyp_main(args)
  File ./tools/gyp/pylib/gyp/__init__.py, line 502, in gyp_main
options.circular_check)
  File ./tools/gyp/pylib/gyp/__init__.py, line 91, in Load
generator = __import__(generator_name, globals(), locals(),
generator_name)
ImportError: No module named generator.make

Lee

On Thu, Mar 26, 2015 at 1:14 PM, Quentin Hartman 
qhart...@direwolfdigital.com wrote:

 I used this as a guide for building calamari packages w/o using vagrant.
 Worked great:
 http://bryanapperson.com/blog/compiling-calamari-ceph-ubuntu-14-04/

 On Thu, Mar 26, 2015 at 10:30 AM, Steffen W Sørensen ste...@me.com
 wrote:


 On 26/03/2015, at 17.18, LaBarre, James (CTR) A6IT 
 james.laba...@cigna.com wrote:
 For that matter, is there a way to build Calamari without going the whole
 vagrant path at all?  Some way of just building it through command-line
 tools?  I would be building it on an Openstack instance, no GUI.  Seems
 silly to have to install an entire virtualbox environment inside something
 that’s already a VM.

 Agreed... if U wanted to built in on your server farm/cloud stack env.
 I just built my packages for Debian Wheezy (with CentOS+RHEL rpms as a
 bonus) on my desktop Mac/OS-X with use of virtualbox and vagrant ( vagrant
 is an easy disposable built-env:)



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
 ceph-users-boun...@lists.ceph.com] *On Behalf Of *JESUS CHAVEZ
 ARGUELLES
 *Sent:* Monday, March 02, 2015 3:00 PM
 *To:* ceph-users@lists.ceph.com
 *Subject:* [ceph-users] Calamari Deployment


 Does anybody know how to succesful install Calamari in rhel7 ? I have
 tried the vagrant thug without sucesss and it seems like a nightmare there
 is a Kind of Sidur when you do vagrant up where it seems not to find the vm
 path...

 Regards

 *Jesus Chavez*
 SYSTEMS ENGINEER-C.SALES

 jesch...@cisco.com
 Phone: *+52 55 5267 3146 +52%2055%205267%203146*
 Mobile: *+51 1 5538883255 +51%201%205538883255*

 CCIE - 44433


 --
 CONFIDENTIALITY NOTICE: If you have received this email in error,
 please immediately notify the sender by e-mail at the address shown.
 This email transmission may contain confidential information.  This
 information is intended only for the use of the individual(s) or entity to

 whom it is intended even if addressed incorrectly.  Please delete it from

 your files if you are not the intended recipient.  Thank you for your
 compliance.  Copyright (c) 2015 Cigna

 ==
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Calamari Deployment

2015-03-26 Thread Steffen W Sørensen


 On 26/03/2015, at 17.18, LaBarre, James (CTR) A6IT james.laba...@cigna.com 
 wrote:
 For that matter, is there a way to build Calamari without going the whole 
 vagrant path at all?  Some way of just building it through command-line 
 tools?  I would be building it on an Openstack instance, no GUI.  Seems silly 
 to have to install an entire virtualbox environment inside something that’s 
 already a VM.
Agreed... if U wanted to built in on your server farm/cloud stack env.
I just built my packages for Debian Wheezy (with CentOS+RHEL rpms as a bonus) 
on my desktop Mac/OS-X with use of virtualbox and vagrant ( vagrant is an easy 
disposable built-env:)

  
  
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
 mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of JESUS CHAVEZ 
 ARGUELLES
 Sent: Monday, March 02, 2015 3:00 PM
 To: ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 Subject: [ceph-users] Calamari Deployment
  
 
 Does anybody know how to succesful install Calamari in rhel7 ? I have tried 
 the vagrant thug without sucesss and it seems like a nightmare there is a 
 Kind of Sidur when you do vagrant up where it seems not to find the vm path...
  
 Regards 
 
 Jesus Chavez
 SYSTEMS ENGINEER-C.SALES
 
 jesch...@cisco.com mailto:jesch...@cisco.com
 Phone: +52 55 5267 3146 tel:+52%2055%205267%203146
 Mobile: +51 1 5538883255 tel:+51%201%205538883255
 
 CCIE - 44433
 --
 CONFIDENTIALITY NOTICE: If you have received this email in error,
 please immediately notify the sender by e-mail at the address shown.  
 This email transmission may contain confidential information.  This 
 information is intended only for the use of the individual(s) or entity to 
 whom it is intended even if addressed incorrectly.  Please delete it from 
 your files if you are not the intended recipient.  Thank you for your 
 compliance.  Copyright (c) 2015 Cigna
 ==
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Udo Lembke

Hi Greg,

On 26.03.2015 18:46, Gregory Farnum wrote:
 I don't know why you're mucking about manually with the rbd directory;
 the rbd tool and rados handle cache pools correctly as far as I know.
that's because I deleted the cache tier pool, so the files like 
rbd_header.2cfc7ce74b0dc51 and rbd_directory are gone.
The whole vm-disk data are in the ec pool (rbd_data.2cfc7ce74b0dc51.*)

I can't see or recreate the VM-disk, because rados setomapval don't like
binary-data and the rbd-tool can't (re)create an rbd-disk with an given
hash (like 2cfc7ce74b0dc51).

The only way I see in the moment, is to create new rbd-disks and copy
all blocks with rados get - file - rados put.
The problem is the time it's take (days to weeks for 3 * 16TB)...

Udo

 -Greg

 On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi Greg,
 ok!

 It's looks like, that my problem is more setomapval-related...

 I must o something like
 rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
 \0x0f\0x00\0x00\0x002cfc7ce74b0dc51

 but rados setomapval don't use the hexvalues - instead of this I got
 rados -p ssd-archiv listomapvals rbd_directory
 name_vm-409-disk-2
 value: (35 bytes) :
  : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\
 0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d
 0020 : 63 35 31: c51


 hmm, strange. With  rados -p ssd-archiv getomapval rbd_directory 
 name_vm-409-disk-2 name_vm-409-disk-2
 I got the binary inside the file name_vm-409-disk-2, but reverse do an
 rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
 name_vm-409-disk-2
 fill the variable with name_vm-409-disk-2 and not with the content of the 
 file...

 Are there other tools for the rbd_directory?

 regards

 Udo

 Am 26.03.2015 15:03, schrieb Gregory Farnum:
 You shouldn't rely on rados ls when working with cache pools. It
 doesn't behave properly and is a silly operation to run against a pool
 of any size even when it does. :)

 More specifically, rados ls is invoking the pgls operation. Normal
 read/write ops will go query the backing store for objects if they're
 not in the cache tier. pgls is different — it just tells you what
 objects are present in the PG on that OSD right now. So any objects
 which aren't in cache won't show up when listing on the cache pool.
 -Greg

 On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi all,
 due an very silly approach, I removed the cache tier of an filled EC pool.

 After recreate the pool and connect with the EC pool I don't see any 
 content.
 How can I see the rbd_data and other files through the new ssd cache tier?

 I think, that I must recreate the rbd_directory (and fill with 
 setomapval), but I don't see anything yet!

 $ rados ls -p ecarchiv | more
 rbd_data.2e47de674b0dc51.00390074
 rbd_data.2e47de674b0dc51.0020b64f
 rbd_data.2fbb1952ae8944a.0016184c
 rbd_data.2cfc7ce74b0dc51.00363527
 rbd_data.2cfc7ce74b0dc51.0004c35f
 rbd_data.2fbb1952ae8944a.0008db43
 rbd_data.2cfc7ce74b0dc51.0015895a
 rbd_data.31229f0238e1f29.000135eb
 ...

 $ rados ls -p ssd-archiv
  nothing 

 generation of the cache tier:
 $ rados mkpool ssd-archiv
 $ ceph osd pool set ssd-archiv crush_ruleset 5
 $ ceph osd tier add ecarchiv ssd-archiv
 $ ceph osd tier cache-mode ssd-archiv writeback
 $ ceph osd pool set ssd-archiv hit_set_type bloom
 $ ceph osd pool set ssd-archiv hit_set_count 1
 $ ceph osd pool set ssd-archiv hit_set_period 3600
 $ ceph osd pool set ssd-archiv target_max_bytes 500


 rule ssd {
 ruleset 5
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step choose firstn 0 type osd
 step emit
 }


 Are there any magic (or which command I missed?) to see the excisting 
 data throug the cache tier?


 regards - and hoping for answers

 Udo
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread J-P Methot

That's a great idea. I know I can setup cinder (the openstack volume 
manager) as a multi-backend manager and migrate from one backend to the 
other, each backend linking to different pools of the same ceph cluster. 
What bugs me though is that I'm pretty sure the image store, glance, 
wouldn't let me do that. Additionally, since the compute component also 
has its own ceph pool, I'm pretty sure it won't let me migrate the data 
through openstack.




On 3/26/2015 3:54 PM, Steffen W Sørensen wrote:

On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote:

Lately I've been going back to work on one of my first ceph setup and now I see 
that I have created way too many placement groups for the pools on that setup 
(about 10 000 too many). I believe this may impact performances negatively, as 
the performances on this ceph cluster are abysmal. Since it is not possible to 
reduce the number of PGs in a pool, I was thinking of creating new pools with a 
smaller number of PGs, moving the data from the old pools to the new pools and 
then deleting the old pools.

I haven't seen any command to copy objects from one pool to another. Would that 
be possible? I'm using ceph for block storage with openstack, so surely there 
must be a way to move block devices from a pool to another, right?

What I did a one point was going one layer higher in my storage abstraction, 
and created new Ceph pools and used those for new storage resources/pool in my 
VM env. (ProxMox) on top of Ceph RBD and then did a live migration of virtual 
disks there, assume you could do the same in OpenStack.

My 0.02$

/Steffen



--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cascading Failure of OSDs

2015-03-26 Thread Quentin Hartman

Since I have been in ceph-land today, it reminded me that I needed to close
the loop on this. I was finally able to isolate this problem down to a
faulty NIC on the ceph cluster network. It worked, but it was
accumulating a huge number of Rx errors. My best guess is some receive
buffer cache failed? Anyway, having a NIC go weird like that is totally
consistent with all the weird problems I was seeing, the corrupted PGs, and
the inability for the cluster to settle down.

As a result we've added NIC error rates to our monitoring suite on the
cluster so we'll hopefully see this coming if it ever happens again.

QH

On Sat, Mar 7, 2015 at 11:36 AM, Quentin Hartman 
qhart...@direwolfdigital.com wrote:

 So I'm not sure what has changed, but in the last 30 minutes the errors
 which were all over the place, have finally settled down to this:
 http://pastebin.com/VuCKwLDp

 The only thing I can think of is that I also net the noscrub flag in
 addition to the nodeep-scrub when I first got here, and that finally
 took. Anyway, they've been stable there for some time now, and I've been
 able to get a couple VMs to come up and behave reasonably well. At this
 point I'm prepared to wipe the entire cluster and start over if I have to
 to get it truly consistent again, since my efforts to zap pg 3.75b haven't
 borne fruit. However, if anyone has a less nuclear option they'd like to
 suggest, I'm all ears.

 I've tried to export/re-import the pg and do a force_create. The import
 failed, and the force_create just reverted back to being incomplete after
 creating for a few minutes.

 QH

 On Sat, Mar 7, 2015 at 9:29 AM, Quentin Hartman 
 qhart...@direwolfdigital.com wrote:

 Now that I have a better understanding of what's happening, I threw
 together a little one-liner to create a report of the errors that the OSDs
 are seeing. Lots of missing  / corrupted pg shards:
 https://gist.github.com/qhartman/174cc567525060cb462e

 I've experimented with exporting / importing the broken pgs with
 ceph_objectstore_tool, and while they seem to export correctly, the tool
 crashes when trying to import:

 root@node12:/var/lib/ceph/osd# ceph_objectstore_tool --op import
 --data-path /var/lib/ceph/osd/ceph-19/ --journal-path
 /var/lib/ceph/osd/ceph-19/journal --file ~/3.75b.export
 Importing pgid 3.75b
 Write 2672075b/rbd_data.2bce2ae8944a.1509/head//3
 Write 3473075b/rbd_data.1d6172ae8944a.0001636a/head//3
 Write f2e4075b/rbd_data.c816f2ae8944a.0208/head//3
 Write f215075b/rbd_data.c4a892ae8944a.0b6b/head//3
 Write c086075b/rbd_data.42a742ae8944a.02fb/head//3
 Write 6f9d075b/rbd_data.1d6172ae8944a.5ac3/head//3
 Write dd9f075b/rbd_data.1d6172ae8944a.0001127d/head//3
 Write f9f075b/rbd_data.c4a892ae8944a.f056/head//3
 Write 4d71175b/rbd_data.c4a892ae8944a.9e51/head//3
 Write bcc3175b/rbd_data.2bce2ae8944a.133f/head//3
 Write 1356175b/rbd_data.3f862ae8944a.05d6/head//3
 Write d327175b/rbd_data.1d6172ae8944a.0001af85/head//3
 Write 7388175b/rbd_data.2bce2ae8944a.1353/head//3
 Write 8cda175b/rbd_data.c4a892ae8944a.b585/head//3
 Write 6b3c175b/rbd_data.c4a892ae8944a.00018e91/head//3
 Write d37f175b/rbd_data.1d6172ae8944a.3a90/head//3
 Write 4590275b/rbd_data.2bce2ae8944a.1f67/head//3
 Write fe51275b/rbd_data.c4a892ae8944a.e917/head//3
 Write 3402275b/rbd_data.3f5c2ae8944a.1252/6//3
 osd/SnapMapper.cc: In function 'void SnapMapper::add_oid(const
 hobject_t, const std::setsnapid_t,
 MapCacher::Transactionstd::basic_stringchar, ceph::buffer::list*)'
 thread 7fba67ff3900 time 2015-03-07 16:21:57.921820
 osd/SnapMapper.cc: 228: FAILED assert(r == -2)
  ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x8b) [0xb94fbb]
  2: (SnapMapper::add_oid(hobject_t const, std::setsnapid_t,
 std::lesssnapid_t, std::allocatorsnapid_t  const,
 MapCacher::Transactionstd::string, ceph::buffer::list*)+0x63e) [0x7b719e]
  3: (get_attrs(ObjectStore*, coll_t, ghobject_t,
 ObjectStore::Transaction*, ceph::buffer::list, OSDriver,
 SnapMapper)+0x67c) [0x661a1c]
  4: (get_object(ObjectStore*, coll_t, ceph::buffer::list)+0x3e5)
 [0x661f85]
  5: (do_import(ObjectStore*, OSDSuperblock)+0xd61) [0x665be1]
  6: (main()+0x2208) [0x63f178]
  7: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
  8: ceph_objectstore_tool() [0x659577]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed
 to interpret this.
 terminate called after throwing an instance of 'ceph::FailedAssertion'
 *** Caught signal (Aborted) **
  in thread 7fba67ff3900
  ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
  1: ceph_objectstore_tool() [0xab1cea]
  2: (()+0x10340) [0x7fba66a95340]
  3: (gsignal()+0x39) [0x7fba627c7cc9]
  4: (abort()+0x148) [0x7fba627cb0d8]
  5:

Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)

2015-03-26 Thread Kyle Hutson

For what it's worth, I don't think  being patient was the answer. I was
having the same problem a couple of weeks ago, and I waited from before 5pm
one day until after 8am the next, and still got the same errors. I ended up
adding a new cephfs pool with a newly-created small pool, but was never
able to actually remove cephfs altogether.

On Thu, Mar 26, 2015 at 12:45 PM, Jake Grimmett j...@mrc-lmb.cam.ac.uk
wrote:

 On 03/25/2015 05:44 PM, Gregory Farnum wrote:

 On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett j...@mrc-lmb.cam.ac.uk
 wrote:

 Dear All,

 Please forgive this post if it's naive, I'm trying to familiarise myself
 with cephfs!

 I'm using Scientific Linux 6.6. with Ceph 0.87.1

 My first steps with cephfs using a replicated pool worked OK.

 Now trying now to test cephfs via a replicated caching tier on top of an
 erasure pool. I've created an erasure pool, cannot put it under the
 existing
 replicated pool.

 My thoughts were to delete the existing cephfs, and start again, however
 I
 cannot delete the existing cephfs:

 errors are as follows:

 [root@ceph1 ~]# ceph fs rm cephfs2
 Error EINVAL: all MDS daemons must be inactive before removing filesystem

 I've tried killing the ceph-mds process, but this does not prevent the
 above
 error.

 I've also tried this, which also errors:

 [root@ceph1 ~]# ceph mds stop 0
 Error EBUSY: must decrease max_mds or else MDS will immediately
 reactivate


 Right, so did you run ceph mds set_max_mds 0 and then repeating the
 stop command? :)


 This also fail...

 [root@ceph1 ~]# ceph-deploy mds destroy
 [ceph_deploy.conf][DEBUG ] found configuration file at:
 /root/.cephdeploy.conf
 [ceph_deploy.cli][INFO  ] Invoked (1.5.21): /usr/bin/ceph-deploy mds
 destroy
 [ceph_deploy.mds][ERROR ] subcommand destroy not implemented

 Am I doing the right thing in trying to wipe the original cephfs config
 before attempting to use an erasure cold tier? Or can I just redefine the
 cephfs?


 Yeah, unfortunately you need to recreate it if you want to try and use
 an EC pool with cache tiering, because CephFS knows what pools it
 expects data to belong to. Things are unlikely to behave correctly if
 you try and stick an EC pool under an existing one. :(

 Sounds like this is all just testing, which is good because the
 suitability of EC+cache is very dependent on how much hot data you
 have, etc...good luck!
 -Greg


 many thanks,

 Jake Grimmett
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 Thanks for your help - much appreciated.

 The set_max_mds 0 command worked, but only after I rebooted the server,
 and restarted ceph twice. Before this I still got an
 mds active error, and so was unable to destroy the cephfs.

 Possibly I was being impatient, and needed to let mds go inactive? there
 were ~1 million files on the system.

 [root@ceph1 ~]# ceph mds set_max_mds 0
 max_mds = 0

 [root@ceph1 ~]# ceph mds stop 0
 telling mds.0 10.1.0.86:6811/3249 to deactivate

 [root@ceph1 ~]# ceph mds stop 0
 Error EEXIST: mds.0 not active (up:stopping)

 [root@ceph1 ~]# ceph fs rm cephfs2
 Error EINVAL: all MDS daemons must be inactive before removing filesystem

 There shouldn't be any other mds servers running..
 [root@ceph1 ~]# ceph mds stop 1
 Error EEXIST: mds.1 not active (down:dne)

 At this point I rebooted the server, did a service ceph restart twice.
 Shutdown ceph, then restarted ceph before this command worked:

 [root@ceph1 ~]# ceph fs rm cephfs2 --yes-i-really-mean-it

 Anyhow, I've now been able to create an erasure coded pool, with a
 replicated tier which cephfs is running on :)

 *Lots* of testing to go!

 Again, many thanks

 Jake

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Gregory Farnum

Has the OSD actually been detected as down yet?

You'll also need to set that min size on your existing pools (ceph
osd pool pool set min_size 1 or similar) to change their behavior;
the config option only takes effect for newly-created pools. (Thus the
default.)

On Thu, Mar 26, 2015 at 1:29 PM, Lee Revell rlrev...@gmail.com wrote:
 I added the osd pool default min size = 1 to test the behavior when 2 of 3
 OSDs are down, but the behavior is exactly the same as without it: when the
 2nd OSD is killed, all client writes start to block and these
 pipe.(stuff).fault messages begin:

 2015-03-26 16:08:50.775848 7fce177fe700  0 monclient: hunting for new mon
 2015-03-26 16:08:53.781133 7fce1c2f9700  0 -- 192.168.122.111:0/1011003 
 192.168.122.131:6789/0 pipe(0x7fce0c01d260 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7fce0c01d4f0).fault
 2015-03-26 16:09:00.009092 7fce1c3fa700  0 -- 192.168.122.111:0/1011003 
 192.168.122.141:6789/0 pipe(0x7fce1802dab0 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7fce1802dd40).fault
 2015-03-26 16:09:12.013147 7fce1c2f9700  0 -- 192.168.122.111:0/1011003 
 192.168.122.131:6789/0 pipe(0x7fce1802e740 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7fce1802e9d0).fault
 2015-03-26 16:10:06.013113 7fce1c2f9700  0 -- 192.168.122.111:0/1011003 
 192.168.122.131:6789/0 pipe(0x7fce1802df80 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7fce1801e600).fault
 2015-03-26 16:10:36.013166 7fce1c3fa700  0 -- 192.168.122.111:0/1011003 
 192.168.122.141:6789/0 pipe(0x7fce1802ebc0 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7fce1802ee50).fault

 Here is my ceph.conf:

 [global]
 fsid = db460aa2-5129-4aaa-8b2e-43eac727124e
 mon_initial_members = ceph-node-1
 mon_host = 192.168.122.121
 auth_cluster_required = cephx
 auth_service_required = cephx
 auth_client_required = cephx
 filestore_xattr_use_omap = true
 osd pool default size = 3
 osd pool default min size = 1
 public network = 192.168.122.0/24


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how do I destroy cephfs? (interested in cephfs + tiering + erasure coding)

2015-03-26 Thread Gregory Farnum

There have been bugs here in the recent past which have been fixed for
hammer, at least...it's possible we didn't backport it for the giant
point release. :(

But for users going forward that procedure should be good!
-Greg

On Thu, Mar 26, 2015 at 11:26 AM, Kyle Hutson kylehut...@ksu.edu wrote:
 For what it's worth, I don't think  being patient was the answer. I was
 having the same problem a couple of weeks ago, and I waited from before 5pm
 one day until after 8am the next, and still got the same errors. I ended up
 adding a new cephfs pool with a newly-created small pool, but was never
 able to actually remove cephfs altogether.

 On Thu, Mar 26, 2015 at 12:45 PM, Jake Grimmett j...@mrc-lmb.cam.ac.uk
 wrote:

 On 03/25/2015 05:44 PM, Gregory Farnum wrote:

 On Wed, Mar 25, 2015 at 10:36 AM, Jake Grimmett j...@mrc-lmb.cam.ac.uk
 wrote:

 Dear All,

 Please forgive this post if it's naive, I'm trying to familiarise myself
 with cephfs!

 I'm using Scientific Linux 6.6. with Ceph 0.87.1

 My first steps with cephfs using a replicated pool worked OK.

 Now trying now to test cephfs via a replicated caching tier on top of an
 erasure pool. I've created an erasure pool, cannot put it under the
 existing
 replicated pool.

 My thoughts were to delete the existing cephfs, and start again, however
 I
 cannot delete the existing cephfs:

 errors are as follows:

 [root@ceph1 ~]# ceph fs rm cephfs2
 Error EINVAL: all MDS daemons must be inactive before removing
 filesystem

 I've tried killing the ceph-mds process, but this does not prevent the
 above
 error.

 I've also tried this, which also errors:

 [root@ceph1 ~]# ceph mds stop 0
 Error EBUSY: must decrease max_mds or else MDS will immediately
 reactivate


 Right, so did you run ceph mds set_max_mds 0 and then repeating the
 stop command? :)


 This also fail...

 [root@ceph1 ~]# ceph-deploy mds destroy
 [ceph_deploy.conf][DEBUG ] found configuration file at:
 /root/.cephdeploy.conf
 [ceph_deploy.cli][INFO  ] Invoked (1.5.21): /usr/bin/ceph-deploy mds
 destroy
 [ceph_deploy.mds][ERROR ] subcommand destroy not implemented

 Am I doing the right thing in trying to wipe the original cephfs config
 before attempting to use an erasure cold tier? Or can I just redefine
 the
 cephfs?


 Yeah, unfortunately you need to recreate it if you want to try and use
 an EC pool with cache tiering, because CephFS knows what pools it
 expects data to belong to. Things are unlikely to behave correctly if
 you try and stick an EC pool under an existing one. :(

 Sounds like this is all just testing, which is good because the
 suitability of EC+cache is very dependent on how much hot data you
 have, etc...good luck!
 -Greg


 many thanks,

 Jake Grimmett
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 Thanks for your help - much appreciated.

 The set_max_mds 0 command worked, but only after I rebooted the server,
 and restarted ceph twice. Before this I still got an
 mds active error, and so was unable to destroy the cephfs.

 Possibly I was being impatient, and needed to let mds go inactive? there
 were ~1 million files on the system.

 [root@ceph1 ~]# ceph mds set_max_mds 0
 max_mds = 0

 [root@ceph1 ~]# ceph mds stop 0
 telling mds.0 10.1.0.86:6811/3249 to deactivate

 [root@ceph1 ~]# ceph mds stop 0
 Error EEXIST: mds.0 not active (up:stopping)

 [root@ceph1 ~]# ceph fs rm cephfs2
 Error EINVAL: all MDS daemons must be inactive before removing filesystem

 There shouldn't be any other mds servers running..
 [root@ceph1 ~]# ceph mds stop 1
 Error EEXIST: mds.1 not active (down:dne)

 At this point I rebooted the server, did a service ceph restart twice.
 Shutdown ceph, then restarted ceph before this command worked:

 [root@ceph1 ~]# ceph fs rm cephfs2 --yes-i-really-mean-it

 Anyhow, I've now been able to create an erasure coded pool, with a
 replicated tier which cephfs is running on :)

 *Lots* of testing to go!

 Again, many thanks

 Jake

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread Stefan Priebe



Am 26.03.2015 um 16:36 schrieb Mark Nelson:

I suspect a config like this where you only have 3 OSDs per node would
be more manageable than something denser.

IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U
super micro chassis for a semi-dense converged solution.  You could
attempt to restrict the OSDs to one socket and then use a second
E5-2697v3 for VMs.  Maybe after you've got cgroups setup properly and if
you've otherwise balanced things it would work out ok.  I question
though how much you really benefit by doing this rather than running a
36 drive storage server with lower bin CPUs and a 2nd 1U box for VMs
(which you don't need as many of because you can dedicate both sockets
to VMs).


that's pretty big. I have only around 6-8 ssd drives per node. In case 
of 36 osds per node i won't mix.



It probably depends quite a bit on how memory, network, and disk
intensive the VMs are, but my take is that it's better to error on the
side of simplicity rather than making things overly complicated.  Every
second you are screwing around trying to make the setup work right eats
into any savings you might gain by going with the converged setup.

Mark

On 03/26/2015 10:12 AM, Quentin Hartman wrote:

I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1
SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs
for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of
RAM unused on each node for OSD / OS overhead. All the VMs are backed by
ceph volumes and things generally work very well. I would prefer a
dedicated storage layer simply because it seems more right, but I
can't say that any of the common concerns of using this kind of setup
have come up for me. Aside from shaving off that 3GB of RAM, my
deployment isn't any more complex than a split stack deployment would
be. After running like this for the better part of a year, I would have
a hard time honestly making a real business case for the extra hardware
a split stack cluster would require.

QH

On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson mnel...@redhat.com
mailto:mnel...@redhat.com wrote:

It's kind of a philosophical question.  Technically there's nothing
that prevents you from putting ceph and the hypervisor on the same
boxes. It's a question of whether or not potential cost savings are
worth increased risk of failure and contention.  You can minimize
those things through various means (cgroups, ristricting NUMA nodes,
etc).  What is more difficult is isolating disk IO contention (say
if you want local SSDs for VMs), memory bus and QPI contention,
network contention, etc. If the VMs are working really hard you can
restrict them to their own socket, and you can even restrict memory
usage to the local socket, but what about remote socket network or
disk IO? (you will almost certainly want these things on the ceph
socket)  I wonder as well about increased risk of hardware failure
with the increased load, but I don't have any statistics.

I'm guessing if you spent enough time at it you could make it work
relatively well, but at least personally I question how beneficial
it really is after all of that.  If you are going for cost savings,
I suspect efficient compute and storage node designs will be nearly
as good with much less complexity.

Mark


On 03/26/2015 07:11 AM, Wido den Hollander wrote:

On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote:

Hi Wido,
Am 26.03.2015 um 11:59 schrieb Wido den Hollander:

On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote:

Hi,

in the past i rwad pretty often that it's not a good
idea to run ceph
and qemu / the hypervisors on the same nodes.

But why is this a bad idea? You save space and can
better use the
ressources you have in the nodes anyway.


Memory pressure during recovery *might* become a
problem. If you make
sure that you don't allocate more then let's say 50% for
the guests it
could work.


mhm sure? I've never seen problems like that. Currently i
ran each ceph
node with 64GB of memory and each hypervisor node with
around 512GB to
1TB RAM while having 48 cores.


Yes, it can happen. You have machines with enough memory, but
if you
overprovision the machines it can happen.

Using cgroups you could also prevent that the OSDs eat
up all memory or CPU.

Never seen an OSD doing so crazy things.


Again, it really depends on the available memory and CPU. If you
buy big
machines for this purpose it probably won't be a problem.

Stefan

[ceph-users] Ceph RBD devices management OpenSVC integration

2015-03-26 Thread Florent MONTHEL

Hi Team,

I’ve just written blog post regarding integration of CEPH RBD devices 
management in OpenSVC service : 
http://www.flox-arts.net/article30/ceph-rbd-devices-management-with-opensvc-service
 
http://www.flox-arts.net/article30/ceph-rbd-devices-management-with-opensvc-service
Next blog post will be regarding Snapshots  clones (integrated too in OpenSVC)
Thanks

Florent Monthel





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Calamari Deployment

2015-03-26 Thread LaBarre, James (CTR) A6IT

Well, we’re a RedHat shop, so I’ll have to see what’s adaptable from there.

(Mint on all my home systems, so I’m not totally lost with Ubuntu g)

From: Quentin Hartman [mailto:qhart...@direwolfdigital.com]
Sent: Thursday, March 26, 2015 1:15 PM
To: Steffen W Sørensen
Cc: LaBarre, James (CTR) A6IT; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Calamari Deployment

I used this as a guide for building calamari packages w/o using vagrant. Worked 
great: 
http://bryanapperson.com/blog/compiling-calamari-ceph-ubuntu-14-04/https://urldefense.proofpoint.com/v2/url?u=http-3A__bryanapperson.com_blog_compiling-2Dcalamari-2Dceph-2Dubuntu-2D14-2D04_d=AwMFaQc=WlnXFIBCT3pmNh_w8hYeLwr=wHASSrXLcneWpRUvkiFE4YeL1dd09LRRfqfW2SnLvVkm=LYJ7jurZXFUR0O0z7RQYmQwoJoI0lxM2z_KcwZyyG2As=VvuP1JebIkqFQQX3vQpQloFGkt_O00GkCSDehbFQjKQe=

On Thu, Mar 26, 2015 at 10:30 AM, Steffen W Sørensen 
ste...@me.commailto:ste...@me.com wrote:

On 26/03/2015, at 17.18, LaBarre, James (CTR) A6IT 
james.laba...@cigna.commailto:james.laba...@cigna.com wrote:
For that matter, is there a way to build Calamari without going the whole 
vagrant path at all?  Some way of just building it through command-line tools?  
I would be building it on an Openstack instance, no GUI.  Seems silly to have 
to install an entire virtualbox environment inside something that’s already a 
VM.
Agreed... if U wanted to built in on your server farm/cloud stack env.
I just built my packages for Debian Wheezy (with CentOS+RHEL rpms as a bonus) 
on my desktop Mac/OS-X with use of virtualbox and vagrant ( vagrant is an easy 
disposable built-env:)

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of JESUS 
CHAVEZ ARGUELLES
Sent: Monday, March 02, 2015 3:00 PM
To: ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
Subject: [ceph-users] Calamari Deployment

Does anybody know how to succesful install Calamari in rhel7 ? I have tried the 
vagrant thug without sucesss and it seems like a nightmare there is a Kind of 
Sidur when you do vagrant up where it seems not to find the vm path...

Regards

Jesus Chavez
SYSTEMS ENGINEER-C.SALES

jesch...@cisco.commailto:jesch...@cisco.com
Phone: +52 55 5267 3146tel:+52%2055%205267%203146
Mobile: +51 1 5538883255tel:+51%201%205538883255

CCIE - 44433

--
CONFIDENTIALITY NOTICE: If you have received this email in error,
please immediately notify the sender by e-mail at the address shown.
This email transmission may contain confidential information.  This
information is intended only for the use of the individual(s) or entity to
whom it is intended even if addressed incorrectly.  Please delete it from
your files if you are not the intended recipient.  Thank you for your
compliance.  Copyright (c) 2015 Cigna
==
___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.comhttps://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.comd=AwMFaQc=WlnXFIBCT3pmNh_w8hYeLwr=wHASSrXLcneWpRUvkiFE4YeL1dd09LRRfqfW2SnLvVkm=LYJ7jurZXFUR0O0z7RQYmQwoJoI0lxM2z_KcwZyyG2As=Xrs2jkzW8YBGyou7WMawVR5OqIS1cPaVO5MW-YIo4XAe=

___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.comhttps://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.comd=AwMFaQc=WlnXFIBCT3pmNh_w8hYeLwr=wHASSrXLcneWpRUvkiFE4YeL1dd09LRRfqfW2SnLvVkm=LYJ7jurZXFUR0O0z7RQYmQwoJoI0lxM2z_KcwZyyG2As=Xrs2jkzW8YBGyou7WMawVR5OqIS1cPaVO5MW-YIo4XAe=

--
CONFIDENTIALITY NOTICE: If you have received this email in error,
please immediately notify the sender by e-mail at the address shown. 
This email transmission may contain confidential information.  This
information is intended only for the use of the individual(s) or entity to
whom it is intended even if addressed incorrectly.  Please delete it from
your files if you are not the intended recipient.  Thank you for your
compliance.  Copyright (c) 2015 Cigna
==
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] More than 50% osds down, CPUs still busy; will the cluster recover without help?

2015-03-26 Thread Chris Murray

That's fair enough Greg, I'll keep upgrading when the opportunity arises, and 
maybe it'll spring back to life someday :-)

-Original Message-
From: Gregory Farnum [mailto:g...@gregs42.com] 
Sent: 20 March 2015 23:05
To: Chris Murray
Cc: ceph-users
Subject: Re: [ceph-users] More than 50% osds down, CPUs still busy; will the 
cluster recover without help?

On Fri, Mar 20, 2015 at 4:03 PM, Chris Murray chrismurra...@gmail.com wrote:
 Ah, I was wondering myself if compression could be causing an issue, but I'm 
 reconsidering now. My latest experiment should hopefully help troubleshoot.

 So, I remembered that ZLIB is slower, but is more 'safe for old kernels'. I 
 try that:

 find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) 
 -exec btrfs filesystem defragment -v -czlib -- {} +

 After much, much waiting, all files have been rewritten, but the OSD still 
 gets stuck at the same point.

 I've now unset the compress attribute on all files and started the defragment 
 process again, but I'm not too hopeful since the files must be 
 readable/writeable if I didn't get some failure during the defrag process.

 find /var/lib/ceph/osd/ceph-1/current -xdev \( -type f -o -type d \) 
 -exec chattr -c -- {} + find /var/lib/ceph/osd/ceph-1/current -xdev \( 
 -type f -o -type d \) -exec btrfs filesystem defragment -v -- {} +

 (latter command still running)

 Any other ideas at all? In the absence of the problem being spelled out to me 
 with an error of some sort, I'm not sure how to troubleshoot further.

Not much, sorry.

 Is it safe to upgrade a problematic cluster, when the time comes, in case 
 this ultimately is a CEPH bug which is fixed in something later than 0.80.9?

In general it should be fine since we're careful about backwards compatibility, 
but without knowing the actual issue I can't promise anything.
-Greg

-
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2015.0.5751 / Virus Database: 4306/9314 - Release Date: 03/16/15
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread J-P Methot


Hi,

Lately I've been going back to work on one of my first ceph setup and 
now I see that I have created way too many placement groups for the 
pools on that setup (about 10 000 too many). I believe this may impact 
performances negatively, as the performances on this ceph cluster are 
abysmal. Since it is not possible to reduce the number of PGs in a pool, 
I was thinking of creating new pools with a smaller number of PGs, 
moving the data from the old pools to the new pools and then deleting 
the old pools.


I haven't seen any command to copy objects from one pool to another. 
Would that be possible? I'm using ceph for block storage with openstack, 
so surely there must be a way to move block devices from a pool to 
another, right?


--
==
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmet...@gtcomm.net
http://www.gtcomm.net

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread Robert LeBlanc

I thought there was some discussion about this before. Something like
creating a new pool and then taking your existing pool as an overlay of the
new pool  (cache) and then flush the overlay to the new pool. I haven't
tried it or know if it is possible.

The other option is shut the VM down, create a new snapshot on the new
pool, point the VM to that and then flatten the RBD.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
On Mar 26, 2015 5:23 PM, Steffen W Sørensen ste...@me.com wrote:


 On 26/03/2015, at 23.13, Gregory Farnum g...@gregs42.com wrote:

 The procedure you've outlined won't copy snapshots, just the head
 objects. Preserving the proper snapshot metadata and inter-pool
 relationships on rbd images I think isn't actually possible when
 trying to change pools.

 This wasn’t ment for migrating a RBD pool, but pure object/Swift pools…

 Anyway seems Glance
 http://docs.openstack.org/developer/glance/architecture.html#basic-architecture
  supports multiple storages
 http://docs.openstack.org/developer/glance/configuring.html#configuring-multiple-swift-accounts-stores
  so
 assume one could use a glance client to also extract/download images into
 local file format (raw, qcow2 vmdk…) as well as uploading images to glance.
 And as glance images ain’t ‘live’ like virtual disk images one could also
 download glance images from one glance store over local file and upload
 back into a different glance back end store. Again this is properly better
 than dealing at a lower abstraction level and having to known its internal
 storage structures and avoid what you’re pointing put Greg.





 On Thu, Mar 26, 2015 at 3:05 PM, Steffen W Sørensen ste...@me.com wrote:


 On 26/03/2015, at 23.01, Gregory Farnum g...@gregs42.com wrote:

 On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote:


 On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote:

 That's a great idea. I know I can setup cinder (the openstack volume
 manager) as a multi-backend manager and migrate from one backend to the
 other, each backend linking to different pools of the same ceph cluster.
 What bugs me though is that I'm pretty sure the image store, glance,
 wouldn't let me do that. Additionally, since the compute component also has
 its own ceph pool, I'm pretty sure it won't let me migrate the data through
 openstack.

 Hm wouldn’t it be possible to do something similar ala:

 # list object from src pool
 rados ls objects loop | filter-obj-id | while read obj; do
# export $obj to local disk
rados -p pool-wth-too-many-pgs get $obj
# import $obj from local disk to new pool
rados -p better-sized-pool put $obj
 done


 You would also have issues with snapshots if you do this on an RBD
 pool. That's unfortunately not feasible.

 What isn’t possible, export-import objects out-and-in of pools or snapshots
 issues?

 /Steffen



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Gregory Farnum

You shouldn't rely on rados ls when working with cache pools. It
doesn't behave properly and is a silly operation to run against a pool
of any size even when it does. :)

More specifically, rados ls is invoking the pgls operation. Normal
read/write ops will go query the backing store for objects if they're
not in the cache tier. pgls is different — it just tells you what
objects are present in the PG on that OSD right now. So any objects
which aren't in cache won't show up when listing on the cache pool.
-Greg

On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi all,
 due an very silly approach, I removed the cache tier of an filled EC pool.

 After recreate the pool and connect with the EC pool I don't see any content.
 How can I see the rbd_data and other files through the new ssd cache tier?

 I think, that I must recreate the rbd_directory (and fill with setomapval), 
 but I don't see anything yet!

 $ rados ls -p ecarchiv | more
 rbd_data.2e47de674b0dc51.00390074
 rbd_data.2e47de674b0dc51.0020b64f
 rbd_data.2fbb1952ae8944a.0016184c
 rbd_data.2cfc7ce74b0dc51.00363527
 rbd_data.2cfc7ce74b0dc51.0004c35f
 rbd_data.2fbb1952ae8944a.0008db43
 rbd_data.2cfc7ce74b0dc51.0015895a
 rbd_data.31229f0238e1f29.000135eb
 ...

 $ rados ls -p ssd-archiv
  nothing 

 generation of the cache tier:
 $ rados mkpool ssd-archiv
 $ ceph osd pool set ssd-archiv crush_ruleset 5
 $ ceph osd tier add ecarchiv ssd-archiv
 $ ceph osd tier cache-mode ssd-archiv writeback
 $ ceph osd pool set ssd-archiv hit_set_type bloom
 $ ceph osd pool set ssd-archiv hit_set_count 1
 $ ceph osd pool set ssd-archiv hit_set_period 3600
 $ ceph osd pool set ssd-archiv target_max_bytes 500


 rule ssd {
 ruleset 5
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step choose firstn 0 type osd
 step emit
 }


 Are there any magic (or which command I missed?) to see the excisting data 
 throug the cache tier?


 regards - and hoping for answers

 Udo
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread Steffen W Sørensen


 On 26/03/2015, at 22.53, Steffen W Sørensen ste...@me.com wrote:
 
 
 On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net 
 mailto:jpmet...@gtcomm.net wrote:
 
 That's a great idea. I know I can setup cinder (the openstack volume 
 manager) as a multi-backend manager and migrate from one backend to the 
 other, each backend linking to different pools of the same ceph cluster. 
 What bugs me though is that I'm pretty sure the image store, glance, 
 wouldn't let me do that. Additionally, since the compute component also has 
 its own ceph pool, I'm pretty sure it won't let me migrate the data through 
 openstack.
 Hm wouldn’t it be possible to do something similar ala:
 
 # list object from src pool
 rados ls objects loop | filter-obj-id | while read obj; do
 # export $obj to local disk
 rados -p pool-wth-too-many-pgs get $obj
 # import $obj from local disk to new pool
 rados -p better-sized-pool put $obj
 done
and of course when done redirect glance to new pool :)

Not sure, but this might require you to quenching the object usage from 
openstack during migration, dunno, maybe ask openstack community if it’s 
possible to live migration of objects first :/


 
 possible split/partition list of objects into multiple concurrent loops, 
 possible from multiple boxes as seems fit for resources at hand, cpu, memory, 
 network, ceph perf.
 
 /Steffen
 
 
 
 
 On 3/26/2015 3:54 PM, Steffen W Sørensen wrote:
 On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote:
 
 Lately I've been going back to work on one of my first ceph setup and now 
 I see that I have created way too many placement groups for the pools on 
 that setup (about 10 000 too many). I believe this may impact performances 
 negatively, as the performances on this ceph cluster are abysmal. Since it 
 is not possible to reduce the number of PGs in a pool, I was thinking of 
 creating new pools with a smaller number of PGs, moving the data from the 
 old pools to the new pools and then deleting the old pools.
 
 I haven't seen any command to copy objects from one pool to another. Would 
 that be possible? I'm using ceph for block storage with openstack, so 
 surely there must be a way to move block devices from a pool to another, 
 right?
 What I did a one point was going one layer higher in my storage 
 abstraction, and created new Ceph pools and used those for new storage 
 resources/pool in my VM env. (ProxMox) on top of Ceph RBD and then did a 
 live migration of virtual disks there, assume you could do the same in 
 OpenStack.
 
 My 0.02$
 
 /Steffen
 
 
 -- 
 ==
 Jean-Philippe Méthot
 Administrateur système / System administrator
 GloboTech Communications
 Phone: 1-514-907-0050
 Toll Free: 1-(888)-GTCOMM1
 Fax: 1-(514)-907-0750
 jpmet...@gtcomm.net
 http://www.gtcomm.net
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread Gregory Farnum

On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote:

 On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote:

 That's a great idea. I know I can setup cinder (the openstack volume 
 manager) as a multi-backend manager and migrate from one backend to the 
 other, each backend linking to different pools of the same ceph cluster. 
 What bugs me though is that I'm pretty sure the image store, glance, 
 wouldn't let me do that. Additionally, since the compute component also has 
 its own ceph pool, I'm pretty sure it won't let me migrate the data through 
 openstack.
 Hm wouldn’t it be possible to do something similar ala:

 # list object from src pool
 rados ls objects loop | filter-obj-id | while read obj; do
  # export $obj to local disk
  rados -p pool-wth-too-many-pgs get $obj
  # import $obj from local disk to new pool
  rados -p better-sized-pool put $obj
 done

You would also have issues with snapshots if you do this on an RBD
pool. That's unfortunately not feasible.
-Greg



 possible split/partition list of objects into multiple concurrent loops, 
 possible from multiple boxes as seems fit for resources at hand, cpu, memory, 
 network, ceph perf.

 /Steffen




 On 3/26/2015 3:54 PM, Steffen W Sørensen wrote:
 On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote:

 Lately I've been going back to work on one of my first ceph setup and now 
 I see that I have created way too many placement groups for the pools on 
 that setup (about 10 000 too many). I believe this may impact performances 
 negatively, as the performances on this ceph cluster are abysmal. Since it 
 is not possible to reduce the number of PGs in a pool, I was thinking of 
 creating new pools with a smaller number of PGs, moving the data from the 
 old pools to the new pools and then deleting the old pools.

 I haven't seen any command to copy objects from one pool to another. Would 
 that be possible? I'm using ceph for block storage with openstack, so 
 surely there must be a way to move block devices from a pool to another, 
 right?
 What I did a one point was going one layer higher in my storage 
 abstraction, and created new Ceph pools and used those for new storage 
 resources/pool in my VM env. (ProxMox) on top of Ceph RBD and then did a 
 live migration of virtual disks there, assume you could do the same in 
 OpenStack.

 My 0.02$

 /Steffen


 --
 ==
 Jean-Philippe Méthot
 Administrateur système / System administrator
 GloboTech Communications
 Phone: 1-514-907-0050
 Toll Free: 1-(888)-GTCOMM1
 Fax: 1-(514)-907-0750
 jpmet...@gtcomm.net
 http://www.gtcomm.net


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread Steffen W Sørensen


 On 26/03/2015, at 23.01, Gregory Farnum g...@gregs42.com wrote:
 
 On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com 
 mailto:ste...@me.com wrote:
 
 On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote:
 
 That's a great idea. I know I can setup cinder (the openstack volume 
 manager) as a multi-backend manager and migrate from one backend to the 
 other, each backend linking to different pools of the same ceph cluster. 
 What bugs me though is that I'm pretty sure the image store, glance, 
 wouldn't let me do that. Additionally, since the compute component also has 
 its own ceph pool, I'm pretty sure it won't let me migrate the data through 
 openstack.
 Hm wouldn’t it be possible to do something similar ala:
 
 # list object from src pool
 rados ls objects loop | filter-obj-id | while read obj; do
 # export $obj to local disk
 rados -p pool-wth-too-many-pgs get $obj
 # import $obj from local disk to new pool
 rados -p better-sized-pool put $obj
 done
 
 You would also have issues with snapshots if you do this on an RBD
 pool. That's unfortunately not feasible.
What isn’t possible, export-import objects out-and-in of pools or snapshots 
issues?

/Steffen___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread Gregory Farnum

The procedure you've outlined won't copy snapshots, just the head
objects. Preserving the proper snapshot metadata and inter-pool
relationships on rbd images I think isn't actually possible when
trying to change pools.

On Thu, Mar 26, 2015 at 3:05 PM, Steffen W Sørensen ste...@me.com wrote:

 On 26/03/2015, at 23.01, Gregory Farnum g...@gregs42.com wrote:

 On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote:


 On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote:

 That's a great idea. I know I can setup cinder (the openstack volume
 manager) as a multi-backend manager and migrate from one backend to the
 other, each backend linking to different pools of the same ceph cluster.
 What bugs me though is that I'm pretty sure the image store, glance,
 wouldn't let me do that. Additionally, since the compute component also has
 its own ceph pool, I'm pretty sure it won't let me migrate the data through
 openstack.

 Hm wouldn’t it be possible to do something similar ala:

 # list object from src pool
 rados ls objects loop | filter-obj-id | while read obj; do
 # export $obj to local disk
 rados -p pool-wth-too-many-pgs get $obj
 # import $obj from local disk to new pool
 rados -p better-sized-pool put $obj
 done


 You would also have issues with snapshots if you do this on an RBD
 pool. That's unfortunately not feasible.

 What isn’t possible, export-import objects out-and-in of pools or snapshots
 issues?

 /Steffen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread Steffen W Sørensen

 On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote:
 
 Lately I've been going back to work on one of my first ceph setup and now I 
 see that I have created way too many placement groups for the pools on that 
 setup (about 10 000 too many). I believe this may impact performances 
 negatively, as the performances on this ceph cluster are abysmal. Since it is 
 not possible to reduce the number of PGs in a pool, I was thinking of 
 creating new pools with a smaller number of PGs, moving the data from the old 
 pools to the new pools and then deleting the old pools.
 
 I haven't seen any command to copy objects from one pool to another. Would 
 that be possible? I'm using ceph for block storage with openstack, so surely 
 there must be a way to move block devices from a pool to another, right?

What I did a one point was going one layer higher in my storage abstraction, 
and created new Ceph pools and used those for new storage resources/pool in my 
VM env. (ProxMox) on top of Ceph RBD and then did a live migration of virtual 
disks there, assume you could do the same in OpenStack.

My 0.02$

/Steffen 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Lee Revell

I added the osd pool default min size = 1 to test the behavior when 2 of 3
OSDs are down, but the behavior is exactly the same as without it: when the
2nd OSD is killed, all client writes start to block and these
pipe.(stuff).fault messages begin:

2015-03-26 16:08:50.775848 7fce177fe700  0 monclient: hunting for new mon
2015-03-26 16:08:53.781133 7fce1c2f9700  0 -- 192.168.122.111:0/1011003 
192.168.122.131:6789/0 pipe(0x7fce0c01d260 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7fce0c01d4f0).fault
2015-03-26 16:09:00.009092 7fce1c3fa700  0 -- 192.168.122.111:0/1011003 
192.168.122.141:6789/0 pipe(0x7fce1802dab0 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7fce1802dd40).fault
2015-03-26 16:09:12.013147 7fce1c2f9700  0 -- 192.168.122.111:0/1011003 
192.168.122.131:6789/0 pipe(0x7fce1802e740 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7fce1802e9d0).fault
2015-03-26 16:10:06.013113 7fce1c2f9700  0 -- 192.168.122.111:0/1011003 
192.168.122.131:6789/0 pipe(0x7fce1802df80 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7fce1801e600).fault
2015-03-26 16:10:36.013166 7fce1c3fa700  0 -- 192.168.122.111:0/1011003 
192.168.122.141:6789/0 pipe(0x7fce1802ebc0 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7fce1802ee50).fault

Here is my ceph.conf:

[global]
fsid = db460aa2-5129-4aaa-8b2e-43eac727124e
mon_initial_members = ceph-node-1
mon_host = 192.168.122.121
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd pool default size = 3
osd pool default min size = 1
public network = 192.168.122.0/24
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Lee Revell

Ah, thanks, got it, I wasn't thinking that mons and osds on the same node
isn't a likely real world thing.

You have to admit that pipe/fault log message is a bit cryptic.

Thanks,

Lee
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread Steffen W Sørensen


 On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote:
 
 That's a great idea. I know I can setup cinder (the openstack volume manager) 
 as a multi-backend manager and migrate from one backend to the other, each 
 backend linking to different pools of the same ceph cluster. What bugs me 
 though is that I'm pretty sure the image store, glance, wouldn't let me do 
 that. Additionally, since the compute component also has its own ceph pool, 
 I'm pretty sure it won't let me migrate the data through openstack.
Hm wouldn’t it be possible to do something similar ala:

# list object from src pool
rados ls objects loop | filter-obj-id | while read obj; do
 # export $obj to local disk
 rados -p pool-wth-too-many-pgs get $obj
 # import $obj from local disk to new pool
 rados -p better-sized-pool put $obj
done

possible split/partition list of objects into multiple concurrent loops, 
possible from multiple boxes as seems fit for resources at hand, cpu, memory, 
network, ceph perf.

/Steffen

  
 
 
 On 3/26/2015 3:54 PM, Steffen W Sørensen wrote:
 On 26/03/2015, at 20.38, J-P Methot jpmet...@gtcomm.net wrote:
 
 Lately I've been going back to work on one of my first ceph setup and now I 
 see that I have created way too many placement groups for the pools on that 
 setup (about 10 000 too many). I believe this may impact performances 
 negatively, as the performances on this ceph cluster are abysmal. Since it 
 is not possible to reduce the number of PGs in a pool, I was thinking of 
 creating new pools with a smaller number of PGs, moving the data from the 
 old pools to the new pools and then deleting the old pools.
 
 I haven't seen any command to copy objects from one pool to another. Would 
 that be possible? I'm using ceph for block storage with openstack, so 
 surely there must be a way to move block devices from a pool to another, 
 right?
 What I did a one point was going one layer higher in my storage abstraction, 
 and created new Ceph pools and used those for new storage resources/pool in 
 my VM env. (ProxMox) on top of Ceph RBD and then did a live migration of 
 virtual disks there, assume you could do the same in OpenStack.
 
 My 0.02$
 
 /Steffen
 
 
 -- 
 ==
 Jean-Philippe Méthot
 Administrateur système / System administrator
 GloboTech Communications
 Phone: 1-514-907-0050
 Toll Free: 1-(888)-GTCOMM1
 Fax: 1-(514)-907-0750
 jpmet...@gtcomm.net
 http://www.gtcomm.net
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Lee Revell

On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


I believe it has, however I can't directly check because ceph health
starts to hang when I down the second node.


 You'll also need to set that min size on your existing pools (ceph
 osd pool pool set min_size 1 or similar) to change their behavior;
 the config option only takes effect for newly-created pools. (Thus the
 default.)


I've done this, however the behavior is the same:

$ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph
osd pool set $f min_size 1; done
set pool 0 min_size to 1
set pool 1 min_size to 1
set pool 2 min_size to 1
set pool 3 min_size to 1
set pool 4 min_size to 1
set pool 5 min_size to 1
set pool 6 min_size to 1
set pool 7 min_size to 1

$ ceph -w
cluster db460aa2-5129-4aaa-8b2e-43eac727124e
 health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
 monmap e3: 3 mons at {ceph-node-1=
192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0},
election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
 osdmap e362: 3 osds: 2 up, 2 in
  pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
25329 MB used, 12649 MB / 40059 MB avail
 840 active+clean

2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840
active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail
2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840
active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail; 0 B/s
rd, 260 kB/s wr, 13 op/s
2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840
active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail; 0 B/s
rd, 943 kB/s wr, 38 op/s
2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840
active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail; 0 B/s
rd, 10699 kB/s wr, 621 op/s

this is where i kill the second OSD

2015-03-26 17:26:26.778461 7f4ebeffd700  0 monclient: hunting for new mon
2015-03-26 17:26:30.701099 7f4ec45f5700  0 -- 192.168.122.111:0/1007741 
192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7f4ec0023490).fault
2015-03-26 17:26:42.701154 7f4ec44f4700  0 -- 192.168.122.111:0/1007741 
192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1
c=0x7f4ec0025440).fault

And all writes block until I bring back an OSD.

Lee
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Gregory Farnum

On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


 I believe it has, however I can't directly check because ceph health
 starts to hang when I down the second node.

Oh. You need to keep a quorum of your monitors running (just the
monitor processes, not of everything in the system) or nothing at all
is going to work. That's how we prevent split brain issues.



 You'll also need to set that min size on your existing pools (ceph
 osd pool pool set min_size 1 or similar) to change their behavior;
 the config option only takes effect for newly-created pools. (Thus the
 default.)


 I've done this, however the behavior is the same:

 $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do ceph osd
 pool set $f min_size 1; done
 set pool 0 min_size to 1
 set pool 1 min_size to 1
 set pool 2 min_size to 1
 set pool 3 min_size to 1
 set pool 4 min_size to 1
 set pool 5 min_size to 1
 set pool 6 min_size to 1
 set pool 7 min_size to 1

 $ ceph -w
 cluster db460aa2-5129-4aaa-8b2e-43eac727124e
  health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
  monmap e3: 3 mons at
 {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0},
 election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
  mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
  osdmap e362: 3 osds: 2 up, 2 in
   pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
 25329 MB used, 12649 MB / 40059 MB avail
  840 active+clean

 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail
 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail; 0 B/s
 rd, 260 kB/s wr, 13 op/s
 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840
 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail; 0 B/s
 rd, 943 kB/s wr, 38 op/s
 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840
 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail; 0 B/s
 rd, 10699 kB/s wr, 621 op/s

 this is where i kill the second OSD

 2015-03-26 17:26:26.778461 7f4ebeffd700  0 monclient: hunting for new mon
 2015-03-26 17:26:30.701099 7f4ec45f5700  0 -- 192.168.122.111:0/1007741 
 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7f4ec0023490).fault
 2015-03-26 17:26:42.701154 7f4ec44f4700  0 -- 192.168.122.111:0/1007741 
 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7f4ec0025440).fault

 And all writes block until I bring back an OSD.

 Lee
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Josh Durgin


On 03/26/2015 10:46 AM, Gregory Farnum wrote:

I don't know why you're mucking about manually with the rbd directory;
the rbd tool and rados handle cache pools correctly as far as I know.


That's true, but the rados tool should be able to manipulate binary data 
more easily. It should probably be able to read from a file or stdin for 
this.


Josh



On Thu, Mar 26, 2015 at 8:56 AM, Udo Lembke ulem...@polarzone.de wrote:

Hi Greg,
ok!

It's looks like, that my problem is more setomapval-related...

I must o something like
rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
\0x0f\0x00\0x00\0x002cfc7ce74b0dc51

but rados setomapval don't use the hexvalues - instead of this I got
rados -p ssd-archiv listomapvals rbd_directory
name_vm-409-disk-2
value: (35 bytes) :
 : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\
0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d
0020 : 63 35 31: c51


hmm, strange. With  rados -p ssd-archiv getomapval rbd_directory name_vm-409-disk-2 
name_vm-409-disk-2
I got the binary inside the file name_vm-409-disk-2, but reverse do an
rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
name_vm-409-disk-2
fill the variable with name_vm-409-disk-2 and not with the content of the 
file...

Are there other tools for the rbd_directory?

regards

Udo

Am 26.03.2015 15:03, schrieb Gregory Farnum:

You shouldn't rely on rados ls when working with cache pools. It
doesn't behave properly and is a silly operation to run against a pool
of any size even when it does. :)

More specifically, rados ls is invoking the pgls operation. Normal
read/write ops will go query the backing store for objects if they're
not in the cache tier. pgls is different — it just tells you what
objects are present in the PG on that OSD right now. So any objects
which aren't in cache won't show up when listing on the cache pool.
-Greg

On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote:

Hi all,
due an very silly approach, I removed the cache tier of an filled EC pool.

After recreate the pool and connect with the EC pool I don't see any content.
How can I see the rbd_data and other files through the new ssd cache tier?

I think, that I must recreate the rbd_directory (and fill with setomapval), but 
I don't see anything yet!

$ rados ls -p ecarchiv | more
rbd_data.2e47de674b0dc51.00390074
rbd_data.2e47de674b0dc51.0020b64f
rbd_data.2fbb1952ae8944a.0016184c
rbd_data.2cfc7ce74b0dc51.00363527
rbd_data.2cfc7ce74b0dc51.0004c35f
rbd_data.2fbb1952ae8944a.0008db43
rbd_data.2cfc7ce74b0dc51.0015895a
rbd_data.31229f0238e1f29.000135eb
...

$ rados ls -p ssd-archiv
 nothing 

generation of the cache tier:
$ rados mkpool ssd-archiv
$ ceph osd pool set ssd-archiv crush_ruleset 5
$ ceph osd tier add ecarchiv ssd-archiv
$ ceph osd tier cache-mode ssd-archiv writeback
$ ceph osd pool set ssd-archiv hit_set_type bloom
$ ceph osd pool set ssd-archiv hit_set_count 1
$ ceph osd pool set ssd-archiv hit_set_period 3600
$ ceph osd pool set ssd-archiv target_max_bytes 500


rule ssd {
 ruleset 5
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step choose firstn 0 type osd
 step emit
}


Are there any magic (or which command I missed?) to see the excisting data 
throug the cache tier?


regards - and hoping for answers

Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Steffen W Sørensen


 On 26/03/2015, at 23.36, Somnath Roy somnath@sandisk.com wrote:
 
 Got most portion of it, thanks !
 But, still not able to get when second node is down why with single monitor 
 in the cluster client is not able to connect ? 
 1 monitor can form a quorum and should be sufficient for a cluster to run.
To have quorum you need more than 50% of monitors, which isn’t possible with 
one out of two, since 1  (0.5*2 + 1) hence at least 3 monitors.

 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: Gregory Farnum [mailto:g...@gregs42.com] 
 Sent: Thursday, March 26, 2015 3:29 PM
 To: Somnath Roy
 Cc: Lee Revell; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down
 
 On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote:
 Greg,
 Couple of dumb question may be.
 
 1. If you see , the clients are connecting fine with two monitors in the 
 cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 
 monitor  (which is I guess happening after making 2 nodes down) it is not 
 able to connect ?
 
 A quorum is a strict majority of the total membership. 2 monitors can form a 
 quorum just fine if there are either 2 or 3 total membership.
 (As long as those two agree on every action, it cannot be lost.)
 
 We don't *recommend* configuring systems with an even number of monitors, 
 because it increases the number of total possible failures without increasing 
 the number of failures that can be tolerated. (3 monitors requires 2 in 
 quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.)
 
 
 2. Also, my understanding is while IO is going on *no* monitor interaction 
 will be on that path, so, why the client io will be stopped because the 
 monitor quorum is not there ? If the min_size =1 is properly set it should 
 able to serve IO as long as 1 OSD (node) is up, isn't it ?
 
 Well, the remaining OSD won't be able to process IO because it's lost its 
 peers, and it can't reach any monitors to do updates or get new maps. 
 (Monitors which are not in quorum will not allow clients to
 connect.)
 The clients will eventually stop serving IO if they know they can't reach a 
 monitor, although I don't remember exactly how that's triggered.
 
 In this particular case, though, the client probably just tried to do an op 
 against the dead osd, realized it couldn't, and tried to fetch a map from the 
 monitors. When that failed it went into search mode, which is what the logs 
 are showing you.
 -Greg
 
 
 Thanks  Regards
 Somnath
 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of Gregory Farnum
 Sent: Thursday, March 26, 2015 2:40 PM
 To: Lee Revell
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs 
 down
 
 On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:
 
 Has the OSD actually been detected as down yet?
 
 
 I believe it has, however I can't directly check because ceph health
 starts to hang when I down the second node.
 
 Oh. You need to keep a quorum of your monitors running (just the monitor 
 processes, not of everything in the system) or nothing at all is going to 
 work. That's how we prevent split brain issues.
 
 
 
 You'll also need to set that min size on your existing pools (ceph 
 osd pool pool set min_size 1 or similar) to change their 
 behavior; the config option only takes effect for newly-created 
 pools. (Thus the
 default.)
 
 
 I've done this, however the behavior is the same:
 
 $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do 
 ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set 
 pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 
 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 
 min_size to 1 set pool 7 min_size to 1
 
 $ ceph -w
cluster db460aa2-5129-4aaa-8b2e-43eac727124e
 health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
 monmap e3: 3 mons at
 {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/
 0 ,ceph-node-3=192.168.122.141:6789/0},
 election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
 mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
 osdmap e362: 3 osds: 2 up, 2 in
  pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
25329 MB used, 12649 MB / 40059 MB avail
 840 active+clean
 
 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail
 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail;
 active+0 B/s
 rd, 260 kB/s wr, 13 op/s
 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840
 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail;
 active+0 B/s
 rd, 943 kB/s

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Gregory Farnum

On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote:
 Greg,
 Couple of dumb question may be.

 1. If you see , the clients are connecting fine with two monitors in the 
 cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 
 monitor  (which is I guess happening after making 2 nodes down) it is not 
 able to connect ?

A quorum is a strict majority of the total membership. 2 monitors can
form a quorum just fine if there are either 2 or 3 total membership.
(As long as those two agree on every action, it cannot be lost.)

We don't *recommend* configuring systems with an even number of
monitors, because it increases the number of total possible failures
without increasing the number of failures that can be tolerated. (3
monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8,
etc etc.)


 2. Also, my understanding is while IO is going on *no* monitor interaction 
 will be on that path, so, why the client io will be stopped because the 
 monitor quorum is not there ? If the min_size =1 is properly set it should 
 able to serve IO as long as 1 OSD (node) is up, isn't it ?

Well, the remaining OSD won't be able to process IO because it's lost
its peers, and it can't reach any monitors to do updates or get new
maps. (Monitors which are not in quorum will not allow clients to
connect.)
The clients will eventually stop serving IO if they know they can't
reach a monitor, although I don't remember exactly how that's
triggered.

In this particular case, though, the client probably just tried to do
an op against the dead osd, realized it couldn't, and tried to fetch a
map from the monitors. When that failed it went into search mode,
which is what the logs are showing you.
-Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
 Gregory Farnum
 Sent: Thursday, March 26, 2015 2:40 PM
 To: Lee Revell
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down

 On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


 I believe it has, however I can't directly check because ceph health
 starts to hang when I down the second node.

 Oh. You need to keep a quorum of your monitors running (just the monitor 
 processes, not of everything in the system) or nothing at all is going to 
 work. That's how we prevent split brain issues.



 You'll also need to set that min size on your existing pools (ceph
 osd pool pool set min_size 1 or similar) to change their behavior;
 the config option only takes effect for newly-created pools. (Thus
 the
 default.)


 I've done this, however the behavior is the same:

 $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do
 ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set
 pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1
 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size
 to 1 set pool 7 min_size to 1

 $ ceph -w
 cluster db460aa2-5129-4aaa-8b2e-43eac727124e
  health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
  monmap e3: 3 mons at
 {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0
 ,ceph-node-3=192.168.122.141:6789/0},
 election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
  mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
  osdmap e362: 3 osds: 2 up, 2 in
   pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
 25329 MB used, 12649 MB / 40059 MB avail
  840 active+clean

 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail
 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail;
 active+0 B/s
 rd, 260 kB/s wr, 13 op/s
 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840
 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail;
 active+0 B/s
 rd, 943 kB/s wr, 38 op/s
 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840
 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail;
 active+0 B/s
 rd, 10699 kB/s wr, 621 op/s

 this is where i kill the second OSD

 2015-03-26 17:26:26.778461 7f4ebeffd700  0 monclient: hunting for new
 mon
 2015-03-26 17:26:30.701099 7f4ec45f5700  0 --
 192.168.122.111:0/1007741 
 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7f4ec0023490).fault
 2015-03-26 17:26:42.701154 7f4ec44f4700  0 --
 192.168.122.111:0/1007741 
 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7f4ec0025440).fault

 And all writes block until I bring back an OSD.

 Lee
 ___
 ceph-users mailing list

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Gregory Farnum

On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy somnath@sandisk.com wrote:
 Got most portion of it, thanks !
 But, still not able to get when second node is down why with single monitor 
 in the cluster client is not able to connect ?
 1 monitor can form a quorum and should be sufficient for a cluster to run.

The whole point of the monitor cluster is to ensure a globally
consistent view of the cluster state that will never be reversed by a
different group of up nodes. If one monitor (out of three) could make
changes to the maps by itself, then there's nothing to prevent all
three monitors from staying up but getting a net split, and then each
issuing different versions of the osdmaps to whichever clients or OSDs
happen to be connected to them.

If you want to get down into the math proofs and things then the Paxos
papers do all the proofs. Or you can look at the CAP theorem about the
tradeoff between consistency and availability. The monitors are a
Paxos cluster and Ceph is a 100% consistent system.
-Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: Gregory Farnum [mailto:g...@gregs42.com]
 Sent: Thursday, March 26, 2015 3:29 PM
 To: Somnath Roy
 Cc: Lee Revell; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down

 On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote:
 Greg,
 Couple of dumb question may be.

 1. If you see , the clients are connecting fine with two monitors in the 
 cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 
 monitor  (which is I guess happening after making 2 nodes down) it is not 
 able to connect ?

 A quorum is a strict majority of the total membership. 2 monitors can form a 
 quorum just fine if there are either 2 or 3 total membership.
 (As long as those two agree on every action, it cannot be lost.)

 We don't *recommend* configuring systems with an even number of monitors, 
 because it increases the number of total possible failures without increasing 
 the number of failures that can be tolerated. (3 monitors requires 2 in 
 quorum, 4 does too. Same for 5 and 6, 7 and 8, etc etc.)


 2. Also, my understanding is while IO is going on *no* monitor interaction 
 will be on that path, so, why the client io will be stopped because the 
 monitor quorum is not there ? If the min_size =1 is properly set it should 
 able to serve IO as long as 1 OSD (node) is up, isn't it ?

 Well, the remaining OSD won't be able to process IO because it's lost its 
 peers, and it can't reach any monitors to do updates or get new maps. 
 (Monitors which are not in quorum will not allow clients to
 connect.)
 The clients will eventually stop serving IO if they know they can't reach a 
 monitor, although I don't remember exactly how that's triggered.

 In this particular case, though, the client probably just tried to do an op 
 against the dead osd, realized it couldn't, and tried to fetch a map from the 
 monitors. When that failed it went into search mode, which is what the logs 
 are showing you.
 -Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Gregory Farnum
 Sent: Thursday, March 26, 2015 2:40 PM
 To: Lee Revell
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs
 down

 On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


 I believe it has, however I can't directly check because ceph health
 starts to hang when I down the second node.

 Oh. You need to keep a quorum of your monitors running (just the monitor 
 processes, not of everything in the system) or nothing at all is going to 
 work. That's how we prevent split brain issues.



 You'll also need to set that min size on your existing pools (ceph
 osd pool pool set min_size 1 or similar) to change their
 behavior; the config option only takes effect for newly-created
 pools. (Thus the
 default.)


 I've done this, however the behavior is the same:

 $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do
 ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set
 pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to
 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6
 min_size to 1 set pool 7 min_size to 1

 $ ceph -w
 cluster db460aa2-5129-4aaa-8b2e-43eac727124e
  health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
  monmap e3: 3 mons at
 {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/
 0 ,ceph-node-3=192.168.122.141:6789/0},
 election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
  mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
  osdmap e362: 3 osds: 2 up, 2 in
   pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
 25329 MB

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Somnath Roy

Got most portion of it, thanks !
But, still not able to get when second node is down why with single monitor in 
the cluster client is not able to connect ? 
1 monitor can form a quorum and should be sufficient for a cluster to run.

Thanks  Regards
Somnath

-Original Message-
From: Gregory Farnum [mailto:g...@gregs42.com] 
Sent: Thursday, March 26, 2015 3:29 PM
To: Somnath Roy
Cc: Lee Revell; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down

On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote:
 Greg,
 Couple of dumb question may be.

 1. If you see , the clients are connecting fine with two monitors in the 
 cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 
 monitor  (which is I guess happening after making 2 nodes down) it is not 
 able to connect ?

A quorum is a strict majority of the total membership. 2 monitors can form a 
quorum just fine if there are either 2 or 3 total membership.
(As long as those two agree on every action, it cannot be lost.)

We don't *recommend* configuring systems with an even number of monitors, 
because it increases the number of total possible failures without increasing 
the number of failures that can be tolerated. (3 monitors requires 2 in quorum, 
4 does too. Same for 5 and 6, 7 and 8, etc etc.)


 2. Also, my understanding is while IO is going on *no* monitor interaction 
 will be on that path, so, why the client io will be stopped because the 
 monitor quorum is not there ? If the min_size =1 is properly set it should 
 able to serve IO as long as 1 OSD (node) is up, isn't it ?

Well, the remaining OSD won't be able to process IO because it's lost its 
peers, and it can't reach any monitors to do updates or get new maps. (Monitors 
which are not in quorum will not allow clients to
connect.)
The clients will eventually stop serving IO if they know they can't reach a 
monitor, although I don't remember exactly how that's triggered.

In this particular case, though, the client probably just tried to do an op 
against the dead osd, realized it couldn't, and tried to fetch a map from the 
monitors. When that failed it went into search mode, which is what the logs are 
showing you.
-Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of Gregory Farnum
 Sent: Thursday, March 26, 2015 2:40 PM
 To: Lee Revell
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs 
 down

 On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


 I believe it has, however I can't directly check because ceph health
 starts to hang when I down the second node.

 Oh. You need to keep a quorum of your monitors running (just the monitor 
 processes, not of everything in the system) or nothing at all is going to 
 work. That's how we prevent split brain issues.



 You'll also need to set that min size on your existing pools (ceph 
 osd pool pool set min_size 1 or similar) to change their 
 behavior; the config option only takes effect for newly-created 
 pools. (Thus the
 default.)


 I've done this, however the behavior is the same:

 $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do 
 ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set 
 pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 
 1 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 
 min_size to 1 set pool 7 min_size to 1

 $ ceph -w
 cluster db460aa2-5129-4aaa-8b2e-43eac727124e
  health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
  monmap e3: 3 mons at
 {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/
 0 ,ceph-node-3=192.168.122.141:6789/0},
 election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
  mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
  osdmap e362: 3 osds: 2 up, 2 in
   pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
 25329 MB used, 12649 MB / 40059 MB avail
  840 active+clean

 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail
 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail;
 active+0 B/s
 rd, 260 kB/s wr, 13 op/s
 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840
 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail;
 active+0 B/s
 rd, 943 kB/s wr, 38 op/s
 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840
 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail;
 active+0 B/s
 rd, 10699 kB/s wr, 621 op/s

 this is where i kill the second OSD

 2015-03-26 17:26:26.778461

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Somnath Roy

Greg,
I think you got me wrong. I am not saying each monitor of a group of 3 should 
be able to change the map. Here is the scenario.

1. Cluster up and running with 3 mons (quorum of 3), all fine.

2. One node (and mon) is down, quorum of 2 , still connecting.

3. 2 nodes (and 2 mons) are down, should be quorum of 1 now and client should 
still be able to connect. Isn't it ?

Cluster with single monitor is able to form a quorum and should be working 
fine. So, why not in case of point 3 ?
If this is the way Paxos works, should we say that in a cluster with say 3 
monitors it should be able to tolerate only one mon failure ?

Let me know if I am missing a point here.

Thanks  Regards
Somnath

-Original Message-
From: Gregory Farnum [mailto:g...@gregs42.com] 
Sent: Thursday, March 26, 2015 3:41 PM
To: Somnath Roy
Cc: Lee Revell; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down

On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy somnath@sandisk.com wrote:
 Got most portion of it, thanks !
 But, still not able to get when second node is down why with single monitor 
 in the cluster client is not able to connect ?
 1 monitor can form a quorum and should be sufficient for a cluster to run.

The whole point of the monitor cluster is to ensure a globally consistent view 
of the cluster state that will never be reversed by a different group of up 
nodes. If one monitor (out of three) could make changes to the maps by itself, 
then there's nothing to prevent all three monitors from staying up but getting 
a net split, and then each issuing different versions of the osdmaps to 
whichever clients or OSDs happen to be connected to them.

If you want to get down into the math proofs and things then the Paxos papers 
do all the proofs. Or you can look at the CAP theorem about the tradeoff 
between consistency and availability. The monitors are a Paxos cluster and Ceph 
is a 100% consistent system.
-Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: Gregory Farnum [mailto:g...@gregs42.com]
 Sent: Thursday, March 26, 2015 3:29 PM
 To: Somnath Roy
 Cc: Lee Revell; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs 
 down

 On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote:
 Greg,
 Couple of dumb question may be.

 1. If you see , the clients are connecting fine with two monitors in the 
 cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 
 monitor  (which is I guess happening after making 2 nodes down) it is not 
 able to connect ?

 A quorum is a strict majority of the total membership. 2 monitors can form a 
 quorum just fine if there are either 2 or 3 total membership.
 (As long as those two agree on every action, it cannot be lost.)

 We don't *recommend* configuring systems with an even number of 
 monitors, because it increases the number of total possible failures 
 without increasing the number of failures that can be tolerated. (3 
 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8, 
 etc etc.)


 2. Also, my understanding is while IO is going on *no* monitor interaction 
 will be on that path, so, why the client io will be stopped because the 
 monitor quorum is not there ? If the min_size =1 is properly set it should 
 able to serve IO as long as 1 OSD (node) is up, isn't it ?

 Well, the remaining OSD won't be able to process IO because it's lost 
 its peers, and it can't reach any monitors to do updates or get new 
 maps. (Monitors which are not in quorum will not allow clients to
 connect.)
 The clients will eventually stop serving IO if they know they can't reach a 
 monitor, although I don't remember exactly how that's triggered.

 In this particular case, though, the client probably just tried to do an op 
 against the dead osd, realized it couldn't, and tried to fetch a map from the 
 monitors. When that failed it went into search mode, which is what the logs 
 are showing you.
 -Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of Gregory Farnum
 Sent: Thursday, March 26, 2015 2:40 PM
 To: Lee Revell
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs 
 down

 On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


 I believe it has, however I can't directly check because ceph health
 starts to hang when I down the second node.

 Oh. You need to keep a quorum of your monitors running (just the monitor 
 processes, not of everything in the system) or nothing at all is going to 
 work. That's how we prevent split brain issues.



 You'll also need to set that min size on your existing pools (ceph 
 osd pool pool set min_size 1 or similar) to change their

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Gregory Farnum

On Thu, Mar 26, 2015 at 3:54 PM, Somnath Roy somnath@sandisk.com wrote:
 Greg,
 I think you got me wrong. I am not saying each monitor of a group of 3 should 
 be able to change the map. Here is the scenario.

 1. Cluster up and running with 3 mons (quorum of 3), all fine.

 2. One node (and mon) is down, quorum of 2 , still connecting.

 3. 2 nodes (and 2 mons) are down, should be quorum of 1 now and client should 
 still be able to connect. Isn't it ?

No. The monitors can't tell the difference between dead monitors, and
monitors they can't reach over the network. So they say there are
three monitors in my map; therefore it requires two to make any
change. That's the case regardless of whether all of them are
running, or only one.


 Cluster with single monitor is able to form a quorum and should be working 
 fine. So, why not in case of point 3 ?
 If this is the way Paxos works, should we say that in a cluster with say 3 
 monitors it should be able to tolerate only one mon failure ?

Yes, that is the case.


 Let me know if I am missing a point here.

 Thanks  Regards
 Somnath

 -Original Message-
 From: Gregory Farnum [mailto:g...@gregs42.com]
 Sent: Thursday, March 26, 2015 3:41 PM
 To: Somnath Roy
 Cc: Lee Revell; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down

 On Thu, Mar 26, 2015 at 3:36 PM, Somnath Roy somnath@sandisk.com wrote:
 Got most portion of it, thanks !
 But, still not able to get when second node is down why with single monitor 
 in the cluster client is not able to connect ?
 1 monitor can form a quorum and should be sufficient for a cluster to run.

 The whole point of the monitor cluster is to ensure a globally consistent 
 view of the cluster state that will never be reversed by a different group of 
 up nodes. If one monitor (out of three) could make changes to the maps by 
 itself, then there's nothing to prevent all three monitors from staying up 
 but getting a net split, and then each issuing different versions of the 
 osdmaps to whichever clients or OSDs happen to be connected to them.

 If you want to get down into the math proofs and things then the Paxos papers 
 do all the proofs. Or you can look at the CAP theorem about the tradeoff 
 between consistency and availability. The monitors are a Paxos cluster and 
 Ceph is a 100% consistent system.
 -Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: Gregory Farnum [mailto:g...@gregs42.com]
 Sent: Thursday, March 26, 2015 3:29 PM
 To: Somnath Roy
 Cc: Lee Revell; ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs
 down

 On Thu, Mar 26, 2015 at 3:22 PM, Somnath Roy somnath@sandisk.com wrote:
 Greg,
 Couple of dumb question may be.

 1. If you see , the clients are connecting fine with two monitors in the 
 cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 
 monitor  (which is I guess happening after making 2 nodes down) it is not 
 able to connect ?

 A quorum is a strict majority of the total membership. 2 monitors can form a 
 quorum just fine if there are either 2 or 3 total membership.
 (As long as those two agree on every action, it cannot be lost.)

 We don't *recommend* configuring systems with an even number of
 monitors, because it increases the number of total possible failures
 without increasing the number of failures that can be tolerated. (3
 monitors requires 2 in quorum, 4 does too. Same for 5 and 6, 7 and 8,
 etc etc.)


 2. Also, my understanding is while IO is going on *no* monitor interaction 
 will be on that path, so, why the client io will be stopped because the 
 monitor quorum is not there ? If the min_size =1 is properly set it should 
 able to serve IO as long as 1 OSD (node) is up, isn't it ?

 Well, the remaining OSD won't be able to process IO because it's lost
 its peers, and it can't reach any monitors to do updates or get new
 maps. (Monitors which are not in quorum will not allow clients to
 connect.)
 The clients will eventually stop serving IO if they know they can't reach a 
 monitor, although I don't remember exactly how that's triggered.

 In this particular case, though, the client probably just tried to do an op 
 against the dead osd, realized it couldn't, and tried to fetch a map from 
 the monitors. When that failed it went into search mode, which is what the 
 logs are showing you.
 -Greg


 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
 Of Gregory Farnum
 Sent: Thursday, March 26, 2015 2:40 PM
 To: Lee Revell
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs
 down

 On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


 I believe it has, however I can't directly

Re: [ceph-users] Migrating objects from one pool to another?

2015-03-26 Thread Steffen W Sørensen


 On 26/03/2015, at 23.13, Gregory Farnum g...@gregs42.com wrote:
 
 The procedure you've outlined won't copy snapshots, just the head
 objects. Preserving the proper snapshot metadata and inter-pool
 relationships on rbd images I think isn't actually possible when
 trying to change pools.
This wasn’t ment for migrating a RBD pool, but pure object/Swift pools… 

Anyway seems Glance 
http://docs.openstack.org/developer/glance/architecture.html#basic-architecture
 supports multiple storages 
http://docs.openstack.org/developer/glance/configuring.html#configuring-multiple-swift-accounts-stores
 so assume one could use a glance client to also extract/download images into 
local file format (raw, qcow2 vmdk…) as well as uploading images to glance. And 
as glance images ain’t ‘live’ like virtual disk images one could also download 
glance images from one glance store over local file and upload back into a 
different glance back end store. Again this is properly better than dealing at 
a lower abstraction level and having to known its internal storage structures 
and avoid what you’re pointing put Greg.




 
 On Thu, Mar 26, 2015 at 3:05 PM, Steffen W Sørensen ste...@me.com wrote:
 
 On 26/03/2015, at 23.01, Gregory Farnum g...@gregs42.com wrote:
 
 On Thu, Mar 26, 2015 at 2:53 PM, Steffen W Sørensen ste...@me.com wrote:
 
 
 On 26/03/2015, at 21.07, J-P Methot jpmet...@gtcomm.net wrote:
 
 That's a great idea. I know I can setup cinder (the openstack volume
 manager) as a multi-backend manager and migrate from one backend to the
 other, each backend linking to different pools of the same ceph cluster.
 What bugs me though is that I'm pretty sure the image store, glance,
 wouldn't let me do that. Additionally, since the compute component also has
 its own ceph pool, I'm pretty sure it won't let me migrate the data through
 openstack.
 
 Hm wouldn’t it be possible to do something similar ala:
 
 # list object from src pool
 rados ls objects loop | filter-obj-id | while read obj; do
# export $obj to local disk
rados -p pool-wth-too-many-pgs get $obj
# import $obj from local disk to new pool
rados -p better-sized-pool put $obj
 done
 
 
 You would also have issues with snapshots if you do this on an RBD
 pool. That's unfortunately not feasible.
 
 What isn’t possible, export-import objects out-and-in of pools or snapshots
 issues?
 
 /Steffen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] All client writes block when 2 of 3 OSDs down

2015-03-26 Thread Somnath Roy

Greg,
Couple of dumb question may be.

1. If you see , the clients are connecting fine with two monitors in the 
cluster. 2 monitors can never form a quorum, but, 1 can, so, why with 1 monitor 
 (which is I guess happening after making 2 nodes down) it is not able to 
connect ?

2. Also, my understanding is while IO is going on *no* monitor interaction will 
be on that path, so, why the client io will be stopped because the monitor 
quorum is not there ? If the min_size =1 is properly set it should able to 
serve IO as long as 1 OSD (node) is up, isn't it ?

Thanks  Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Gregory Farnum
Sent: Thursday, March 26, 2015 2:40 PM
To: Lee Revell
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] All client writes block when 2 of 3 OSDs down

On Thu, Mar 26, 2015 at 2:30 PM, Lee Revell rlrev...@gmail.com wrote:
 On Thu, Mar 26, 2015 at 4:40 PM, Gregory Farnum g...@gregs42.com wrote:

 Has the OSD actually been detected as down yet?


 I believe it has, however I can't directly check because ceph health
 starts to hang when I down the second node.

Oh. You need to keep a quorum of your monitors running (just the monitor 
processes, not of everything in the system) or nothing at all is going to work. 
That's how we prevent split brain issues.



 You'll also need to set that min size on your existing pools (ceph
 osd pool pool set min_size 1 or similar) to change their behavior;
 the config option only takes effect for newly-created pools. (Thus
 the
 default.)


 I've done this, however the behavior is the same:

 $ for f in `ceph osd lspools | sed 's/[0-9]//g' | sed 's/,//g'`; do
 ceph osd pool set $f min_size 1; done set pool 0 min_size to 1 set
 pool 1 min_size to 1 set pool 2 min_size to 1 set pool 3 min_size to 1
 set pool 4 min_size to 1 set pool 5 min_size to 1 set pool 6 min_size
 to 1 set pool 7 min_size to 1

 $ ceph -w
 cluster db460aa2-5129-4aaa-8b2e-43eac727124e
  health HEALTH_WARN 1 mons down, quorum 0,1 ceph-node-1,ceph-node-2
  monmap e3: 3 mons at
 {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0
 ,ceph-node-3=192.168.122.141:6789/0},
 election epoch 194, quorum 0,1 ceph-node-1,ceph-node-2
  mdsmap e94: 1/1/1 up {0=ceph-node-1=up:active}
  osdmap e362: 3 osds: 2 up, 2 in
   pgmap v5913: 840 pgs, 8 pools, 7441 MB data, 994 objects
 25329 MB used, 12649 MB / 40059 MB avail
  840 active+clean

 2015-03-26 17:23:56.009938 mon.0 [INF] pgmap v5913: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail
 2015-03-26 17:25:51.042802 mon.0 [INF] pgmap v5914: 840 pgs: 840
 active+clean; 7441 MB data, 25329 MB used, 12649 MB / 40059 MB avail;
 active+0 B/s
 rd, 260 kB/s wr, 13 op/s
 2015-03-26 17:25:56.046491 mon.0 [INF] pgmap v5915: 840 pgs: 840
 active+clean; 7441 MB data, 25333 MB used, 12645 MB / 40059 MB avail;
 active+0 B/s
 rd, 943 kB/s wr, 38 op/s
 2015-03-26 17:26:01.058167 mon.0 [INF] pgmap v5916: 840 pgs: 840
 active+clean; 7441 MB data, 25335 MB used, 12643 MB / 40059 MB avail;
 active+0 B/s
 rd, 10699 kB/s wr, 621 op/s

 this is where i kill the second OSD

 2015-03-26 17:26:26.778461 7f4ebeffd700  0 monclient: hunting for new
 mon
 2015-03-26 17:26:30.701099 7f4ec45f5700  0 --
 192.168.122.111:0/1007741 
 192.168.122.141:6789/0 pipe(0x7f4ec0023200 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7f4ec0023490).fault
 2015-03-26 17:26:42.701154 7f4ec44f4700  0 --
 192.168.122.111:0/1007741 
 192.168.122.131:6789/0 pipe(0x7f4ec00251b0 sd=3 :0 s=1 pgs=0 cs=0 l=1
 c=0x7f4ec0025440).fault

 And all writes block until I bring back an OSD.

 Lee
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] more human readable log to track request or using mapreduce for data statistics

2015-03-26 Thread 池信泽

hi，ceph:

Currently, the command ”ceph --admin-daemon
/var/run/ceph/ceph-osd.0.asok dump_historic_ops“ may return as below:

{ description: osd_op(client.4436.1:11617
rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92),
  received_at: 2015-03-25 19:41:47.146145,
  age: 2.186521,
  duration: 1.237882,
  type_data: [
commit sent; apply or cleanup,
{ client: client.4436,
  tid: 11617},
[
{ time: 2015-03-25 19:41:47.150803,
  event: event1},
{ time: 2015-03-25 19:41:47.150873,
  event: event2},
{ time: 2015-03-25 19:41:47.150895,
  event: event3},
{ time: 2015-03-25 19:41:48.384027,
  event: event4}]]}

I think this message is not so suitable for grep log or using
mapreduce for data statistics. Such as, I want to know
the write request avg latency for each rbd everyday. If we could
output the all latency in just one line, it would be very easy to
achieve it.

Such as, the output log maybe something like this:
2015-03-26 03:30:53.859759 osd=osd.0 pg=2.11 op=(client.4436.1:11617
rb.0.1153.6b8b4567.0192 [] 2.8eb4757c ondisk+write e92)
received_at=1427355253 age=2.186521 duration=1.237882 tid=11617
client=client.4436 event1=20ms event2=300ms event3=400ms event4=100ms.

The above:

duration means: the time between (reply_to_client_stamp -
request_received_stamp)
event1 means: the time between (the event1_stamp - request_received_stamp)
...
event4 means: the time between (the event4_stamp - request_received_stamp)

Now, If we output the every log as above. it would be every easy to
know the write request avg latency for each rbd everyday.
Or if I use grep it is more easy to find out which stage is the bottleneck.

-- 
Regards,
xinze
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] hadoop namenode not starting due to bindException while deploying hadoop with cephFS

2015-03-26 Thread Gregory Farnum

On Wed, Mar 25, 2015 at 8:10 PM, Ridwan Rashid Noel ridwan...@gmail.com wrote:
 Hi Greg,

 Thank you for your response. I have understood that I should be starting
 only the mapred daemons when using cephFS instead of HDFS. I have fixed that
 and trying to run hadoop wordcount job using this instruction:

 bin/hadoop jar hadoop*examples*.jar wordcount /tmp/wc-input /tmp/wc-output

 but I am getting this error

 15/03/26 02:54:35 INFO util.NativeCodeLoader: Loaded the native-hadoop
 library
 15/03/26 02:54:35 INFO input.FileInputFormat: Total input paths to process :
 1
 15/03/26 02:54:35 WARN snappy.LoadSnappy: Snappy native library not loaded
 15/03/26 02:54:35 INFO mapred.JobClient: Running job: job_201503260253_0001
 15/03/26 02:54:36 INFO mapred.JobClient:  map 0% reduce 0%
 15/03/26 02:54:36 INFO mapred.JobClient: Task Id :
 attempt_201503260253_0001_m_21_0, Status : FAILED
 Error initializing attempt_201503260253_0001_m_21_0:
 java.io.FileNotFoundException: File
 file:/tmp/hadoop-ceph/mapred/system/job_201503260253_0001/jobToken does not
 exist.
 at
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
 at
 org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
 at
 org.apache.hadoop.mapred.TaskTracker.localizeJobTokenFile(TaskTracker.java:4445)
 at
 org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1272)
 at
 org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1213)
 at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2568)
 at java.lang.Thread.run(Thread.java:745)

I'm not an expert at setting up Hadoop, but these errors are coming
out of the RawLocalFileSystem, which I think means that worker node
is trying to use a local FS instead of Ceph. Did you set up each node
to access Ceph? Have you set up and used Hadoop previously?
-Greg


 .

 I have used the core-site.xml configurations as mentioned in
 http://ceph.com/docs/master/cephfs/hadoop/
 Please tell me how can this problem be solved?

 Regards,

 Ridwan Rashid Noel

 Doctoral Student,
 Department of Computer Science,
 University of Texas at San Antonio

 Contact# 210-773-9966

 On Fri, Mar 20, 2015 at 4:04 PM, Gregory Farnum g...@gregs42.com wrote:

 On Fri, Mar 20, 2015 at 1:05 PM, Ridwan Rashid ridwan...@gmail.com
 wrote:
  Gregory Farnum greg@... writes:
 
 
  On Thu, Mar 19, 2015 at 5:57 PM, Ridwan Rashid ridwan064@... wrote:
   Hi,
  
   I have a 5 node ceph(v0.87) cluster and am trying to deploy hadoop
   with
   cephFS. I have installed hadoop-1.1.1 in the nodes and changed the
   conf/core-site.xml file according to the ceph documentation
   http://ceph.com/docs/master/cephfs/hadoop/ but after changing the
   file the
   namenode is not starting (namenode can be formatted) but the other
   services(datanode, jobtracker, tasktracker) are running in hadoop.
  
   The default hadoop works fine but when I change the core-site.xml
   file as
   above I get the following bindException as can be seen from the
   namenode
  log:
  
  
   2015-03-19 01:37:31,436 ERROR
   org.apache.hadoop.hdfs.server.namenode.NameNode:
   java.net.BindException:
   Problem binding to node1/10.242.144.225:6789 : Cannot assign
   requested
  address
  
  
   I have one monitor for the ceph cluster (node1/10.242.144.225) and I
   included in the core-site.xml file ceph://10.242.144.225:6789 as the
   value
   of fs.default.name. The 6789 port is the default port being used by
   the
   monitor node of ceph, so that may be the reason for the bindException
   but
   the ceph documentation mentions that it should be included like this
   in the
   core-site.xml file. It would be really helpful to get some pointers
   to where
   I am doing wrong in the setup.
 
  I'm a bit confused. The NameNode is only used by HDFS, and so
  shouldn't be running at all if you're using CephFS. Nor do I have any
  idea why you've changed anything in a way that tells the NameNode to
  bind to the monitor's IP address; none of the instructions that I see
  can do that, and they certainly shouldn't be.
  -Greg
 
 
  Hi Greg,
 
  I want to run a hadoop job (e.g. terasort) and want to use cephFS
  instead of
  HDFS. In Using Hadoop with cephFS documentation in
  http://ceph.com/docs/master/cephfs/hadoop/ if you look into the Hadoop
  configuration section, the first property fs.default.name has to be set
  as
  the ceph URI and in the notes it's mentioned as ceph://[monaddr:port]/.
  My
  core-site.xml of hadoop conf looks like this
 
  configuration
 
  property
  namefs.default.name/name
  valueceph://10.242.144.225:6789/value
  /property

 Yeah, that all makes sense. But I don't understand why or how you're
 starting up a NameNode at all, nor what config values it's drawing
 from to try and bind to that port. The NameNode is the problem because
 it shouldn't even be invoked.
 -Greg

Re: [ceph-users] ceph falsely reports clock skew?

2015-03-26 Thread Sage Weil

On Thu, 26 Mar 2015, Gregory Farnum wrote:
 On Thu, Mar 26, 2015 at 7:44 AM, Lee Revell rlrev...@gmail.com wrote:
  I have a virtual test environment of an admin node and 3 mon + osd nodes,
  built by just following the quick start guide.  It seems to work OK but ceph
  is constantly complaining about clock skew much greater than reality.
  Clocksource on the virtuals is kvm-clock and they also run ntpd.
 
  ceph-admin-node
  26 Mar 10:35:29 ntpdate[2647]: adjust time server 91.189.94.4 offset
  0.000802 sec
 
  ceph-node-1
  26 Mar 10:35:35 ntpdate[4250]: adjust time server 91.189.94.4 offset
  0.002537 sec
 
  ceph-node-2
  26 Mar 10:35:42 ntpdate[1708]: adjust time server 91.189.94.4 offset
  -0.000214 sec
 
  ceph-node-3
  26 Mar 10:35:49 ntpdate[1964]: adjust time server 91.189.94.4 offset
  0.001490 sec
 
  ceph@ceph-admin-node:~/my-cluster$ ceph -w
  cluster db460aa2-5129-4aaa-8b2e-43eac727124e
   health HEALTH_WARN clock skew detected on mon.ceph-node-2
   monmap e3: 3 mons at
  {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0},
  election epoch 140, quorum 0,1,2 ceph-node-1,ceph-node-2,ceph-node-3
   mdsmap e54: 1/1/1 up {0=ceph-node-1=up:active}
   osdmap e182: 3 osds: 3 up, 3 in
pgmap v3594: 840 pgs, 8 pools, 7163 MB data, 958 objects
  29850 MB used, 27118 MB / 60088 MB avail
   840 active+clean
 
 What clock skews is it reporting? I don't remember the defaults, but
 if ntp is consistently adjusting your clocks by a couple of
 milliseconds then I don't think Ceph is going to be very happy about
 it.

IIRC the mons re-check sync every 5 minutes.  Does the warning persist?  
Does it go away if you restart the mons?

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread Quentin Hartman

I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1 SSD
(os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs for
ceph network, and 72GB of RAM. I configure openstack to leave 3GB of RAM
unused on each node for OSD / OS overhead. All the VMs are backed by ceph
volumes and things generally work very well. I would prefer a dedicated
storage layer simply because it seems more right, but I can't say that
any of the common concerns of using this kind of setup have come up for me.
Aside from shaving off that 3GB of RAM, my deployment isn't any more
complex than a split stack deployment would be. After running like this for
the better part of a year, I would have a hard time honestly making a real
business case for the extra hardware a split stack cluster would require.

QH

On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson mnel...@redhat.com wrote:

 It's kind of a philosophical question.  Technically there's nothing that
 prevents you from putting ceph and the hypervisor on the same boxes. It's a
 question of whether or not potential cost savings are worth increased risk
 of failure and contention.  You can minimize those things through various
 means (cgroups, ristricting NUMA nodes, etc).  What is more difficult is
 isolating disk IO contention (say if you want local SSDs for VMs), memory
 bus and QPI contention, network contention, etc. If the VMs are working
 really hard you can restrict them to their own socket, and you can even
 restrict memory usage to the local socket, but what about remote socket
 network or disk IO? (you will almost certainly want these things on the
 ceph socket)  I wonder as well about increased risk of hardware failure
 with the increased load, but I don't have any statistics.

 I'm guessing if you spent enough time at it you could make it work
 relatively well, but at least personally I question how beneficial it
 really is after all of that.  If you are going for cost savings, I suspect
 efficient compute and storage node designs will be nearly as good with much
 less complexity.

 Mark


 On 03/26/2015 07:11 AM, Wido den Hollander wrote:

 On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote:

 Hi Wido,
 Am 26.03.2015 um 11:59 schrieb Wido den Hollander:

 On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote:

 Hi,

 in the past i rwad pretty often that it's not a good idea to run ceph
 and qemu / the hypervisors on the same nodes.

 But why is this a bad idea? You save space and can better use the
 ressources you have in the nodes anyway.


 Memory pressure during recovery *might* become a problem. If you make
 sure that you don't allocate more then let's say 50% for the guests it
 could work.


 mhm sure? I've never seen problems like that. Currently i ran each ceph
 node with 64GB of memory and each hypervisor node with around 512GB to
 1TB RAM while having 48 cores.


 Yes, it can happen. You have machines with enough memory, but if you
 overprovision the machines it can happen.

  Using cgroups you could also prevent that the OSDs eat up all memory or
 CPU.

 Never seen an OSD doing so crazy things.


 Again, it really depends on the available memory and CPU. If you buy big
 machines for this purpose it probably won't be a problem.

  Stefan

  So technically it could work, but memorey and CPU pressure is something
 which might give you problems.

  Stefan

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph falsely reports clock skew?

2015-03-26 Thread Gregory Farnum

On Thu, Mar 26, 2015 at 7:44 AM, Lee Revell rlrev...@gmail.com wrote:
 I have a virtual test environment of an admin node and 3 mon + osd nodes,
 built by just following the quick start guide.  It seems to work OK but ceph
 is constantly complaining about clock skew much greater than reality.
 Clocksource on the virtuals is kvm-clock and they also run ntpd.

 ceph-admin-node
 26 Mar 10:35:29 ntpdate[2647]: adjust time server 91.189.94.4 offset
 0.000802 sec

 ceph-node-1
 26 Mar 10:35:35 ntpdate[4250]: adjust time server 91.189.94.4 offset
 0.002537 sec

 ceph-node-2
 26 Mar 10:35:42 ntpdate[1708]: adjust time server 91.189.94.4 offset
 -0.000214 sec

 ceph-node-3
 26 Mar 10:35:49 ntpdate[1964]: adjust time server 91.189.94.4 offset
 0.001490 sec

 ceph@ceph-admin-node:~/my-cluster$ ceph -w
 cluster db460aa2-5129-4aaa-8b2e-43eac727124e
  health HEALTH_WARN clock skew detected on mon.ceph-node-2
  monmap e3: 3 mons at
 {ceph-node-1=192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0},
 election epoch 140, quorum 0,1,2 ceph-node-1,ceph-node-2,ceph-node-3
  mdsmap e54: 1/1/1 up {0=ceph-node-1=up:active}
  osdmap e182: 3 osds: 3 up, 3 in
   pgmap v3594: 840 pgs, 8 pools, 7163 MB data, 958 objects
 29850 MB used, 27118 MB / 60088 MB avail
  840 active+clean

What clock skews is it reporting? I don't remember the defaults, but
if ntp is consistently adjusting your clocks by a couple of
milliseconds then I don't think Ceph is going to be very happy about
it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph falsely reports clock skew?

2015-03-26 Thread Lee Revell

I have a virtual test environment of an admin node and 3 mon + osd nodes,
built by just following the quick start guide.  It seems to work OK but
ceph is constantly complaining about clock skew much greater than reality.
Clocksource on the virtuals is kvm-clock and they also run ntpd.

ceph-admin-node
26 Mar 10:35:29 ntpdate[2647]: adjust time server 91.189.94.4 offset
0.000802 sec

ceph-node-1
26 Mar 10:35:35 ntpdate[4250]: adjust time server 91.189.94.4 offset
0.002537 sec

ceph-node-2
26 Mar 10:35:42 ntpdate[1708]: adjust time server 91.189.94.4 offset
-0.000214 sec

ceph-node-3
26 Mar 10:35:49 ntpdate[1964]: adjust time server 91.189.94.4 offset
0.001490 sec

ceph@ceph-admin-node:~/my-cluster$ ceph -w
cluster db460aa2-5129-4aaa-8b2e-43eac727124e
 health HEALTH_WARN clock skew detected on mon.ceph-node-2
 monmap e3: 3 mons at {ceph-node-1=
192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0},
election epoch 140, quorum 0,1,2 ceph-node-1,ceph-node-2,ceph-node-3
 mdsmap e54: 1/1/1 up {0=ceph-node-1=up:active}
 osdmap e182: 3 osds: 3 up, 3 in
  pgmap v3594: 840 pgs, 8 pools, 7163 MB data, 958 objects
29850 MB used, 27118 MB / 60088 MB avail
 840 active+clean

Lee
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5

2015-03-26 Thread Gregory Farnum

On Thu, Mar 26, 2015 at 2:56 AM, Saverio Proto ziopr...@gmail.com wrote:
 Thanks for the answer. Now the meaning of MB data and MB used is
 clear, and if all the pools have size=3 I expect a ratio 1 to 3 of the
 two values.

 I still can't understand why MB used is so big in my setup.
 All my pools are size =3 but the ratio MB data and MB used is 1 to
 5 instead of 1 to 3.

 My first guess was that I wrote a wrong crushmap that was making more
 than 3 copies.. (is it really possible to make such a mistake?)

 So I changed my crushmap and I put the default one, that just spreads
 data across hosts, but I see no change, the ratio is still 1 to 5.

 I thought maybe my 3 monitors have different views of the pgmap, so I
 tried to restart the monitors but this also did not help.

 What useful information may I share here to troubleshoot this issue further ?
 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)

You just need to go look at one of your OSDs and see what data is
stored on it. Did you configure things so that the journals are using
a file on the same storage disk? If so, *that* is why the data used
is large.

I promise that your 5:1 ratio won't persist as you write more than 2GB
of data into the cluster.
-Greg


 Thank you

 Saverio



 2015-03-25 14:55 GMT+01:00 Gregory Farnum g...@gregs42.com:
 On Wed, Mar 25, 2015 at 1:24 AM, Saverio Proto ziopr...@gmail.com wrote:
 Hello there,

 I started to push data into my ceph cluster. There is something I
 cannot understand in the output of ceph -w.

 When I run ceph -w I get this kinkd of output:

 2015-03-25 09:11:36.785909 mon.0 [INF] pgmap v278788: 26056 pgs: 26056
 active+clean; 2379 MB data, 19788 MB used, 33497 GB / 33516 GB avail


 2379MB is actually the data I pushed into the cluster, I can see it
 also in the ceph df output, and the numbers are consistent.

 What I dont understand is 19788MB used. All my pools have size 3, so I
 expected something like 2379 * 3. Instead this number is very big.

 I really need to understand how MB used grows because I need to know
 how many disks to buy.

 MB used is the summation of (the programmatic equivalent to) df
 across all your nodes, whereas MB data is calculated by the OSDs
 based on data they've written down. Depending on your configuration
 MB used can include thing like the OSD journals, or even totally
 unrelated data if the disks are shared with other applications.

 MB used including the space used by the OSD journals is my first
 guess about what you're seeing here, in which case you'll notice that
 it won't grow any faster than MB data does once the journal is fully
 allocated.
 -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph falsely reports clock skew?

2015-03-26 Thread Lee Revell

I have a virtual test environment of an admin node and 3 mon + osd nodes,
built by just following the quick start guide.  It seems to work OK but
ceph is constantly complaining about clock skew much greater than reality.
Clocksource on the virtuals is kvm-clock and they also run ntpd.

ceph-admin-node
26 Mar 10:35:29 ntpdate[2647]: adjust time server 91.189.94.4 offset
0.000802 sec

ceph-node-1
26 Mar 10:35:35 ntpdate[4250]: adjust time server 91.189.94.4 offset
0.002537 sec

ceph-node-2
26 Mar 10:35:42 ntpdate[1708]: adjust time server 91.189.94.4 offset
-0.000214 sec

ceph-node-3
26 Mar 10:35:49 ntpdate[1964]: adjust time server 91.189.94.4 offset
0.001490 sec

ceph@ceph-admin-node:~/my-cluster$ ceph -w
cluster db460aa2-5129-4aaa-8b2e-43eac727124e
 health HEALTH_WARN clock skew detected on mon.ceph-node-2
 monmap e3: 3 mons at {ceph-node-1=
192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0},
election epoch 140, quorum 0,1,2 ceph-node-1,ceph-node-2,ceph-node-3
 mdsmap e54: 1/1/1 up {0=ceph-node-1=up:active}
 osdmap e182: 3 osds: 3 up, 3 in
  pgmap v3594: 840 pgs, 8 pools, 7163 MB data, 958 objects
29850 MB used, 27118 MB / 60088 MB avail
 840 active+clean
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] running Qemu / Hypervisor AND Ceph on the same nodes

2015-03-26 Thread Mark Nelson

I suspect a config like this where you only have 3 OSDs per node would 
be more manageable than something denser.


IE theoretically a single E5-2697v3 is enough to run 36 OSDs in a 4U 
super micro chassis for a semi-dense converged solution.  You could 
attempt to restrict the OSDs to one socket and then use a second 
E5-2697v3 for VMs.  Maybe after you've got cgroups setup properly and if 
you've otherwise balanced things it would work out ok.  I question 
though how much you really benefit by doing this rather than running a 
36 drive storage server with lower bin CPUs and a 2nd 1U box for VMs 
(which you don't need as many of because you can dedicate both sockets 
to VMs).


It probably depends quite a bit on how memory, network, and disk 
intensive the VMs are, but my take is that it's better to error on the 
side of simplicity rather than making things overly complicated.  Every 
second you are screwing around trying to make the setup work right eats 
into any savings you might gain by going with the converged setup.


Mark

On 03/26/2015 10:12 AM, Quentin Hartman wrote:

I run a converged openstack / ceph cluster with 14 1U nodes. Each has 1
SSD (os / journals), 3 1TB spinners (1 OSD each), 16 HT cores, 10Gb NICs
for ceph network, and 72GB of RAM. I configure openstack to leave 3GB of
RAM unused on each node for OSD / OS overhead. All the VMs are backed by
ceph volumes and things generally work very well. I would prefer a
dedicated storage layer simply because it seems more right, but I
can't say that any of the common concerns of using this kind of setup
have come up for me. Aside from shaving off that 3GB of RAM, my
deployment isn't any more complex than a split stack deployment would
be. After running like this for the better part of a year, I would have
a hard time honestly making a real business case for the extra hardware
a split stack cluster would require.

QH

On Thu, Mar 26, 2015 at 6:57 AM, Mark Nelson mnel...@redhat.com
mailto:mnel...@redhat.com wrote:

It's kind of a philosophical question.  Technically there's nothing
that prevents you from putting ceph and the hypervisor on the same
boxes. It's a question of whether or not potential cost savings are
worth increased risk of failure and contention.  You can minimize
those things through various means (cgroups, ristricting NUMA nodes,
etc).  What is more difficult is isolating disk IO contention (say
if you want local SSDs for VMs), memory bus and QPI contention,
network contention, etc. If the VMs are working really hard you can
restrict them to their own socket, and you can even restrict memory
usage to the local socket, but what about remote socket network or
disk IO? (you will almost certainly want these things on the ceph
socket)  I wonder as well about increased risk of hardware failure
with the increased load, but I don't have any statistics.

I'm guessing if you spent enough time at it you could make it work
relatively well, but at least personally I question how beneficial
it really is after all of that.  If you are going for cost savings,
I suspect efficient compute and storage node designs will be nearly
as good with much less complexity.

Mark


On 03/26/2015 07:11 AM, Wido den Hollander wrote:

On 26-03-15 12:04, Stefan Priebe - Profihost AG wrote:

Hi Wido,
Am 26.03.2015 um 11:59 schrieb Wido den Hollander:

On 26-03-15 11:52, Stefan Priebe - Profihost AG wrote:

Hi,

in the past i rwad pretty often that it's not a good
idea to run ceph
and qemu / the hypervisors on the same nodes.

But why is this a bad idea? You save space and can
better use the
ressources you have in the nodes anyway.


Memory pressure during recovery *might* become a
problem. If you make
sure that you don't allocate more then let's say 50% for
the guests it
could work.


mhm sure? I've never seen problems like that. Currently i
ran each ceph
node with 64GB of memory and each hypervisor node with
around 512GB to
1TB RAM while having 48 cores.


Yes, it can happen. You have machines with enough memory, but if you
overprovision the machines it can happen.

Using cgroups you could also prevent that the OSDs eat
up all memory or CPU.

Never seen an OSD doing so crazy things.


Again, it really depends on the available memory and CPU. If you
buy big
machines for this purpose it probably won't be a problem.

Stefan

So technically it could work, but memorey and CPU
pressure is something
which might give you problems.

Re: [ceph-users] How to see the content of an EC Pool after recreate the SSD-Cache tier?

2015-03-26 Thread Udo Lembke

Hi Greg,
ok!

It's looks like, that my problem is more setomapval-related...

I must o something like
rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
\0x0f\0x00\0x00\0x002cfc7ce74b0dc51

but rados setomapval don't use the hexvalues - instead of this I got
rados -p ssd-archiv listomapvals rbd_directory
name_vm-409-disk-2
value: (35 bytes) :
 : 5c 30 78 30 66 5c 30 78 30 30 5c 30 78 30 30 5c : \0x0f\0x00\0x00\
0010 : 30 78 30 30 32 63 66 63 37 63 65 37 34 62 30 64 : 0x002cfc7ce74b0d
0020 : 63 35 31: c51


hmm, strange. With  rados -p ssd-archiv getomapval rbd_directory 
name_vm-409-disk-2 name_vm-409-disk-2
I got the binary inside the file name_vm-409-disk-2, but reverse do an
rados -p ssd-archiv setomapval rbd_directory name_vm-409-disk-2 
name_vm-409-disk-2
fill the variable with name_vm-409-disk-2 and not with the content of the 
file...

Are there other tools for the rbd_directory?

regards

Udo

Am 26.03.2015 15:03, schrieb Gregory Farnum:
 You shouldn't rely on rados ls when working with cache pools. It
 doesn't behave properly and is a silly operation to run against a pool
 of any size even when it does. :)
 
 More specifically, rados ls is invoking the pgls operation. Normal
 read/write ops will go query the backing store for objects if they're
 not in the cache tier. pgls is different — it just tells you what
 objects are present in the PG on that OSD right now. So any objects
 which aren't in cache won't show up when listing on the cache pool.
 -Greg
 
 On Thu, Mar 26, 2015 at 3:43 AM, Udo Lembke ulem...@polarzone.de wrote:
 Hi all,
 due an very silly approach, I removed the cache tier of an filled EC pool.

 After recreate the pool and connect with the EC pool I don't see any content.
 How can I see the rbd_data and other files through the new ssd cache tier?

 I think, that I must recreate the rbd_directory (and fill with setomapval), 
 but I don't see anything yet!

 $ rados ls -p ecarchiv | more
 rbd_data.2e47de674b0dc51.00390074
 rbd_data.2e47de674b0dc51.0020b64f
 rbd_data.2fbb1952ae8944a.0016184c
 rbd_data.2cfc7ce74b0dc51.00363527
 rbd_data.2cfc7ce74b0dc51.0004c35f
 rbd_data.2fbb1952ae8944a.0008db43
 rbd_data.2cfc7ce74b0dc51.0015895a
 rbd_data.31229f0238e1f29.000135eb
 ...

 $ rados ls -p ssd-archiv
  nothing 

 generation of the cache tier:
 $ rados mkpool ssd-archiv
 $ ceph osd pool set ssd-archiv crush_ruleset 5
 $ ceph osd tier add ecarchiv ssd-archiv
 $ ceph osd tier cache-mode ssd-archiv writeback
 $ ceph osd pool set ssd-archiv hit_set_type bloom
 $ ceph osd pool set ssd-archiv hit_set_count 1
 $ ceph osd pool set ssd-archiv hit_set_period 3600
 $ ceph osd pool set ssd-archiv target_max_bytes 500


 rule ssd {
 ruleset 5
 type replicated
 min_size 1
 max_size 10
 step take ssd
 step choose firstn 0 type osd
 step emit
 }


 Are there any magic (or which command I missed?) to see the excisting data 
 throug the cache tier?


 regards - and hoping for answers

 Udo
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] All pools have size=3 but MB data and MB used ratio is 1 to 5

2015-03-26 Thread Saverio Proto

 You just need to go look at one of your OSDs and see what data is
 stored on it. Did you configure things so that the journals are using
 a file on the same storage disk? If so, *that* is why the data used
 is large.

I followed your suggestion and this is the result of my trobleshooting.

Each OSD controls a disk that is mounted in a folder with the name:

/var/lib/ceph/osd/ceph-N

where N is the OSD number

The journal is stored on another disk drive. I have three extra SSD
drives per server, that I partitioned with 6 partitions each, and
those partitions are journal partitions.
I checked that the setup is correct because each
/var/lib/ceph/osd/ceph-N/journal points correctly to another drive.

with df -h I see the folders where my OSD are mounted. The space
occupation looks well distributed among all OSDs as expected.

the data is always in a folder called:

/var/lib/ceph/osd/ceph-N/current

I checked with the tool ncdu where the data is stored inside the
current folders.

in each OSD there is a folder with a lot of data called

/var/lib/ceph/osd/ceph-N/current/meta

If I sum the MB for each meta folder that is more or less the extra
space that is consumed, leading to the 1 to 5 ratio.

the meta folder contains a lot of binary files, unreadable, but
looking at the file names it looks like it is where the versions of
the osdmap are stored.

but it is really a lot of metadata.

I will start now to push a lot of data into the cluster to see if the
metadata grows a lot or stays costant.

There is a way to clean up old metadata ?

thanks

Saverio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph falsely reports clock skew?

2015-03-26 Thread Lee Revell

I think I solved the problem. The clock skew only happens when restarting a
node to simulate hardware failure. The virtual comes up with a skewed clock
and ceph services start before ntp has time to adjust it, then there's a
delay before ceph rechecks the clock skew.

Lee

On Thu, Mar 26, 2015 at 11:21 AM, Sage Weil s...@newdream.net wrote:

 On Thu, 26 Mar 2015, Gregory Farnum wrote:
  On Thu, Mar 26, 2015 at 7:44 AM, Lee Revell rlrev...@gmail.com wrote:
   I have a virtual test environment of an admin node and 3 mon + osd
 nodes,
   built by just following the quick start guide.  It seems to work OK
 but ceph
   is constantly complaining about clock skew much greater than reality.
   Clocksource on the virtuals is kvm-clock and they also run ntpd.
  
   ceph-admin-node
   26 Mar 10:35:29 ntpdate[2647]: adjust time server 91.189.94.4 offset
   0.000802 sec
  
   ceph-node-1
   26 Mar 10:35:35 ntpdate[4250]: adjust time server 91.189.94.4 offset
   0.002537 sec
  
   ceph-node-2
   26 Mar 10:35:42 ntpdate[1708]: adjust time server 91.189.94.4 offset
   -0.000214 sec
  
   ceph-node-3
   26 Mar 10:35:49 ntpdate[1964]: adjust time server 91.189.94.4 offset
   0.001490 sec
  
   ceph@ceph-admin-node:~/my-cluster$ ceph -w
   cluster db460aa2-5129-4aaa-8b2e-43eac727124e
health HEALTH_WARN clock skew detected on mon.ceph-node-2
monmap e3: 3 mons at
   {ceph-node-1=
 192.168.122.121:6789/0,ceph-node-2=192.168.122.131:6789/0,ceph-node-3=192.168.122.141:6789/0
 },
   election epoch 140, quorum 0,1,2 ceph-node-1,ceph-node-2,ceph-node-3
mdsmap e54: 1/1/1 up {0=ceph-node-1=up:active}
osdmap e182: 3 osds: 3 up, 3 in
 pgmap v3594: 840 pgs, 8 pools, 7163 MB data, 958 objects
   29850 MB used, 27118 MB / 60088 MB avail
840 active+clean
 
  What clock skews is it reporting? I don't remember the defaults, but
  if ntp is consistently adjusting your clocks by a couple of
  milliseconds then I don't think Ceph is going to be very happy about
  it.

 IIRC the mons re-check sync every 5 minutes.  Does the warning persist?
 Does it go away if you restart the mons?

 sage

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

65 matches

Mail list logo