[ceph-users] radosgw-admin orphans find -- Hammer

2017-09-07 Thread Daniel Schneller
Hello,

we need to reclaim a lot of wasted space by RGW orphans in our production 
Hammer cluster (0.94.10 on Ubuntu 14.04).

According to http://tracker.ceph.com/issues/18258 
<http://tracker.ceph.com/issues/18258> there is a bug in the radosgw-admin 
orphans find command, that causes it to get stuck in an infinite loop.

From the bug report I cannot tell if there are unusual circumstances that need 
to be present to trigger the infinite-loop condition, of if I am more or less 
guaranteed to hit the issue.
The bug has been fixed, but not im Hammer.

Any chance of getting it backported into Hammer? 
Is the fix in the radosgw-admin tool itself, or are there more/other components 
that would have to be touched?

As the cluster has about 200 million objects, I would rather not just “try my 
luck” and get stuck in the middle.

Any insight on this would be appreciated.

Thanks a lot,
Daniel

-- 
Daniel Schneller
Principal Cloud Engineer
 
CenterDevice GmbH  | Hochstraße 11
   | 42697 Solingen
tel: +49 1754155711| Deutschland
daniel.schnel...@centerdevice.de   | www.centerdevice.de

Geschäftsführung: Dr. Patrick Peschlow, Dr. Lukas Pustina,
Michael Rosbach, Handelsregister-Nr.: HRB 18655,
HR-Gericht: Bonn, USt-IdNr.: DE-815299431


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD omap disk write bursts

2016-09-19 Thread Daniel Schneller

Hello!

We are observing a somewhat strange IO pattern on our OSDs.

The cluster is running Hammer 0.94.1, 48 OSDs, 4 TB spinners, xfs, 
colocated journals.


Over periods of days on end we see groups of 3 OSDs being busy with 
lots and lots of small writes for several minutes at a time.
Once one group calms down, another group begins. Might be easier to 
understand in a graph:


https://public.centerdevice.de/3e62a18d-dd01-477e-b52b-f65d181e2920

(this shows a limited time range to make the individual lines 
discernable)


Initial attemps to correlate this to client activity with small writes, 
turned out to be wrong -- not really surprising, because both VM RBD 
activity, as well as RGW object storage should show much evenly spread 
patterns across all OSDs. 


Using sysdig I figured it seems to be LevelDB activity:

[16:58:42 B|daniel.schneller@node02] ~ 
➜  sudo sysdig -p "%12user.name %6proc.pid %12proc.name %3fd.num 
%fd.typechar %fd.name" "evt.type=write and proc.pid=8215"
root 8215   ceph-osd 153 f 
/var/lib/ceph/osd/ceph-14/current/omap/763308.log


... (*lots and lots* more writes to 763308.log ) ... 

root 8215   ceph-osd 153 f 
/var/lib/ceph/osd/ceph-14/current/omap/763308.log
root 8215   ceph-osd 153 f 
/var/lib/ceph/osd/ceph-14/current/omap/763308.log
root 8215   ceph-osd 103 f 
/var/lib/ceph/osd/ceph-14/current/omap/763310.log
root 8215   ceph-osd 103 f 
/var/lib/ceph/osd/ceph-14/current/omap/763310.log
root 8215   ceph-osd 15  f 
/var/lib/ceph/osd/ceph-14/current/omap/LOG
root 8215   ceph-osd 15  f 
/var/lib/ceph/osd/ceph-14/current/omap/LOG
root 8215   ceph-osd 103 f 
/var/lib/ceph/osd/ceph-14/current/omap/763310.log
root 8215   ceph-osd 103 f 
/var/lib/ceph/osd/ceph-14/current/omap/763310.log
root 8215   ceph-osd 103 f 
/var/lib/ceph/osd/ceph-14/current/omap/763310.log
root 8215   ceph-osd 103 f 
/var/lib/ceph/osd/ceph-14/current/omap/763310.log
root 8215   ceph-osd 153 f 
/var/lib/ceph/osd/ceph-14/current/omap/763311.ldb
root 8215   ceph-osd 103 f 
/var/lib/ceph/osd/ceph-14/current/omap/763310.log
root 8215   ceph-osd 103 f 
/var/lib/ceph/osd/ceph-14/current/omap/763310.log
root 8215   ceph-osd 153 f 
/var/lib/ceph/osd/ceph-14/current/omap/763311.ldb
root 8215   ceph-osd 153 f 
/var/lib/ceph/osd/ceph-14/current/omap/763311.ldb
root 8215   ceph-osd 153 f 
/var/lib/ceph/osd/ceph-14/current/omap/763311.ldb
root 8215   ceph-osd 153 f 
/var/lib/ceph/osd/ceph-14/current/omap/763311.ldb
root 8215   ceph-osd 153 f 
/var/lib/ceph/osd/ceph-14/current/omap/763311.ldb
root 8215   ceph-osd 153 f 
/var/lib/ceph/osd/ceph-14/current/omap/763311.ldb
root 8215   ceph-osd 153 f 
/var/lib/ceph/osd/ceph-14/current/omap/763311.ldb
... (*lots and lots* more writes to 763311.ldb ) ... 
root 8215   ceph-osd 15  f 
/var/lib/ceph/osd/ceph-14/current/omap/LOG
root 8215   ceph-osd 15  f 
/var/lib/ceph/osd/ceph-14/current/omap/LOG
root 8215   ceph-osd 18  f 
/var/lib/ceph/osd/ceph-14/current/omap/MANIFEST-171304
root 8215   ceph-osd 18  f 
/var/lib/ceph/osd/ceph-14/current/omap/MANIFEST-171304
root 8215   ceph-osd 15  f 
/var/lib/ceph/osd/ceph-14/current/omap/LOG
root 8215   ceph-osd 15  f 
/var/lib/ceph/osd/ceph-14/current/omap/LOG
root 8215   ceph-osd 103 f 
/var/lib/ceph/osd/ceph-14/current/omap/763310.log
root 8215   ceph-osd 103 f 
/var/lib/ceph/osd/ceph-14/current/omap/763310.log
... (*lots and lots* more writes to 763310.log ) ... 



This correlates to the patterns in the graph for the given OSDs. If I 
understand this correctly, it looks like LevelDB compaction -- however, 
if that is the case, why would that happen in groups of only three at a 
time and why would it hit a single OSD in short succession? See this 
single-OSD graph of the same time as  before:


https://public.centerdevice.de/ab5f417d-43af-435d-aad0-7becff2b9acb

Are there any regular / event based maintenance tasks that are ensured 
to only run on n (=3)

OSDs at time?

Can I do anything to smooth this out or reduce it somehow?

Thanks,
Daniel



--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW container deletion problem

2016-07-27 Thread Daniel Schneller


Bump

On 2016-07-25 14:05:38 +, Daniel Schneller said:


Hi!

I created a bunch of test containers with some objects in them via
RGW/Swift (Ubuntu, RGW via Apache, Ceph Hammer 0.94.1)

Now I try to get rid of the test data.

I manually staretd with one container:

~/rgwtest ➜  swift -v -V 1.0 -A http://localhost:8405/auth -U <...> -K
<...> --insecure delete test_a6b3e80c-e880-bef9-b1b5-892073e3b153
test_10
test_5
test_100
test_20
test_30

So far so good. Notice that locahost:8405 is bound by haproxy,
distributing requests to 4 RGWs on different servers, in case that is
relevant.

To make sure my script gets error handling right, I tried to delete the
same container again, leading to an error:

~/rgwtest ➜  swift -v --retries=0 -V 1.0 -A http://localhost:8405/auth
-U <...> -K <...> --insecure delete
test_a6b3e80c-e880-bef9-b1b5-892073e3b153
Container DELETE failed:
http://localhost:8405:8405/swift/v1/test_a6b3e80c-e880-bef9-b1b5-892073e3b153
500 Internal Server Error   UnknownError

Stat'ing it still works:

~/rgwtest ➜  swift -v -V 1.0 -A http://localhost:8405/auth -U <...> -K
<...> --insecure stat test_a6b3e80c-e880-bef9-b1b5-892073e3b153
   URL:
http://localhost:8405/swift/v1/test_a6b3e80c-e880-bef9-b1b5-892073e3b153
Auth Token: AUTH_rgwtk...
   Account: v1
 Container: test_a6b3e80c-e880-bef9-b1b5-892073e3b153
   Objects: 0
 Bytes: 0
  Read ACL:
 Write ACL:
   Sync To:
  Sync Key:
Server: Apache/2.4.7 (Ubuntu)
X-Container-Bytes-Used-Actual: 0
X-Storage-Policy: default-placement
  Content-Type: text/plain; charset=utf-8


Checking the RGW Logs I found this:

2016-07-25 15:21:29.751055 7fbcd67f4700  1 == starting new request
req=0x7fbce40a1100 =
2016-07-25 15:21:29.768688 7fbcd67f4700  0 WARNING: set_req_state_err
err_no=125 resorting to 500
2016-07-25 15:21:29.768743 7fbcd67f4700  1 == req done
req=0x7fbce40a1100 http_status=500 ==

Googling a little and finding this:

http://tracker.ceph.com/issues/14208

mentioning similar issues and an out-of-sync metadata cache between
different RGWs. I vaguely remember having seen something like this
in the Firefly  timeframe before, but I am not sure if it is the same.

Where does this metadata cache live? Can it be flushed somehow without
disturbing other operations?

I found this PDF

https://archive.fosdem.org/2016/schedule/event/virt_iaas_ceph_rados_gateway_overview/attachments/audio/1077/export/events/attachments/virt_iaas_ceph_rados_gateway_overview/audio/1077/Fosdem_RGW.pdf 




but without the "audio track" it doesn't really help me.

Thanks!
Daniel



--
--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW container deletion problem

2016-07-25 Thread Daniel Schneller

Hi!

I created a bunch of test containers with some objects in them via
RGW/Swift (Ubuntu, RGW via Apache, Ceph Hammer 0.94.1)

Now I try to get rid of the test data.

I manually staretd with one container:

~/rgwtest ➜  swift -v -V 1.0 -A http://localhost:8405/auth -U <...> -K 
<...> --insecure delete test_a6b3e80c-e880-bef9-b1b5-892073e3b153

test_10
test_5
test_100
test_20
test_30

So far so good. Notice that locahost:8405 is bound by haproxy,
distributing requests to 4 RGWs on different servers, in case that is
relevant.

To make sure my script gets error handling right, I tried to delete the
same container again, leading to an error:

~/rgwtest ➜  swift -v --retries=0 -V 1.0 -A http://localhost:8405/auth 
-U <...> -K <...> --insecure delete 
test_a6b3e80c-e880-bef9-b1b5-892073e3b153

Container DELETE failed:
http://localhost:8405:8405/swift/v1/test_a6b3e80c-e880-bef9-b1b5-892073e3b153 
500 Internal Server Error   UnknownError


Stat'ing it still works:

~/rgwtest ➜  swift -v -V 1.0 -A http://localhost:8405/auth -U <...> -K 
<...> --insecure stat test_a6b3e80c-e880-bef9-b1b5-892073e3b153
  URL: 
http://localhost:8405/swift/v1/test_a6b3e80c-e880-bef9-b1b5-892073e3b153

   Auth Token: AUTH_rgwtk...
  Account: v1
Container: test_a6b3e80c-e880-bef9-b1b5-892073e3b153
  Objects: 0
Bytes: 0
 Read ACL:
Write ACL:
  Sync To:
 Sync Key:
   Server: Apache/2.4.7 (Ubuntu)
X-Container-Bytes-Used-Actual: 0
X-Storage-Policy: default-placement
 Content-Type: text/plain; charset=utf-8


Checking the RGW Logs I found this:

2016-07-25 15:21:29.751055 7fbcd67f4700  1 == starting new request 
req=0x7fbce40a1100 =
2016-07-25 15:21:29.768688 7fbcd67f4700  0 WARNING: set_req_state_err 
err_no=125 resorting to 500
2016-07-25 15:21:29.768743 7fbcd67f4700  1 == req done 
req=0x7fbce40a1100 http_status=500 ==


Googling a little and finding this:

http://tracker.ceph.com/issues/14208

mentioning similar issues and an out-of-sync metadata cache between
different RGWs. I vaguely remember having seen something like this
in the Firefly  timeframe before, but I am not sure if it is the same.

Where does this metadata cache live? Can it be flushed somehow without
disturbing other operations?

I found this PDF

https://archive.fosdem.org/2016/schedule/event/virt_iaas_ceph_rados_gateway_overview/attachments/audio/1077/export/events/attachments/virt_iaas_ceph_rados_gateway_overview/audio/1077/Fosdem_RGW.pdf 



but without the "audio track" it doesn't really help me.

Thanks!
Daniel


--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pinpointing performance bottleneck / would SSD journals help?

2016-06-27 Thread Daniel Schneller

On 2016-06-27 16:01:07 +, Lionel Bouton said:


Le 27/06/2016 17:42, Daniel Schneller a écrit :

Hi!

* Network Link saturation.
All links / bonds are well below any relevant load (around 35MB/s or
less)

...
Or you sure ? On each server you have 12 OSDs with a theoretical
bandwidth of at least half of 100MB/s (minimum bandwidth of any
reasonable HDD but halved because of the journal on the same device).
Which means your total disk bandwidth per server is 600MB/s.


Correct. However, I fear that because of lots of random IO going on,
we won't be coming anywhere near that number, esp. with 3x replication.


Bonded links are not perfect aggregation (depending on the mode one
client will either always use the same link or have its traffic
imperfectly balanced between the 2), so your theoretical network
bandwidth is probably nearest to 1Gbps (~ 120MB/s).


We use layer3+4 to spread traffic based on sources and destination
IP and port information. Benchmarks have shown that using enough
parallel streams we can saturate the full 250MB/s this ideally
produces. You are right, of course, that any single TCP connection
will never exceed 1Gbps.


What could happen is that the 35MB/s is an average over a large period
(several seconds), it's probably peaking at 120MB/s during short bursts.


That thought crossed my mind early on, too, but these values are based on
/proc/net/dev which has counters for each network device. The statistics
are gathered by checking the difference between the current sample and
the last. So this does not suffer from samples being taken at relatively
long intervals.


I wouldn't use less than 10Gbps for both the cluster and public networks
in your case.


I whole-heartedly agree... Certainly sensible, but for now we have to make
due with the infrastructure we have. Still, based on the data we have so far,
the network at least doesn't jump at me as a (major) contributor to the
slowness we see in this current scenario.



You didn't say how many VMs are running : the rkB/s and wkB/s seem very
low (note that for write intensive tasks your VM is reading quite a
bit...) but if you have 10 VMs or more battling for read and write
access this way it wouldn't be unexpected. As soon as latency rises for
one reason or another (here it would be network latency) you can expect
the total throughput of random accesses to plummet.


In total there are about 25 VMs, however many of them are less I/O bound
than MongoDB and Elasticsearch.  As for the comparatively high read load,
I agree, but I cannot really explain that in detail at the moment.

In general I would be very much interested in diagnosing the underlying
bare metal layer without making too many assumptions about what clients
are actually doing. In this case we can look into the VMs, but in general
it would be ideal to pinpoint a bottleneck on the "lower" levels. Any
improvements there would be beneficial to all client software.

Cheers,
Daniel


--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Pinpointing performance bottleneck / would SSD journals help?

2016-06-27 Thread Daniel Schneller

Hi!

We are currently trying to pinpoint a bottleneck and are somewhat stuck.

First things first, this is the hardware setup:

4x DELL PowerEdge R510, 12x4TB OSD HDDs, journal colocated on HDD
  96GB RAM, 2x6 Cores + HT
2x1GbE bonded interfaces for Cluster Network
2x1GbE bonded interfaces for Public Network
Ceph Hammer on Ubuntu 14.04

6 OpenStack Compute Nodes with all-RBD VMs (no ephemeral storage).

The VMs run a variety of stuff, most notable MongoDB, Elasticsearch
and our custom software which uses both the VM's virtual disks as
well the Rados Gateway for Object Storage.

Recently, under certain more write intensive conditions we see reads
overall system performance starting to suffer as well.

Here is an iostat -x 3 sample for one of the VMs hosting MongoDB.
Notice the "await" times (vda is the root, vdb is the data volume).


Linux 3.13.0-35-generic (node02)06/24/2016  _x86_64_(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  1.550.000.440.420.00   97.59

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda   0.00 0.910.091.01 2.55 9.59
22.12 0.01  266.90 2120.51   98.59   4.76   0.52
vdb   0.00 1.53   18.39   40.79   405.98   483.92
30.07 0.305.685.425.80   3.96  23.43


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  5.050.002.083.160.00   89.71

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda   0.00 0.000.000.00 0.00 0.00 
0.00 0.000.000.000.00   0.00   0.00
vdb   0.00 7.00   23.00   29.00   368.00   500.00
33.38 1.91  446.00  422.26  464.83  19.08  99.20


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  4.430.001.734.940.00   88.90

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda   0.00 0.000.000.00 0.00 0.00 
0.00 0.000.000.000.00   0.00   0.00
vdb   0.0013.00   45.00   83.00   712.00  1041.00
27.39 2.54 1383.25  272.18 1985.64   7.50  96.00



If we read this right, the average time spent waiting for read or write
requests to be serviced can be multi-second. This would go in line with
MongoDB's slow log, where we see fully indexed queries, returning a
single result, taking over a second, where they would normally be finished
quasi instantly.

So far we have looked at these metrics (using StackExchange's Bosun
from https://bosun.org). Most values are collected every 15 seconds.

* Network Link saturation.
 All links / bonds are well below any relevant load (around 35MB/s or
 less)

* Storage Node RAM
 At least 3GB reported "free", between 50GB and 70GB as cached.

* Storage node CPU.
 Hardly above 30%

* # of ios in progress per OSD (as per /proc/diskstats)
 These reach values of up to 180.



Bosun collects the raw data for these metrics (and lots of others)
every 15 seconds.

We have a suspicion the spinners are the culprit here, but to verify
this and to be able to convince the upper layers of company leadership
to invest in some SSDs for journals, we need better evidence; apart
from the personal desire to understand exactly what's going on here :)

Regardless of the VMs on top (which could be any client, as I see it)
which metrics would I have to collect/look at to verify/reject the
assumption that we are limited by our pure HDD setup?


Thanks a lot!

Daniel


--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to run multiple RadosGW instances under the same zone

2016-01-04 Thread Daniel Schneller

On 2016-01-04 10:37:43 +, Srinivasula Maram said:


Hi Joseph,
 
You can try haproxy as proxy for load balancing and failover.
 
Thanks,
Srinivas 


We have 6 hosts running RadosGW with haproxy in front of them without problems.
Depending on your setup you might even consider running haproxy locally 
on your application servers, so that your application always connects 
to localhost. This saves you from having to set up highly available 
load balancers. It's strongly recommended, of course, to use some kind 
of automatic provisioning (Ansible, Puppet etc.) to roll out identical 
haproxy configuration on all these machines. 





--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Namespaces and authentication

2015-11-30 Thread Daniel Schneller

Hi!

On 
http://docs.ceph.com/docs/master/rados/operations/user-management/#namespace
I read about auth namespaces. According to the most recent 
documentation it is still not supported by any of the client libraries, 
especially rbd.


I have a client asking to get access to rbd volumes for Kubernetes 
(http://kubernetes.io/v1.1/docs/user-guide/volumes.html#rbd). Due to 
the dynamic nature of the environment, I would like to grant them 
access to a dedicated pool where they could create volumes on their 
own. Different ceph secrets should be used for different volumes, so 
that they can hand out different secrets to different tenants in their 
environment to only give them access to their respective volumes.


Is there any way to do that yet? Are there plans on extending the 
namespace support beyond the current state?


Of course, I would be open to suggestions on how to do it differently, 
too, in case I am overlooking something obvious.


Main requirements are
a) client admin can create new rbd volumes in a dedicated pool, 
b) client admin can limit access to a volume to a specific user/secret.


Thanks!
Daniel


--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Number of buckets per user

2015-11-11 Thread Daniel Schneller

Hi!

Maybe I am missing something obvious, but is there no way to quickly 
tell how many buckets an RGW user has? I can see the max_buckets limit 
in radosgw-admin user info --uid=x, but nothing about how much of that 
limit has been used.


To be clear: I do not care what they are called, or what is in them, 
just the count.


Is that something the RGW maintains for cheap queries?

Thanks,
Daniel


--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Creating RGW Zone System Users Fails with "couldn't init storage provider"

2015-11-05 Thread Daniel Schneller

Bump... :)

On 2015-11-02 15:52:44 +, Daniel Schneller said:


Hi!


I am trying to set up a Rados Gateway, prepared for multiple regions
and zones, according to the documenation on
http://docs.ceph.com/docs/hammer/radosgw/federated-config/.
Ceph version is 0.94.3 (Hammer).

I am stuck at the "Create zone users" step
(http://docs.ceph.com/docs/hammer/radosgw/federated-config/#create-zone-users).


Running the user create command I get this:

$ sudo radosgw-admin user create --uid="eu-zone1"
--display-name="Region-EU Zone-zone1" --client-id
client.radosgw.eu-zone1-1 --system
couldn't init storage provider
$ echo $?
5



I have found this in a Documentation bug ticket, but unfortunately
there is no indication of what was actually going on there:
http://tracker.ceph.com/issues/10848#note-21

I am at a loss, I have even tried to figure out what was going on via
reading the rgw-admin source, but I could not find any strong hints.

Ideas?

Thanks,
Daniel


Find all relevant(?) bits of configuration below:


Ceph.conf has this for the RGW instances:


[client.radosgw.eu-zone1-1]
  host = dec-b1-d7-73-f0-04
  admin socket = /var/run/ceph-radosgw/client.radosgw.dec-b1-d7-73-f0-04.asok
  pid file = /var/run/ceph-radosgw/$name.pid
  rgw region = eu
  rgw region root pool = .eu.rgw.root
  rgw zone = eu-zone1
  rgw zone root pool = .eu-zone1.rgw.root
  rgw_print_continue = false
  keyring = /etc/ceph/ceph.client.radosgw.keyring
  rgw_socket_path = /var/run/ceph-radosgw/client.radosgw.eu-zone1-1.sock
  log_file = /var/log/radosgw/radosgw.log
  rgw_enable_ops_log = false
  rgw_gc_max_objs = 31
  rgw_frontends = fastcgi
  debug_rgw = 20


Keyring:
[client.radosgw.eu-zone1-1]
key = 
caps mon = "allow rwx"
caps osd = "allow rwx"


ceph auth list has the same key and these caps:

client.radosgw.eu-zone1-1
key: 
caps: [mon] allow rwx
caps: [osd] allow rwx



I have followed the instructions on that page and have created Region
and Zone configurations as follows:



{ "name": "eu",
  "api_name": "eu",
  "is_master": "true",
  "endpoints": [
"https:\/\/rgw-eu-zone1.mydomain.net:443\/",
"http:\/\/rgw-eu-zone1.mydomain.net:80\/"],
  "master_zone": "eu-zone1",
  "zones": [
{ "name": "eu-zone1",
  "endpoints": [
"https:\/\/rgw-eu-zone1.mydomain.net:443\/",
"http:\/\/rgw-eu-zone1.mydomain.net:80\/"],
  "log_meta": "true",
  "log_data": "true"}
  ],
  "placement_targets": [
   {
 "name": "default-placement",
 "tags": []
   }
  ],
  "default_placement": "default-placement"}



{ "domain_root": ".eu-zone1.domain.rgw",
  "control_pool": ".eu-zone1.rgw.control",
  "gc_pool": ".eu-zone1.rgw.gc",
  "log_pool": ".eu-zone1.log",
  "intent_log_pool": ".eu-zone1.intent-log",
  "usage_log_pool": ".eu-zone1.usage",
  "user_keys_pool": ".eu-zone1.users",
  "user_email_pool": ".eu-zone1.users.email",
  "user_swift_pool": ".eu-zone1.users.swift",
  "user_uid_pool": ".eu-zone1.users.uid",
  "system_key": { "access_key": "", "secret_key": ""},
  "placement_pools": [
{ "key": "default-placement",
  "val": { "index_pool": ".eu-zone1.rgw.buckets.index",
   "data_pool": ".eu-zone1.rgw.buckets"}
}
  ]
}


These pools are defined:

rbd
images
volumes
.eu-zone1.rgw.root
.eu-zone1.rgw.control
.eu-zone1.rgw.gc
.eu-zone1.rgw.buckets
.eu-zone1.rgw.buckets.index
.eu-zone1.rgw.buckets.extra
.eu-zone1.log
.eu-zone1.intent-log
.eu-zone1.usage
.eu-zone1.users
.eu-zone1.users.email
.eu-zone1.users.swift
.eu-zone1.users.uid
.eu.rgw.root
.eu-zone1.domain.rgw
.rgw
.rgw.root
.rgw.gc
.users.uid
.users
.rgw.control
.log
.intent-log
.usage
.users.email
.users.swift



--
--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Creating RGW Zone System Users Fails with "couldn't init storage provider"

2015-11-05 Thread Daniel Schneller

On 2015-11-05 12:16:35 +, Wido den Hollander said:



This is usuaully when keys aren't set up properly. Are you sure that the
cephx keys you are using are correct and that you can connect to the
Ceph cluster?

Wido


Yes, I could execute all kinds of commands, however it turns out, I
might have seen the effects of some non-obvious behavior:

We noticed was that whatever is used as an argument to --client-id
(tried with completely random crap), we could successfully execute
commands!

E. g.

$ sudo radosgw-admin zone list --client-id blablabla"

would get results back just fine, which took me very much by surprise.


Turns out, if you read `man ceph` closely, "--client-id" is not even a
valid parameter! Trying it with e. g. "ceph -s" will tell you that
immediately:

$ sudo ceph --client-id blablabla -s
Invalid command:  unused arguments: ['--client-id', 'blablabla']
...

On the other hand, radosgw-admin doesn't:

$ sudo radosgw-admin user info --uid=someuser --client-id blablabla
{ results }


Apparently, radosgw-admin swallows unkown arguments silently. It just
uses the admin key, which I could see by running this as an
unprivileged user without sudo:

$ radosgw-admin user info --uid=someuser --client-id blablabla
2015-11-05 14:47:30.079318 7fc4dd104900 -1 monclient(hunting): ERROR:
missing keyring, cannot use cephx for authentication
couldn't init storage provider
2015-11-05 14:47:30.079323 7fc4dd104900  0 librados: client.admin
initialization error (2) No such file or directory

The unkown --client-id argument gets dropped and it tries to use the
admin keyring, which it is not allowed to access without sudo.


I still do not know exactly, why this did not help me originally,
because it should just have created the user using the admin key. So it
is not exactly clear what was going on then. Nevertheless, user exists
now, so it might remain a mistery...


In any case, making radosgw-admin at least _inform_ about unknown
arguments might be a better idea than just silently ignoring them.


Thanks!
Daniel


--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One object in .rgw.buckets.index causes systemic instability

2015-11-04 Thread Daniel Schneller
We had a similar issue in Firefly, where we had a very large number 
(about 1.500.000) of buckets for a single RGW user. We observed a 
number of slow requests in day-to-day use, but did not think much of it 
at the time.


At one point the primary OSD managing the list of buckets for that user 
crashed and could not restart, because processing the tremendous amount 
of buckets on startup - which also seemed to be single-threaded, 
judging by to 100% CPU usage we could see - took longer than the 
suicide-timeout. That lead to this OSD crashing again, and again. 
Eventually, it would be marked out and the secondary tried to process 
the list with the same result, leading to a cascading failure.


While I am quite certain it is a different code path in your case (you 
speak about a handful of buckets), it certainly sounds like the a very 
similar issue. Do you have lots of objects in those few buckets, or are 
they few, but large in size to reach the 30TB? Worst case you might be 
in for a similar procedure as we had to take: Take load off the 
cluster, increase the timeouts to ridiculous levels and copy the data 
over into a more evenly distributed set of buckets (users in our case). 
Fortunately as long as we did not try to write to the problematic 
buckets, we could still read from them.


Please notice that this is only a guess, I could be completely wrong.

Daniel

On 2015-11-03 13:33:19 +, Gerd Jakobovitsch said:


Dear all,

I have a cluster running hammer (0.94.5), with 5 nodes. The main usage 
is for S3-compatible object storage.
I am getting to a very troublesome problem at a ceph cluster. A single 
object in the .rgw.buckets.index is not responding to request and takes 
a very long time while recovering after an osd restart. During this 
time, the OSDs where this object is mapped got heavily loaded, with 
high cpu as well as memory usage. At the same time, the directory 
/var/lib/ceph/osd/ceph-XX/current/omap gets a large number of entries ( 
> 1), that won't decrease.


Very frequently, I get >100 blocked requests for this object, and the 
main OSD that stores it ends up accepting no other requests. Very 
frequently the OSD ends up crashing due to filestore timeout, and 
getting it up again is very troublesome - it usually has to run alone 
in the node for a long time, until the object gets recovered, somehow.


At the OSD logs, there are several entries like these:
 -7051> 2015-11-03 10:46:08.339283 7f776974f700 10 log_client  logged 
2015-11-03 10:46:02.942023 osd.63 10.17.0.9:6857/2002 41 : cluster 
[WRN] slow re
quest 120.003081 seconds old, received at 2015-11-03 10:43:56.472825: 
osd_repop(osd.53.236531:7 34.7 
8a7482ff/.dir.default.198764998.1/head//34 v 2369

84'22) currently commit_sent


2015-11-03 10:28:32.405265 7f0035982700  0 log_channel(cluster) log 
[WRN] : 97 slow requests, 1 included below; oldest blocked for > 
2046.502848 secs
2015-11-03 10:28:32.405269 7f0035982700  0 log_channel(cluster) log 
[WRN] : slow request 1920.676998 seconds old, received at 2015-11-03 
09:56:31.7282
24: osd_op(client.210508702.0:14696798 .dir.default.198764998.1 [call 
rgw.bucket_prepare_op] 15.8a7482ff ondisk+write+known_if_redirected 
e236956) cur

rently waiting for blocked object

Is there any way to go deeper into this problem, or to rebuild the .rgw 
index without loosing data? I currently have 30 TB of data in the 
cluster - most of it concentrated in a handful of buckets - that I 
can't loose.


Regards.
--
 
 
 
 
 
 
 
 

--

As informações contidas nesta mensagem são CONFIDENCIAIS, protegidas 
pelo sigilo legal e por direitos autorais. A divulgação, distribuição, 
reprodução ou qualquer forma de utilização do teor deste documento 
depende de autorização do emissor, sujeitando-se o infrator às sanções 
legais. Caso esta comunicação tenha sido recebida por engano, favor 
avisar imediatamente, respondendo esta mensagem.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Creating RGW Zone System Users Fails with "couldn't init storage provider"

2015-11-02 Thread Daniel Schneller

Hi!


I am trying to set up a Rados Gateway, prepared for multiple regions 
and zones, according to the documenation on 
http://docs.ceph.com/docs/hammer/radosgw/federated-config/.

Ceph version is 0.94.3 (Hammer).

I am stuck at the "Create zone users" step 
(http://docs.ceph.com/docs/hammer/radosgw/federated-config/#create-zone-users). 



Running the user create command I get this:

$ sudo radosgw-admin user create --uid="eu-zone1" 
--display-name="Region-EU Zone-zone1" --client-id 
client.radosgw.eu-zone1-1 --system

couldn't init storage provider
$ echo $?
5



I have found this in a Documentation bug ticket, but unfortunately 
there is no indication of what was actually going on there: 
http://tracker.ceph.com/issues/10848#note-21


I am at a loss, I have even tried to figure out what was going on via 
reading the rgw-admin source, but I could not find any strong hints.


Ideas?

Thanks,
Daniel


Find all relevant(?) bits of configuration below:


Ceph.conf has this for the RGW instances:


[client.radosgw.eu-zone1-1]
 host = dec-b1-d7-73-f0-04
 admin socket = /var/run/ceph-radosgw/client.radosgw.dec-b1-d7-73-f0-04.asok
 pid file = /var/run/ceph-radosgw/$name.pid
 rgw region = eu
 rgw region root pool = .eu.rgw.root
 rgw zone = eu-zone1
 rgw zone root pool = .eu-zone1.rgw.root
 rgw_print_continue = false
 keyring = /etc/ceph/ceph.client.radosgw.keyring
 rgw_socket_path = /var/run/ceph-radosgw/client.radosgw.eu-zone1-1.sock
 log_file = /var/log/radosgw/radosgw.log
 rgw_enable_ops_log = false
 rgw_gc_max_objs = 31
 rgw_frontends = fastcgi
 debug_rgw = 20


Keyring:
[client.radosgw.eu-zone1-1]
   key = 
   caps mon = "allow rwx"
   caps osd = "allow rwx"


ceph auth list has the same key and these caps:

client.radosgw.eu-zone1-1
key: 
caps: [mon] allow rwx
caps: [osd] allow rwx



I have followed the instructions on that page and have created Region 
and Zone configurations as follows:




{ "name": "eu",
 "api_name": "eu",
 "is_master": "true",
 "endpoints": [
   "https:\/\/rgw-eu-zone1.mydomain.net:443\/",
   "http:\/\/rgw-eu-zone1.mydomain.net:80\/"],
 "master_zone": "eu-zone1",
 "zones": [
   { "name": "eu-zone1",
 "endpoints": [
   "https:\/\/rgw-eu-zone1.mydomain.net:443\/",
   "http:\/\/rgw-eu-zone1.mydomain.net:80\/"],
 "log_meta": "true",
 "log_data": "true"}
 ],
 "placement_targets": [
  {
"name": "default-placement",
"tags": []
  }
 ],
 "default_placement": "default-placement"}



{ "domain_root": ".eu-zone1.domain.rgw",
 "control_pool": ".eu-zone1.rgw.control",
 "gc_pool": ".eu-zone1.rgw.gc",
 "log_pool": ".eu-zone1.log",
 "intent_log_pool": ".eu-zone1.intent-log",
 "usage_log_pool": ".eu-zone1.usage",
 "user_keys_pool": ".eu-zone1.users",
 "user_email_pool": ".eu-zone1.users.email",
 "user_swift_pool": ".eu-zone1.users.swift",
 "user_uid_pool": ".eu-zone1.users.uid",
 "system_key": { "access_key": "", "secret_key": ""},
 "placement_pools": [
   { "key": "default-placement",
 "val": { "index_pool": ".eu-zone1.rgw.buckets.index",
  "data_pool": ".eu-zone1.rgw.buckets"}
   }
 ]
}


These pools are defined:

rbd
images
volumes
.eu-zone1.rgw.root
.eu-zone1.rgw.control
.eu-zone1.rgw.gc
.eu-zone1.rgw.buckets
.eu-zone1.rgw.buckets.index
.eu-zone1.rgw.buckets.extra
.eu-zone1.log
.eu-zone1.intent-log
.eu-zone1.usage
.eu-zone1.users
.eu-zone1.users.email
.eu-zone1.users.swift
.eu-zone1.users.uid
.eu.rgw.root
.eu-zone1.domain.rgw
.rgw
.rgw.root
.rgw.gc
.users.uid
.users
.rgw.control
.log
.intent-log
.usage
.users.email
.users.swift



--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Problems to expect with newer point release rgw vs. older MONs/OSDs

2015-07-08 Thread Daniel Schneller

On 2015-07-08 10:34:14 +, Wido den Hollander said:




On 08-07-15 12:20, Daniel Schneller wrote:

Hi!

Just a quick question regarding mixed versions. So far a cluster is
running on 0.94.1-1trusty without Rados Gateway. Since the packets have
been updated in the meantime, installing radosgw now would entail
bringing a few updated dependencies along. OSDs and MONs on the nodes
that are to become Rados Gateways would not automatically be upgraded, too.

Is that a safe setup, or do I need to upgrade the whole cluster to the
same point release?



That's safe. It's not required that the whole cluster runs the same version.

However, 94.2 fixes some bugs, so I would recommend that you upgrade the
cluster anyway. It can be done in a rolling fashion.

Wido


Understood. We are planning to upgrade to 0.94.3 on everything once
that become available. In the meantime we decided to install rgw 0.94.1
just to have less stuff to track in our heads and because we know 0.94.1
works for our current use case in another cluster.

However, just now I tried this without success:

[C|daniel.schneller@node01]  ~ ➜  apt-get install --dry-run
radosgw=0.94.1-1trusty
...
E: Version '0.94.1-1trusty' for 'radosgw' was not found

[C|daniel.schneller@node01]  ~ ➜  apt-cache policy radosgw
radosgw:
 Installed: (none)
 Candidate: 0.94.2-1trusty
 Version table:
0.94.2-1trusty 0
   999 http://ceph.com/debian-hammer/ trusty/main amd64 Packages
0.80.9-0ubuntu0.14.04.2 0
   500 http://archive.ubuntu.com/ubuntu/ trusty-updates/main amd64
Packages
0.79-0ubuntu1 0
   500 http://archive.ubuntu.com/ubuntu/ trusty/main amd64 Packages

It seems the repo does not offer anything but the most recent version?
Am I missing anthing?

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Problems to expect with newer point release rgw vs. older MONs/OSDs

2015-07-08 Thread Daniel Schneller

Hi!

Just a quick question regarding mixed versions. So far a cluster is 
running on 0.94.1-1trusty without Rados Gateway. Since the packets have 
been updated in the meantime, installing radosgw now would entail 
bringing a few updated dependencies along. OSDs and MONs on the nodes 
that are to become Rados Gateways would not automatically be upgraded, 
too.


Is that a safe setup, or do I need to upgrade the whole cluster to the 
same point release?


Thanks,
Daniel

[C|daniel.schneller@node01]  ~ ➜  apt-get install --dry-run radosgw
NOTE: This is only a simulation!
 apt-get needs root privileges for real execution.
 Keep also in mind that locking is deactivated,
 so don't depend on the relevance to the real current situation!
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
 ceph-common libfcgi0ldbl librados2 libradosstriper1 librbd1 python-cephfs
 python-rados python-rbd
The following NEW packages will be installed:
 libfcgi0ldbl radosgw
The following packages will be upgraded:
 ceph-common librados2 libradosstriper1 librbd1 python-cephfs python-rados
 python-rbd
7 upgraded, 2 newly installed, 0 to remove and 206 not upgraded.
Inst libradosstriper1 [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) []
Inst ceph-common [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) []
Inst librbd1 [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) []
Inst librados2 [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) []
Inst python-rados [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) []
Inst python-cephfs [0.94.1-1trusty] (0.94.2-1trusty stable [amd64]) []
Inst python-rbd [0.94.1-1trusty] (0.94.2-1trusty stable [amd64])
Inst libfcgi0ldbl (2.4.0-8.1ubuntu5 Ubuntu:14.04/trusty [amd64])
Inst radosgw (0.94.2-1trusty stable [amd64])
Conf librados2 (0.94.2-1trusty stable [amd64])
Conf libradosstriper1 (0.94.2-1trusty stable [amd64])
Conf librbd1 (0.94.2-1trusty stable [amd64])
Conf python-rados (0.94.2-1trusty stable [amd64])
Conf python-cephfs (0.94.2-1trusty stable [amd64])
Conf python-rbd (0.94.2-1trusty stable [amd64])
Conf ceph-common (0.94.2-1trusty stable [amd64])
Conf libfcgi0ldbl (2.4.0-8.1ubuntu5 Ubuntu:14.04/trusty [amd64])
Conf radosgw (0.94.2-1trusty stable [amd64])



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Node reboot -- OSDs not logging off from cluster

2015-07-07 Thread Daniel Schneller

On 2015-07-03 01:31:35 +, Johannes Formann said:


Hi,


When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs
do not seem to shut down correctly. Clients hang and ceph osd tree show
the OSDs of that node still up. Repeated runs of ceph osd tree show
them going down after a while. For instance, here OSD.7 is still up,
even though the machine is in the middle of the reboot cycle.


...


Any ideas as to what is causing this or how to diagnose this?


I see this behavior (only) when I reboot a ceph-node with a monitor and OSDs.
I guess somehow this relates. (OSD-messages getting lost due to the 
„failing“ mon)


Sorry for being silent for a few days, other things kept me busy.
Indeed this an interesting thought. We do have MONs running on three of
our storage nodes. I need to verify if the one where I  aw the problem
is one of them, but with 5 total, there is more than 50% chance ;)

Can any tell me which log levels on the MONs and/or OSDs I might want
to change to track if the shutdown notification are actually received
by the monitors or where they get lost?

Regards,
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Node reboot -- OSDs not logging off from cluster

2015-06-30 Thread Daniel Schneller

Hi!

We are seeing a strange - and problematic - behavior in our 0.94.1
cluster on Ubuntu 14.04.1. We have 5 nodes, 4 OSDs each.

When rebooting one of the nodes (e. g. for a kernel upgrade) the OSDs
do not seem to shut down correctly. Clients hang and ceph osd tree show
the OSDs of that node still up. Repeated runs of ceph osd tree show
them going down after a while. For instance, here OSD.7 is still up,
even though the machine is in the middle of the reboot cycle.

[C|root@control01]  ~ ➜  ceph osd tree
# idweight  type name   up/down reweight
-1  36.2root default
-2  7.24host node01
0   1.81osd.0   up  1   
5   1.81osd.5   up  1   
10  1.81osd.10  up  1   
15  1.81osd.15  up  1   
-3  7.24host node02
1   1.81osd.1   up  1   
6   1.81osd.6   up  1   
11  1.81osd.11  up  1   
16  1.81osd.16  up  1   
-4  7.24host node03
2   1.81osd.2   down1   
7   1.81osd.7   up  1   
12  1.81osd.12  down1   
17  1.81osd.17  down1   
-5  7.24host node04
3   1.81osd.3   up  1   
8   1.81osd.8   up  1   
13  1.81osd.13  up  1   
18  1.81osd.18  up  1   
-6  7.24host node05
4   1.81osd.4   up  1   
9   1.81osd.9   up  1   
14  1.81osd.14  up  1   
19  1.81osd.19  up  1

So it seems, the services are either not shut down correctly when the
reboot begins, or they do not get enough time to actually let the
cluster know they are going away.

If I stop the OSDs on that node manually before the reboot, everything
works as expected and clients don't notice any interruptions.

[C|root@node03]  ~ ➜  service ceph-osd stop id=2
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  service ceph-osd stop id=7
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  service ceph-osd stop id=12
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  service ceph-osd stop id=17
ceph-osd stop/waiting
[C|root@node03]  ~ ➜  reboot

The upstart file was not changed from the packaged version.
Interestingly, the same Ceph version on a different cluster does _not_
show this behaviour.

Any ideas as to what is causing this or how to diagnose this?

Cheers,
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpected period of iowait, no obvious activity?

2015-06-23 Thread Daniel Schneller

 On 23.06.2015, at 14:13, Gregory Farnum g...@gregs42.com wrote:
 
 ...
 On the other hand, there are lots of administrative tasks that can run
 and do something like this. The CERN guys had a lot of trouble with
 some daemon which wanted to scan the OSD's entire store for tracking
 changes, and was installed by their standard Ubuntu deployment.

Thanks! Good hint. I will look into that.

Daniel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very chatty MON logs: Is this normal?

2015-06-19 Thread Daniel Schneller

On 2015-06-18 09:53:54 +, Joao Eduardo Luis said:


Setting 'mon debug = 0/5' should be okay.  Unless you see that setting
'/5' impacts your performance and/or memory consumption, you should
leave that be.  '0/5' means 'output only debug 0 or lower to the logs;
keep the last 1000 debug level 5 or lower in memory in case of a crash'.
Your logs will not be as heavily populated but, if for some reason the
daemon crashes, you get quite a few of debug information to help track
down the source of the problem.


Great, will do.

Just for my understanding re/ memory: If this is a ring
buffer for the last 1 events, shouldn't that be a somewhat fixed amount
of memory? How would it negatively affect the MON's consumption? Assuming
it works that way, once they have been running for a few days or weeks,
these buffers would be full of events anyway, just more aged ones if
the memory level was lower?

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unexpected period of iowait, no obvious activity?

2015-06-19 Thread Daniel Schneller

Hi!

Recently over a few hours our 4 Ceph disk nodes showed unusually high
and somewhat constant iowait times. Cluster runs 0.94.1 on Ubuntu
14.04.1.

It started on one node, then - with maybe 15 minutes delay each - on the
next and the next one. Overall duration of the phenomenon was about 90
minutes on each machine, finishing in the same order they had started.

We could not see any obvious cluster activity during that time,
applications did not do anything out of the ordinary. Scrubbing and deep
scrubbing were turned off long before this happened.

We are using CephFS for shared administrator home directories on the
system, RBD volumes for OpenStack and the Rados Gateway to manage
application data via the Swift interface. Telemetry and logs from inside
the VMs did not offer an explanation either.

The fact that these readings were limited to OSD hosts, but none of the
other (client) nodes in the system, suggests this must be some kind of
Ceph behaviour. Any ideas? We would like to understand what the system
was doing, but haven't found anything obvious in the logs.

Thanks!
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Very chatty MON logs: Is this normal?

2015-06-17 Thread Daniel Schneller

On 2015-06-17 18:52:51 +, Somnath Roy said:


This is presently written from log level 1 onwards :-)
So, only log level 0 will not log this..
Try, 'debug_mon = 0/0' in the conf file..


Yeah, once I had sent the mail I realized that 1 in the log line was 
the level. Had overlooked that before.
However, I'd rather not set the level to 0/0, as that would disable all 
logging from the MONs.


Now, I don't have enough knowledge on that part to say whether it is 
important enough to log at log level 1 , sorry :-(


That would indeed be an interesting to know.
Judging from the sheer amount, at least I have my doubts, because the 
cluster seems to be running without any issues. So I figure at least it 
isn't indicative of an immediate issue.


Anyone with a little more definitve knowledge around? Should I create a 
bug ticket for this?


Cheers,
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD crashing over and over, taking cluster down

2015-05-19 Thread Daniel Schneller
 7f7aed260700 20 osd.5 pg_epoch: 34697 pg[81.1f9( v 
34697'23687 (34295'20610,34697'23687] local-les=34670 n=12229 ec=16487 les/c 
34670/34670 34669/34669/34645) [5,41,17] r=0 lpr=34669 crt=34688'23683 lcod 
34688'23684 mlcod 34688'23684 active+clean]  snapset_obc 
obc(ce36d9f9/default.139790885.16459__shadow_.B5eeIJm5n8dpsjn-4q5gXmHr4mIcVS1_5/snapdir//81
 rwstate(write n=1 w=0))
We cannot pinpoint an exact trigger, but it there _seems_ to be some 
correlation with larger uploads into the RGW. This is not yet completely 
validated, but merely a timing related assumption. Could it be that the RGW 
code causes OSDs to fail either with the big upload alone or in conjunction 
with parallel other requests? We are seeing more crashes, though, than large 
uploads, so this remains a guess at best.

We created this http://tracker.ceph.com/issues/11677 
http://tracker.ceph.com/issues/11677 bug ticket, but would be extremely 
thankful for some quicker help.
I am in the #ceph IRC channel as dschneller, too.

Thanks!
Daniel


-- 
Daniel Schneller
Infrastructure Engineer / Developer
 
CenterDevice GmbH  | Merscheider Straße 1
   | 42699 Solingen
tel: +49 1754155711| Deutschland
daniel.schnel...@centerdevice.de   | www.centerdevice.de




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD images -- parent snapshot missing (help!)

2015-05-16 Thread Daniel Schneller

On 2015-05-16 04:13:57 +, Tuomas Juntunen said:


Hey Pavel

Could you share your C program and the process how you were able to fix 
the images.


Thanks

Tuomas


Pavel,

That would indeed be invaluable!

Thank you very much in advance!
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RadosGW User Limit?

2015-05-15 Thread Daniel Schneller

Hello!

I am wondering if there is a limit to the number of (Swift) users that 
should be observed when using RadosGW.
For example, if I were to offer storage via S3 or Swift APIs with Ceph 
and RGW as the backing implementation and people could just sign up 
through some kind of public website, need I watch the number of users 
created?
Would a few thousand / ten-thousand / hundred-thousand users cause 
trouble, or is the system designed (and hopefully test ;)) to handle 
this? I would certainly hope so, because otherwise there would be a 
natural limit to how much data you could store in any cluster, not 
determined by the cluster size itself.


If there are caveats, what would they be and when would I expect them?

If it matters for this: Hammer 0.94.1.

Thanks for any insight!

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deleting RGW Users

2015-05-15 Thread Daniel Schneller

Hello!

In our cluster we had a nasty problem recently due to a very large 
number of buckets for a single RadosGW user.
The bucket limit was disabled earlier, and the number of buckets grew 
to the point where OSDs started to go down due to excessive access 
times, missed heartbeats etc.


We have since rectified that problem by first raising the relevant 
timeouts to near ridiculous levels so we could get the system to 
respond again and by copying all data from that single user to a few 
hundred new users. Of course, the old gigantic user is still around. 
Not sure if this is relevant, but we also have quite a few snapshots on 
the rgw pools.


We are now hesitant to delete the problematic user, because we're not 
sure how this is implemented. Will deleting the user iterate its 
buckets and delete those one by one? If so, we would be in trouble, 
because anything but reading from that users' buckets is a good way to 
get processes to crash / timeout again. If it does it at a lower level, 
do we need to expect the snapshots to cause trouble? Either now, or 
when we finally get around to throw out old ones?


So before we know more about what the implemention does (we're 
currently on Hammer 0.94.1) we won't touch that user, but we would like 
to get rid of it and the space it is wasting.


Thanks a lot in advance!
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados cppool

2015-05-14 Thread Daniel Schneller

On 2015-05-14 21:04:06 +, Daniel Schneller said:


On 2015-04-23 19:39:33 +, Sage Weil said:


On Thu, 23 Apr 2015, Pavel V. Kaygorodov wrote:

Hi!

I have copied two of my pools recently, because old ones has too many pgs.
Both of them contains RBD images, with 1GB and ~30GB of data.
Both pools was copied without errors, RBD images are mountable and
seems to be fine.
CEPH version is 0.94.1


You will likely have problems if you try to delete snapshots that existed
on the images (snaps are not copied/preserved by cppool).

sage


Could you be more specific on what these problems would look like? Are
you referring to RBD pools in particular, or is this a general issue
with snapshots? Anything that could be done to prevent these issues?

Background of the question is that we take daily snapshots of some
pools to allow reverting data when users make mistakes (via RGW). So it
would be difficult to get rid of all snapshots first.

Thanks
Daniel


Never mind, found more information on this on the list a few posts later.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph -w output

2015-05-14 Thread Daniel Schneller

Hi!

I am trying to get behind the values in ceph -w, especially those 
regarding throughput(?) at the end:


2015-05-15 00:54:33.333500 mon.0 [INF] pgmap v26048646: 17344 pgs: 
17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 
6023 kB/s rd, 549 kB/s wr, 7564 op/s
2015-05-15 00:54:34.339739 mon.0 [INF] pgmap v26048647: 17344 pgs: 
17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 
1853 kB/s rd, 1014 kB/s wr, 2015 op/s
2015-05-15 00:54:35.353621 mon.0 [INF] pgmap v26048648: 17344 pgs: 
17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 
2101 kB/s rd, 1680 kB/s wr, 1950 op/s
2015-05-15 00:54:36.375887 mon.0 [INF] pgmap v26048649: 17344 pgs: 
17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 
1641 kB/s rd, 1266 kB/s wr, 1710 op/s
2015-05-15 00:54:37.399647 mon.0 [INF] pgmap v26048650: 17344 pgs: 
17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 
4735 kB/s rd, 777 kB/s wr, 7088 op/s
2015-05-15 00:54:38.453922 mon.0 [INF] pgmap v26048651: 17344 pgs: 
17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 
5176 kB/s rd, 942 kB/s wr, 7779 op/s
2015-05-15 00:54:39.462838 mon.0 [INF] pgmap v26048652: 17344 pgs: 
17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 
3407 kB/s rd, 768 kB/s wr, 2131 op/s
2015-05-15 00:54:40.488387 mon.0 [INF] pgmap v26048653: 17344 pgs: 
17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 
3343 kB/s rd, 518 kB/s wr, 1881 op/s
2015-05-15 00:54:41.512540 mon.0 [INF] pgmap v26048654: 17344 pgs: 
17344 active+clean; 6296 GB data, 19597 GB used, 155 TB / 174 TB avail; 
1221 kB/s rd, 2385 kB/s wr, 1686 op/s


Am I right to assume the values for kB/s rd and kB/s wr mean that 
the indicated amount of data has been read/written by clients since the 
last line, total over all OSDs?


As for the op/s I am a little more uncertain. What kind of operations 
does this count?

Assuming it is also reads and writes aggregated, what counts as an operation?
For example, when I request data via the Rados Gateway, do I see one 
op here for the request from RGW's perspective, or do I see multiple, 
depending on how many low level objects a big RGW upload was striped 
to?
What about non-rgw objects that get striped? Are reads/writes on those 
counted as one or one per stripe?
Is there anything else counting into this but reads/writes to the 
object data? What about key/value level accesses?


Is it possible to someone come up with a theoretical estimate for a 
maximum value achievable with a given set of hardware?

This is a cluster of 4 nodes with 48 OSDs, 4TB each, all spinners.
Are these values good, bad, critical?

Can I somehow deduce - even if it is just a rather rough estimate - how 
loaded my cluster is? I am not talking about precision monitoring, 
but some kind of traffic light system (e.g. up to X% of the theoretical 
max is fine, up to Y% show a very busy cluster and anything above Y% 
means we might be up for trouble)?


Any pointers to documentation or other material would be appreciated if 
this was discussed in some detail before. The only thing I found was a 
post on this list from 2013 which did not say more than ops are reads, 
writes, anything, not going into detail about the anything.


Thanks a lot!

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados cppool

2015-05-14 Thread Daniel Schneller

On 2015-04-23 19:39:33 +, Sage Weil said:


On Thu, 23 Apr 2015, Pavel V. Kaygorodov wrote:

Hi!

I have copied two of my pools recently, because old ones has too many pgs.
Both of them contains RBD images, with 1GB and ~30GB of data.
Both pools was copied without errors, RBD images are mountable and 
seems to be fine.

CEPH version is 0.94.1


You will likely have problems if you try to delete snapshots that existed
on the images (snaps are not copied/preserved by cppool).

sage


Could you be more specific on what these problems would look like? Are 
you referring to RBD pools in particular, or is this a general issue 
with snapshots? Anything that could be done to prevent these issues?


Background of the question is that we take daily snapshots of some 
pools to allow reverting data when users make mistakes (via RGW). So it 
would be difficult to get rid of all snapshots first.


Thanks
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Firefly to Hammer

2015-05-14 Thread Daniel Schneller
You should be able to do just that. We recently upgraded from Firefly 
to Hammer like that. Follow the order described on the website. 
Monitors, OSDs, MDSs.


Notice that the Debian packages do not restart running daemons, but 
they _do_ start up not running ones. So say for some reason before your 
upgrade you shut down OSDs, they would be started as part of the 
upgrade.


Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understand RadosGW logs

2015-03-05 Thread Daniel Schneller

Bump...

On 2015-03-03 10:54:13 +, Daniel Schneller said:


Hi!

After realizing the problem with log rotation (see
http://thread.gmane.org/gmane.comp.file-systems.ceph.user/17708)
and fixing it, I now for the first time have some
meaningful (and recent) logs to look at.

While from an application perspective there seem
to be no issues, I would like to understand some
messages I find with relatively high frequency in
the logs:

Exhibit 1
-
2015-03-03 11:14:53.685361 7fcf4bfef700  0 ERROR: flush_read_list():
d-client_c-handle_data() returned -1
2015-03-03 11:15:57.476059 7fcf39ff3700  0 ERROR: flush_read_list():
d-client_c-handle_data() returned -1
2015-03-03 11:17:43.570986 7fcf25fcb700  0 ERROR: flush_read_list():
d-client_c-handle_data() returned -1
2015-03-03 11:22:00.881640 7fcf39ff3700  0 ERROR: flush_read_list():
d-client_c-handle_data() returned -1
2015-03-03 11:22:48.147011 7fcf35feb700  0 ERROR: flush_read_list():
d-client_c-handle_data() returned -1
2015-03-03 11:27:40.572723 7fcf50ff9700  0 ERROR: flush_read_list():
d-client_c-handle_data() returned -1
2015-03-03 11:29:40.082954 7fcf36fed700  0 ERROR: flush_read_list():
d-client_c-handle_data() returned -1
2015-03-03 11:30:32.204492 7fcf4dff3700  0 ERROR: flush_read_list():
d-client_c-handle_data() returned -1

I cannot find anything relevant by Googling for
that, apart from the actual line of code that
produces this line.
What does that mean? Is it an indication of data
corruption or are there more benign reasons for
this line?


Exhibit 2
--
Several of these blocks

2015-03-03 07:06:17.805772 7fcf36fed700  1 == starting new request
req=0x7fcf5800f3b0 =
2015-03-03 07:06:17.836671 7fcf36fed700  0
RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592
part_ofs=0 rule-part_size=0
2015-03-03 07:06:17.836758 7fcf36fed700  0
RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896
part_ofs=0 rule-part_size=0
2015-03-03 07:06:17.836918 7fcf36fed700  0
RGWObjManifest::operator++(): result: ofs=13055243 stripe_ofs=13055243
part_ofs=0 rule-part_size=0
2015-03-03 07:06:18.263126 7fcf36fed700  1 == req done
req=0x7fcf5800f3b0 http_status=200 ==
...
2015-03-03 09:27:29.855001 7fcf28fd1700  1 == starting new request
req=0x7fcf580102a0 =
2015-03-03 09:27:29.866718 7fcf28fd1700  0
RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.866778 7fcf28fd1700  0
RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.866852 7fcf28fd1700  0
RGWObjManifest::operator++(): result: ofs=13107200 stripe_ofs=13107200
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.866917 7fcf28fd1700  0
RGWObjManifest::operator++(): result: ofs=17301504 stripe_ofs=17301504
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.875466 7fcf28fd1700  0
RGWObjManifest::operator++(): result: ofs=21495808 stripe_ofs=21495808
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.884434 7fcf28fd1700  0
RGWObjManifest::operator++(): result: ofs=25690112 stripe_ofs=25690112
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.906155 7fcf28fd1700  0
RGWObjManifest::operator++(): result: ofs=29884416 stripe_ofs=29884416
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.914364 7fcf28fd1700  0
RGWObjManifest::operator++(): result: ofs=34078720 stripe_ofs=34078720
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.940653 7fcf28fd1700  0
RGWObjManifest::operator++(): result: ofs=38273024 stripe_ofs=38273024
part_ofs=0 rule-part_size=0
2015-03-03 09:27:30.272816 7fcf28fd1700  0
RGWObjManifest::operator++(): result: ofs=42467328 stripe_ofs=42467328
part_ofs=0 rule-part_size=0
2015-03-03 09:27:31.125773 7fcf28fd1700  0
RGWObjManifest::operator++(): result: ofs=46661632 stripe_ofs=46661632
part_ofs=0 rule-part_size=0
2015-03-03 09:27:31.192661 7fcf28fd1700  0 ERROR: flush_read_list():
d-client_c-handle_data() returned -1
2015-03-03 09:27:31.194481 7fcf28fd1700  1 == req done
req=0x7fcf580102a0 http_status=200 ==
...
2015-03-03 09:28:43.008517 7fcf2a7d4700  1 == starting new request
req=0x7fcf580102a0 =
2015-03-03 09:28:43.016414 7fcf2a7d4700  0
RGWObjManifest::operator++(): result: ofs=887579 stripe_ofs=887579
part_ofs=0 rule-part_size=0
2015-03-03 09:28:43.022387 7fcf2a7d4700  1 == req done
req=0x7fcf580102a0 http_status=200 ==

First, what is the req= line? Is that a thread-id?
I am asking, because the same id is used over and over
in the same file over time.

More importantly, what do the RGWObjManifest::operator++():...
lines mean? In the middle case above the block even ends
with one of the ERROR lines mentioned before, but the HTTP
status is still 200, suggesting a succesful operation.

Thanks in advance for shedding some light, because I would like
to know if I need to take some action or at least keep an
eye on these via monitoring?

Cheers,
Daniel

[ceph-users] Understand RadosGW logs

2015-03-03 Thread Daniel Schneller

Hi!

After realizing the problem with log rotation (see
http://thread.gmane.org/gmane.comp.file-systems.ceph.user/17708)
and fixing it, I now for the first time have some
meaningful (and recent) logs to look at.

While from an application perspective there seem
to be no issues, I would like to understand some
messages I find with relatively high frequency in
the logs:

Exhibit 1
-
2015-03-03 11:14:53.685361 7fcf4bfef700  0 ERROR: flush_read_list(): 
d-client_c-handle_data() returned -1
2015-03-03 11:15:57.476059 7fcf39ff3700  0 ERROR: flush_read_list(): 
d-client_c-handle_data() returned -1
2015-03-03 11:17:43.570986 7fcf25fcb700  0 ERROR: flush_read_list(): 
d-client_c-handle_data() returned -1
2015-03-03 11:22:00.881640 7fcf39ff3700  0 ERROR: flush_read_list(): 
d-client_c-handle_data() returned -1
2015-03-03 11:22:48.147011 7fcf35feb700  0 ERROR: flush_read_list(): 
d-client_c-handle_data() returned -1
2015-03-03 11:27:40.572723 7fcf50ff9700  0 ERROR: flush_read_list(): 
d-client_c-handle_data() returned -1
2015-03-03 11:29:40.082954 7fcf36fed700  0 ERROR: flush_read_list(): 
d-client_c-handle_data() returned -1
2015-03-03 11:30:32.204492 7fcf4dff3700  0 ERROR: flush_read_list(): 
d-client_c-handle_data() returned -1


I cannot find anything relevant by Googling for
that, apart from the actual line of code that
produces this line.
What does that mean? Is it an indication of data
corruption or are there more benign reasons for
this line?


Exhibit 2
--
Several of these blocks

2015-03-03 07:06:17.805772 7fcf36fed700  1 == starting new request 
req=0x7fcf5800f3b0 =
2015-03-03 07:06:17.836671 7fcf36fed700  0 
RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 
part_ofs=0 rule-part_size=0
2015-03-03 07:06:17.836758 7fcf36fed700  0 
RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 
part_ofs=0 rule-part_size=0
2015-03-03 07:06:17.836918 7fcf36fed700  0 
RGWObjManifest::operator++(): result: ofs=13055243 stripe_ofs=13055243 
part_ofs=0 rule-part_size=0
2015-03-03 07:06:18.263126 7fcf36fed700  1 == req done 
req=0x7fcf5800f3b0 http_status=200 ==

...
2015-03-03 09:27:29.855001 7fcf28fd1700  1 == starting new request 
req=0x7fcf580102a0 =
2015-03-03 09:27:29.866718 7fcf28fd1700  0 
RGWObjManifest::operator++(): result: ofs=4718592 stripe_ofs=4718592 
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.866778 7fcf28fd1700  0 
RGWObjManifest::operator++(): result: ofs=8912896 stripe_ofs=8912896 
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.866852 7fcf28fd1700  0 
RGWObjManifest::operator++(): result: ofs=13107200 stripe_ofs=13107200 
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.866917 7fcf28fd1700  0 
RGWObjManifest::operator++(): result: ofs=17301504 stripe_ofs=17301504 
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.875466 7fcf28fd1700  0 
RGWObjManifest::operator++(): result: ofs=21495808 stripe_ofs=21495808 
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.884434 7fcf28fd1700  0 
RGWObjManifest::operator++(): result: ofs=25690112 stripe_ofs=25690112 
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.906155 7fcf28fd1700  0 
RGWObjManifest::operator++(): result: ofs=29884416 stripe_ofs=29884416 
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.914364 7fcf28fd1700  0 
RGWObjManifest::operator++(): result: ofs=34078720 stripe_ofs=34078720 
part_ofs=0 rule-part_size=0
2015-03-03 09:27:29.940653 7fcf28fd1700  0 
RGWObjManifest::operator++(): result: ofs=38273024 stripe_ofs=38273024 
part_ofs=0 rule-part_size=0
2015-03-03 09:27:30.272816 7fcf28fd1700  0 
RGWObjManifest::operator++(): result: ofs=42467328 stripe_ofs=42467328 
part_ofs=0 rule-part_size=0
2015-03-03 09:27:31.125773 7fcf28fd1700  0 
RGWObjManifest::operator++(): result: ofs=46661632 stripe_ofs=46661632 
part_ofs=0 rule-part_size=0
2015-03-03 09:27:31.192661 7fcf28fd1700  0 ERROR: flush_read_list(): 
d-client_c-handle_data() returned -1
2015-03-03 09:27:31.194481 7fcf28fd1700  1 == req done 
req=0x7fcf580102a0 http_status=200 ==

...
2015-03-03 09:28:43.008517 7fcf2a7d4700  1 == starting new request 
req=0x7fcf580102a0 =
2015-03-03 09:28:43.016414 7fcf2a7d4700  0 
RGWObjManifest::operator++(): result: ofs=887579 stripe_ofs=887579 
part_ofs=0 rule-part_size=0
2015-03-03 09:28:43.022387 7fcf2a7d4700  1 == req done 
req=0x7fcf580102a0 http_status=200 ==


First, what is the req= line? Is that a thread-id?
I am asking, because the same id is used over and over
in the same file over time.

More importantly, what do the RGWObjManifest::operator++():...
lines mean? In the middle case above the block even ends
with one of the ERROR lines mentioned before, but the HTTP
status is still 200, suggesting a succesful operation.

Thanks in advance for shedding some light, because I would like
to know if I need to take some action or at least keep an
eye on these via monitoring?

Cheers,
Daniel


___
ceph-users 

Re: [ceph-users] Shutting down a cluster fully and powering it back up

2015-03-02 Thread Daniel Schneller

On 2015-02-28 20:46:15 +, Gregory Farnum said:


Sounds good!
-Greg
On Sat, Feb 28, 2015 at 10:55 AM David 
da...@visions.se wrote:

Hi!



We did that a few weeks ago and it mostly worked fine.
However, on startup of one of the 4 machines, it got stuck
while starting OSDs (at least that's what the console
output indicated), while the others started up just
fine.

After waiting for more than 20 minutes with the other
3 machines already back up we hit ctrl-alt-del via
the server console. The signal got caught, the OS restarted
and came up without problems the next time.

Unfortunately, as this was in the middle of the night
after a very long day of moving hardware around in the
datacenter we did not manage to save the logs before
they were rotated...

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RadosGW Log Rotation (firefly)

2015-03-02 Thread Daniel Schneller

On our Ubuntu 14.04/Firefly 0.80.8 cluster we are seeing
problem with log file rotation for the rados gateway.

The /etc/logrotate.d/radosgw script gets called, but
it does not work correctly. It spits out this message,
coming from the postrotate portion:

   /etc/cron.daily/logrotate:
   reload: Unknown parameter: id
   invoke-rc.d: initscript radosgw, action reload failed.

A new log file actually gets created, but due to the
failure in the post-rotate script, the daemon actually
continues writing into the now deleted previous file:

   [B|root@node01]  /etc/init ➜  ps aux | grep radosgw
   root 13077  0.9  0.1 13710396 203256 ? Ssl  Feb14 212:27 
/usr/bin/radosgw -n client.radosgw.node01


   [B|root@node01]  /etc/init ➜  ls -l /proc/13077/fd/
   total 0
   lr-x-- 1 root root 64 Mar  2 15:53 0 - /dev/null
   lr-x-- 1 root root 64 Mar  2 15:53 1 - /dev/null
   lr-x-- 1 root root 64 Mar  2 15:53 2 - /dev/null
   l-wx-- 1 root root 64 Mar  2 15:53 3 - 
/var/log/radosgw/radosgw.log.1 (deleted)

   ...

Trying manually with   service radosgw reload  fails with
the same message. Running the non-upstart
/etc/init.d/radosgw reload   works. It will, kind of crudely,
just send a SIGHUP to any running radosgw process.

To figure out the cause I compared OSDs and RadosGW wrt
to upstart and got this:

   [B|root@node01]  /etc/init ➜  initctl list | grep osd
   ceph-osd-all start/running
   ceph-osd-all-starter stop/waiting
   ceph-osd (ceph/8) start/running, process 12473
   ceph-osd (ceph/9) start/running, process 12503
   ...

   [B|root@node01]  /etc/init ➜  initctl reload radosgw cluster=ceph 
id=radosgw.node01

   initctl: Unknown instance: ceph/radosgw.node01

   [B|root@node01]  /etc/init ➜  initctl list | grep rados
   radosgw-instance stop/waiting
   radosgw stop/waiting
   radosgw-all-starter stop/waiting
   radosgw-all start/running

Apart from me not being totally clear about what the difference
between radosgw-instance and radosgw is, obviously Upstart
has no idea about which PID to send the SIGHUP to when I ask
it to reload.

I can, of course, replace the logrotate config and use the
/etc/init.d/radosgw reload  approach, but I would like to
understand if this is something unique to our system, or if
this is a bug in the scripts.

FWIW here's an excerpt from /etc/ceph.conf:

   [client.radosgw.node01]
   host = node01
   rgw print continue = false
   keyring = /etc/ceph/keyring.radosgw.gateway
   rgw socket path = /tmp/radosgw.sock
   log file = /var/log/radosgw/radosgw.log
   rgw enable ops log = false
   rgw gc max objs = 31


Thanks!
Daniel



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW Log Rotation (firefly)

2015-03-02 Thread Daniel Schneller

On 2015-03-02 18:17:00 +, Gregory Farnum said:



I'm not very (well, at all, for rgw) familiar with these scripts, but
how are you starting up your RGW daemon? There's some way to have
Apache handle the process instead of Upstart, but Yehuda says you
don't want to do it.
-Greg


Well, we installed the packages via APT. That places the upstart
scripts into /etc/init. Nothing special. That will make Upstart
launch them in boot.

In the meantime I just placed

   /var/log/radosgw/*.log {
   rotate 7
   daily
   compress
   sharedscripts
   postrotate
start-stop-daemon --stop --signal HUP -x /usr/bin/radosgw --oknodo
   endscript
   missingok
   notifempty
   }

into the logrotate script, removing the more complicated (and not working :))
logic with the core piece from the regular init.d script.

Because the daemons were already running and using an already deleted script,
logrotate wouldn't see the need to rotate the (visible) ones, because they
had not changed. So I needed to manually execute the above start-stop-daemon
on all relevant nodes ones to force the gateway to start a new, non-deleted
logfile.

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Update 0.80.7 to 0.80.8 -- Restart Order

2015-02-07 Thread Daniel Schneller

On 2015-02-03 18:48:45 +, Alexandre DERUMIER said:


debian deb packages update are not restarting services.

(So, I think it should be the same for ubuntu).

you need to restart daemons in this order:

-monitor
-osd
-mds
-rados gateway

http://ceph.com/docs/master/install/upgrading-ceph/


Just a small update: We just updated from 0.80.7 to our
own build of 0.80.8 with the fix for http://tracker.ceph.com/issues/10262
added, because that was the main reason for us to
update.

Went as planned. Updated packages via
apt-get --upgrade-only, then restarted MONs one by one,
then OSDs one by one and finally MDSs one by one.

The only slight hiccup was that ceph-fuse did not
unmount voluntarily on two machines, claiming the
filesystem was in use, even though noone was logged
in who could be using it.

Apart from that, no interruptions, no problems :)

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs-fuse: set/getfattr, change pools

2015-02-03 Thread Daniel Schneller

Hi!

We have a CephFS directory /baremetal mounted as /cephfs via FUSE on 
our clients.

There are no specific settings configured for /baremetal.
As a result, trying to get the directory layout via getfattr does not work

getfattr -n 'ceph.dir.layout' /cephfs
/cephfs: ceph.dir.layout: No such attribute

Using a dummy file I can work around this to get at least at the pool name:

➜ touch dummy.txt
➜ getfattr -n 'ceph.file.layout' dummy.txt
# file: dummy.txt
ceph.file.layout=stripe_unit=4194304 stripe_count=1 
object_size=4194304 pool=cephfs


(BTW: Why doesn't getfattr -d -m - dummy.txt show any of the Ceph attributes?)


Now, say I wanted to put /baremetal into a different pool, how would I 
go about this?


Can I setfattr on the /cephfs mountpoint and assign it a different pool 
with e. g. different replication settings?


Or would I need to mount the CephFS / directory somewhere and modify 
the settings for /baremetal from there?


Can I change this after the fact at all, or do I have to mount both 
pools at the same time and move data between them manually?


Thanks a lot!

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-fuse: set/getfattr, change pools

2015-02-03 Thread Daniel Schneller

On 2015-02-03 18:19:24 +, Gregory Farnum said:


Okay, I've looked at the code a bit, and I think that it's not showing
you one because there isn't an explicit layout set. You should still
be able to set one if you like, though; have you tried that?


Actually, no, not yet. We were setting up CephFS on a 2nd cluster today
and came across these issues. Turns out when we set up the first one we
had used the kernel module and the accompanying tools, so some of our
notes were not applicable anymore.

We will play with this some more and come back if problems turn up.

Thanks!

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-fuse: set/getfattr, change pools

2015-02-03 Thread Daniel Schneller

We have a CephFS directory /baremetal mounted as /cephfs via FUSE on our
clients.
There are no specific settings configured for /baremetal.
As a result, trying to get the directory layout via getfattr does not work

getfattr -n 'ceph.dir.layout' /cephfs
/cephfs: ceph.dir.layout: No such attribute


What version are you running? I thought it was zapped a while ago, but
in some versions of the code you can't access these xattrs on the root
inode (but you can on everything else).


0.80.7



(BTW: Why doesn't getfattr -d -m - dummy.txt show any of the Ceph
attributes?)


They're virtual xattrs controlling layout: you don't want tools like
rsync trying to copy them around.


That actually makes perfect sense :)


You can change the layout settings whenever you want, but there's no
mechanism for CephFS to move the data between different pools; it
simply applies the settings when the file is created.


Understood. So if we did not move the data ourselves, e. g. by mounting
both CephFS paths simultaneously to different mount points and moved
the data over using rsync, mv, cp, ... we would gradullay get new files
stored in the old pool, new files in the new? So every file knows about
its containing pool itself, right?

Cheers,
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs-fuse: set/getfattr, change pools

2015-02-03 Thread Daniel Schneller
Understood. Thanks for the details. 

Daniel 




On Tue, Feb 3, 2015 at 1:23 PM -0800, Gregory Farnum g...@gregs42.com wrote:










On Tue, Feb 3, 2015 at 1:17 PM, John Spray  wrote:
 On Tue, Feb 3, 2015 at 2:21 PM, Daniel Schneller
  wrote:
 Now, say I wanted to put /baremetal into a different pool, how would I go
 about this?

 Can I setfattr on the /cephfs mountpoint and assign it a different pool with
 e. g. different replication settings?

 This should make it clearer:
 http://ceph.com/docs/master/cephfs/file-layouts/#inheritance-of-layouts

 When you change the layout of a directory, the new layout will only
 apply to newly created files: it will not trigger any data movement.

 If you explicitly change the layout of a file containing data to point
 to a different pool, then you will see zeros when you try to read it
 back (although new data will be written to the new pool).

That statement sounds really scary. To reassure people: you can't
actually change layout on a file which has already been written to!
Trying to do so will return an error code; actually changing the
layouts and seeing this result would require manually mucking around
with RADOS data underneath the MDS.
-Greg___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Update 0.80.7 to 0.80.8 -- Restart Order

2015-02-02 Thread Daniel Schneller

Hello!

We are planning to upgrade our Ubuntu 14.04.1 based cluster from Ceph 
Firefly 0.80.7 to 0.80.8. We have 4 nodes, 12x4TB spinners each (plus 
OS disks). Apart from the 12 OSDs per node, nodes 1-3 have MONs running.


The instructions on ceph.com say it is best to first restart the MONs, 
then the OSDs. We are wondering, if updating the packages from the 
repository will trigger daemon reloads through package scripts. This 
would then certainly not guarantee the restart order recommended.


Do these instructions assume that the monitors are separate machines 
that could be updated first? If so, are there best-practice 
recommendations on how to update a production cluster without service 
interruption?


Thanks a lot for any advice!

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Update 0.80.7 to 0.80.8 -- Restart Order

2015-02-02 Thread Daniel Schneller

On 2015-02-02 16:09:27 +, Gregory Farnum said:

That said, for a point release it shouldn't matter what order stuff 
gets restarted in. I wouldn't worry about it. :)


That is good to know. One follow-up then: If the packets trigger 
restarts, they will most probably do so for *all* daemons virtually at 
once, right? So that means that all OSDs on that host will go down at 
the same time. That sounds like a not so good idea, taking 25% of the 
cluster down at the same time (provided I go host by host)?


Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Removing Snapshots Killing Cluster Performance

2014-12-01 Thread Daniel Schneller

Hi!

We take regular (nightly) snapshots of our Rados Gateway Pools for
backup purposes. This allows us - with some manual pokery - to restore
clients' documents should they delete them accidentally.

The cluster is a 4 server setup with 12x4TB spinning disks each,
totaling about 175TB. We are running firefly.

We have now completed our first month of snapshots and want to remove
the oldest ones. Unfortunately doing so practically kills everything
else that is using the cluster, because performance drops to almost zero
while the OSDs work their disks 100% (as per iostat). It seems this is
the same phenomenon I asked about some time ago where we were deleting
whole pools.

I could not find any way to throttle the background deletion activity
(the command returns almost immediately). Here is a graph the I/O
operations waiting (colored by device) while deleting a few snapshots.
Each of the blocks in the graph show one snapshot being removed. The
big one in the middle was a snapshot of the .rgw.buckets pool. It took
about 15 minutes during which basically nothing relying on the cluster
was working due to immense slowdowns. This included users getting 
kicked off their SSH sessions due to timeouts.


https://public.centerdevice.de/8c95f1c2-a7c3-457f-83b6-834688e0d048

While this is a big issue in itself for us, we would at least try to
estimate how long the process will take per snapshot / per pool. I
assume the time needed is a function of the number of objects that were
modified between two snapshots. We tried to get an idea of at least how
many objects were added/removed in total by running `rados df` with a
snapshot specified as a parameter, but it seems we still always get the
current values:

$ sudo rados -p .rgw df --snap backup-20141109
selected snap 13 'backup-20141109'
pool name       category                 KB      objects
.rgw            -                     276165      1368545

$ sudo rados -p .rgw df --snap backup-20141124
selected snap 28 'backup-20141124'
pool name       category                 KB      objects
.rgw            -                     276165      1368546

$ sudo rados -p .rgw df
pool name       category                 KB      objects
.rgw            -                     276165      1368547

So there are a few questions:

1) Is there any way to control how much such an operation will
tax the cluster (we would be happy to have it run longer, if that meant
not utilizing all disks fully during that time)?

2) Is there a way to get a decent approximation of how much work
deleting a specific snapshot will entail (in terms of objects, time,
whatever)?

3) Would SSD journals help here? Or any other hardware configuration
change for that matter?

4) Any other recommendations? We definitely need to remove the data,
not because of a lack of space (at least not at the moment), but because
when customers delete stuff / cancel accounts, we are obliged to remove
their data at least after a reasonable amount of time.

Cheers,
Daniel___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing Snapshots Killing Cluster Performance

2014-12-01 Thread Daniel Schneller

On 2014-12-01 10:03:35 +, Dan Van Der Ster said:

Which version of Ceph are you using? This could be related: 
http://tracker.ceph.com/issues/9487


Firefly. I had seen this ticket earlier (when deleting a whole pool) and hoped
the backport of the fix would be available some time soon. I must admin, I did
not look this up before posting, because I had forgotten about it.

See ReplicatedPG: don't move on to the next snap immediately; 
basically, the OSD is getting into a tight loop trimming the snapshot 
objects. The fix above breaks out of that loop more frequently, and 
then you can use the osd snap trim sleep option to throttle it further. 
I’m not sure if the fix above will be sufficient if you have many 
objects to remove per snapshot.


Just so I get this right: With the fix alone you are not sure it would 
be nice

enough, so adjusting the snap trim sleep option in addition might be needed?
I assume the loop that will be broken up with 9487 does not take the sleep
time into account?

That commit is only in giant at the moment. The backport to dumpling is 
in the dumpling branch but not yet in a release, and firefly is still 
pending.


Holding my breath :)

Any thoughts on the other items I had in the original post?


2) Is there a way to get a decent approximation of how much work
deleting a specific snapshot will entail (in terms of objects, time,
whatever)?

3) Would SSD journals help here? Or any other hardware configuration
change for that matter?



Thanks!
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Removing Snapshots Killing Cluster Performance

2014-12-01 Thread Daniel Schneller

Thanks for your input. We will see what we can find out
with the logs and how to proceed from there. 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Configuring swift user for ceph Rados Gateway - 403 Access Denied

2014-11-11 Thread Daniel Schneller

On 2014-11-11 13:12:32 +, ವಿನೋದ್ Vinod H I said:


Hi,
I am having problems accessing rados gateway using swift interface.
I am using ceph firefly version and have configured a us region as 
explained in the docs.

There are two zones us-east and us-west.
us-east gateway is running on host ceph-node-1 and us-west gateway is 
running on host ceph-node-2.


[...]



Auth GET failed: http://ceph-node-1/auth 403 Forbidden
[...]



  swift_keys: [
        { user: useast:swift,
          secret_key: FmQYYbzly4RH+PmNlrWA3ynN+eJrayYXzeISGDSw}],


We have seen problems when the secret_key has special characters. I am 
not sure if + is one of them, but the manual states this somewhere. 
Try setting the key explictly or by re-generating one until you get one 
without any special chars.


Drove me nuts.

Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Swift + radosgw: How do I find accounts/containers/objects limitation?

2014-11-01 Thread Daniel Schneller

To remove the max_bucket limit I used

radosgw-admin user modify --uid=username --max-buckets=0

Off the top of my head, I think 


radosgw-admin user info  --uid=username

will show you the current values without changing anything.
See also this thread I started about this topic a few weeks ago. 


https://www.mail-archive.com/ceph-users@lists.ceph.com/msg12840.html

Daniel___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crash with rados cppool and snapshots

2014-10-30 Thread Daniel Schneller

Ticket created: http://tracker.ceph.com/issues/9941


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete pools with low priority?

2014-10-30 Thread Daniel Schneller

On 2014-10-30 10:14:44 +, Dan van der Ster said:

Hi Daniel,
I can't remember if deleting a pool invokes the snap trimmer to do the 
actual work deleting objects. But if it does, then it is most 
definitely broken in everything except latest releases (actual dumpling 
doesn't have the fix yet in a release).
Given a release with those fixes (see tracker #9487) then you should 
enable the snap trim sleep (e.g. 0.05) and set the io priority class to 
3 or idle.

Cheers, Dan

Dan, thank you for the hint.

I have looked at the ticket, but I am not too familiar with trac (yet) 
to understand the current state. The header part says Status: Pending 
Backport and Backport: dumpling. At the very bottom (as of now ;)) 
however, I see a revision


Revision 496e561d
Added by Samuel Just 3 days ago
ReplicatedPG: don't move on to the next snap immediately
If we have a bunch of trimmed snaps for which we have no
objects, we'll spin for a long time. Instead, requeue.
Fixes: #9487
Backport: dumpling, firefly, giant
Reviewed-by: Sage Weil s...@redhat.com
Signed-off-by: Samuel Just sam.j...@inktank.com
(cherry picked from commit c17ac03a50da523f250eb6394c89cc7e93cb4659)

Does this mean there will be a backport to firefly, too, and that the 
bug status (the header of the page) hasn't been updated yet?


Thanks!
Daniel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Crash with rados cppool and snapshots

2014-10-30 Thread Daniel Schneller
Apart from the current there is a bug part, is the idea to copy a 
snapshot into a new pool a viable one for a full-backup-restore?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Crash with rados cppool and snapshots

2014-10-29 Thread Daniel Schneller
Hi!

We are exploring options to regularly preserve (i.e. backup) the
contents of the pools backing our rados gateways. For that we create
nightly snapshots of all the relevant pools when there is no activity
on the system to get consistent states.

In order to restore the whole pools back to a specific snapshot state,
we tried to use the rados cppool command (see below) to copy a snapshot
state into a new pool. Unfortunately this causes a segfault. Are we
doing anything wrong?

This command:

rados cppool --snap snap-1 deleteme.lp deleteme.lp2 2 segfault.txt

Produces this output:

*** Caught signal (Segmentation fault) ** in thread 7f8f49a927c0 ceph
version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1: rados()
[0x43eedf] 2: (()+0x10340) [0x7f8f48738340] 3:
(librados::IoCtxImpl::snap_lookup(char const*, unsigned long*)+0x17)
[0x7f8f48aff127] 4: (main()+0x1385) [0x411e75] 5:
(__libc_start_main()+0xf5) [0x7f8f4795fec5] 6: rados() [0x41c6f7]
2014-10-29 12:03:22.761653 7f8f49a927c0 -1 *** Caught signal
(Segmentation fault) ** in thread 7f8f49a927c0

 ceph version 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3) 1:
 rados() [0x43eedf] 2: (()+0x10340) [0x7f8f48738340] 3:
 (librados::IoCtxImpl::snap_lookup(char const*, unsigned long*)+0x17)
 [0x7f8f48aff127] 4: (main()+0x1385) [0x411e75] 5:
 (__libc_start_main()+0xf5) [0x7f8f4795fec5] 6: rados() [0x41c6f7] NOTE:
 a copy of the executable, or `objdump -rdS executable` is needed to
 interpret this.

Full segfault file and the objdump output for the rados command can be
found here:

- https://public.centerdevice.de/53bddb80-423e-4213-ac62-59fe8dbb9bea
- https://public.centerdevice.de/50b81566-41fb-439a-b58b-e1e32d75f32a

We updated to the 0.80.7 release (saw the issue with 0.80.5 before and
had hoped that the long list of bugfixes in the release notes would
include a fix for this) but are still seeing it. Rados gateways, OSDs,
MONs etc. have all been restarted after the update. Package versions 
as follows:

daniel.schneller@node01 [~] $  
➜  dpkg -l | grep ceph
ii  ceph0.80.7-1trusty 
ii  ceph-common 0.80.7-1trusty 
ii  ceph-fs-common  0.80.7-1trusty 
ii  ceph-fuse   0.80.7-1trusty 
ii  ceph-mds0.80.7-1trusty 
ii  libcephfs1  0.80.7-1trusty 
ii  python-ceph 0.80.7-1trusty 

daniel.schneller@node01 [~] $  
➜  uname -a
Linux node01 3.13.0-27-generic #50-Ubuntu SMP Thu May 15 18:06:16 
   UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Copying without the snapshot works. Should this work at least in 
theory?

Thanks! 

Daniel

-- 
Daniel Schneller
Mobile Development Lead
 
CenterDevice GmbH  | Merscheider Straße 1
   | 42699 Solingen
tel: +49 1754155711| Deutschland
daniel.schnel...@centerdevice.com  | www.centerdevice.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete pools with low priority?

2014-10-29 Thread Daniel Schneller

Bump :-)

Any ideas on this? They would be much appreciated.

Also: Sorry for a possible double post, client had forgotten its email config.

On 2014-10-22 21:21:54 +, Daniel Schneller said:


We have been running several rounds of benchmarks through the Rados
Gateway. Each run creates several hundred thousand objects and similarly
many containers.

The cluster consists of 4 machines, 12 OSD disks (spinning, 4TB) — 48
OSDs total.

After running a set of benchmarks we renamed the pools used by the
gateway pools to get a clean baseline. In total we now have several
million objects and containers in 3 pools. Redundancy for all pools is
set to 3.

Today we started deleting the benchmark data. Once the first renamed set
of RGW pools was executed, cluster performance started to go down the
drain. Using iotop we can see that the disks are all working furiously.
As the command to delete the pools came back very quickly, the
assumption is that we are now seeing the effects of the actual objects
being removed, causing lots and lots of IO activity on the disks,
negatively impacting regular operations.

We are running OpenStack on top of Ceph, and we see drastic reduction in
responsiveness of these machines as well as in CephFS.

Fortunately this is still a test setup, so no production systems are
affected. Nevertheless I would like to ask a few questions:

1) Is it possible to have the object deletion run in some low-prio mode?
2) If not, is there another way to delete lots and lots of objects
without affecting the rest of the cluster so badly? 3) Can we somehow
determine the progress of the deletion so far? We would like to estimate
if this is going to take hours, days or weeks? 4) Even if not possible
for the already running deletion, could be get a progress for the
remaining pools we still want to delete? 5) Are there any parameters
that we might tune — even if just temporarily - to speed this up?

Slide 18 of http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern
describes a very similar situation.

Thanks, Daniel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Delete pools with low priority?

2014-10-22 Thread Daniel Schneller
We have been running several rounds of benchmarks through the Rados
Gateway. Each run creates several hundred thousand objects and similarly
many containers.

The cluster consists of 4 machines, 12 OSD disks (spinning, 4TB) — 48
OSDs total.

After running a set of benchmarks we renamed the pools used by the
gateway pools to get a clean baseline. In total we now have several
million objects and containers in 3 pools. Redundancy for all pools is
set to 3.

Today we started deleting the benchmark data. Once the first renamed set
of RGW pools was executed, cluster performance started to go down the
drain. Using iotop we can see that the disks are all working furiously.
As the command to delete the pools came back very quickly, the
assumption is that we are now seeing the effects of the actual objects
being removed, causing lots and lots of IO activity on the disks,
negatively impacting regular operations.

We are running OpenStack on top of Ceph, and we see drastic reduction in
responsiveness of these machines as well as in CephFS.

Fortunately this is still a test setup, so no production systems are
affected. Nevertheless I would like to ask a few questions:

1) Is it possible to have the object deletion run in some low-prio mode?
2) If not, is there another way to delete lots and lots of objects
without affecting the rest of the cluster so badly? 3) Can we somehow
determine the progress of the deletion so far? We would like to estimate
if this is going to take hours, days or weeks? 4) Even if not possible
for the already running deletion, could be get a progress for the
remaining pools we still want to delete? 5) Are there any parameters
that we might tune — even if just temporarily - to speed this up?

Slide 18 of http://www.slideshare.net/Inktank_Ceph/scaling-ceph-at-cern
describes a very similar situation.

Thanks, Daniel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Icehouse Ceph -- live migration fails?

2014-10-17 Thread Daniel Schneller
samuel samu60@... writes:

 Hi all,This issue is also affecting us (centos6.5 based icehouse) and,
 as far as I could read, 
 comes from the fact that the path /var/lib/nova/instances (or whatever
 configuration path you have in nova.conf) is not shared. Nova does not
 see this shared path and therefore does not allow to perform live
 migrate although all the required information is stored in ceph and in
 the qemu local state.
 
 Some people has cheated nova to see this as a shared path but I'm
 not confident about how 
 this will affect stability.
 
  
 Can someone confirm this deduction? What are the possible workarounds
 for this situation in 
 a full ceph based environment (without shared path)?

I got it to work finally. Step 1 was double checking nova.conf on the
compute nodes. It was actually missing the flags pointed out earlier in
this thread.

As for the /var/lib/nova/instances data, this will get transferred to
the destination host as part of the migration. For that to work, you
need to have the transport between the libvirtd's set up correctly.

libvirt_live_migration_flag=VIR_MIGRATE_UNDEFINE_SOURCE,
VIR_MIGRATE_PEER2PEER,VIR _MIGRATE_LIVE,VIR_MIGRATE_PERSIST_DEST
live_migration_uri=qemu+ssh://nova@%s/system?keyfile=/var/lib/nova/.ssh/
id_rsa

I did not want to open another TCP port on all the nodes, so I went with
the SSH based transport as described in the libvirtd documentation. For
some reason it would only work once I explicitly added the user account
(nova@...) and the location of the key file explicitly, even though the
locations and names are default.

As part of our deployment via Ansible we make sure the nova user has an
up to date list of host keys in /var/lib/nova/.ssh/known_hosts.
Otherwise you will get errors regarding failing host key verification in
/var/log/nova/nova-compute.log if you try to live migrate. Of course,
they user needs to be present everywhere, have the same key everywhere
and have that key's public part be in /var/lib/nova/.ssh/authorized_keys
for the login to work without user intervention.

Setting up this alone brought me almost to my goal, the only thing I had
missed was

vncserver_listen = 0.0.0.0

in nova.conf -- this address will be put into the virtual machines
libvirt.xml file as the address the machine uses for its VNC console.
While on the baremetal node where it was originally created, this works.
However, when the VM gets migrated to another host (basically copying
over the instance folder from /var/lib/nova/instances) this address
cannot be bound on the new baremetal host and the migration fails. The
log is pretty clear about that. Once I had changed the vncserver_listen,
new machines could be migrated immediately.

For existing ones, I have not tried if editing the libvirt.xml file
while they are running is in any way harmful, so I will wait until I can
shut them down for a short maintenance window, then edit the file to
replace the current listen address with 0.0.0.0 and bring them up again.

One more caveat: If you use the Horizon dashboard, there is a bug in the
Icehouse release that prevents successful live migration on another
level, because it uses the wrong names for the baremetal machines.
Instead of the compute service names (e. g. node01, node02 ... in my
case), it uses the fully qualified hypervisor names. This will not work.
See https://bugs.launchpad.net/horizon/+bug/1335999 for details.

I applied the corresponding patch from
https://git.openstack.org/cgit/openstack/horizon/patch/?
id=89dc7de2e87b8d4e35837ad5122117aa2fb2c520 (excluding the tests, those
do not match well enough). Now I can live migration from horizon and the
command line :)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] max_bucket limit -- safe to disable?

2014-10-07 Thread Daniel Schneller
=0x7f0378fd8250 obj=.rgw:CNT-UUID 
state=0x7f02b007ac18 s-prefetch_data=0
31.886430 ID 10 cache get: name=.rgw+CNT-UUID : type miss (requested=22, 
cached=19)
36.746327 ID 10 cache put: name=.rgw+CNT-UUID
36.746404 ID 10 moving .rgw+CNT-UUID to cache LRU end
36.746426 ID 20 get_obj_state: s-obj_tag was set empty
36.746431 ID 20 Read xattr: user.rgw.idtag
36.746433 ID 20 Read xattr: user.rgw.manifest
36.746452 ID 10 cache get: name=.rgw+CNT-UUID : hit
36.746481 ID 20 rgw_get_bucket_info: bucket instance: 
CNT-UUID(@{i=.rgw.buckets.index}.rgw.buckets[default.78418684.119116])
36.746491 ID 20 reading from 
.rgw:.bucket.meta.CNT-UUID:default.78418684.119116
36.746549 ID 20 get_obj_state: rctx=0x7f0378fd8250 
obj=.rgw:.bucket.meta.CNT-UUID:default.78418684.119116 state=0x7f02b00ce638 
s-prefetch_data=0
36.746585 ID 10 cache get: 
name=.rgw+.bucket.meta.CNT-UUID:default.78418684.119116 : type miss 
(requested=22, cached=19)
36.747938 ID 10 cache put: 
name=.rgw+.bucket.meta.CNT-UUID:default.78418684.119116
36.747955 ID 10 moving .rgw+.bucket.meta.CNT-UUID:default.78418684.119116 
to cache LRU end
36.747963 ID 10 updating xattr: name=user.rgw.acl bl.length()=177
36.747972 ID 20 get_obj_state: s-obj_tag was set empty
36.747975 ID 20 Read xattr: user.rgw.acl
36.747977 ID 20 Read xattr: user.rgw.idtag
36.747978 ID 20 Read xattr: user.rgw.manifest
36.747985 ID 10 cache get: 
name=.rgw+.bucket.meta.CNT-UUID:default.78418684.119116 : hit
36.748025 ID 15 Read AccessControlPolicyAccessControlPolicy 
xmlns=http://s3.amazonaws.com/doc/2006-03-01/;OwnerIDdocumentstore/IDDisplayNameDocument
 Store/DisplayName/OwnerAccessControlListGrantGrantee 
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance; 
xsi:type=CanonicalUserIDdocumentstore/IDDisplayNameDocument 
Store/DisplayName/GranteePermissionFULL_CONTROL/Permission/Grant/AccessControlList/AccessControlPolicy
36.748037 ID  2 req 983095:4.861888:swift:PUT 
/swift/v1/CNT-UUID/version:put_obj:init op
36.748043 ID  2 req 983095:4.861895:swift:PUT 
/swift/v1/CNT-UUID/version:put_obj:verifying op mask
36.748046 ID 20 required_mask= 2 user.op_mask=7
36.748050 ID  2 req 983095:4.861902:swift:PUT 
/swift/v1/CNT-UUID/version:put_obj:verifying op permissions
36.748054 ID  5 Searching permissions for uid=documentstore mask=50
36.748056 ID  5 Found permission: 15
36.748058 ID  5 Searching permissions for group=1 mask=50
36.748060 ID  5 Permissions for group not found
36.748061 ID  5 Searching permissions for group=2 mask=50
36.748063 ID  5 Permissions for group not found
36.748064 ID  5 Getting permissions id=documentstore owner=documentstore 
perm=2
36.748066 ID 10  uid=documentstore requested perm (type)=2, policy perm=2, 
user_perm_mask=2, acl perm=2
36.748069 ID  2 req 983095:4.861921:swift:PUT 
/swift/v1/CNT-UUID/version:put_obj:verifying op params
36.748072 ID  2 req 983095:4.861924:swift:PUT 
/swift/v1/CNT-UUID/version:put_obj:executing
36.748200 ID 20 get_obj_state: rctx=0x7f0378fd8250 obj=CNT-UUID:version 
state=0x7f02b0042618 s-prefetch_data=0
36.802077 ID 10 setting object write_tag=default.78418684.983095
36.818727 ID  2 req 983095:4.932579:swift:PUT 
/swift/v1/CNT-UUID/version:put_obj:http status=201

==


-- 
Daniel Schneller
Mobile Development Lead
 
CenterDevice GmbH  | Merscheider Straße 1
   | 42699 Solingen
tel: +49 1754155711| Deutschland
daniel.schnel...@centerdevice.com  | www.centerdevice.com




 On 06 Oct 2014, at 19:26, Yehuda Sadeh yeh...@redhat.com wrote:
 
 It'd be interesting to see which rados operation is slowing down the
 requests. Can you provide a log dump of a request (with 'debug rgw =
 20', and 'debug ms = 1'). This might give us a better idea as to
 what's going on.
 
 Thanks,
 Yehuda
 
 On Mon, Oct 6, 2014 at 10:05 AM, Daniel Schneller
 daniel.schnel...@centerdevice.com wrote:
 Hi again!
 
 We have done some tests regarding the limits of storing lots and
 lots of buckets through Rados Gateway into Ceph.
 
 Our test used a single user for which we removed the default max
 buckets limit. It then continuously created containers - both empty
 and such with 10 objects of around 100k random data in them.
 
 With 3 parallel processes we saw relatively consistent time of
 about   500-700msper such container.
 
 This kept steady until we reached approx. 3 million containers
 after which the time per insert sharply went up to currently
 around   1600ms   and rising. Due to some hiccups with network
 equipment the tests were aborted a few times, but then resumed without
 deleting any of the previous runs created containers, so the actual
 number might be 2.8 or 3.2 million, but still in that ballpark.
 We aborted the test here.
 
 Judging by the advice given earlier (see quoted mail below) that
 we might hit a limit on some per-user data structures, we created
 another user account, removed its max-bucket limit as well and
 restarted the benchmark with that one

Re: [ceph-users] max_bucket limit -- safe to disable?

2014-10-07 Thread Daniel Schneller
Hi!

 By looking at these logs it seems that there are only 8 pgs on the
 .rgw pool, if this is correct then you may want to change that
 considering your workload.


Thanks. See out pg_num configuration below. We had already suspected
that the 1600 that we had previously (48 OSDs * 100 / triple redundancy)
were not ideal, so we increased the .rgw.buckets pool to 2048.

The number of objects and their size was in an earlier email, but for
completeness I will put them up once again. 

Any other ideas where to look?

==
for i in $(rados df | awk '{ print $1 }' | grep '^\.'); do
   echo $i; echo -n  - “; 
   ceph osd pool get $i pg_num; 
   echo -n  - “; 
   ceph osd pool get $i pgp_num;
done

.intent-log
 - pg_num: 1600
 - pgp_num: 1600
.log
 - pg_num: 1600
 - pgp_num: 1600
.rgw
 - pg_num: 1600
 - pgp_num: 1600
.rgw.buckets
 - pg_num: 2048
 - pgp_num: 2048
.rgw.buckets.index
 - pg_num: 1600
 - pgp_num: 1600
.rgw.control
 - pg_num: 1600
 - pgp_num: 1600
.rgw.gc
 - pg_num: 1600
 - pgp_num: 1600
.rgw.root
 - pg_num: 100
 - pgp_num: 100
.usage
 - pg_num: 1600
 - pgp_num: 1600
.users
 - pg_num: 1600
 - pgp_num: 1600
.users.email
 - pg_num: 1600
 - pgp_num: 1600
.users.swift
 - pg_num: 1600
 - pgp_num: 1600
.users.uid
 - pg_num: 1600
 - pgp_num: 1600
===


 .rgw
 =
 KB: 1,966,932
 objects:9,094,552
 rd:   195,747,645
  rd KB:   153,585,472
 wr:30,191,844
  wr KB:10,751,065
 
 .rgw.buckets
 =
 KB: 2,038,313,855
 objects:   22,088,103
 rd: 5,455,123
  rd KB:   408,416,317
 wr:   149,377,728
  wr KB: 1,882,517,472
 
 .rgw.buckets.index
 =
 KB: 0
 objects:5,374,376
 rd:   267,996,778
  rd KB:   262,626,106
 wr:   107,142,891
  wr KB: 0
 
 .rgw.control
 =
 KB: 0
 objects:8
 rd: 0
  rd KB: 0
 wr: 0
  wr KB: 0
 
 .rgw.gc
 =
 KB: 0
 objects:   32
 rd: 5,554,407
  rd KB: 5,713,942
 wr: 8,355,934
  wr KB: 0
 
 .rgw.root
 =
 KB: 1
 objects:3
 rd:   524
  rd KB:   346
 wr: 3
  wr KB: 3


Daniel

 On 08 Oct 2014, at 01:03, Yehuda Sadeh yeh...@redhat.com wrote:
 
 This operation stalled quite a bit, seems that it was waiting for the osd:
 
 2.547155 7f036ffc7700  1 -- 10.102.4.11:0/1009401 --
 10.102.4.14:6809/7428 -- osd_op(client.78418684.0:27514711
 .bucket.meta.CNT-UUID-FINDME:default.78418684.122043 [call
 version.read,getxattrs,stat] 5.3b7d1197 ack+read e16034) v4 -- ?+0
 0x7f026802e2c0 con 0x7f040c055ca0
 ...
 7.619750 7f041ddf4700  1 -- 10.102.4.11:0/1009401 == osd.32
 10.102.4.14:6809/7428 208252  osd_op_reply(27514711
 .bucket.meta.CNT-UUID-FINDME:default.78418684.122043
 [call,getxattrs,stat] v0'0 uv6371 ondisk = 0) v6  338+0+336
 (3685145659 0 4232894755) 0x7f00e430f540 con 0x7f040c055ca0
 
 By looking at these logs it seems that there are only 8 pgs on the
 .rgw pool, if this is correct then you may want to change that
 considering your workload.
 
 Yehuda
 
 
 On Tue, Oct 7, 2014 at 3:46 PM, Daniel Schneller
 daniel.schnel...@centerdevice.com wrote:
 Hi!
 
 Sorry, I must have missed the enabling of that debug module.
 However, the test setup has been the same all the time -
 I only have the one test-application :)
 
 But maybe I phrased it a bit ambiguously when I wrote
 
 It then continuously created containers - both empty
 and such with 10 objects of around 100k random data in them.
 
 100 kilobytes is the size of a single object, of which we create 10
 per container. The container gets created first, without any
 objects, naturally, then 10 objects are added. One of these objects
 is called “version”, the rest have generated names with a fixed
 prefix and appended 1-9. The version object is the one I picked
 for the example logs I sent earlier.
 
 I hope this makes the setup clearer.
 
 Attached you will find the (now more extensive) logs for the outliers
 again. As you did not say that I garbled the logs, I assume the
 pre-processing was OK, so I have prepared the new data in a similar
 fashion, marking the relevant request with CNT-UUID-FINDME.
 
 I have not removed any lines in between the beginning of the
 “interesting” request and its completion to keep all the network
 traffic log intact. Due to the increased verbosity, I will not post
 the logs inline, but only attach them gzipped.
 
 As before, should the full data set be needed, I can provide
 an archived version.
 
 
 
 
 Thanks for your support!
 Daniel
 
 
 
 
 On 07 Oct 2014, at 22:45, Yehuda Sadeh yeh...@redhat.com wrote:
 
 The logs here don't include the messenger (debug ms = 1). It's hard to
 tell what going on from looking at the outliers. Also, in your
 previous mail you

Re: [ceph-users] max_bucket limit -- safe to disable?

2014-10-06 Thread Daniel Schneller
Hi again!

We have done some tests regarding the limits of storing lots and 
lots of buckets through Rados Gateway into Ceph.

Our test used a single user for which we removed the default max
buckets limit. It then continuously created containers - both empty
and such with 10 objects of around 100k random data in them.

With 3 parallel processes we saw relatively consistent time of
about   500-700msper such container.

This kept steady until we reached approx. 3 million containers
after which the time per insert sharply went up to currently
around   1600ms   and rising. Due to some hiccups with network 
equipment the tests were aborted a few times, but then resumed without
deleting any of the previous runs created containers, so the actual
number might be 2.8 or 3.2 million, but still in that ballpark.
We aborted the test here. 

Judging by the advice given earlier (see quoted mail below) that
we might hit a limit on some per-user data structures, we created 
another user account, removed its max-bucket limit as well and
restarted the benchmark with that one, _expecting_ the times to be
down to the original range of 500-700ms.

However, what we are seeing is that the times stay at the   1600ms
and higher levels even for that fresh account.

Here is the output of `rados df`, reformatted to fit the email.
clones, degraded and unfound were 0 in all cases and have been
left out for clarity:

.rgw
=
   KB: 1,966,932
  objects: 9,094,552
   rd:   195,747,645
rd KB:   153,585,472
   wr:30,191,844
wr KB:10,751,065

.rgw.buckets
=
   KB: 2,038,313,855
  objects:22,088,103
   rd: 5,455,123
rd KB:   408,416,317
   wr:   149,377,728
wr KB: 1,882,517,472

.rgw.buckets.index
=
   KB: 0
  objects: 5,374,376
   rd:   267,996,778
rd KB:   262,626,106
   wr:   107,142,891
wr KB: 0

.rgw.control
=
   KB: 0
  objects: 8
   rd: 0
rd KB: 0
   wr: 0
wr KB: 0

.rgw.gc
=
   KB: 0
  objects:32
   rd: 5,554,407
rd KB: 5,713,942
   wr: 8,355,934
wr KB: 0

.rgw.root
=
   KB: 1
  objects: 3
   rd:   524
rd KB:   346
   wr: 3
wr KB: 3


We would very much like to understand what is going on here 
in order to decide if Rados Gateway is a viable option to base
our production system on (where we expect similar counts
as in the benchmark), or if we need to investigate using librados
directly which we would like to avoid if possible.

Any advice on what configuration parameters to check or
which additional information to provide to analyze this would be
very much welcome.

Cheers,
Daniel


-- 
Daniel Schneller
Mobile Development Lead

CenterDevice GmbH  | Merscheider Straße 1
  | 42699 Solingen
tel: +49 1754155711| Deutschland
daniel.schnel...@centerdevice.com mailto:daniel.schnel...@centerdevice.com  | 
www.centerdevice.com http://www.centerdevice.com/




 On 10 Sep 2014, at 19:42, Gregory Farnum g...@inktank.com wrote:
 
 On Wednesday, September 10, 2014, Daniel Schneller 
 daniel.schnel...@centerdevice.com 
 mailto:daniel.schnel...@centerdevice.com wrote:
 On 09 Sep 2014, at 21:43, Gregory Farnum g...@inktank.com  wrote:
 
 
 Yehuda can talk about this with more expertise than I can, but I think
 it should be basically fine. By creating so many buckets you're
 decreasing the effectiveness of RGW's metadata caching, which means
 the initial lookup in a particular bucket might take longer.
 
 Thanks for your thoughts. With “initial lookup in a particular bucket”
 do you mean accessing any of the objects in a bucket? If we directly
 access the object (not enumerating the buckets content), would that
 still be an issue?
 Just trying to understand the inner workings a bit better to make
 more educated guesses :)
 
 When doing an object lookup, the gateway combines the bucket ID with a 
 mangled version of the object name to try and do a read out of RADOS. It 
 first needs to get that bucket ID though -- it will cache an the bucket 
 name-ID mapping, but if you have a ton of buckets there could be enough 
 entries to degrade the cache's effectiveness. (So, you're more likely to pay 
 that extra disk access lookup.)
  
 
 
 The big concern is that we do maintain a per-user list of all their
 buckets — which is stored in a single RADOS object — so if you have an
 extreme number of buckets that RADOS object could get pretty big and
 become a bottleneck when creating/removing/listing the buckets. You
 
 Alright. Listing buckets is no problem, that we don’t do. Can you
 say what “pretty big

[ceph-users] Icehouse Ceph -- live migration fails?

2014-09-25 Thread Daniel Schneller
Hi!

We have an Icehouse system running with librbd based Cinder and Glance
configurations, storing images and volumes in Ceph.

Configuration is (apart from network setup details, of course) by the
book / OpenStack setup guide.

Works very nicely, including regular migration, but live migration of
virtual machines fails. I created a simple machine booting from a volume
based off the Ubuntu 14.04.1 cloud image for testing. 

Using Horizon, I can move this VM from host to host, but when I try to
Live Migrate it from one baremetal host to another, I get an error 
message “Failed to live migrate instance to host ’node02’.

The only related log entry I recognize is in the controller’s nova-api.log:


2014-09-25 17:15:47.679 3616 INFO nova.api.openstack.wsgi 
[req-f3dc3c2e-d366-40c5-a1f1-31db71afd87a f833f8e2d1104e66b9abe9923751dcf2 
a908a95a87cc42cd87ff97da4733c414] HTTP exception thrown: Compute service of 
node02.baremetal.clusterb.centerdevice.local is unavailable at this time.
2014-09-25 17:15:47.680 3616 INFO nova.osapi_compute.wsgi.server 
[req-f3dc3c2e-d366-40c5-a1f1-31db71afd87a f833f8e2d1104e66b9abe9923751dcf2 
a908a95a87cc42cd87ff97da4733c414] 10.102.6.8 POST 
/v2/a908a95a87cc42cd87ff97da4733c414/servers/0f762f35-64ee-461f-baa4-30f5de4d5ddf/action
 HTTP/1.1 status: 400 len: 333 time: 0.1479030

I cannot see anything of value on the destination host itself.

New machines get scheduled there, so the compute service cannot really
be down.

In this thread Travis 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-March/019944.html 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-March/019944.html
describes a similar situation, however that was on Folsom, so I wonder if it
is still applicable.

Would be great to get some outside opinion :)

Thanks!
Daniel

-- 
Daniel Schneller
Mobile Development Lead
 
CenterDevice GmbH  | Merscheider Straße 1
   | 42699 Solingen
tel: +49 1754155711| Deutschland
daniel.schnel...@centerdevice.com  | www.centerdevice.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] max_bucket limit -- safe to disable?

2014-09-10 Thread Daniel Schneller
On 09 Sep 2014, at 21:43, Gregory Farnum g...@inktank.com wrote:


 Yehuda can talk about this with more expertise than I can, but I think
 it should be basically fine. By creating so many buckets you're
 decreasing the effectiveness of RGW's metadata caching, which means
 the initial lookup in a particular bucket might take longer.

Thanks for your thoughts. With “initial lookup in a particular bucket”
do you mean accessing any of the objects in a bucket? If we directly
access the object (not enumerating the buckets content), would that
still be an issue?
Just trying to understand the inner workings a bit better to make
more educated guesses :)


 The big concern is that we do maintain a per-user list of all their
 buckets — which is stored in a single RADOS object — so if you have an
 extreme number of buckets that RADOS object could get pretty big and
 become a bottleneck when creating/removing/listing the buckets. You

Alright. Listing buckets is no problem, that we don’t do. Can you
say what “pretty big” would be in terms of MB? How much space does a
bucket record consume in there? Based on that I could run a few numbers.


 should run your own experiments to figure out what the limits are
 there; perhaps you have an easy way of sharding up documents into
 different users.

Good advice. We can do that per distributor (an org unit in our
software) to at least compartmentalize any potential locking issues
in this area to that single entity. Still, there would be quite
a lot of buckets/objects per distributor, so some more detail on
the above items would be great.

Thanks a lot!


Daniel___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] max_bucket limit -- safe to disable?

2014-09-09 Thread Daniel Schneller
Hi list!

Under 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-September/033670.html 
I found a situation not unlike ours, but unfortunately either 
the list archive fails me or the discussion ended without a 
conclusion, so I dare to ask again :)

We currently have a setup of 4 servers with 12 OSDs each, 
combined journal and data. No SSDs.

We develop a document management application that accepts user
uploads of all kinds of documents and processes them in several
ways. For any given document, we might create anywhere from 10s
to several hundred dependent artifacts.

We are now preparing to move from Gluster to a Ceph based
backend. The application uses the Apache JClouds Library to 
talk to the Rados Gateways that are running on all 4 of these 
machines, load balanced by haproxy. 

We currently intend to create one container for each document 
and put all the dependent and derived artifacts as objects into
that container. 
This gives us a nice compartmentalization per document, also 
making it easy to remove a document and everything that is
connected with it.

During the first test runs we ran into the default limit of
1000 containers per user. In the thread mentioned above that 
limit was removed (setting the max_buckets value to 0). We did 
that and now can upload more than 1000 documents.

I just would like to understand

a) if this design is recommended, or if there are reasons to go
   about the whole issue in a different way, potentially giving
   up the benefit of having all document artifacts under one
   convenient handle.

b) is there any absolute limit for max_buckets that we will run
   into? Remember we are talking about 10s of millions of 
   containers over time.

c) are any performance issues to be expected with this design
   and can we tune any parameters to alleviate this?

Any feedback would be very much appreciated.

Regards,
Daniel

-- 
Daniel Schneller
Mobile Development Lead
 
CenterDevice GmbH  | Merscheider Straße 1
   | 42699 Solingen
tel: +49 1754155711| Deutschland
daniel.schnel...@centerdevice.com  | www.centerdevice.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com