Re: [ceph-users] mds isn't working anymore after osd's running full

2014-08-20 Thread Jasper Siero
Unfortunately that doesn't help. I restarted both the active and standby mds 
but that doesn't change the state of the mds. Is there a way to force the mds 
to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap epoch 1833, 
have 1832)? 

Thanks,

Jasper

Van: Gregory Farnum [g...@inktank.com]
Verzonden: dinsdag 19 augustus 2014 19:49
Aan: Jasper Siero
CC: ceph-users@lists.ceph.com
Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full

On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero
jasper.si...@target-holding.nl wrote:
 Hi all,

 We have a small ceph cluster running version 0.80.1 with cephfs on five
 nodes.
 Last week some osd's were full and shut itself down. To help de osd's start
 again I added some extra osd's and moved some placement group directories on
 the full osd's (which has a copy on another osd) to another place on the
 node (as mentioned in
 http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/)
 After clearing some space on the full osd's I started them again. After a
 lot of deep scrubbing and two pg inconsistencies which needed to be repaired
 everything looked fine except the mds which still is in the replay state and
 it stays that way.
 The log below says that mds need osdmap epoch 1833 and have 1832.

 2014-08-18 12:29:22.268248 7fa786182700  1 mds.-1.0 handle_mds_map standby
 2014-08-18 12:29:22.273995 7fa786182700  1 mds.0.25 handle_mds_map i am now
 mds.0.25
 2014-08-18 12:29:22.273998 7fa786182700  1 mds.0.25 handle_mds_map state
 change up:standby -- up:replay
 2014-08-18 12:29:22.274000 7fa786182700  1 mds.0.25 replay_start
 2014-08-18 12:29:22.274014 7fa786182700  1 mds.0.25  recovery set is
 2014-08-18 12:29:22.274016 7fa786182700  1 mds.0.25  need osdmap epoch 1833,
 have 1832
 2014-08-18 12:29:22.274017 7fa786182700  1 mds.0.25  waiting for osdmap 1833
 (which blacklists prior instance)

  # ceph status
 cluster c78209f5-55ea-4c70-8968-2231d2b05560
  health HEALTH_WARN mds cluster is degraded
  monmap e3: 3 mons at
 {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0},
 election epoch 362, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003
  mdsmap e154: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby
  osdmap e1951: 12 osds: 12 up, 12 in
   pgmap v193685: 492 pgs, 4 pools, 60297 MB data, 470 kobjects
 124 GB used, 175 GB / 299 GB avail
  492 active+clean

 # ceph osd tree
 # idweighttype nameup/downreweight
 -10.2399root default
 -20.05997host th1-osd001
 00.01999osd.0up1
 10.01999osd.1up1
 20.01999osd.2up1
 -30.05997host th1-osd002
 30.01999osd.3up1
 40.01999osd.4up1
 50.01999osd.5up1
 -40.05997host th1-mon003
 60.01999osd.6up1
 70.01999osd.7up1
 80.01999osd.8up1
 -50.05997host th1-mon002
 90.01999osd.9up1
 100.01999osd.10up1
 110.01999osd.11up1

 What is the way to get the mds up and running again?

 I still have all the placement group directories which I moved from the full
 osds which where down to create disk space.

Try just restarting the MDS daemon. This sounds a little familiar so I
think it's a known bug which may be fixed in a later dev or point
release on the MDS, but it's a soft-state rather than a disk state
issue.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] some pgs active+remapped, Ceph can not recover itself.

2014-08-20 Thread debian Only
thanks , Lewis.and i got one suggestion it is better to put similar OSD
size .


2014-08-20 9:24 GMT+07:00 Craig Lewis cle...@centraldesktop.com:

 I believe you need to remove the authorization for osd.4 and osd.6 before
 re-creating them.

 When I re-format disks, I migrate data off of the disk using:
   ceph osd out $OSDID

 Then wait for the remapping to finish.  Once it does:
   stop ceph-osd id=$OSDID
   ceph osd out $OSDID
   ceph auth del osd.$OSDID
   ceph osd crush remove osd.$OSDID
   ceph osd rm $OSDID

 Ceph will migrate the data off of it.  When it's empty, you can delete it
 using the above commands. Since osd.4 and osd.6 are already lost, you can
 just do the part after remapping finishes for them.


 You could be having trouble because the size of the OSDs are so different.
  I wouldn't mix OSDs that are 100GB and 1.8TB.  Most of the stuck PGs are
 on osd.5, osd.7, and one of the small OSDs.  You can migrate data off of
 those small disks the same way I said to do osd.10.



 On Tue, Aug 19, 2014 at 6:34 AM, debian Only onlydeb...@gmail.com wrote:

 this is happen after some OSD fail and i recreate osd.

 i have did  ceph osd rm osd.4  to remove the osd.4 and osd.6. but when
 i use ceph-deploy to install OSD by
  ceph-deploy osd --zap-disk --fs-type btrfs create ceph0x-vm:sdb,
 ceph-deploy result said new osd is ready,
  but the OSD can not start. said that ceph-disk failure.
  /var/lib/ceph/bootstrap-osd/ceph.keyring and auth:error
  and i have check the ceph.keyring is same as other on live OSD.

  when i run ceph-deploy twice. first it will create osd.4, failed , will
 display in osd tree.  then osd.6 same.
  next ceph-deploy osd again, create osd.10, this OSD can start
 successful.  but osd.4 osd.6 display down in osd tree.

  when i use ceph osd reweight-by-utilization,  run one time, more pgs
 active+remapped. Ceph can not recover itself

  and Crush map tunables already optimize.  do not how to solve it.

 root@ceph-admin:~# ceph osd crush dump
 { devices: [
 { id: 0,
   name: osd.0},
 { id: 1,
   name: osd.1},
 { id: 2,
   name: osd.2},
 { id: 3,
   name: osd.3},
 { id: 4,
   name: device4},
 { id: 5,
   name: osd.5},
 { id: 6,
   name: device6},
 { id: 7,
   name: osd.7},
 { id: 8,
   name: osd.8},
 { id: 9,
   name: osd.9},
 { id: 10,
   name: osd.10}],
   types: [
 { type_id: 0,
   name: osd},
 { type_id: 1,
   name: host},
 { type_id: 2,
   name: chassis},
 { type_id: 3,
   name: rack},
 { type_id: 4,
   name: row},
 { type_id: 5,
   name: pdu},
 { type_id: 6,
   name: pod},
 { type_id: 7,
   name: room},
 { type_id: 8,
   name: datacenter},
 { type_id: 9,
   name: region},
 { type_id: 10,
   name: root}],
   buckets: [
 { id: -1,
   name: default,
   type_id: 10,
   type_name: root,
   weight: 302773,
   alg: straw,
   hash: rjenkins1,
   items: [
 { id: -2,
   weight: 5898,
   pos: 0},
 { id: -3,
   weight: 5898,
   pos: 1},
 { id: -4,
   weight: 5898,
   pos: 2},
 { id: -5,
   weight: 12451,
   pos: 3},
 { id: -6,
   weight: 13107,
   pos: 4},
 { id: -7,
   weight: 87162,
   pos: 5},
 { id: -8,
   weight: 49807,
   pos: 6},
 { id: -9,
   weight: 116654,
   pos: 7},
 { id: -10,
   weight: 5898,
   pos: 8}]},
 { id: -2,
   name: ceph02-vm,
   type_id: 1,
   type_name: host,
   weight: 5898,
   alg: straw,
   hash: rjenkins1,
   items: [
 { id: 0,
   weight: 5898,
   pos: 0}]},
 { id: -3,
   name: ceph03-vm,
   type_id: 1,
   type_name: host,
   weight: 5898,
   alg: straw,
   hash: rjenkins1,
   items: [
 { id: 1,
   weight: 5898,
   pos: 0}]},
 { id: -4,
   name: ceph01-vm,
   type_id: 1,
   type_name: host,
   weight: 5898,
   alg: straw,
   hash: rjenkins1,
   items: [
 { id: 2,
   weight: 5898,
   pos: 0}]},
 { id: -5,
   name: ceph04-vm,
   type_id: 1,
   type_name: host,
   

Re: [ceph-users] Problem when buildingrunning cuttlefish from source on Ubuntu 14.04 Server

2014-08-20 Thread NotExist
Hello Gregory:
I'm doing some comparison about performance between different
combination of environment. Therefore I have to try such old version.
Thanks for your kindly help! The solution you provided does work! I
think I was relying on ceph-disk too much therefore I didn't noticed
this.

2014-08-20 1:44 GMT+08:00 Gregory Farnum g...@inktank.com:
 On Thu, Aug 14, 2014 at 2:28 AM, NotExist notex...@gmail.com wrote:
 Hello everyone:

 Since there's no cuttlefish package for 14.04 server on ceph
 repository (only ceph-deploy there), I tried to build cuttlefish from
 source on 14.04.

 ...why? Cuttlefish is old and no longer provided updates. You really
 want to be using either Dumpling or Firefly.


 Here's what I did:
 Get source by following http://ceph.com/docs/master/install/clone-source/
 Enter the sourcecode directory
 git checkout cluttlefish
 git submodule update
 rm -rf src/civetweb/ src/erasure-code/ src/rocksdb/
 to get the latest cluttlefish repo.

 Build source by following http://ceph.com/docs/master/install/build-ceph/
 beside the package this url mentioned for Ubuntu:

 sudo apt-get install autotools-dev autoconf automake cdbs gcc g++ git
 libboost-dev libedit-dev libssl-dev libtool libfcgi libfcgi-dev
 libfuse-dev linux-kernel-headers libcrypto++-dev libcrypto++
 libexpat1-dev pkg-config
 sudo apt-get install uuid-dev libkeyutils-dev libgoogle-perftools-dev
 libatomic-ops-dev libaio-dev libgdata-common libgdata13 libsnappy-dev
 libleveldb-dev

 I also found it will need

 sudo apt-get install libboost-filesystem-dev libboost-thread-dev
 libboost-program-options-dev

 (And xfsprogs if you need xfs)
 after all packages are installed, I start to complie according to the doc:

 ./autogen.sh
 ./configure
 make -j8

 And install following
 http://ceph.com/docs/master/install/install-storage-cluster/#installing-a-build

 sudo make install

 everything seems fine, but I found ceph_common.sh had been putted to
 '/usr/local/lib/ceph', and some tools are putted into
 /usr/local/usr/local/sbin/ (ceph-disk* and ceph-create-keys). I was
 used to use ceph-disk to prepare the disk on other deployment (on
 other machines with Emperor), but I can't do it now (and maybe the
 path is the reason) so I choose to do do all stuffs manually.

 I follow the doc
 http://ceph.com/docs/master/install/manual-deployment/ to deploy the
 cluster many times, but it turns out different this time.
 /etc/ceph isn't there, therefore I sudo mkdir /etc/ceph
 Put a ceph.conf into /etc/ceph
 Generate all required keys in /etc/ceph instead of /tmp/ to keep them

 ceph-authtool --create-keyring /etc/ceph/ceph.mon.keyring --gen-key -n
 mon. --cap mon 'allow *'
 ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring
 --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd
 'allow *' --cap mds 'allow'
 ceph-authtool /etc/ceph/ceph.mon.keyring --import-keyring
 /etc/ceph/ceph.client.admin.keyring

 Generate monmap with monmaptool

 monmaptool --create --add storage01 192.168.11.1 --fsid
 9f8fffe3-040d-4641-b35a-ffa90241f723 /etc/ceph/monmap

 /var/lib/ceph is not there either

 sudo mkdir -p /var/lib/ceph/mon/ceph-storage01
 sudo ceph-mon --mkfs -i storage01 --monmap /etc/ceph/monmap --keyring
 /etc/ceph/ceph.mon.keyring

 log directory are not there, so I create it manually:

 sudo mkdir /var/log/ceph

 since service doesn't work, I start mon daemon manually:

 sudo /usr/local/bin/ceph-mon -i storage01

 and ceph -s looks like these:
 storage@storage01:~/ceph$ ceph -s
health HEALTH_ERR 192 pgs stuck inactive; 192 pgs stuck unclean; no osds
monmap e1: 1 mons at {storage01=192.168.11.1:6789/0}, election
 epoch 2, quorum 0 storage01
osdmap e1: 0 osds: 0 up, 0 in
 pgmap v2: 192 pgs: 192 creating; 0 bytes data, 0 KB used, 0 KB / 0 KB 
 avail
mdsmap e1: 0/0/1 up

 And I add disks as osd by following manual commands:
 sudo mkfs -t xfs -f /dev/sdb
 sudo mkdir /var/lib/ceph/osd/ceph-1
 sudo mount /dev/sdb /var/lib/ceph/osd/ceph-1/
 sudo ceph-osd -i 1 --mkfs --mkkey
 ceph osd create
 ceph osd crush add osd.1 1.0 host=storage01
 sudo ceph-osd -i 1

 for 10 times, and I got:
 storage@storage01:~/ceph$ ceph osd tree

 # idweight  type name   up/down reweight
 -2  10  host storage01
 0   1   osd.0   up  1
 1   1   osd.1   up  1
 2   1   osd.2   up  1
 3   1   osd.3   up  1
 4   1   osd.4   up  1
 5   1   osd.5   up  1
 6   1   osd.6   up  1
 7   1   osd.7   up  1
 8   1   osd.8   up  1
 9   1   osd.9   up  1
 -1  0   root default

 and

 storage@storage01:~/ceph$ ceph -s
health HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean
monmap e1: 1 mons at {storage01=192.168.11.1:6789/0}, election
 epoch 2, quorum 0 storage01
osdmap e32: 10 osds: 10 up, 10 in
 pgmap v56: 

[ceph-users] Starting Ceph OSD

2014-08-20 Thread Pons
 

Hi All,
We monitored two of our osd as down using the ceph osd tree
command. We tried starting it using the following commands but ceph osd
tree command still reports it as down. Please see below for the commands
used. 

command:sudo start ceph-osd id=osd.0
output: ceph-osd
(ceph/osd.0) stop/pre-start, process 3831

ceph osd tree output:
# id
weight type name up/down reweight
-1 5.13 root default
-2 1.71 host
ceph-node1
0 0.8 osd.0 down 0 
2 0.91 osd.2 down 0 
-3 1.71 host
ceph-node2

command: sudo start ceph-osd id=0
output: ceph-osd (ceph/0)
start/running, process 3887

ceph osd tree output:
# id weight type name
up/down reweight
-1 5.13 root default
-2 1.71 host ceph-node1
0 0.8
osd.0 down 0 
2 0.91 osd.2 down 0 
-3 1.71 host ceph-node2

command:
sudo start ceph-osd id=0
output: ceph-osd (ceph/0) start/running,
process 4348

ceph osd tree output:
# id weight type name up/down
reweight
-1 5.22 root default
-2 1.8 host ceph-node1
0 0.8 osd.0 down 0

2 0.91 osd.2 down 0 

Is there any other ways to start an OSD? I'm out
of ideas. What we do is we execute the ceph-deploy activate command to
make an OSD as UP. Is that the right way to do it? We are using ceph
version 0.80.4 

Thanks!

Regards,
Pons

 ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RadosGW problems

2014-08-20 Thread Marco Garcês
Hello,

Yehuda, I know I was using the correct fastcgi module, it was the one on
Ceph repositories; I had also disabled in apache, all other modules;

I tried to create a second swift user, using the provided instructions,
only to get the following:

# radosgw-admin user create --uid=marcogarces --display-name=Marco Garces
# radosgw-admin subuser create --uid=marcogarces
--subuser=marcogarces:swift --access=full
# radosgw-admin key create --subuser=marcogarces:swift --key-type=swift
--gen-secret
could not create key: unable to add access key, unable to store user info
2014-08-20 13:19:33.664945 7f925b130880  0 WARNING: can't store user info,
swift id () already mapped to another user (marcogarces)


So I have created another user, some other way:

# radosgw-admin user create --subuser=testuser:swift --display-name=Test
User One --key-type=swift --access=full
{ user_id: testuser,
  display_name: Test User One,
  email: ,
  suspended: 0,
  max_buckets: 1000,
  auid: 0,
  subusers: [],
  keys: [],
  swift_keys: [
{ user: testuser:swift,
  secret_key: MHA4vFaDy5XsJq+F5NuZLcBMCoJcuot44ASDuReY}],
  caps: [],
  op_mask: read, write, delete,
  default_placement: ,
  placement_tags: [],
  bucket_quota: { enabled: false,
  max_size_kb: -1,
  max_objects: -1},
  user_quota: { enabled: false,
  max_size_kb: -1,
  max_objects: -1},
  temp_url_keys: []}


Now, when I do, from the client:

swift -V 1 -A http://gateway.bcitestes.local/auth -U testuser:swift -K
MHA4vFaDy5XsJq+F5NuZLcBMCoJcuot44ASDuReY stat
   Account: v1
Containers: 0
   Objects: 0
 Bytes: 0
Server: Tengine/2.0.3
Connection: keep-alive
X-Account-Bytes-Used-Actual: 0
  Content-Type: text/plain; charset=utf-8


If I try using https, I still have errors:

swift --insecure -V 1 -A https://gateway.bcitestes.local/auth -U
testuser:swift -K MHA4vFaDy5XsJq+F5NuZLcBMCoJcuot44ASDuReY stat
Account HEAD failed: http://gateway.bcitestes.local:443/swift/v1 400 Bad
Request


And I could not validate this account using a Swift client (Cyberduck);
Also, there are no S3 credentials!
How can I have a user with both S3 and Swift credentials created, and valid
to use with http/https, and on all clients (command line and gui). The
first user works great with the S3 credentials, on all scenarios.

Thank you,
Marco Garcês

On Tue, Aug 19, 2014 at 7:59 PM, Yehuda Sadeh yeh...@inktank.com wrote:

 On Tue, Aug 19, 2014 at 5:32 AM, Marco Garcês ma...@garces.cc wrote:
 
  UPDATE:
 
  I have installed Tengine (nginx fork) and configured both HTTP and HTTPS
 to use radosgw socket.

 Looking back at this thread, and considering this solution it seems to
 me that you were running the wrong apache fastcgi module.

 
  I can login with S3, create buckets and upload objects.
 
  It's still not possible to use Swift credentials, can you help me on
 this part? What do I use when I login (url, username, password) ?
  Here is the info for the user:
 
  radosgw-admin user info --uid=mgarces
  { user_id: mgarces,
display_name: Marco Garces,
email: marco.gar...@bci.co.mz,
suspended: 0,
max_buckets: 1000,
auid: 0,
subusers: [
  { id: mgarces:swift,
permissions: full-control}],
keys: [
  { user: mgarces:swift,
access_key: AJW2BCBXHFJ1DPXT112O,
secret_key: },
  { user: mgarces,
access_key: S88Y6ZJRACZG49JFPY83,
secret_key: PlubMMjfQecJ5Py46e2kZz5VuUgHgsjLmYZDRdFg}],
swift_keys: [
  { user: mgarces:swift,
secret_key: TtKWhY67ujhjn36\/nhv44A2BVPw5wDi3Sp13YrMM}],
caps: [],
op_mask: read, write, delete,
default_placement: ,
placement_tags: [],
bucket_quota: { enabled: false,
max_size_kb: -1,
max_objects: -1},
user_quota: { enabled: false,
max_size_kb: -1,
max_objects: -1},
temp_url_keys: []}
 

 You might be hitting issue #8587 (aka #9155). Try creating a second
 swift user, see if it still happens.

 Yehuda

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Hugo Mills
   We have a ceph system here, and we're seeing performance regularly
descend into unusability for periods of minutes at a time (or longer).
This appears to be triggered by writing large numbers of small files.

   Specifications:

ceph 0.80.5
6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
2 machines running primary and standby MDS
3 monitors on the same machines as the OSDs
Infiniband to about 8 CephFS clients (headless, in the machine room)
Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
   machines, in the analysis lab)

   The cluster stores home directories of the users and a larger area
of scientific data (approx 15 TB) which is being processed and
analysed by the users of the cluster.

   We have a relatively small number of concurrent users (typically
4-6 at most), who use GUI tools to examine their data, and then
complex sets of MATLAB scripts to process it, with processing often
being distributed across all the machines using Condor.

   It's not unusual to see the analysis scripts write out large
numbers (thousands, possibly tens or hundreds of thousands) of small
files, often from many client machines at once in parallel. When this
happens, the ceph cluster becomes almost completely unresponsive for
tens of seconds (or even for minutes) at a time, until the writes are
flushed through the system. Given the nature of modern GUI desktop
environments (often reading and writing small state files in the
user's home directory), this means that desktop interactiveness and
responsiveness for all the other users of the cluster suffer.

   1-minute load on the servers typically peaks at about 8 during
these events (on 4-core machines). Load on the clients also peaks
high, because of the number of processes waiting for a response from
the FS. The MDS shows little sign of stress -- it seems to be entirely
down to the OSDs. ceph -w shows requests blocked for more than 10
seconds, and in bad cases, ceph -s shows up to many hundreds of
requests blocked for more than 32s.

   We've had to turn off scrubbing and deep scrubbing completely --
except between 01.00 and 04.00 every night -- because it triggers the
exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
up to 7 PGs being scrubbed, as it did on Monday, it's completely
unusable.

   Is this problem something that's often seen? If so, what are the
best options for mitigation or elimination of the problem? I've found
a few references to issue #6278 [1], but that seems to be referencing
scrub specifically, not ordinary (if possibly pathological) writes.

   What are the sorts of things I should be looking at to work out
where the bottleneck(s) are? I'm a bit lost about how to drill down
into the ceph system for identifying performance issues. Is there a
useful guide to tools somewhere?

   Is an upgrade to 0.84 likely to be helpful? How development are
the development releases, from a stability / dangerous bugs point of
view?

   Thanks,
   Hugo.

[1] http://tracker.ceph.com/issues/6278

-- 
Hugo Mills :: IT Services, University of Reading
Specialist Engineer, Research Servers
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Dan Van Der Ster
Hi,

Do you get slow requests during the slowness incidents? What about monitor 
elections?
Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I 
think the default config is still conservative, and there are options to cache 
more entries, etc…)
What about iostat on the OSDs — are your OSD disks busy reading or writing 
during these incidents?
What are you using for OSD journals?
Also check the CPU usage for the mons and osds...

Does your hardware provide enough IOPS for what your users need? (e.g. what is 
the op/s from ceph -w)

If disabling deep scrub helps, then it might be that something else is reading 
the disks heavily. One thing to check is updatedb — we had to disable it from 
indexing /var/lib/ceph on our OSDs.

Best Regards,
Dan

-- Dan van der Ster || Data  Storage Services || CERN IT Department --


On 20 Aug 2014, at 16:39, Hugo Mills h.r.mi...@reading.ac.uk wrote:

   We have a ceph system here, and we're seeing performance regularly
 descend into unusability for periods of minutes at a time (or longer).
 This appears to be triggered by writing large numbers of small files.
 
   Specifications:
 
 ceph 0.80.5
 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
 2 machines running primary and standby MDS
 3 monitors on the same machines as the OSDs
 Infiniband to about 8 CephFS clients (headless, in the machine room)
 Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
   machines, in the analysis lab)
 
   The cluster stores home directories of the users and a larger area
 of scientific data (approx 15 TB) which is being processed and
 analysed by the users of the cluster.
 
   We have a relatively small number of concurrent users (typically
 4-6 at most), who use GUI tools to examine their data, and then
 complex sets of MATLAB scripts to process it, with processing often
 being distributed across all the machines using Condor.
 
   It's not unusual to see the analysis scripts write out large
 numbers (thousands, possibly tens or hundreds of thousands) of small
 files, often from many client machines at once in parallel. When this
 happens, the ceph cluster becomes almost completely unresponsive for
 tens of seconds (or even for minutes) at a time, until the writes are
 flushed through the system. Given the nature of modern GUI desktop
 environments (often reading and writing small state files in the
 user's home directory), this means that desktop interactiveness and
 responsiveness for all the other users of the cluster suffer.
 
   1-minute load on the servers typically peaks at about 8 during
 these events (on 4-core machines). Load on the clients also peaks
 high, because of the number of processes waiting for a response from
 the FS. The MDS shows little sign of stress -- it seems to be entirely
 down to the OSDs. ceph -w shows requests blocked for more than 10
 seconds, and in bad cases, ceph -s shows up to many hundreds of
 requests blocked for more than 32s.
 
   We've had to turn off scrubbing and deep scrubbing completely --
 except between 01.00 and 04.00 every night -- because it triggers the
 exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
 up to 7 PGs being scrubbed, as it did on Monday, it's completely
 unusable.
 
   Is this problem something that's often seen? If so, what are the
 best options for mitigation or elimination of the problem? I've found
 a few references to issue #6278 [1], but that seems to be referencing
 scrub specifically, not ordinary (if possibly pathological) writes.
 
   What are the sorts of things I should be looking at to work out
 where the bottleneck(s) are? I'm a bit lost about how to drill down
 into the ceph system for identifying performance issues. Is there a
 useful guide to tools somewhere?
 
   Is an upgrade to 0.84 likely to be helpful? How development are
 the development releases, from a stability / dangerous bugs point of
 view?
 
   Thanks,
   Hugo.
 
 [1] http://tracker.ceph.com/issues/6278
 
 -- 
 Hugo Mills :: IT Services, University of Reading
 Specialist Engineer, Research Servers
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Dan Van Der Ster
Hi,

On 20 Aug 2014, at 16:55, German Anders 
gand...@despegar.commailto:gand...@despegar.com wrote:

Hi Dan,

  How are you? I want to know how you disable the indexing on the 
/var/lib/ceph OSDs?


# grep ceph /etc/updatedb.conf
PRUNEPATHS = /afs /media /net /sfs /tmp /udev /var/cache/ccache 
/var/spool/cups /var/spool/squid /var/tmp /var/lib/ceph



Did you disable deep scrub on you OSDs?


No but this can be an issue. If you get many PGs scrubbing at once, performance 
will suffer.

There is a new feature in 0.67.10 to sleep between scrubbing “chunks”. I set 
the that sleep to 0.1 (and the chunk_max to 5, and the scrub size 1MB). In 
0.67.10+1 there are some new options to set the iopriority of the scrubbing 
threads. Set that to class = 3, priority = 0 to give the scrubbing thread the 
idle priority. You need to use the cfq disk scheduler for io priorities to 
work. (cfq will also help if updatedb is causing any problems, since it runs 
with ionice -c 3).

I’m pretty sure those features will come in 0.80.6 as well.

Do you have the journals on SSD's or RAMDISK?


Never use RAMDISK.

We currently have the journals on the same spinning disk as the OSD, but the 
iops performance is low for the rbd and fs use-cases. (For object store it 
should be OK). But for rbd or fs, you really need journals on SSDs or your 
cluster will suffer.

We now have SSDs on order to augment our cluster. (The way I justified this is 
that our cluster has X TB of storage capacity and Y iops capacity. With disk 
journals we will run out of iops capacity well before we run out of storage 
capacity. So you can either increase the iops capacity substantially by 
decreasing the volume of the cluster by 20% and replacing those disks with SSD 
journals, or you can just leave 50% of the disk capacity empty since you can’t 
use it anyway).


What's the perf of your cluster? randos bench? fio? I've setup a new cluster 
and I want to know what would be the best option scheme to go.

It’s not really meaningful to compare performance of different clusters with 
different hardware. Some “constants” I can advise
  - with few clients, large write throughput is limited by the clients 
bandwidth, as long as you have enough OSDs and the client is striping over many 
objects.
  - with disk journals, small write latency will be ~30-50ms even when the 
cluster is idle. if you have SSD journals, maybe ~10ms.
  - count your iops. Each disk OSD can do ~100, and you need to divide by the 
number of replicas. With SSDs you can do a bit better than this since the 
synchronous writes go to the SSDs not the disks. In my tests with our hardware 
I estimate that going from disk to SSD journal will multiply the iops capacity 
by around 5x.

I also found that I needed to increase some the journal max write and journal 
queue max limits, also the filestore limits, to squeeze the best performance 
out of the SSD journals. Try increasing filestore queue max ops/bytes, 
filestore queue committing max ops/bytes, and the filestore wbthrottle xfs * 
options. (I’m not going to publish exact configs here because I haven’t 
finished tuning yet).

Cheers, Dan


Thanks a lot!!

Best regards,

German Anders














On Wednesday 20/08/2014 at 11:51, Dan Van Der Ster wrote:
Hi,

Do you get slow requests during the slowness incidents? What about monitor 
elections?
Are your MDSs using a lot of CPU? did you try tuning anything in the MDS (I 
think the default config is still conservative, and there are options to cache 
more entries, etc…)
What about iostat on the OSDs — are your OSD disks busy reading or writing 
during these incidents?
What are you using for OSD journals?
Also check the CPU usage for the mons and osds...

Does your hardware provide enough IOPS for what your users need? (e.g. what is 
the op/s from ceph -w)

If disabling deep scrub helps, then it might be that something else is reading 
the disks heavily. One thing to check is updatedb — we had to disable it from 
indexing /var/lib/ceph on our OSDs.

Best Regards,
Dan

-- Dan van der Ster || Data  Storage Services || CERN IT Department --


On 20 Aug 2014, at 16:39, Hugo Mills 
h.r.mi...@reading.ac.ukmailto:h.r.mi...@reading.ac.uk wrote:

We have a ceph system here, and we're seeing performance regularly
descend into unusability for periods of minutes at a time (or longer).
This appears to be triggered by writing large numbers of small files.

Specifications:

ceph 0.80.5
6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
2 machines running primary and standby MDS
3 monitors on the same machines as the OSDs
Infiniband to about 8 CephFS clients (headless, in the machine room)
Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
machines, in the analysis lab)

The cluster stores home directories of the users and a larger area
of scientific data (approx 15 TB) which is being processed and
analysed by the users of the cluster.

We have a 

Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Hugo Mills
   Hi, Dan,

   Some questions below I can't answer immediately, but I'll spend
tomorrow morning irritating people by triggering these events (I think
I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small
files in it) and giving you more details. For the ones I can answer
right now:

On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote:
 Do you get slow requests during the slowness incidents?

   Slow requests, yes. ceph -w reports them coming in groups, e.g.:

2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 
kB/s rd, 3506 kB/s wr, 527 op/s
2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; 
oldest blocked for  10.133901 secs
2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, 
received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 
10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, 
received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 
1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, 
received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 
1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, 
received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 
1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, 
received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 
1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; 
oldest blocked for  10.73 secs
2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, 
received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 
100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, 
received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 
100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 
B/s rd, 3532 kB/s wr, 377 op/s
2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; 
oldest blocked for  10.709989 secs
2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, 
received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 
10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, 
received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 
10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, 
received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 
10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.920298 mon.1 [INF] pgmap v2287928: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 12231 
B/s rd, 5534 kB/s wr, 370 op/s
2014-08-20 15:51:26.925996 mon.1 [INF] pgmap v2287929: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 26498 
B/s rd, 8121 kB/s wr, 367 op/s
2014-08-20 15:51:27.933424 mon.1 [INF] pgmap v2287930: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 706 kB/s 
rd, 7552 kB/s wr, 444 op/s

 What about monitor elections?

   No, that's been reporting monmap e3 and election epoch 130 for
a week or two. I assume that to mean we've had no elections. We're
actually running without one monitor at the moment, because one
machine is down, but we've had the same problems with the machine
present.

 Are your MDSs using a lot of CPU?

   No, they're showing load averages well under 1 the whole time. Peak
load average is about 0.6.

 did you try tuning anything in the MDS (I think the default config
 is still conservative, and there are options to cache more entries,
 etc…)

   Not much. We have:


Re: [ceph-users] mds isn't working anymore after osd's running full

2014-08-20 Thread Gregory Farnum
After restarting your MDS, it still says it has epoch 1832 and needs
epoch 1833? I think you didn't really restart it.
If the epoch numbers have changed, can you restart it with debug mds
= 20, debug objecter = 20, debug ms = 1 in the ceph.conf and post
the resulting log file somewhere?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Aug 20, 2014 at 12:49 AM, Jasper Siero
jasper.si...@target-holding.nl wrote:
 Unfortunately that doesn't help. I restarted both the active and standby mds 
 but that doesn't change the state of the mds. Is there a way to force the mds 
 to look at the 1832 epoch (or earlier) instead of 1833 (need osdmap epoch 
 1833, have 1832)?

 Thanks,

 Jasper
 
 Van: Gregory Farnum [g...@inktank.com]
 Verzonden: dinsdag 19 augustus 2014 19:49
 Aan: Jasper Siero
 CC: ceph-users@lists.ceph.com
 Onderwerp: Re: [ceph-users] mds isn't working anymore after osd's running full

 On Mon, Aug 18, 2014 at 6:56 AM, Jasper Siero
 jasper.si...@target-holding.nl wrote:
 Hi all,

 We have a small ceph cluster running version 0.80.1 with cephfs on five
 nodes.
 Last week some osd's were full and shut itself down. To help de osd's start
 again I added some extra osd's and moved some placement group directories on
 the full osd's (which has a copy on another osd) to another place on the
 node (as mentioned in
 http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/)
 After clearing some space on the full osd's I started them again. After a
 lot of deep scrubbing and two pg inconsistencies which needed to be repaired
 everything looked fine except the mds which still is in the replay state and
 it stays that way.
 The log below says that mds need osdmap epoch 1833 and have 1832.

 2014-08-18 12:29:22.268248 7fa786182700  1 mds.-1.0 handle_mds_map standby
 2014-08-18 12:29:22.273995 7fa786182700  1 mds.0.25 handle_mds_map i am now
 mds.0.25
 2014-08-18 12:29:22.273998 7fa786182700  1 mds.0.25 handle_mds_map state
 change up:standby -- up:replay
 2014-08-18 12:29:22.274000 7fa786182700  1 mds.0.25 replay_start
 2014-08-18 12:29:22.274014 7fa786182700  1 mds.0.25  recovery set is
 2014-08-18 12:29:22.274016 7fa786182700  1 mds.0.25  need osdmap epoch 1833,
 have 1832
 2014-08-18 12:29:22.274017 7fa786182700  1 mds.0.25  waiting for osdmap 1833
 (which blacklists prior instance)

  # ceph status
 cluster c78209f5-55ea-4c70-8968-2231d2b05560
  health HEALTH_WARN mds cluster is degraded
  monmap e3: 3 mons at
 {th1-mon001=10.1.2.21:6789/0,th1-mon002=10.1.2.22:6789/0,th1-mon003=10.1.2.23:6789/0},
 election epoch 362, quorum 0,1,2 th1-mon001,th1-mon002,th1-mon003
  mdsmap e154: 1/1/1 up {0=th1-mon001=up:replay}, 1 up:standby
  osdmap e1951: 12 osds: 12 up, 12 in
   pgmap v193685: 492 pgs, 4 pools, 60297 MB data, 470 kobjects
 124 GB used, 175 GB / 299 GB avail
  492 active+clean

 # ceph osd tree
 # idweighttype nameup/downreweight
 -10.2399root default
 -20.05997host th1-osd001
 00.01999osd.0up1
 10.01999osd.1up1
 20.01999osd.2up1
 -30.05997host th1-osd002
 30.01999osd.3up1
 40.01999osd.4up1
 50.01999osd.5up1
 -40.05997host th1-mon003
 60.01999osd.6up1
 70.01999osd.7up1
 80.01999osd.8up1
 -50.05997host th1-mon002
 90.01999osd.9up1
 100.01999osd.10up1
 110.01999osd.11up1

 What is the way to get the mds up and running again?

 I still have all the placement group directories which I moved from the full
 osds which where down to create disk space.

 Try just restarting the MDS daemon. This sounds a little familiar so I
 think it's a known bug which may be fixed in a later dev or point
 release on the MDS, but it's a soft-state rather than a disk state
 issue.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best Practice to Copy/Move Data Across Clusters

2014-08-20 Thread Larry Liu
Hi guys,

Anyone has done copy/move data between clusters? If yes,  what are the best 
practices for you?

Thanks


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best Practice to Copy/Move Data Across Clusters

2014-08-20 Thread Brian Rak
We do it with rbd volumes.  We're using rbd export/import and netcat to 
transfer it across clusters.  This was the most efficient solution, that 
did not require one cluster to have access to the other clusters (though 
it does require some way of starting the process on the different machines).




On 8/20/2014 12:49 PM, Larry Liu wrote:

Hi guys,

Anyone has done copy/move data between clusters? If yes,  what are the best 
practices for you?

Thanks


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Translating a RadosGW object name into a filename on disk

2014-08-20 Thread Craig Lewis
Looks like I need to upgrade to Firefly to get ceph-kvstore-tool
before I can proceed.
I am getting some hits just from grepping the LevelDB store, but so
far nothing has panned out.

Thanks for the help!

On Tue, Aug 19, 2014 at 10:27 AM, Gregory Farnum g...@inktank.com wrote:
 It's been a while since I worked on this, but let's see what I remember...

 On Thu, Aug 14, 2014 at 11:34 AM, Craig Lewis cle...@centraldesktop.com 
 wrote:
 In my effort to learn more of the details of Ceph, I'm trying to
 figure out how to get from an object name in RadosGW, through the
 layers, down to the files on disk.

 clewis@clewis-mac ~ $ s3cmd ls s3://cpltest/
 2014-08-13 23:0214M  28dde9db15fdcb5a342493bc81f91151
 s3://cpltest/vmware-freebsd-tools.tar.gz

 Looking at the .rgw pool's contents tells me that the cpltest bucket
 is default.73886.55:
 root@dev-ceph0:/var/lib/ceph/osd/ceph-0/current# rados -p .rgw ls | grep 
 cpltest
 cpltest
 .bucket.meta.cpltest:default.73886.55

 Okay, what you're seeing here are two different types, whose names I'm
 not going to get right:
 1) The bucket link cpltest, which maps from the name cpltest to a
 bucket instance. The contents of cpltest, or one of its xattrs, are
 pointing at .bucket.meta.cpltest:default.73886.55
 2) The bucket instance .bucket.meta.cpltest:default.73886.55. I
 think this contains the bucket index (list of all objects), etc.

 The rados objects that belong to that bucket are:
 root@dev-ceph0:~# rados -p .rgw.buckets ls | grep default.73886.55
 default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_1
 default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_3
 default.73886.55_vmware-freebsd-tools.tar.gz
 default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_2
 default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_4

 Okay, so when you ask RGW for the object vmware-freebsd-tools.tar.gz
 from the cpltest bucket, it will look up (or, if we're lucky, have
 cached) the cpltest link, and find out that the bucket prefix is
 default.73886.55. It will then try and access the object
 default.73886.55_vmware-freebsd-tools.tar.gz (whose construction I
 hope is obvious — bucket instance ID as a prefix, _ as a separate,
 then the object name). This RADOS object is called the head for the
 RGW object. In addition to (usually) the beginning bit of data, it
 will also contain some xattrs with things like a tag for any extra
 RADOS objects which include data for this RGW object. In this case,
 that tag is RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ. (This construction is
 how we do atomic overwrites of RGW objects which are larger than a
 single RADOS object, in addition to a few other things.)

 I don't think there's any way of mapping from a shadow (tail) object
 name back to its RGW name. but if you look at the rados object xattrs,
 there might (? or might not) be an attr which contains the parent
 object in one form or another. Check that out.

 (Or, if you want to check out the source, I think all the relevant
 bits for this are somewhere in the
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com

 I know those shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_ files are the
 rest of vmware-freebsd-tools.tar.gz.  I can infer that because this
 bucket only has a single file (and the sum of the sizes matches).
 With many files, I can't infer the link anymore.

 How do I look up that link?

 I tried reading the src/rgw/rgw_rados.cc, but I'm getting lost.



 My real goal is the reverse.  I recently repaired an inconsistent PG.
 The primary replica had the bad data, so I want to verify that the
 repaired object is correct.  I have a database that stores the SHA256
 of every object.  If I can get from the filename on disk back to an S3
 object, I can verify the file.  If it's bad, I can restore from the
 replicated zone.


 Aside from today's task, I think it's really handy to understand these
 low level details.  I know it's been handy in the past, when I had
 disk corruption under my PostgreSQL database.  Knowing (and
 practicing) ahead of time really saved me a lot of downtime then.


 Thanks for any pointers.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Translating a RadosGW object name into a filename on disk

2014-08-20 Thread Sage Weil
On Wed, 20 Aug 2014, Craig Lewis wrote:
 Looks like I need to upgrade to Firefly to get ceph-kvstore-tool
 before I can proceed.
 I am getting some hits just from grepping the LevelDB store, but so
 far nothing has panned out.

FWIW if you just need the tool, you can wget the .deb and 'dpkg -x foo.deb 
/tmp/whatever' and grab the binary from there.

sage


 
 Thanks for the help!
 
 On Tue, Aug 19, 2014 at 10:27 AM, Gregory Farnum g...@inktank.com wrote:
  It's been a while since I worked on this, but let's see what I remember...
 
  On Thu, Aug 14, 2014 at 11:34 AM, Craig Lewis cle...@centraldesktop.com 
  wrote:
  In my effort to learn more of the details of Ceph, I'm trying to
  figure out how to get from an object name in RadosGW, through the
  layers, down to the files on disk.
 
  clewis@clewis-mac ~ $ s3cmd ls s3://cpltest/
  2014-08-13 23:0214M  28dde9db15fdcb5a342493bc81f91151
  s3://cpltest/vmware-freebsd-tools.tar.gz
 
  Looking at the .rgw pool's contents tells me that the cpltest bucket
  is default.73886.55:
  root@dev-ceph0:/var/lib/ceph/osd/ceph-0/current# rados -p .rgw ls | grep 
  cpltest
  cpltest
  .bucket.meta.cpltest:default.73886.55
 
  Okay, what you're seeing here are two different types, whose names I'm
  not going to get right:
  1) The bucket link cpltest, which maps from the name cpltest to a
  bucket instance. The contents of cpltest, or one of its xattrs, are
  pointing at .bucket.meta.cpltest:default.73886.55
  2) The bucket instance .bucket.meta.cpltest:default.73886.55. I
  think this contains the bucket index (list of all objects), etc.
 
  The rados objects that belong to that bucket are:
  root@dev-ceph0:~# rados -p .rgw.buckets ls | grep default.73886.55
  default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_1
  default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_3
  default.73886.55_vmware-freebsd-tools.tar.gz
  default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_2
  default.73886.55__shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_4
 
  Okay, so when you ask RGW for the object vmware-freebsd-tools.tar.gz
  from the cpltest bucket, it will look up (or, if we're lucky, have
  cached) the cpltest link, and find out that the bucket prefix is
  default.73886.55. It will then try and access the object
  default.73886.55_vmware-freebsd-tools.tar.gz (whose construction I
  hope is obvious ? bucket instance ID as a prefix, _ as a separate,
  then the object name). This RADOS object is called the head for the
  RGW object. In addition to (usually) the beginning bit of data, it
  will also contain some xattrs with things like a tag for any extra
  RADOS objects which include data for this RGW object. In this case,
  that tag is RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ. (This construction is
  how we do atomic overwrites of RGW objects which are larger than a
  single RADOS object, in addition to a few other things.)
 
  I don't think there's any way of mapping from a shadow (tail) object
  name back to its RGW name. but if you look at the rados object xattrs,
  there might (? or might not) be an attr which contains the parent
  object in one form or another. Check that out.
 
  (Or, if you want to check out the source, I think all the relevant
  bits for this are somewhere in the
  -Greg
  Software Engineer #42 @ http://inktank.com | http://ceph.com
 
  I know those shadow__RpwwfOt2X-mhwU65Qa1OHDi--4OMGvQ_ files are the
  rest of vmware-freebsd-tools.tar.gz.  I can infer that because this
  bucket only has a single file (and the sum of the sizes matches).
  With many files, I can't infer the link anymore.
 
  How do I look up that link?
 
  I tried reading the src/rgw/rgw_rados.cc, but I'm getting lost.
 
 
 
  My real goal is the reverse.  I recently repaired an inconsistent PG.
  The primary replica had the bad data, so I want to verify that the
  repaired object is correct.  I have a database that stores the SHA256
  of every object.  If I can get from the filename on disk back to an S3
  object, I can verify the file.  If it's bad, I can restore from the
  replicated zone.
 
 
  Aside from today's task, I think it's really handy to understand these
  low level details.  I know it's been handy in the past, when I had
  disk corruption under my PostgreSQL database.  Knowing (and
  practicing) ahead of time really saved me a lot of downtime then.
 
 
  Thanks for any pointers.
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Andrei Mikhailovsky
Hugo,

I would look at setting up a cache pool made of 4-6 ssds to start with. So, if 
you have 6 osd servers, stick at least 1 ssd disk in each server for the cache 
pool. It should greatly reduce the osd's stress of writing a large number of 
small files. Your cluster should become more responsive and the end user's 
experience should also improve.

I am planning on doing so in a near future, but according to my friend's 
experience, introducing a cache pool has greatly increased the overall 
performance of the cluster and has removed the performance issues that he was 
having during scrubbing/deep-scrubbing/recovery activities.

The size of your working data set should determine the size of the cache pool, 
but in general it will create a nice speedy buffer between your clients and 
those terribly slow spindles.

Andrei





- Original Message -
From: Hugo Mills h.r.mi...@reading.ac.uk
To: Dan Van Der Ster daniel.vanders...@cern.ch
Cc: Ceph Users List ceph-users@lists.ceph.com
Sent: Wednesday, 20 August, 2014 4:54:28 PM
Subject: Re: [ceph-users] Serious performance problems with small file writes

   Hi, Dan,

   Some questions below I can't answer immediately, but I'll spend
tomorrow morning irritating people by triggering these events (I think
I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small
files in it) and giving you more details. For the ones I can answer
right now:

On Wed, Aug 20, 2014 at 02:51:12PM +, Dan Van Der Ster wrote:
 Do you get slow requests during the slowness incidents?

   Slow requests, yes. ceph -w reports them coming in groups, e.g.:

2014-08-20 15:51:23.911711 mon.1 [INF] pgmap v2287926: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 8449 
kB/s rd, 3506 kB/s wr, 527 op/s
2014-08-20 15:51:22.381063 osd.5 [WRN] 6 slow requests, 6 included below; 
oldest blocked for  10.133901 secs
2014-08-20 15:51:22.381066 osd.5 [WRN] slow request 10.133901 seconds old, 
received at 2014-08-20 15:51:12.247127: osd_op(mds.0.101:5528578 
10005889b29. [create 0~0,setxattr parent (394)] 0.786a9365 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381068 osd.5 [WRN] slow request 10.116337 seconds old, 
received at 2014-08-20 15:51:12.264691: osd_op(mds.0.101:5529006 
1000599e576. [create 0~0,setxattr parent (392)] 0.5ccbd6a9 ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381070 osd.5 [WRN] slow request 10.116277 seconds old, 
received at 2014-08-20 15:51:12.264751: osd_op(mds.0.101:5529009 
1000588932d. [create 0~0,setxattr parent (394)] 0.de5eca4e ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:22.381071 osd.5 [WRN] slow request 10.115296 seconds old, 
received at 2014-08-20 15:51:12.265732: osd_op(mds.0.101:5529042 
1000588933e. [create 0~0,setxattr parent (395)] 0.5e4d56be ondisk+write 
e217298) v4 currently waiting for subops from 7
2014-08-20 15:51:22.381073 osd.5 [WRN] slow request 10.115184 seconds old, 
received at 2014-08-20 15:51:12.265844: osd_op(mds.0.101:5529047 
1000599e58a. [create 0~0,setxattr parent (395)] 0.6a487965 ondisk+write 
e217298) v4 currently waiting for subops from 6
2014-08-20 15:51:24.381370 osd.5 [WRN] 2 slow requests, 2 included below; 
oldest blocked for  10.73 secs
2014-08-20 15:51:24.381373 osd.5 [WRN] slow request 10.73 seconds old, 
received at 2014-08-20 15:51:14.381267: osd_op(mds.0.101:5529327 
100058893ca. [create 0~0,setxattr parent (395)] 0.750c7574 ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.381375 osd.5 [WRN] slow request 10.28 seconds old, 
received at 2014-08-20 15:51:14.381312: osd_op(mds.0.101:5529329 
100058893cb. [create 0~0,setxattr parent (395)] 0.c75853fa ondisk+write 
e217298) v4 currently commit sent
2014-08-20 15:51:24.913554 mon.1 [INF] pgmap v2287927: 704 pgs: 704 
active+clean; 18105 GB data, 37369 GB used, 20169 GB / 57538 GB avail; 13218 
B/s rd, 3532 kB/s wr, 377 op/s
2014-08-20 15:51:25.381582 osd.5 [WRN] 3 slow requests, 3 included below; 
oldest blocked for  10.709989 secs
2014-08-20 15:51:25.381586 osd.5 [WRN] slow request 10.709989 seconds old, 
received at 2014-08-20 15:51:14.671549: osd_op(mds.0.101:5529457 
10005889403. [create 0~0,setxattr parent (407)] 0.e15ab1fa ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381587 osd.5 [WRN] slow request 10.709767 seconds old, 
received at 2014-08-20 15:51:14.671771: osd_op(mds.0.101:5529462 
10005889406. [create 0~0,setxattr parent (406)] 0.70f8a5d3 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20 15:51:25.381589 osd.5 [WRN] slow request 10.182354 seconds old, 
received at 2014-08-20 15:51:15.199184: osd_op(mds.0.101:5529464 
10005889407. [create 0~0,setxattr parent (391)] 0.30535d02 ondisk+write 
e217298) v4 currently no flag points reached
2014-08-20