Re: rbd caching

2012-05-07 Thread Josh Durgin

On 05/05/2012 04:51 PM, Sage Weil wrote:

The second set of patches restructure the way the cache itself is managed.
One goal is to be able to control cache behavior on a per-image basis
(this one write-thru, this was write-back, etc.).  Another goal is to
share a single pool of memory for several images.  The librbd.h calls to
do this currently look something like this:

int rbd_cache_create(rados_t cluster, rbd_cache_t *cache, uint64_t max_size,
 uint64_t max_dirty, uint64_t target_dirty);
int rbd_cache_destroy(rbd_cache_t cache);
int rbd_open_cached(rados_ioctx_t io, const char *name, rbd_image_t image,
 const char *snap_name, rbd_cache_t cache);

Setting the cache tunables should probably be broken out into several
different calls, so that it is possible to add new ones in the future.
Beyond that, though, the limitation here is that you can set the
target_dirty or max_dirty for a _cache_, and then have multiple images
share that cache, but you can't then set a max_dirty limit for an
individual image.


I'm not sure that these should be separate API calls. We can
already control per-image caches via different rados_conf
settings when the image is opened. We're already opening
a new rados_cluster_t (which can have its own settings)
for each image in qemu.


Does it matter?  Ideally, I supposed, you could set:

  - per-cache size
  - per-cache max_dirty
  - per-cache target_dirty
  - per-image max_dirty  (0 for write-thru)
  - per-image target_dirty

and then share a single cache for many images, and the flushing logic
could observe both sets of dirty limits.  That just means calls to set
max_dirty and target_dirty for individual images, too.


I don't think all this flexibility is necessary. If we did want
to add it, it could be done with configuration settings instead
of pushing the complexity to the librbd caller. For example, there
could be a 'rbd_cache_name' option, and the images using the same
cache name could share the same underlying cache. Alternatively,
there could be an option to make all rbd images use the same cache
with their own limits.

What use cases do you see for single-vm cache sharing? I can't think
of any common ones off the top of my head. It seems like ksm
will provide much more benefit (especially with layering).


Is it worth the complexity?  In the end, this will be wired up to the qemu
writeback options, so the range of actual usage will fall within
whatever is doable with those options and generic 'rbd cache size = ..'
tunables, most likely...


There's no notion of shared caches or cache size, since it's
designed for using the host page cache. I think leaving any
extra cache configuration in rbd-specific options makes sense
for now.

Josh
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PGs stuck in creating state

2012-05-07 Thread Sage Weil
On Mon, 7 May 2012, Vladimir Bashkirtsev wrote:
 On 20/04/12 14:41, Sage Weil wrote:
  On Fri, 20 Apr 2012, Vladimir Bashkirtsev wrote:
   Dear devs,
   
   First of all I would like to bow my head at your great effort! Even if
   ceph
   did not reach prime time status yet it is already extremely powerful and
   fairly stable to the point we have deployed it in live environment (still
   backing up of course).
   
   I have played with ceph extensively to put it different failed modes and
   it
   recovers in most cases without major issues. However when I was adding up
   OSDs
   I have got 2 PGs stuck in creating mode and does not matter what I have
   done
   (changing CRUSH, restarting OSDs in order to get ceph to move data around)
   they still there. So instead of using scientific blindfold method I
   decided to
   dig deeper. So here we go (please disregard stuck active+remapped as it is
   part of another problem):
   
   [root@gamma tmp]# ceph -s
   2012-04-20 12:40:45.294969pg v625626: 600 pgs: 2 creating, 586
   active+clean, 12 active+remapped; 253 GB data, 767 GB used, 1292 GB / 2145
   GB
   avail
   2012-04-20 12:40:45.299426   mds e457: 1/1/1 up {0=2=up:active}, 2
   up:standby
   2012-04-20 12:40:45.299549   osd e6022: 6 osds: 6 up, 6 in
   2012-04-20 12:40:45.299856   log 2012-04-20 12:26:43.988716 mds.0
   172.16.64.202:6802/21363 1 : [DBG] reconnect by client.13897
   172.16.64.10:0/1319861516 after 0.101633
   2012-04-20 12:40:45.300160   mon e1: 3 mons at
   {0=172.16.64.200:6789/0,1=172.16.64.201:6789/0,2=172.16.64.202:6789/0}
   
   [root@gamma tmp]# ceph pg dump | grep creating
   2.1p3   0   0   0   0   0   0   0   creating
   0.000'0 0'0 []  []  0'0 0.00
   1.1p3   0   0   0   0   0   0   0   creating
   0.000'0 0'0 []  []  0'0 0.00
  Sigh.. this is exactly the problem I noticed last week that prompted the
  'localized pgs' email.  I'm working on patches to remove this
  functionality entirely as we speak, since it's a broken design in several
  ways.
  
  Once you restart your osds (be sure to control-c that 'ceph' command
  first), you should be able to wipe out those bad pgs with
  
  ceph osd pool disable_lpgspoolname  --yes-i-really-mean-it
  
  This is in current master, and will be included in 0.46.

 Have upgraded to 0.46, done disable_lpgs on all pools - still no banana. Other
 ideas?

The stray pg entries will get zapped by v0.47.  The 'disable_lpgs' will 
only prevent new ones from getting created 0.47.  

sage


  
  Alternatively, you can just ignore those pgs for the time being.  They
  are completely harmless as long as you can tolerate the ceph -s/ceph
  health noise.  v0.47 should silently zap them all on upgrade.
  
  sage
  
   My understanding is that p3 means that PG creating on osd.3 . I've tried
   to
   stop and restart osd.3 - no banana. So I went to more dramatic option:
   lose
   osd.3 . Completely destroyed osd.3 and rebuilt it from scratch. osd.3 came
   back again with exactly the same PGs in creating mode - makes me think
   that
   that osd.3 is not responsible for this.
   
   [root@gamma tmp]# ceph osd dump
   dumped osdmap epoch 6022
   epoch 6022
   fsid 7719f573-4c48-4852-a27f-51c7a3fe1c1e
   created 2012-03-31 04:47:12.130128
   modifed 2012-04-20 12:26:56.406193
   flags
   
   pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 192
   pgp_num 192 lpg_num 2 lpgp_num 2 last_change 1137 owner 0
   crash_replay_interval 45
   pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num
   192
   pgp_num 192 lpg_num 2 lpgp_num 2 last_change 1160 owner 0
   pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 192
   pgp_num 192 lpg_num 2 lpgp_num 2 last_change 1141 owner 0
   
   max_osd 6
   osd.0 up   in  weight 1 up_from 6001 up_thru 6018 down_at 6000
   last_clean_interval [5980,5996) 172.16.64.200:6804/23046
   172.16.64.200:6805/23046 172.16.64.200:6806/23046 exists,up
   osd.1 up   in  weight 1 up_from 5998 up_thru 6017 down_at 5997
   last_clean_interval [5992,5996) 172.16.64.201:6804/27598
   172.16.64.201:6805/27598 172.16.64.201:6806/27598 exists,up
   osd.2 up   in  weight 1 up_from 5998 up_thru 6019 down_at 5997
   last_clean_interval [5978,5996) 172.16.64.202:6804/21457
   172.16.64.202:6805/21457 172.16.64.202:6806/21457 exists,up
   osd.3 up   in  weight 1 up_from 5972 up_thru 6017 down_at 5970
   last_clean_interval [5884,5969) lost_at 1163 172.16.64.203:6800/10614
   172.16.64.203:6801/10614 172.16.64.203:6802/10614 exists,up
   osd.4 up   in  weight 1 up_from 5995 up_thru 6017 down_at 5988
   last_clean_interval [5898,5987) 172.16.64.204:6800/16357
   172.16.64.204:6801/16357 172.16.64.204:6802/16357 exists,up
   osd.5 up   in  weight 1 up_from 5984 up_thru 6021 down_at 5982
   last_clean_interval [5921,5981) 172.16.64.205:6800/11346
   

Re: rbd caching

2012-05-07 Thread Sage Weil
On Mon, 7 May 2012, Josh Durgin wrote:
 On 05/05/2012 04:51 PM, Sage Weil wrote:
  The second set of patches restructure the way the cache itself is managed.
  One goal is to be able to control cache behavior on a per-image basis
  (this one write-thru, this was write-back, etc.).  Another goal is to
  share a single pool of memory for several images.  The librbd.h calls to
  do this currently look something like this:
  
  int rbd_cache_create(rados_t cluster, rbd_cache_t *cache, uint64_t max_size,
   uint64_t max_dirty, uint64_t target_dirty);
  int rbd_cache_destroy(rbd_cache_t cache);
  int rbd_open_cached(rados_ioctx_t io, const char *name, rbd_image_t image,
   const char *snap_name, rbd_cache_t cache);
  
  Setting the cache tunables should probably be broken out into several
  different calls, so that it is possible to add new ones in the future.
  Beyond that, though, the limitation here is that you can set the
  target_dirty or max_dirty for a _cache_, and then have multiple images
  share that cache, but you can't then set a max_dirty limit for an
  individual image.
 
 I'm not sure that these should be separate API calls. We can
 already control per-image caches via different rados_conf
 settings when the image is opened. We're already opening
 a new rados_cluster_t (which can have its own settings)
 for each image in qemu.

If we don't want or care about cache sharing, then yeah.  The advantage is 
that, generally speaking, the caches will be more effective if they are 
pooled and share an LRU.  It may not be worth it, especially if qemu won't 
use it.

If that's the case, I think we should merge through 
385142305a83f58f8aa0a93c98679c4018f98a28, which moves the rbd cache 
settings to

  rbd cache size
  rbd cache max dirty
  rbd cache target dirty

distinct from the fs client settings (client oc *).  
77f9dbbafb7f21fb20892b2ebc38e83a55f828ad might be worth taking as well.

sage


  Does it matter?  Ideally, I supposed, you could set:
  
- per-cache size
- per-cache max_dirty
- per-cache target_dirty
- per-image max_dirty  (0 for write-thru)
- per-image target_dirty
  
  and then share a single cache for many images, and the flushing logic
  could observe both sets of dirty limits.  That just means calls to set
  max_dirty and target_dirty for individual images, too.
 
 I don't think all this flexibility is necessary. If we did want
 to add it, it could be done with configuration settings instead
 of pushing the complexity to the librbd caller. For example, there
 could be a 'rbd_cache_name' option, and the images using the same
 cache name could share the same underlying cache. Alternatively,
 there could be an option to make all rbd images use the same cache
 with their own limits.
 
 What use cases do you see for single-vm cache sharing? I can't think
 of any common ones off the top of my head. It seems like ksm
 will provide much more benefit (especially with layering).
 
  Is it worth the complexity?  In the end, this will be wired up to the qemu
  writeback options, so the range of actual usage will fall within
  whatever is doable with those options and generic 'rbd cache size = ..'
  tunables, most likely...
 
 There's no notion of shared caches or cache size, since it's
 designed for using the host page cache. I think leaving any
 extra cache configuration in rbd-specific options makes sense
 for now.
 
 Josh
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


OSD hotplugging Chef cookbook (chef-1)

2012-05-07 Thread Tommi Virtanen
Hi. I've been working on easy deployability and manageability of Ceph.
This work is intended to be a complete replacement for mkcephfs, and
integrate new product features instead of just automating the
previous, clumsier, administration mechanisms. I'm using Chef to
create and expand the cluster, but most of the new functionality is in
making the OSDs more dynamic.

The current work is in a branch called ceph-1, and will be improved
upon, but it is now at a stage where others should start looking at
it.

Here's a quick intro to what's there right now. Apologies for the
formatting, I need to be on a plane fairly soon.. Rest assured, any
command that looks clumsy is that way mostly because I haven't had
time to make it prettier. I'll go through this with our QA and tech
writer once the dust settles, to clean up the instructions.


Limitations (all to be removed later):
- supports only 1 monitor
- journal is a file inside osd data directory
- only supports 1 cluster (name hardcoded to ceph); later you will
be able to run multiple clusters on the same hardware
- no rgw, mds, or anything else but a RADOS/rbd cluster tested yet
- no integration with e.g. OpenStack yet


Open questions:
- I removed the sysv-style init script (from the debian packaging).
I'm not sure what to do with that. Older debs will still need it?
- details of what goes where in e.g. the chef environment will change;
input is welcome



How to try it out:

I need to leave to make to the airport in time, but the latest change
is still compiling :(
Wait till 
http://gitbuilder.ceph.com/ceph-deb-oneiric-x86_64-basic/ref/chef-1/sha1
says 4b75bccd52104d0ecd551e0656a30791b25fe032, hope for the best, and
proceed:


# create 3 vms running ubuntu 11.10 server; mine ended up being named
chef02, inst03, inst04
# they need to be able to talk to each other, so do not use KVM's
user networking, but NAT or bridged. (NAT is default for libvirt.)

# make sure your vm has a unique hostname first, or it'll get
confusing later; edit /etc/hostname, /etc/hosts, run sudo hostname
newname, re-login

# source for this:
http://wiki.opscode.com/display/chef/Installing+Chef+Server+on+Debian+or+Ubuntu+using+Packages

# figure out the IP address of your chef server vm
gpg --keyserver keys.gnupg.net --recv-keys 83EF826A
gpg --export packa...@opscode.com | sudo apt-key add -
sudo tee /etc/apt/sources.list.d/chef.list EOF
deb http://apt.opscode.com/ oneiric-0.10 main
deb-src http://apt.opscode.com/ oneiric-0.10 main
EOF
sudo apt-get update
sudo apt-get install chef
# answer using the IP address of your chef server vm
┌───┤ Configuring chef 
├───┐
│  This is the full URI that clients will use 
to connect to the│
│  server.  
   │
│  .
   │
│  This will be used in /etc/chef/client.rb as 
'chef_server_url'.  │
│   
   │
│ URL of Chef Server (e.g., 
http://chef.example.com:4000): │
│   
   │
│ 
http://192.168.122.168:4000/ │
│   
   │
│  Ok 
   │
│   
   │

└──┘

sudo apt-get install chef-server
# you MUST enter some password here or the installation will fail; no
human will need to type this ever again
┌┤ Configuring chef-solr
├─┐
│  Set the password for the chef user in the AMQP 
server queue. Use
   │
│  RabbitMQ's rabbitmqctl program to set this password. 
The default
user   │
│  and vhost are assumed (chef and /chef, respectively).
   │
│  .
   │
│  RabbitMQ does not have the capability to read the 
password from
a file, and │
│  this will be passed via  on the command-line. As 
such, do not
use shell   │
│  meta-characters that could cause errors such as !.
   │
│  .
   │
│  This will be used in /etc/chef/solr.rb and 
/etc/chef/server.rb
as 

[no subject]

2012-05-07 Thread Tim Flavin
The new site is great!  I like the Ceph documentation, however I found
a couple of typos.  Is this the best place address them?  (Some of the
apparent typos may be my not understanding what is going on.)



http://ceph.com/docs/master/config-cluster/ceph-conf/

The  Hardware Recommendations link near the bottom of the page gives
a 404.  Did you want to point to
http://ceph.com/docs/master/install/hardware-recommendations/ ?


http://ceph.com/docs/master/config-ref/osd-config

For  osd client message size cap  The default value is 500 MB but
the description lists it a 200 MB.


http://ceph.com/docs/master/api/librbdpy/

The line of code: size = 4 * 1024 * 1024  # 4 GiB appears to be
missing a * 1024, and the next line
 is rbd_inst.create('myimage', 4) when it probably should be
rbd_inst.create('myimage', size) This is repeated several times.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD hotplugging Chef cookbook (chef-1)

2012-05-07 Thread Tommi Virtanen
On Mon, May 7, 2012 at 3:58 PM, Tommi Virtanen t...@inktank.com wrote:
 The current work is in a branch called ceph-1, and will be improved
 upon, but it is now at a stage where others should start looking at
 it.

Make that chef-1.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html