[ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Guang Yang
Hi ceph-users,

This is Guang and I am pretty new to ceph, glad to meet you guys in the 
community!

After walking through some documents of Ceph, I have a couple of questions:
  1. Is there any comparison between Ceph and AWS S3, in terms of the ability 
to handle different work-loads (from KB to GB), with corresponding performance 
report?
  2. Looking at some industry solutions for distributed storage, GFS / Haystack 
/ HDFS all use meta-server to store the logical-to-physical mapping within 
memory and avoid disk I/O lookup for file reading, is the concern valid for 
Ceph (in terms of latency to read file)?
  3. Some industry research shows that one issue of file system is the 
metadata-to-data ratio, in terms of both access and storage, and some technic 
uses the mechanism to combine small files to large physical files to reduce the 
ratio (Haystack for example), if we want to use ceph to store photos, should 
this be a concern as Ceph use one physical file per object?

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph VM Backup

2013-08-19 Thread Wido den Hollander

On 08/18/2013 10:58 PM, Wolfgang Hennerbichler wrote:

On Sun, Aug 18, 2013 at 06:57:56PM +1000, Martin Rudat wrote:

Hi,

On 2013-02-25 20:46, Wolfgang Hennerbichler wrote:

maybe some of you are interested in this - I'm using a dedicated VM to
backup important VMs which have their storage in RBD. This is nothing
fancy and not implemented perfectly, but it works. The VM's don't notice
that they're backed up, the only requirement is that the filesystem of
the VM is directly on the RBD, the script doesn't calculate offsets of
partition tables.

Looking at how you're doing that, if you trust the script to be able
to create new snapshots; couldn't you do that with less machinery
involved by installing the ceph binaries on the backup host,
creating the snapshot and attaching it with rbd, rather than
attaching it to the VM?


this was written at a time where kernels could not map format 2 rbd images.


Also; where's the fsck call? You're snapshotting a running system;
it's almost guaranteed that you've done the snapshot in the middle
of a batch of writes; then again, it would be cool to be able to ask
the VM to sync, to capture a consistent filesystem, though.


I use journaling filesystems. The journal is replayed during mount (can be seen 
in kernel logs) and the FS is therefore considered to be clean.


I don't know about recent kernels, but older ones could be made to
crash by boldly mounting a filesystem that hadn't been fscked.


This works for production systems. That's what journals are all about, right?


Correct, but older kernels might not respect barriers correctly. But if 
you use a modern kernel ( I think 2.6.36 or so) there won't be a problem.


Like you said, on mount the journal will be replayed and the FS will be 
clean.


It's nothing less then an unexpected shutdown.

Wido



wogri


--
Martin Rudat


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-19 Thread Samuel Just
You're right, PGLog::undirty() looks suspicious.  I just pushed a
branch wip-dumpling-pglog-undirty with a new config
(osd_debug_pg_log_writeout) which if set to false will disable some
strictly debugging checks which occur in PGLog::undirty().  We haven't
actually seen these checks causing excessive cpu usage, so this may be
a red herring.
-Sam

On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey oli...@xs4all.nl wrote:
 Hey Mark,

 On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
 On 08/17/2013 06:13 AM, Oliver Daudey wrote:
  Hey all,
 
  This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
  created in the tracker.  Thought I would pass it through the list as
  well, to get an idea if anyone else is running into it.  It may only
  show under higher loads.  More info about my setup is in the bug-report
  above.  Here goes:
 
 
  I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
  and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
  unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
  +MB/sec on simple linear writes to a file with `dd' inside a VM on this
  cluster under regular load and the osds usually averaged 20-100%
  CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
  the osds shot up to 100% to 400% in `top' (multi-core system) and the
  speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
  complained that disk-access inside the VMs was significantly slower and
  the backups of the RBD-store I was running, also got behind quickly.
 
  After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
  rest at 0.67 Dumpling, speed and load returned to normal. I have
  repeated this performance-hit upon upgrade on a similar test-cluster
  under no additional load at all. Although CPU-usage for the osds wasn't
  as dramatic during these tests because there was no base-load from other
  VMs, I/O-performance dropped significantly after upgrading during these
  tests as well, and returned to normal after downgrading the osds.
 
  I'm not sure what to make of it. There are no visible errors in the logs
  and everything runs and reports good health, it's just a lot slower,
  with a lot more CPU-usage.

 Hi Oliver,

 If you have access to the perf command on this system, could you try
 running:

 sudo perf top

 And if that doesn't give you much,

 sudo perf record -g

 then:

 sudo perf report | less

 during the period of high CPU usage?  This will give you a call graph.
 There may be symbols missing, but it might help track down what the OSDs
 are doing.

 Thanks for your help!  I did a couple of runs on my test-cluster,
 loading it with writes from 3 VMs concurrently and measuring the results
 at the first node with all 0.67 Dumpling-components and with the osds
 replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
 while, then copied anything that showed in red and green into this post.
 Here are the results (sorry for the word-wraps):

 First, with 0.61.7 osds:

  19.91%  [kernel][k] intel_idle
  10.18%  [kernel][k] _raw_spin_lock_irqsave
   6.79%  ceph-osd[.] ceph_crc32c_le
   4.93%  [kernel][k]
 default_send_IPI_mask_sequence_phys
   2.71%  [kernel][k] copy_user_generic_string
   1.42%  libc-2.11.3.so  [.] memcpy
   1.23%  [kernel][k] find_busiest_group
   1.13%  librados.so.2.0.0   [.] ceph_crc32c_le_intel
   1.11%  [kernel][k] _raw_spin_lock
   0.99%  kvm [.] 0x1931f8
   0.92%  [igb]   [k] igb_poll
   0.87%  [kernel][k] native_write_cr0
   0.80%  [kernel][k] csum_partial
   0.78%  [kernel][k] __do_softirq
   0.63%  [kernel][k] hpet_legacy_next_event
   0.53%  [ip_tables] [k] ipt_do_table
   0.50%  libc-2.11.3.so  [.] 0x74433

 Second test, with 0.67 osds:

  18.32%  [kernel]  [k] intel_idle
   7.58%  [kernel]  [k] _raw_spin_lock_irqsave
   7.04%  ceph-osd  [.] PGLog::undirty()
   4.39%  ceph-osd  [.] ceph_crc32c_le_intel
   3.92%  [kernel]  [k]
 default_send_IPI_mask_sequence_phys
   2.25%  [kernel]  [k] copy_user_generic_string
   1.76%  libc-2.11.3.so[.] memcpy
   1.56%  librados.so.2.0.0 [.] ceph_crc32c_le_intel
   1.40%  libc-2.11.3.so[.] vfprintf
   1.12%  libc-2.11.3.so[.] 0x7217b
   1.05%  [kernel]  [k] _raw_spin_lock
   1.01%  [kernel]  [k] find_busiest_group
   0.83%  kvm   [.] 0x193ab8
   0.80%  [kernel]  [k] native_write_cr0
   0.76%  [kernel]  [k] __do_softirq
   0.73% 

[ceph-users] Ceph Deployments

2013-08-19 Thread Schmitt, Christian
Hello, I just have some small questions about Ceph Deployment models and if
this would work for us.
Currently the first question would be, is it possible to have a ceph single
node setup, where everything is on one node?
Our Application, Ceph's object storage and a database? We focus on this
deployment model for our very small customers, who only have like 20
members that use our application, so the load wouldn't be very high.
And the next question would be, is it possible to extend the Ceph single
node to 3 nodes later, if they need more availability?

Also we always want to use Shared Nothing Machines, so every service would
be on one machine, is this Okai for Ceph, or does Ceph really need a lot of
CPU/Memory/Disk Speed?
Currently we make an archiving software for small customers and we want to
move things on the file system on a object storage. Currently we only have
customers that needs 1 machine or 3 machines. But everything should work as
fine on more.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Wolfgang Hennerbichler
On 08/19/2013 10:36 AM, Schmitt, Christian wrote:
 Hello, I just have some small questions about Ceph Deployment models and
 if this would work for us.
 Currently the first question would be, is it possible to have a ceph
 single node setup, where everything is on one node?

yes. depends on 'everything', but it's possible (though not recommended)
to run mon, mds, and osd's on the same host, and even do virtualisation.

 Our Application, Ceph's object storage and a database? 

what is 'a database'?

 We focus on this
 deployment model for our very small customers, who only have like 20
 members that use our application, so the load wouldn't be very high.
 And the next question would be, is it possible to extend the Ceph single
 node to 3 nodes later, if they need more availability?

yes.

 Also we always want to use Shared Nothing Machines, so every service
 would be on one machine, is this Okai for Ceph, or does Ceph really need
 a lot of CPU/Memory/Disk Speed?

ceph needs cpu / disk speed when disks fail and need to be recovered. it
also uses some cpu when you have a lot of i/o, but generally it is
rather lightweight.
shared nothing is possible with ceph, but in the end this really depends
on your application.

 Currently we make an archiving software for small customers and we want
 to move things on the file system on a object storage. 

you mean from the filesystem to an object storage?

 Currently we only
 have customers that needs 1 machine or 3 machines. But everything should
 work as fine on more.

it would with ceph. probably :)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Mark Kirkwood

On 19/08/13 18:17, Guang Yang wrote:


   3. Some industry research shows that one issue of file system is the
metadata-to-data ratio, in terms of both access and storage, and some
technic uses the mechanism to combine small files to large physical
files to reduce the ratio (Haystack for example), if we want to use ceph
to store photos, should this be a concern as Ceph use one physical file
per object?


If you use Ceph as a pure object store, and get and put data via the 
basic rados api then sure, one client data object will be stored in one 
Ceph 'object'. However if you use rados gateway (S3 or Swift look-alike 
api) then each client data object will be broken up into chunks at the 
rados level (typically 4M sized chunks).



Regards

Mark

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Wolfgang Hennerbichler


On 08/19/2013 11:18 AM, Mark Kirkwood wrote:
 However if you use rados gateway (S3 or Swift look-alike
 api) then each client data object will be broken up into chunks at the
 rados level (typically 4M sized chunks).

= which is a good thing in terms of replication and OSD usage
distribution.

 
 Regards
 
 Mark
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
DI (FH) Wolfgang Hennerbichler
Software Development
Unit Advanced Computing Technologies
RISC Software GmbH
A company of the Johannes Kepler University Linz

IT-Center
Softwarepark 35
4232 Hagenberg
Austria

Phone: +43 7236 3343 245
Fax: +43 7236 3343 250
wolfgang.hennerbich...@risc-software.at
http://www.risc-software.at
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Destroyed Ceph Cluster

2013-08-19 Thread Georg Höllrigl

Hello List,

The troubles to fix such a cluster continue... I get output like this now:

# ceph health
HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; mds cluster is 
degraded; mds vvx-ceph-m-03 is laggy



When checking for the ceph-mds processes, there are now none left... no 
matter which server I check. And the won't start up again!?


The log starts up with:
2013-08-19 11:23:30.503214 7f7e9dfbd780  0 ceph version 0.67 
(e3b7bc5bce8ab330ec1661381072368af3c218a0), process ceph-mds, pid 27636

2013-08-19 11:23:30.523314 7f7e9904b700  1 mds.-1.0 handle_mds_map standby
2013-08-19 11:23:30.529418 7f7e9904b700  1 mds.0.26 handle_mds_map i am 
now mds.0.26
2013-08-19 11:23:30.529423 7f7e9904b700  1 mds.0.26 handle_mds_map state 
change up:standby -- up:replay

2013-08-19 11:23:30.529426 7f7e9904b700  1 mds.0.26 replay_start
2013-08-19 11:23:30.529434 7f7e9904b700  1 mds.0.26  recovery set is
2013-08-19 11:23:30.529436 7f7e9904b700  1 mds.0.26  need osdmap epoch 
277, have 276
2013-08-19 11:23:30.529438 7f7e9904b700  1 mds.0.26  waiting for osdmap 
277 (which blacklists prior instance)
2013-08-19 11:23:30.534090 7f7e9904b700 -1 mds.0.sessionmap _load_finish 
got (2) No such file or directory
2013-08-19 11:23:30.535483 7f7e9904b700 -1 mds/SessionMap.cc: In 
function 'void SessionMap::_load_finish(int, ceph::bufferlist)' thread 
7f7e9904b700 time 2013-08-19 11:23:30.534107

mds/SessionMap.cc: 83: FAILED assert(0 == failed to load sessionmap)


Anyone an idea how to get the cluster back running?





Georg




On 16.08.2013 16:23, Mark Nelson wrote:

Hi Georg,

I'm not an expert on the monitors, but that's probably where I would
start.  Take a look at your monitor logs and see if you can get a sense
for why one of your monitors is down.  Some of the other devs will
probably be around later that might know if there are any known issues
with recreating the OSDs and missing PGs.

Mark

On 08/16/2013 08:21 AM, Georg Höllrigl wrote:

Hello,

I'm still evaluating ceph - now a test cluster with the 0.67 dumpling.
I've created the setup with ceph-deploy from GIT.
I've recreated a bunch of OSDs, to give them another journal.
There already was some test data on these OSDs.
I've already recreated the missing PGs with ceph pg force_create_pg


HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 5 requests
are blocked  32 sec; mds cluster is degraded; 1 mons down, quorum
0,1,2 vvx-ceph-m-01,vvx-ceph-m-02,vvx-ceph-m-03

Any idea how to fix the cluster, besides completley rebuilding the
cluster from scratch? What if such a thing happens in a production
environment...

The pgs from ceph pg dump looks all like creating for some time now:

2.3d0   0   0   0   0   0   0 creating
  2013-08-16 13:43:08.186537   0'0 0:0 []  [] 0'0
0.000'0 0.00

Is there a way to just dump the data, that was on the discarded OSDs?




Kind Regards,
Georg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Martin Rudat

On 2013-08-19 18:36, Schmitt, Christian wrote:
Currently the first question would be, is it possible to have a ceph 
single node setup, where everything is on one node?
Yes, definitely, I've currently got a single-node ceph 'cluster', but, 
to the best of my knowledge, it's not the recommended configuration for 
long-term usage; in the coming weeks (given this is a home server), I'll 
be attempting to bring up another two nodes.


Our Application, Ceph's object storage and a database? We focus on 
this deployment model for our very small customers, who only have like 
20 members that use our application, so the load wouldn't be very high.
And the next question would be, is it possible to extend the Ceph 
single node to 3 nodes later, if they need more availability?
I'm not sure how much ram the monitor and mds take, but each osd (disk) 
seems to nominally use 300M of ram. My 'server' is a micro-ATX board 
with 5 spinning disks and a SSD, plugged into a small UPS; total cost 
about 2000 AUD. It's running a mail-server, backuppc for the other VMs, 
PCs and laptops in the house, a file-server re-exporting the disk from 
ceph, and some other random stuff. The VMs chew up a little more than 8G 
of ram in total, and on the 16G machine, there doesn't seem to be any 
performance problems (with only two users, mind you).


Also we always want to use Shared Nothing Machines, so every service 
would be on one machine, is this Okai for Ceph, or does Ceph really 
need a lot of CPU/Memory/Disk Speed?
Currently we make an archiving software for small customers and we 
want to move things on the file system on a object storage. Currently 
we only have customers that needs 1 machine or 3 machines. But 
everything should work as fine on more.
Depending on your definition of 'machine', a cluster of 3 smaller 
machines may be substitutable for a single larger one; with the hope 
that hardware failure only takes out 1 node, leaving the whole cluster 
still online and able to be restored to full capacity at your (relative) 
leisure, rather than Right Now, as the backups aren't running anymore...


The two 'new' nodes I'm spinning up are my old desktop machine and its 
predecessor, which, arguably could be construed as being 'free'. =)


For firms of your target size, it may be an effective thing to suggest 
upgrading one or more desktops, and use the old machines to run the 
backup system on. Especially if you're charging for the service 
provided, more than for the hardware, you may be able to consolidate 
multiple existing servers into VMs running on a ceph cluster, with 
enough spare capacity to also run your backup suite, with minimal to no 
actual hardware outlay.


--
Martin Rudat


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Assert and monitor-crash when attemting to create pool-snapshots while rbd-snapshots are in use or have been used on a pool

2013-08-19 Thread Joao Eduardo Luis

On 08/18/2013 07:11 PM, Oliver Daudey wrote:

Hey all,

Also created on the tracker, under http://tracker.ceph.com/issues/6047

While playing around on my test-cluster, I ran into a problem that I've
seen before, but have never been able to reproduce until now.  The use
of pool-snapshots and rbd-snapshots seems to be mutually exclusive in
the same pool, even if you have used one type of snapshot before and
have since deleted all snapshots of that type.  Unfortunately, the
condition doesn't appear to be handled gracefully yet, leading, in one
case, to monitors crashing.  I think this one goes back at least as far
as Bobtail and still exists in Dumpling.  My cluster is a
straightforward one with 3 Debian Squeeze-nodes, each running a mon, mds
and osd.  To reproduce:

# ceph osd pool create test 256 256
pool 'test' created
# ceph osd pool mksnap test snapshot
created pool test snap snapshot
# ceph osd pool rmsnap test snapshot
removed pool test snap snapshot

So far, so good.  Now we try to create an rbd-snapshot in the same pool:

# rbd --pool=test create --size=102400 image
# rbd --pool=test snap create image@snapshot
rbd: failed to create snapshot: (22) Invalid argument
2013-08-18 19:27:50.892291 7f983bc10780 -1 librbd: failed to create snap
id: (22) Invalid argument

That failed, but at least the cluster is OK.  Now we start over again
and create the rbd-snapshot first:

# ceph osd pool delete test test --yes-i-really-really-mean-it
pool 'test' deleted
# ceph osd pool create test 256 256
pool 'test' created
# rbd --pool=test create --size=102400 image
# rbd --pool=test snap create image@snapshot
# rbd --pool=test snap ls image
SNAPID NAME  SIZE
  2 snapshot 102400 MB
# rbd --pool=test snap rm image@snapshot
# ceph osd pool mksnap test snapshot
2013-08-18 19:35:59.494551 7f48d75a1700  0 monclient: hunting for new
mon
^CError EINTR:  (I pressed CTRL-C)


Thanks for the steps to reproduce Oliver!  Managed to reproduce this on 
0.67.1 on the first attempt.


This bug appears to be the same as #5959 on the tracker.  I spent some 
time last week looking into it, and although I realized it was far too 
easy to trigger it on cuttlefish, I never managed to trigger it on next 
-- which I attributed to d1501938f5d07c067d908501fc5cfe3c857d7281.


I'll be looking into this.

  -Joao





My leader monitor crashed at that last command, here's the apparent
critical point in the logs:

 -3 2013-08-18 19:35:59.315956 7f9b870b1700  1 --
194.109.43.18:6789/0 == c
lient.5856 194.109.43.18:0/1030570 8  mon_command({snap:
snapshot, pref
ix: osd pool mksnap, pool: test} v 0) v1  107+0+0 (983560
0 0) 0x23e4200 con 0x2d202c0
 -2 2013-08-18 19:35:59.316020 7f9b870b1700  0 mon.a@0(leader) e1
handle_command mon_command({snap: snapshot, prefix: osd pool
mksnap, pool: test} v 0) v1
 -1 2013-08-18 19:35:59.316033 7f9b870b1700  1
mon.a@0(leader).paxos(paxos active c 1190049..1190629) is_readable
now=2013-08-18 19:35:59.316034 lease_expire=2013-08-18 19:36:03.535809
has v0 lc 1190629
  0 2013-08-18 19:35:59.317612 7f9b870b1700 -1 osd/osd_types.cc: In
function 'void pg_pool_t::add_snap(const char*, utime_t)' thread
7f9b870b1700 time 2013-08-18 19:35:59.316102
osd/osd_types.cc: 682: FAILED assert(!is_unmanaged_snaps_mode())

Apart from fixing this assert and maybe giving a more clear
error-message with the failed creation of the rbd-snapshot, maybe there
should be a way to switch from one snaps_mode to the other without
having to delete the entire pool, if one doesn't already exist.  BTW:
How exactly does one use the pool-snapshots?  There doesn't seem to be a
documented way of listing or using them after creation.

More info available on request.



Regards,

  Oliver

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Schmitt, Christian
 Date: Mon, 19 Aug 2013 10:50:25 +0200
 From: Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Ceph Deployments
 Message-ID: 5211dc51.4070...@risc-software.at
 Content-Type: text/plain; charset=ISO-8859-1

 On 08/19/2013 10:36 AM, Schmitt, Christian wrote:
  Hello, I just have some small questions about Ceph Deployment models and
  if this would work for us.
  Currently the first question would be, is it possible to have a ceph
  single node setup, where everything is on one node?

 yes. depends on 'everything', but it's possible (though not recommended)
 to run mon, mds, and osd's on the same host, and even do virtualisation.

Currently we don't want to virtualise on this machine since the
machine is really small, as said we focus on small to midsize
businesses. Most of the time they even need a tower server due to the
lack of a correct rack. ;/

  Our Application, Ceph's object storage and a database?

 what is 'a database'?

We run Postgresql or MariaDB (without/with Galera depending on the cluster size)

  We focus on this
  deployment model for our very small customers, who only have like 20
  members that use our application, so the load wouldn't be very high.
  And the next question would be, is it possible to extend the Ceph single
  node to 3 nodes later, if they need more availability?

 yes.

Thats good!

  Also we always want to use Shared Nothing Machines, so every service
  would be on one machine, is this Okai for Ceph, or does Ceph really need
  a lot of CPU/Memory/Disk Speed?

 ceph needs cpu / disk speed when disks fail and need to be recovered. it
 also uses some cpu when you have a lot of i/o, but generally it is
 rather lightweight.
 shared nothing is possible with ceph, but in the end this really depends
 on your application.

hm, when disk fails we already doing some backup on a dell powervault
rd1000, so i don't think thats a problem and also we would run ceph on
a Dell PERC Raid Controller with RAID1 enabled on the data disk.

  Currently we make an archiving software for small customers and we want
  to move things on the file system on a object storage.

 you mean from the filesystem to an object storage?

yes, currently everything is on the filesystem and this is really
horrible, thousands of pdfs just on the filesystem. we can't scale up
that easily with this setup.
Currently we run on Microsoft Servers, but we plan to rewrite our
whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5,
7, 9, ... X²-1 should be possible.

  Currently we only
  have customers that needs 1 machine or 3 machines. But everything should
  work as fine on more.

 it would with ceph. probably :)

That's nice to hear. I was really scared that we don't find a solution
that can run on 1 system and scale up to even more. We first looked at
HDFS but this isn't lightweight. And the overhead of Metadata etc.
just isn't that cool.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Wolfgang Hennerbichler
On 08/19/2013 12:01 PM, Schmitt, Christian wrote:
 yes. depends on 'everything', but it's possible (though not recommended)
 to run mon, mds, and osd's on the same host, and even do virtualisation.
 
 Currently we don't want to virtualise on this machine since the
 machine is really small, as said we focus on small to midsize
 businesses. Most of the time they even need a tower server due to the
 lack of a correct rack. ;/

whoa :)

 Our Application, Ceph's object storage and a database?

 what is 'a database'?
 
 We run Postgresql or MariaDB (without/with Galera depending on the cluster 
 size)

You wouldn't want to put the data of postgres or mariadb on cephfs. I
would run the native versions directly on the servers and use
mysql-multi-master circular replication. I don't know about similar
features of postgres.

 shared nothing is possible with ceph, but in the end this really depends
 on your application.
 
 hm, when disk fails we already doing some backup on a dell powervault
 rd1000, so i don't think thats a problem and also we would run ceph on
 a Dell PERC Raid Controller with RAID1 enabled on the data disk.

this is open to discussion, and really depends on your use case.

 Currently we make an archiving software for small customers and we want
 to move things on the file system on a object storage.

 you mean from the filesystem to an object storage?
 
 yes, currently everything is on the filesystem and this is really
 horrible, thousands of pdfs just on the filesystem. we can't scale up
 that easily with this setup.

Got it.

 Currently we run on Microsoft Servers, but we plan to rewrite our
 whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5,
 7, 9, ... X²-1 should be possible.

cool.

 Currently we only
 have customers that needs 1 machine or 3 machines. But everything should
 work as fine on more.

 it would with ceph. probably :)
 
 That's nice to hear. I was really scared that we don't find a solution
 that can run on 1 system and scale up to even more. We first looked at
 HDFS but this isn't lightweight. 

not only that, HDFS also has a single point of failure.

 And the overhead of Metadata etc.
 just isn't that cool.

:)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deploy Ceph on RHEL6.4

2013-08-19 Thread Guang Yang
Hi ceph-users,
I would like to check if there is any manual / steps which can let me try to 
deploy ceph in RHEL?

Thanks,
Guang___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Schmitt, Christian
2013/8/19 Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at:
 On 08/19/2013 12:01 PM, Schmitt, Christian wrote:
 yes. depends on 'everything', but it's possible (though not recommended)
 to run mon, mds, and osd's on the same host, and even do virtualisation.

 Currently we don't want to virtualise on this machine since the
 machine is really small, as said we focus on small to midsize
 businesses. Most of the time they even need a tower server due to the
 lack of a correct rack. ;/

 whoa :)

Yep that's awful.

 Our Application, Ceph's object storage and a database?

 what is 'a database'?

 We run Postgresql or MariaDB (without/with Galera depending on the cluster 
 size)

 You wouldn't want to put the data of postgres or mariadb on cephfs. I
 would run the native versions directly on the servers and use
 mysql-multi-master circular replication. I don't know about similar
 features of postgres.

No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs
in CephFS or Ceph's Object Storage and hold a key or path in the
database, also other things like user management will belong to the
database

 shared nothing is possible with ceph, but in the end this really depends
 on your application.

 hm, when disk fails we already doing some backup on a dell powervault
 rd1000, so i don't think thats a problem and also we would run ceph on
 a Dell PERC Raid Controller with RAID1 enabled on the data disk.

 this is open to discussion, and really depends on your use case.

Yeah we definitely know that it isn't good to use Ceph on a single
node, but i think it's easier to design the application that it will
depends on ceph. it wouldn't be easy to manage to have a single node
without ceph and more than 1 node with ceph.

 Currently we make an archiving software for small customers and we want
 to move things on the file system on a object storage.

 you mean from the filesystem to an object storage?

 yes, currently everything is on the filesystem and this is really
 horrible, thousands of pdfs just on the filesystem. we can't scale up
 that easily with this setup.

 Got it.

 Currently we run on Microsoft Servers, but we plan to rewrite our
 whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5,
 7, 9, ... X²-1 should be possible.

 cool.

 Currently we only
 have customers that needs 1 machine or 3 machines. But everything should
 work as fine on more.

 it would with ceph. probably :)

 That's nice to hear. I was really scared that we don't find a solution
 that can run on 1 system and scale up to even more. We first looked at
 HDFS but this isn't lightweight.

 not only that, HDFS also has a single point of failure.

 And the overhead of Metadata etc.
 just isn't that cool.

 :)

Yeah that's why I came to Ceph. I think that's probably the way we want to go.
Really thank you for your help. It's good to know that I have a
solution for the things that are badly designed on our current
solution.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deploy Ceph on RHEL6.4

2013-08-19 Thread xan.peng
On Mon, Aug 19, 2013 at 6:09 PM, Guang Yang yguan...@yahoo.com wrote:
 Hi ceph-users,
 I would like to check if there is any manual / steps which can let me try to
 deploy ceph in RHEL?

Setup with ceph-deploy: http://dachary.org/?p=1971
Official documentation will also be helpful:
http://ceph.com/docs/master/start/quick-ceph-deploy/
-- 
-Thanks.
- xan.peng
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)

2013-08-19 Thread Olivier Bonvalet
Hi,

I have an OSD which crash every time I try to start it (see logs below).
Is it a known problem ? And is there a way to fix it ?

root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log
2013-08-19 11:07:48.478558 7f6fe367a780  0 ceph version 0.61.7 
(8f010aff684e820ecc837c25ac77c7a05d7191ff), process ceph-osd, pid 19327
2013-08-19 11:07:48.516363 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is supported and appears to work
2013-08-19 11:07:48.516380 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-08-19 11:07:48.516514 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount did NOT detect btrfs
2013-08-19 11:07:48.517087 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount syscall(SYS_syncfs, fd) fully supported
2013-08-19 11:07:48.517389 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount found snaps 
2013-08-19 11:07:49.199483 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount: enabling WRITEAHEAD journal mode: btrfs not detected
2013-08-19 11:07:52.191336 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.196020 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.196920 7f6fe367a780  1 journal close /dev/sdk4
2013-08-19 11:07:52.199908 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is supported and appears to work
2013-08-19 11:07:52.199916 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount FIEMAP ioctl is disabled via 'filestore fiemap' config option
2013-08-19 11:07:52.200058 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount did NOT detect btrfs
2013-08-19 11:07:52.200886 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount syscall(SYS_syncfs, fd) fully supported
2013-08-19 11:07:52.200919 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount found snaps 
2013-08-19 11:07:52.215850 7f6fe367a780  0 filestore(/var/lib/ceph/osd/ceph-65) 
mount: enabling WRITEAHEAD journal mode: btrfs not detected
2013-08-19 11:07:52.219819 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.227420 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
2013-08-19 11:07:52.500342 7f6fe367a780  0 osd.65 144201 crush map has features 
262144, adjusting msgr requires for clients
2013-08-19 11:07:52.500353 7f6fe367a780  0 osd.65 144201 crush map has features 
262144, adjusting msgr requires for osds
2013-08-19 11:08:13.581709 7f6fbdcb5700 -1 osd/OSD.cc: In function 'OSDMapRef 
OSDService::get_map(epoch_t)' thread 7f6fbdcb5700 time 2013-08-19 
11:08:13.579519
osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl))

 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
 1: (OSDService::get_map(unsigned int)+0x44b) [0x6f5b9b]
 2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle, 
PG::RecoveryCtx*, std::setboost::intrusive_ptrPG, 
std::lessboost::intrusive_ptrPG , std::allocatorboost::intrusive_ptrPG  
*)+0x3c8) [0x6f8f48]
 3: (OSD::process_peering_events(std::listPG*, std::allocatorPG*  const, 
ThreadPool::TPHandle)+0x31f) [0x6f975f]
 4: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG*  const, 
ThreadPool::TPHandle)+0x14) [0x7391d4]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8f8e3a]
 6: (ThreadPool::WorkThread::entry()+0x10) [0x8fa0e0]
 7: (()+0x6b50) [0x7f6fe3070b50]
 8: (clone()+0x6d) [0x7f6fe15cba7d]
 NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
interpret this.

full logs here : http://pastebin.com/RphNyLU0


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Da Chun Ng
I have a 3 nodes, 15 osds ceph cluster setup:* 15 7200 RPM SATA disks, 5 for 
each node.* 10G network* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for 
each node.* 64G Ram for each node.

I deployed the cluster with ceph-deploy, and created a new data pool for 
cephfs.Both the data and metadata pools are set with replica size 3.Then 
mounted the cephfs on one of the three nodes, and tested the performance with 
fio.
The sequential read  performance looks good:fio -direct=1 -iodepth 1 -thread 
-rw=read -ioengine=libaio -bs=16K -size=1G -numjobs=16 -group_reporting 
-name=mytest -runtime 60read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 
60012msec
But the sequential write/random read/random write performance is very poor:fio 
-direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K -size=256M 
-numjobs=16 -group_reporting -name=mytest -runtime 60write: io=397280KB, 
bw=6618.2KB/s, iops=413 , runt= 60029msecfio -direct=1 -iodepth 1 -thread 
-rw=randread -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting 
-name=mytest -runtime 60read : io=665664KB, bw=11087KB/s, iops=692 , runt= 
60041msecfio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio 
-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60write: 
io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec
I am mostly surprised by the seq write performance comparing to the raw sata 
disk performance(It can get 4127 IOPS when mounted with ext4). My cephfs only 
gets 1/10 performance of the raw disk.
How can I tune my cluster to improve the sequential write/random read/random 
write performance?


  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] dumpling ceph cli tool breaks openstack cinder

2013-08-19 Thread Øystein Lønning Nerhus
Hi,

I just noticed that in dumpling the ceph cli tool no longer utilises the 
CEPH_ARGS environment variable.  This is used by openstack cinder to specifiy 
the cephx user.   Ref: 
http://ceph.com/docs/next/rbd/rbd-openstack/#configure-openstack-to-use-ceph

I modifiied this line in /usr/share/pyshared/cinder/volume/driver.py

 stdout, _ = self._execute('ceph', 'fsid')
 stdout, _ = self._execute('ceph', '--id', 'volumes', 'fsid')

For my particular setup this seems to be sufficient as a quick workaround.  Is 
there a proper way to do this with the new tool?

Note: This only hit when i tried to create a volume from an image (i'm using 
copy on write cloning).  creating a fresh volume didnt invoke the ceph fsid 
command in the openstack script, so i guess some openstack users will not be 
affected.

Thanks,

Øystein___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Da Chun Ng
Sorry, forget to tell the OS and kernel version.
It's Centos 6.4 with kernel 3.10.6 .fio 2.0.13 .

From: dachun...@outlook.com
To: ceph-users@lists.ceph.com
Date: Mon, 19 Aug 2013 11:28:24 +
Subject: [ceph-users] Poor write/random read/random write performance




I have a 3 nodes, 15 osds ceph cluster setup:* 15 7200 RPM SATA disks, 5 for 
each node.* 10G network* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for 
each node.* 64G Ram for each node.

I deployed the cluster with ceph-deploy, and created a new data pool for 
cephfs.Both the data and metadata pools are set with replica size 3.Then 
mounted the cephfs on one of the three nodes, and tested the performance with 
fio.
The sequential read  performance looks good:fio -direct=1 -iodepth 1 -thread 
-rw=read -ioengine=libaio -bs=16K -size=1G -numjobs=16 -group_reporting 
-name=mytest -runtime 60read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 
60012msec
But the sequential write/random read/random write performance is very poor:fio 
-direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K -size=256M 
-numjobs=16 -group_reporting -name=mytest -runtime 60write: io=397280KB, 
bw=6618.2KB/s, iops=413 , runt= 60029msecfio -direct=1 -iodepth 1 -thread 
-rw=randread -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting 
-name=mytest -runtime 60read : io=665664KB, bw=11087KB/s, iops=692 , runt= 
60041msecfio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio 
-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60write: 
io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec
I am mostly surprised by the seq write performance comparing to the raw sata 
disk performance(It can get 4127 IOPS when mounted with ext4). My cephfs only 
gets 1/10 performance of the raw disk.
How can I tune my cluster to improve the sequential write/random read/random 
write performance?


  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Mark Nelson

On 08/19/2013 06:28 AM, Da Chun Ng wrote:

I have a 3 nodes, 15 osds ceph cluster setup:
* 15 7200 RPM SATA disks, 5 for each node.
* 10G network
* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
* 64G Ram for each node.

I deployed the cluster with ceph-deploy, and created a new data pool 
for cephfs.

Both the data and metadata pools are set with replica size 3.
Then mounted the cephfs on one of the three nodes, and tested the 
performance with fio.


The sequential read  performance looks good:
fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K 
-size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60

read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec


Sounds like readahead and or caching is helping out a lot here. Btw, you 
might want to make sure this is actually coming from the disks with 
iostat or collectl or something.




But the sequential write/random read/random write performance is very 
poor:
fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K 
-size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60

write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec


One thing to keep in mind is that unless you have SSDs in this system, 
you will be doing 2 writes for every client write to the spinning disks 
(since data and journals will both be on the same disk).


So let's do the math:

6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 
(KB-bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive


If there is no write coalescing going on, this isn't terrible.  If there 
is, this is terrible.  Have you tried buffered writes with the sync 
engine at the same IO size?


fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio -bs=16K 
-size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60

read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec


In this case:

11087 * 1024 (KB-bytes) / 16384 / 15 = ~46 IOPS / drive.

Definitely not great!  You might want to try fiddling with read ahead 
both on the CephFS client and on the block devices under the OSDs 
themselves.


One thing I did notice back during bobtail is that increasing the number 
of osd op threads seemed to help small object read performance.  It 
might be worth looking at too.


http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread

Other than that, if you really want to dig into this, you can use tools 
like iostat, collectl, blktrace, and seekwatcher to try and get a feel 
for what the IO going to the OSDs looks like.  That can help when 
diagnosing this sort of thing.


fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio 
-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60

write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec


6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024 
(KB-bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS / drive




I am mostly surprised by the seq write performance comparing to the 
raw sata disk performance(It can get 4127 IOPS when mounted with 
ext4). My cephfs only gets 1/10 performance of the raw disk.


7200 RPM spinning disks typically top out at something like 150 IOPS 
(and some are lower).  With 15 disks, to hit 4127 IOPS you were probably 
seeing some write coalescing effects (or if these were random reads, 
some benefit to read ahead).




How can I tune my cluster to improve the sequential write/random 
read/random write performance?
I don't know what kind of controller you have, but in cases where 
journals are on the same disks as the data, using writeback cache helps 
a lot because the controller can coalesce the direct IO journal writes 
in cache and just do big periodic dumps to the drives.  That really 
reduces seek overhead for the writes.  Using SSDs for the journals 
accomplishes much of the same effect, and lets you get faster large IO 
writes too, but in many chassis there is a density (and cost) trade-off.


Hope this helps!

Mark







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Request preinstalled Virtual Machines Images for cloning.

2013-08-19 Thread Johannes Klarenbeek
Dear Ceph Developers and Users,

I was wondering if there is any download location for preinstalled virtual 
machine images with the latest release of ceph. Preferably 4 different images 
with Ceph-OSD, Ceph-Mon, Ceph-MDS and last but not least a Ceph-Client with 
iscsi target server installed. But since the latter is the client, I guess any 
distro would do.

If this doesn't exist, maybe it's a great idea for distribution from the 
ceph.com website. I could just startup an image like ceph-osd on any hypervisor 
to add its local storage via disk passthrough to my ceph private cloud, and 
just distribute some monitors and metadata servers over the rest of the 
hypervisors. Packages like this can be kept small (like using slitaz forexample 
- since this one performs best on hyper-v hypervisors).

Any Ideas?

Regards,
Johannes


__ Informatie van ESET Endpoint Antivirus, versie van database 
viruskenmerken 8703 (20130819) __

Het bericht is gecontroleerd door ESET Endpoint Antivirus.

http://www.eset.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Da Chun Ng
Thanks very much! Mark.Yes, I put the data and journal on the same disk, no SSD 
in my environment.My controllers are general SATA II.
Some more questions below in blue.

Date: Mon, 19 Aug 2013 07:48:23 -0500
From: mark.nel...@inktank.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Poor write/random read/random write performance


  

  
  
On 08/19/2013 06:28 AM, Da Chun Ng
  wrote:



  
  I have a 3 nodes, 15 osds ceph cluster setup:
* 15 7200 RPM SATA disks, 5 for each node.
* 10G network
* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each
  node.
* 64G Ram for each node.

  

  
  I deployed the cluster with ceph-deploy, and created a
new data pool for cephfs.
  Both the data and metadata pools are set with replica
size 3.
  Then mounted the cephfs on
  one of the three nodes, and tested the performance with
  fio.
  

  
  The sequential read  performance looks good:
  fio -direct=1 -iodepth 1 -thread -rw=read
-ioengine=libaio -bs=16K -size=1G -numjobs=16
-group_reporting -name=mytest -runtime 60
  read : io=10630MB, bw=181389KB/s, iops=11336 , runt=
60012msec

  



Sounds like readahead and or caching is helping out a lot here. 
Btw, you might want to make sure this is actually coming from the
disks with iostat or collectl or something.
I ran sync  echo 3 | tee /proc/sys/vm/drop_caches on all the nodes before 
every test. I used collectl to watch every disk IO, the numbers should match. I 
think readahead is helping here.




  

  

  
  But the sequential write/random read/random write
performance is very poor:
  fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K 
-size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
  write: io=397280KB, bw=6618.2KB/s, iops=413 , runt=
60029msec

  



One thing to keep in mind is that unless you have SSDs in this
system, you will be doing 2 writes for every client write to the
spinning disks (since data and journals will both be on the same
disk).



So let's do the math:



6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024
(KB-bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS
/ drive



If there is no write coalescing going on, this isn't terrible.  If
there is, this is terrible. 
How can I know if there is write coalescing going on?
Have you tried buffered writes with the
sync engine at the same IO size?
Do you mean as below?fio -direct=0 -iodepth 1 -thread -rw=write -ioengine=sync 
-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60




  

  fio -direct=1 -iodepth 1 -thread -rw=randread
-ioengine=libaio -bs=16K -size=256M -numjobs=16
-group_reporting -name=mytest -runtime 60
  read : io=665664KB, bw=11087KB/s, iops=692 , runt=
60041msec

  



In this case:



11087 * 1024 (KB-bytes) / 16384 / 15 = ~46 IOPS / drive.  



Definitely not great!  You might want to try fiddling with read
ahead both on the CephFS client and on the block devices under the
OSDs themselves.  
Could you please tell me how to enable read ahead on the CephFS client? 
For the block devices under the OSDs, the read ahead value is:[root@ceph0 ~]# 
blockdev --getra /dev/sdi256How big is appropriate for it?


One thing I did notice back during bobtail is that increasing the
number of osd op threads seemed to help small object read
performance.  It might be worth looking at too.



http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread



Other than that, if you really want to dig into this, you can use
tools like iostat, collectl, blktrace, and seekwatcher to try and
get a feel for what the IO going to the OSDs looks like.  That can
help when diagnosing this sort of thing.




  

  fio -direct=1 -iodepth 1 -thread -rw=randwrite
-ioengine=libaio -bs=16K -size=256M -numjobs=16
-group_reporting -name=mytest -runtime 60
  write: io=361056KB, bw=6001.1KB/s, iops=375 , runt=
60157msec

  



6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024
(KB-bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS
/ drive




  

  

  
  I am mostly surprised by the seq write performance
comparing to the raw sata disk performance(It can get 4127
IOPS when mounted with ext4). My cephfs only gets 1/10
performance of the raw disk.

Re: [ceph-users] dumpling ceph cli tool breaks openstack cinder

2013-08-19 Thread Sage Weil
On Mon, 19 Aug 2013, S?bastien Han wrote:
 Hi,
 
 The new version of the driver (for Havana) doesn't need the CEPH_ARGS 
 argument, the driver now uses the librbd and librados (not the CLI anymore).
 
 I guess a better patch will result in:
 
 stdout, _ = self._execute('ceph', '--id', 'self.configuration.rbd_user', 
 'fsid')
 I'll report the bug. Thanks!
 
 However I don't know how to fix this with the new CLI.

I opened http://tracker.ceph.com/issues/6052.  This is a simple matter of 
adding a call to rados_conf_parse_env(...).

Thanks!
sage


 
 Cheers.
 
 
 S?bastien Han
 Cloud Engineer
 
 Always give 100%. Unless you're giving blood.
 
 
 
 Phone: +33 (0)1 49 70 99 72 - Mobile: +33 (0)6 52 84 44 70
 Mail: sebastien@enovance.com - Skype : han.sbastien
 Address : 10, rue de la Victoire - 75009 Paris
 Web : www.enovance.com - Twitter : @enovance
 
 On August 19, 2013 at 1:28:57 PM, ?ystein L?nning Nerhus (ner...@vx.no) wrote:
 
 Hi,
 
 I just noticed that in dumpling the ceph cli tool no longer utilises the 
 CEPH_ARGS environment variable.  This is used by openstack cinder to 
 specifiy the cephx user.   Ref: 
 http://ceph.com/docs/next/rbd/rbd-openstack/#configure-openstack-to-use-ceph
 
 I modifiied this line in /usr/share/pyshared/cinder/volume/driver.py
 
          stdout, _ = self._execute('ceph', 'fsid')
          stdout, _ = self._execute('ceph', '--id', 'volumes', 'fsid')
 
 For my particular setup this seems to be sufficient as a quick workaround.  
 Is there a proper way to do this with the new tool?
 
 Note: This only hit when i tried to create a volume from an image (i'm using 
 copy on write cloning).  creating a fresh volume didnt invoke the ceph fsid 
 command in the openstack script, so i guess some openstack users will not be 
 affected.
 
 Thanks,
 
 ?ystein___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Destroyed Ceph Cluster

2013-08-19 Thread Gregory Farnum
Have you ever used the FS? It's missing an object which we're
intermittently seeing failures to create (on initial setup) when the
cluster is unstable.
If so, clear out the metadata pool and check the docs for newfs.
-Greg

On Monday, August 19, 2013, Georg Höllrigl wrote:

 Hello List,

 The troubles to fix such a cluster continue... I get output like this now:

 # ceph health
 HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; mds cluster is
 degraded; mds vvx-ceph-m-03 is laggy


 When checking for the ceph-mds processes, there are now none left... no
 matter which server I check. And the won't start up again!?

 The log starts up with:
 2013-08-19 11:23:30.503214 7f7e9dfbd780  0 ceph version 0.67 (**
 e3b7bc5bce8ab330ec166138107236**8af3c218a0), process ceph-mds, pid 27636
 2013-08-19 11:23:30.523314 7f7e9904b700  1 mds.-1.0 handle_mds_map standby
 2013-08-19 11:23:30.529418 7f7e9904b700  1 mds.0.26 handle_mds_map i am
 now mds.0.26
 2013-08-19 11:23:30.529423 7f7e9904b700  1 mds.0.26 handle_mds_map state
 change up:standby -- up:replay
 2013-08-19 11:23:30.529426 7f7e9904b700  1 mds.0.26 replay_start
 2013-08-19 11:23:30.529434 7f7e9904b700  1 mds.0.26  recovery set is
 2013-08-19 11:23:30.529436 7f7e9904b700  1 mds.0.26  need osdmap epoch
 277, have 276
 2013-08-19 11:23:30.529438 7f7e9904b700  1 mds.0.26  waiting for osdmap
 277 (which blacklists prior instance)
 2013-08-19 11:23:30.534090 7f7e9904b700 -1 mds.0.sessionmap _load_finish
 got (2) No such file or directory
 2013-08-19 11:23:30.535483 7f7e9904b700 -1 mds/SessionMap.cc: In function
 'void SessionMap::_load_finish(int, ceph::bufferlist)' thread 7f7e9904b700
 time 2013-08-19 11:23:30.534107
 mds/SessionMap.cc: 83: FAILED assert(0 == failed to load sessionmap)


 Anyone an idea how to get the cluster back running?





 Georg




 On 16.08.2013 16:23, Mark Nelson wrote:

 Hi Georg,

 I'm not an expert on the monitors, but that's probably where I would
 start.  Take a look at your monitor logs and see if you can get a sense
 for why one of your monitors is down.  Some of the other devs will
 probably be around later that might know if there are any known issues
 with recreating the OSDs and missing PGs.

 Mark

 On 08/16/2013 08:21 AM, Georg Höllrigl wrote:

 Hello,

 I'm still evaluating ceph - now a test cluster with the 0.67 dumpling.
 I've created the setup with ceph-deploy from GIT.
 I've recreated a bunch of OSDs, to give them another journal.
 There already was some test data on these OSDs.
 I've already recreated the missing PGs with ceph pg force_create_pg


 HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 5 requests
 are blocked  32 sec; mds cluster is degraded; 1 mons down, quorum
 0,1,2 vvx-ceph-m-01,vvx-ceph-m-02,**vvx-ceph-m-03

 Any idea how to fix the cluster, besides completley rebuilding the
 cluster from scratch? What if such a thing happens in a production
 environment...

 The pgs from ceph pg dump looks all like creating for some time now:

 2.3d0   0   0   0   0   0   0 creating
   2013-08-16 13:43:08.186537   0'0 0:0 []  [] 0'0
 0.000'0 0.00

 Is there a way to just dump the data, that was on the discarded OSDs?




 Kind Regards,
 Georg
 __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Mark Nelson
On 08/19/2013 08:59 AM, Da Chun Ng wrote:
 Thanks very much! Mark.
 Yes, I put the data and journal on the same disk, no SSD in my environment.
 My controllers are general SATA II.

Ok, so in this case the lack of WB cache on the controller and no SSDs
for journals is probably having an effect.

 
 Some more questions below in blue.
 
 
 Date: Mon, 19 Aug 2013 07:48:23 -0500
 From: mark.nel...@inktank.com
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Poor write/random read/random write performance
 
 On 08/19/2013 06:28 AM, Da Chun Ng wrote:
 
 I have a 3 nodes, 15 osds ceph cluster setup:
 * 15 7200 RPM SATA disks, 5 for each node.
 * 10G network
 * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
 * 64G Ram for each node.
 
 I deployed the cluster with ceph-deploy, and created a new data pool
 for cephfs.
 Both the data and metadata pools are set with replica size 3.
 Then mounted the cephfs on one of the three nodes, and tested the
 performance with fio.
 
 The sequential read  performance looks good:
 fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K
 -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60
 read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec
 
 
 Sounds like readahead and or caching is helping out a lot here. Btw, you 
 might want to make sure this is actually coming from the disks with 
 iostat or collectl or something.
 
 I ran sync  echo 3 | tee /proc/sys/vm/drop_caches on all the nodes 
 before every test. I used collectl to watch every disk IO, the numbers 
 should match. I think readahead is helping here.

Ok, good!  I suspect that readahead is indeed helping.

 
 
 But the sequential write/random read/random write performance is
 very poor:
 fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K
 -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
 write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec
 
 
 One thing to keep in mind is that unless you have SSDs in this system, 
 you will be doing 2 writes for every client write to the spinning disks 
 (since data and journals will both be on the same disk).
 
 So let's do the math:
 
 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 
 (KB-bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive
 
 If there is no write coalescing going on, this isn't terrible.  If there 
 is, this is terrible.
 
 How can I know if there is write coalescing going on?

look in collectl at the average IO sizes going to the disks.  I bet they
will be 16KB.  If you were to look further with blktrace and
seekwatcher, I bet you'd see lots of seeking between OSD data writes and
journal writes since there is no controller cache helping smooth things
out (and your journals are on the same drives).

 
 Have you tried buffered writes with the sync engine at the same IO size?
 
 Do you mean as below?
 fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K 
 -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60

Yeah, that'd work.

 
 fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio
 -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
 read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec
 
 
 In this case:
 
 11087 * 1024 (KB-bytes) / 16384 / 15 = ~46 IOPS / drive.
 
 Definitely not great!  You might want to try fiddling with read ahead 
 both on the CephFS client and on the block devices under the OSDs 
 themselves.
 
 Could you please tell me how to enable read ahead on the CephFS client?

It's one of the mount options:

http://ceph.com/docs/master/man/8/mount.ceph/

 
 For the block devices under the OSDs, the read ahead value is:
 [root@ceph0 ~]# blockdev --getra /dev/sdi
 256
 How big is appropriate for it?

To be honest I've seen different results depending on the hardware.  I'd
try anywhere from 32kb to 2048kb.

 
 One thing I did notice back during bobtail is that increasing the number 
 of osd op threads seemed to help small object read performance.  It 
 might be worth looking at too.
 
 http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread
 
 Other than that, if you really want to dig into this, you can use tools 
 like iostat, collectl, blktrace, and seekwatcher to try and get a feel 
 for what the IO going to the OSDs looks like.  That can help when 
 diagnosing this sort of thing.
 
 fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio
 -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
 write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec
 
 
 6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024 
 (KB-bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS / drive
 
 
 I am mostly surprised by the seq write 

Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Da Chun Ng
Thank you! Testing now.
How about pg num? I'm using the default size 64, as I tried with (100 * 
osd_num)/replica_size, but it decreased the performance surprisingly.

 Date: Mon, 19 Aug 2013 11:33:30 -0500
 From: mark.nel...@inktank.com
 To: dachun...@outlook.com
 CC: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Poor write/random read/random write performance
 
 On 08/19/2013 08:59 AM, Da Chun Ng wrote:
  Thanks very much! Mark.
  Yes, I put the data and journal on the same disk, no SSD in my environment.
  My controllers are general SATA II.
 
 Ok, so in this case the lack of WB cache on the controller and no SSDs
 for journals is probably having an effect.
 
  
  Some more questions below in blue.
  
  
  Date: Mon, 19 Aug 2013 07:48:23 -0500
  From: mark.nel...@inktank.com
  To: ceph-users@lists.ceph.com
  Subject: Re: [ceph-users] Poor write/random read/random write performance
  
  On 08/19/2013 06:28 AM, Da Chun Ng wrote:
  
  I have a 3 nodes, 15 osds ceph cluster setup:
  * 15 7200 RPM SATA disks, 5 for each node.
  * 10G network
  * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
  * 64G Ram for each node.
  
  I deployed the cluster with ceph-deploy, and created a new data pool
  for cephfs.
  Both the data and metadata pools are set with replica size 3.
  Then mounted the cephfs on one of the three nodes, and tested the
  performance with fio.
  
  The sequential read  performance looks good:
  fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K
  -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60
  read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec
  
  
  Sounds like readahead and or caching is helping out a lot here. Btw, you 
  might want to make sure this is actually coming from the disks with 
  iostat or collectl or something.
  
  I ran sync  echo 3 | tee /proc/sys/vm/drop_caches on all the nodes 
  before every test. I used collectl to watch every disk IO, the numbers 
  should match. I think readahead is helping here.
 
 Ok, good!  I suspect that readahead is indeed helping.
 
  
  
  But the sequential write/random read/random write performance is
  very poor:
  fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K
  -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
  write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec
  
  
  One thing to keep in mind is that unless you have SSDs in this system, 
  you will be doing 2 writes for every client write to the spinning disks 
  (since data and journals will both be on the same disk).
  
  So let's do the math:
  
  6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 
  (KB-bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive
  
  If there is no write coalescing going on, this isn't terrible.  If there 
  is, this is terrible.
  
  How can I know if there is write coalescing going on?
 
 look in collectl at the average IO sizes going to the disks.  I bet they
 will be 16KB.  If you were to look further with blktrace and
 seekwatcher, I bet you'd see lots of seeking between OSD data writes and
 journal writes since there is no controller cache helping smooth things
 out (and your journals are on the same drives).
 
  
  Have you tried buffered writes with the sync engine at the same IO size?
  
  Do you mean as below?
  fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K 
  -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
 
 Yeah, that'd work.
 
  
  fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio
  -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
  read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec
  
  
  In this case:
  
  11087 * 1024 (KB-bytes) / 16384 / 15 = ~46 IOPS / drive.
  
  Definitely not great!  You might want to try fiddling with read ahead 
  both on the CephFS client and on the block devices under the OSDs 
  themselves.
  
  Could you please tell me how to enable read ahead on the CephFS client?
 
 It's one of the mount options:
 
 http://ceph.com/docs/master/man/8/mount.ceph/
 
  
  For the block devices under the OSDs, the read ahead value is:
  [root@ceph0 ~]# blockdev --getra /dev/sdi
  256
  How big is appropriate for it?
 
 To be honest I've seen different results depending on the hardware.  I'd
 try anywhere from 32kb to 2048kb.
 
  
  One thing I did notice back during bobtail is that increasing the number 
  of osd op threads seemed to help small object read performance.  It 
  might be worth looking at too.
  
  http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread
  
  Other than that, if you really want to dig into this, you can use tools 
  like iostat, collectl, blktrace, and seekwatcher to try and get a feel 
  for what the IO 

Re: [ceph-users] Poor write/random read/random write performance

2013-08-19 Thread Mark Nelson
On 08/19/2013 12:05 PM, Da Chun Ng wrote:
 Thank you! Testing now.
 
 How about pg num? I'm using the default size 64, as I tried with (100 * 
 osd_num)/replica_size, but it decreased the performance surprisingly.

Oh!  That's odd!  Typically you would want more than that.  Most likely
you aren't distributing PGs very evenly across OSDs with 64.  More PGs
shouldn't decrease performance unless the monitors are behaving badly.
We saw some issues back in early cuttlefish but you should be fine with
many more PGs.

Mark

 
   Date: Mon, 19 Aug 2013 11:33:30 -0500
   From: mark.nel...@inktank.com
   To: dachun...@outlook.com
   CC: ceph-users@lists.ceph.com
   Subject: Re: [ceph-users] Poor write/random read/random write performance
  
   On 08/19/2013 08:59 AM, Da Chun Ng wrote:
Thanks very much! Mark.
Yes, I put the data and journal on the same disk, no SSD in my 
 environment.
My controllers are general SATA II.
  
   Ok, so in this case the lack of WB cache on the controller and no SSDs
   for journals is probably having an effect.
  
   
Some more questions below in blue.
   

 
Date: Mon, 19 Aug 2013 07:48:23 -0500
From: mark.nel...@inktank.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Poor write/random read/random write 
 performance
   
On 08/19/2013 06:28 AM, Da Chun Ng wrote:
   
I have a 3 nodes, 15 osds ceph cluster setup:
* 15 7200 RPM SATA disks, 5 for each node.
* 10G network
* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.
* 64G Ram for each node.
   
I deployed the cluster with ceph-deploy, and created a new data pool
for cephfs.
Both the data and metadata pools are set with replica size 3.
Then mounted the cephfs on one of the three nodes, and tested the
performance with fio.
   
The sequential read performance looks good:
fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K
-size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60
read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec
   
   
Sounds like readahead and or caching is helping out a lot here. 
 Btw, you
might want to make sure this is actually coming from the disks with
iostat or collectl or something.
   
I ran sync  echo 3 | tee /proc/sys/vm/drop_caches on all the nodes
before every test. I used collectl to watch every disk IO, the numbers
should match. I think readahead is helping here.
  
   Ok, good! I suspect that readahead is indeed helping.
  
   
   
But the sequential write/random read/random write performance is
very poor:
fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K
-size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec
   
   
One thing to keep in mind is that unless you have SSDs in this system,
you will be doing 2 writes for every client write to the spinning 
 disks
(since data and journals will both be on the same disk).
   
So let's do the math:
   
6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024
(KB-bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / 
 drive
   
If there is no write coalescing going on, this isn't terrible. If 
 there
is, this is terrible.
   
How can I know if there is write coalescing going on?
  
   look in collectl at the average IO sizes going to the disks. I bet they
   will be 16KB. If you were to look further with blktrace and
   seekwatcher, I bet you'd see lots of seeking between OSD data writes and
   journal writes since there is no controller cache helping smooth things
   out (and your journals are on the same drives).
  
   
Have you tried buffered writes with the sync engine at the same IO 
 size?
   
Do you mean as below?
fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K
-size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60
  
   Yeah, that'd work.
  
   
fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio
-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest 
 -runtime 60
read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec
   
   
In this case:
   
11087 * 1024 (KB-bytes) / 16384 / 15 = ~46 IOPS / drive.
   
Definitely not great! You might want to try fiddling with read ahead
both on the CephFS client and on the block devices under the OSDs
themselves.
   
Could you please tell me how to enable read ahead on the CephFS client?
  
   It's one of the mount options:
  
   http://ceph.com/docs/master/man/8/mount.ceph/
  
   
For the block devices under the OSDs, the read ahead value is:
[root@ceph0 ~]# blockdev --getra /dev/sdi
256
How big is appropriate for it?
  
   To be honest I've seen different results depending on the hardware. I'd
   try 

Re: [ceph-users] Ceph Deployments

2013-08-19 Thread John Wilkins
Actually, I wrote the Quick Start guides so that you could do exactly
what you are trying to do, but mostly from a kick the tires
perspective so that people can learn to use Ceph without imposing
$100k worth of hardware as a requirement. See
http://ceph.com/docs/master/start/quick-ceph-deploy/

I even added a section so that you could do it on one disk--e.g., on
your laptop.  
http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only

It says demo only, because you won't get great performance out of a
single node. Monitors, OSDs, and Journals writing to disk and fsync
issues would make performance sub-optimal.

For better performance, you should consider a separate drive for each
Ceph OSD Daemon if you can, and potentially a separate SSD drive
partitioned for journals. If you can separate the OS and monitor
drives from the OSD drives, that's better too.

I wrote it as a two-node quick start, because you cannot kernel mount
the Ceph Filesystem or Ceph Block Devices on the same host as the Ceph
Storage Cluster. It's a kernel issue, not a Ceph issue. However, you
can get around this too. If your machine has enough RAM and CPU, you
can also install virtual machines and kernel mount cephfs and block
devices in the virtual machines with no kernel issues. You don't need
to use VMs at all for librbd. So you can install QEMU/KVM, libvirt and
OpenStack all on the same host too.  It's just not an ideal situation
from performance or high availability perspective.



On Mon, Aug 19, 2013 at 3:12 AM, Schmitt, Christian
c.schm...@briefdomain.de wrote:
 2013/8/19 Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at:
 On 08/19/2013 12:01 PM, Schmitt, Christian wrote:
 yes. depends on 'everything', but it's possible (though not recommended)
 to run mon, mds, and osd's on the same host, and even do virtualisation.

 Currently we don't want to virtualise on this machine since the
 machine is really small, as said we focus on small to midsize
 businesses. Most of the time they even need a tower server due to the
 lack of a correct rack. ;/

 whoa :)

 Yep that's awful.

 Our Application, Ceph's object storage and a database?

 what is 'a database'?

 We run Postgresql or MariaDB (without/with Galera depending on the cluster 
 size)

 You wouldn't want to put the data of postgres or mariadb on cephfs. I
 would run the native versions directly on the servers and use
 mysql-multi-master circular replication. I don't know about similar
 features of postgres.

 No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs
 in CephFS or Ceph's Object Storage and hold a key or path in the
 database, also other things like user management will belong to the
 database

 shared nothing is possible with ceph, but in the end this really depends
 on your application.

 hm, when disk fails we already doing some backup on a dell powervault
 rd1000, so i don't think thats a problem and also we would run ceph on
 a Dell PERC Raid Controller with RAID1 enabled on the data disk.

 this is open to discussion, and really depends on your use case.

 Yeah we definitely know that it isn't good to use Ceph on a single
 node, but i think it's easier to design the application that it will
 depends on ceph. it wouldn't be easy to manage to have a single node
 without ceph and more than 1 node with ceph.

 Currently we make an archiving software for small customers and we want
 to move things on the file system on a object storage.

 you mean from the filesystem to an object storage?

 yes, currently everything is on the filesystem and this is really
 horrible, thousands of pdfs just on the filesystem. we can't scale up
 that easily with this setup.

 Got it.

 Currently we run on Microsoft Servers, but we plan to rewrite our
 whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5,
 7, 9, ... X²-1 should be possible.

 cool.

 Currently we only
 have customers that needs 1 machine or 3 machines. But everything should
 work as fine on more.

 it would with ceph. probably :)

 That's nice to hear. I was really scared that we don't find a solution
 that can run on 1 system and scale up to even more. We first looked at
 HDFS but this isn't lightweight.

 not only that, HDFS also has a single point of failure.

 And the overhead of Metadata etc.
 just isn't that cool.

 :)

 Yeah that's why I came to Ceph. I think that's probably the way we want to go.
 Really thank you for your help. It's good to know that I have a
 solution for the things that are badly designed on our current
 solution.

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
John Wilkins
Senior Technical Writer
Intank
john.wilk...@inktank.com
(415) 425-9599

Re: [ceph-users] RBD and balanced reads

2013-08-19 Thread Gregory Farnum
On Mon, Aug 19, 2013 at 9:07 AM, Sage Weil s...@inktank.com wrote:
 On Mon, 19 Aug 2013, S?bastien Han wrote:
 Hi guys,

 While reading a developer doc, I came across the following options:

 * osd balance reads = true
 * osd shed reads = true
 * osd shed reads min latency
 * osd shed reads min latency diff

 The problem is that I can't find any of these options in config_opts.h.

 These are left over from an old unimplemented experiment and were removed
 a while back.

 Loic Dachary also gave me a flag that he found from the code.

 m-get_flags()  CEPH_OSD_FLAG_LOCALIZE_READS)

 So my questions are:

 * Which from the above flags are correct?
 * Do balanced reads really exist in RBD?

 For localized reads you want

 OPTION(rbd_balance_snap_reads, OPT_BOOL, false)
 OPTION(rbd_localize_snap_reads, OPT_BOOL, false)

 Note that the 'localize' logic is still very primitive (it matches by IP
 address).  There is a blueprint to improve this:

 
 http://wiki.ceph.com/01Planning/02Blueprints/Emperor/librados%2F%2Fobjecter%3A_smarter_localized_reads

Also, there are some issues with read/write consistency when using
localized reads because the replicas do not provide the ordering
guarantees that primaries will. See
http://tracker.ceph.com/issues/5388
At present localized reads are really only suitable for spreading the
load on write-once, read-many workloads.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Deployments

2013-08-19 Thread Wolfgang Hennerbichler
What you are trying to do will work, because you will not need any kernel 
related code for object storage, so a one node setup will work for you. 

-- 
Sent from my mobile device

On 19.08.2013, at 20:29, Schmitt, Christian c.schm...@briefdomain.de wrote:

 That sounds bad for me.
 As said one of the things we consider is a one node setup, for production.
 Not every Customer will afford hardware worth more than ~4000 Euro.
 Small business users don't need need the biggest hardware, but i don't
 think it's a good way to have a version who uses the filesystem and
 one version who use ceph.
 
 We prefer a Object Storage for our Files. It should work like the
 Object Storage of the App Engine.
 That scales from 1 to X Servers.
 
 
 2013/8/19 John Wilkins john.wilk...@inktank.com:
 Actually, I wrote the Quick Start guides so that you could do exactly
 what you are trying to do, but mostly from a kick the tires
 perspective so that people can learn to use Ceph without imposing
 $100k worth of hardware as a requirement. See
 http://ceph.com/docs/master/start/quick-ceph-deploy/
 
 I even added a section so that you could do it on one disk--e.g., on
 your laptop.  
 http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only
 
 It says demo only, because you won't get great performance out of a
 single node. Monitors, OSDs, and Journals writing to disk and fsync
 issues would make performance sub-optimal.
 
 For better performance, you should consider a separate drive for each
 Ceph OSD Daemon if you can, and potentially a separate SSD drive
 partitioned for journals. If you can separate the OS and monitor
 drives from the OSD drives, that's better too.
 
 I wrote it as a two-node quick start, because you cannot kernel mount
 the Ceph Filesystem or Ceph Block Devices on the same host as the Ceph
 Storage Cluster. It's a kernel issue, not a Ceph issue. However, you
 can get around this too. If your machine has enough RAM and CPU, you
 can also install virtual machines and kernel mount cephfs and block
 devices in the virtual machines with no kernel issues. You don't need
 to use VMs at all for librbd. So you can install QEMU/KVM, libvirt and
 OpenStack all on the same host too.  It's just not an ideal situation
 from performance or high availability perspective.
 
 
 
 On Mon, Aug 19, 2013 at 3:12 AM, Schmitt, Christian
 c.schm...@briefdomain.de wrote:
 2013/8/19 Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at:
 On 08/19/2013 12:01 PM, Schmitt, Christian wrote:
 yes. depends on 'everything', but it's possible (though not recommended)
 to run mon, mds, and osd's on the same host, and even do virtualisation.
 
 Currently we don't want to virtualise on this machine since the
 machine is really small, as said we focus on small to midsize
 businesses. Most of the time they even need a tower server due to the
 lack of a correct rack. ;/
 
 whoa :)
 
 Yep that's awful.
 
 Our Application, Ceph's object storage and a database?
 
 what is 'a database'?
 
 We run Postgresql or MariaDB (without/with Galera depending on the 
 cluster size)
 
 You wouldn't want to put the data of postgres or mariadb on cephfs. I
 would run the native versions directly on the servers and use
 mysql-multi-master circular replication. I don't know about similar
 features of postgres.
 
 No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs
 in CephFS or Ceph's Object Storage and hold a key or path in the
 database, also other things like user management will belong to the
 database
 
 shared nothing is possible with ceph, but in the end this really depends
 on your application.
 
 hm, when disk fails we already doing some backup on a dell powervault
 rd1000, so i don't think thats a problem and also we would run ceph on
 a Dell PERC Raid Controller with RAID1 enabled on the data disk.
 
 this is open to discussion, and really depends on your use case.
 
 Yeah we definitely know that it isn't good to use Ceph on a single
 node, but i think it's easier to design the application that it will
 depends on ceph. it wouldn't be easy to manage to have a single node
 without ceph and more than 1 node with ceph.
 
 Currently we make an archiving software for small customers and we want
 to move things on the file system on a object storage.
 
 you mean from the filesystem to an object storage?
 
 yes, currently everything is on the filesystem and this is really
 horrible, thousands of pdfs just on the filesystem. we can't scale up
 that easily with this setup.
 
 Got it.
 
 Currently we run on Microsoft Servers, but we plan to rewrite our
 whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5,
 7, 9, ... X²-1 should be possible.
 
 cool.
 
 Currently we only
 have customers that needs 1 machine or 3 machines. But everything should
 work as fine on more.
 
 it would with ceph. probably :)
 
 That's nice to hear. I was really scared that we don't find a solution
 that can run on 1 

Re: [ceph-users] Ceph Deployments

2013-08-19 Thread John Wilkins
Wolfgang is correct. You do not need VMs at all if you are setting up
Ceph Object Storage. It's just Apache, FastCGI, and the radosgw daemon
interacting with the Ceph Storage Cluster. You can do that on one box
no problem. It's still better to have more drives for performance
though.

On Mon, Aug 19, 2013 at 12:08 PM, Wolfgang Hennerbichler
wolfgang.hennerbich...@risc-software.at wrote:
 What you are trying to do will work, because you will not need any kernel 
 related code for object storage, so a one node setup will work for you.

 --
 Sent from my mobile device

 On 19.08.2013, at 20:29, Schmitt, Christian c.schm...@briefdomain.de 
 wrote:

 That sounds bad for me.
 As said one of the things we consider is a one node setup, for production.
 Not every Customer will afford hardware worth more than ~4000 Euro.
 Small business users don't need need the biggest hardware, but i don't
 think it's a good way to have a version who uses the filesystem and
 one version who use ceph.

 We prefer a Object Storage for our Files. It should work like the
 Object Storage of the App Engine.
 That scales from 1 to X Servers.


 2013/8/19 John Wilkins john.wilk...@inktank.com:
 Actually, I wrote the Quick Start guides so that you could do exactly
 what you are trying to do, but mostly from a kick the tires
 perspective so that people can learn to use Ceph without imposing
 $100k worth of hardware as a requirement. See
 http://ceph.com/docs/master/start/quick-ceph-deploy/

 I even added a section so that you could do it on one disk--e.g., on
 your laptop.  
 http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only

 It says demo only, because you won't get great performance out of a
 single node. Monitors, OSDs, and Journals writing to disk and fsync
 issues would make performance sub-optimal.

 For better performance, you should consider a separate drive for each
 Ceph OSD Daemon if you can, and potentially a separate SSD drive
 partitioned for journals. If you can separate the OS and monitor
 drives from the OSD drives, that's better too.

 I wrote it as a two-node quick start, because you cannot kernel mount
 the Ceph Filesystem or Ceph Block Devices on the same host as the Ceph
 Storage Cluster. It's a kernel issue, not a Ceph issue. However, you
 can get around this too. If your machine has enough RAM and CPU, you
 can also install virtual machines and kernel mount cephfs and block
 devices in the virtual machines with no kernel issues. You don't need
 to use VMs at all for librbd. So you can install QEMU/KVM, libvirt and
 OpenStack all on the same host too.  It's just not an ideal situation
 from performance or high availability perspective.



 On Mon, Aug 19, 2013 at 3:12 AM, Schmitt, Christian
 c.schm...@briefdomain.de wrote:
 2013/8/19 Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at:
 On 08/19/2013 12:01 PM, Schmitt, Christian wrote:
 yes. depends on 'everything', but it's possible (though not recommended)
 to run mon, mds, and osd's on the same host, and even do virtualisation.

 Currently we don't want to virtualise on this machine since the
 machine is really small, as said we focus on small to midsize
 businesses. Most of the time they even need a tower server due to the
 lack of a correct rack. ;/

 whoa :)

 Yep that's awful.

 Our Application, Ceph's object storage and a database?

 what is 'a database'?

 We run Postgresql or MariaDB (without/with Galera depending on the 
 cluster size)

 You wouldn't want to put the data of postgres or mariadb on cephfs. I
 would run the native versions directly on the servers and use
 mysql-multi-master circular replication. I don't know about similar
 features of postgres.

 No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs
 in CephFS or Ceph's Object Storage and hold a key or path in the
 database, also other things like user management will belong to the
 database

 shared nothing is possible with ceph, but in the end this really depends
 on your application.

 hm, when disk fails we already doing some backup on a dell powervault
 rd1000, so i don't think thats a problem and also we would run ceph on
 a Dell PERC Raid Controller with RAID1 enabled on the data disk.

 this is open to discussion, and really depends on your use case.

 Yeah we definitely know that it isn't good to use Ceph on a single
 node, but i think it's easier to design the application that it will
 depends on ceph. it wouldn't be easy to manage to have a single node
 without ceph and more than 1 node with ceph.

 Currently we make an archiving software for small customers and we want
 to move things on the file system on a object storage.

 you mean from the filesystem to an object storage?

 yes, currently everything is on the filesystem and this is really
 horrible, thousands of pdfs just on the filesystem. we can't scale up
 that easily with this setup.

 Got it.

 Currently we run on Microsoft Servers, but we plan to 

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-19 Thread Oliver Daudey
Hey Samuel,

Thanks!  I installed your version, repeated the same tests on my
test-cluster and the extra CPU-loading seems to have disappeared.  Then
I replaced one osd of my production-cluster with your modified version
and it's config-option and it seems to be a lot less CPU-hungry now.
Although the Cuttlefish-osds still seem to be even more CPU-efficient,
your changes have definitely helped a lot.  We seem to be looking in the
right direction, at least for this part of the problem.

BTW, I ran `perf top' on the production-node with your modified osd and
didn't see anything osd-related stand out on top.  PGLog::undirty()
was in there, but with much lower usage, right at the bottom of the
green part of the output.

Many thanks for your help so far!


   Regards,

 Oliver

On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:
 You're right, PGLog::undirty() looks suspicious.  I just pushed a
 branch wip-dumpling-pglog-undirty with a new config
 (osd_debug_pg_log_writeout) which if set to false will disable some
 strictly debugging checks which occur in PGLog::undirty().  We haven't
 actually seen these checks causing excessive cpu usage, so this may be
 a red herring.
 -Sam
 
 On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey oli...@xs4all.nl wrote:
  Hey Mark,
 
  On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
  On 08/17/2013 06:13 AM, Oliver Daudey wrote:
   Hey all,
  
   This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
   created in the tracker.  Thought I would pass it through the list as
   well, to get an idea if anyone else is running into it.  It may only
   show under higher loads.  More info about my setup is in the bug-report
   above.  Here goes:
  
  
   I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
   and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
   unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
   +MB/sec on simple linear writes to a file with `dd' inside a VM on this
   cluster under regular load and the osds usually averaged 20-100%
   CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
   the osds shot up to 100% to 400% in `top' (multi-core system) and the
   speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
   complained that disk-access inside the VMs was significantly slower and
   the backups of the RBD-store I was running, also got behind quickly.
  
   After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
   rest at 0.67 Dumpling, speed and load returned to normal. I have
   repeated this performance-hit upon upgrade on a similar test-cluster
   under no additional load at all. Although CPU-usage for the osds wasn't
   as dramatic during these tests because there was no base-load from other
   VMs, I/O-performance dropped significantly after upgrading during these
   tests as well, and returned to normal after downgrading the osds.
  
   I'm not sure what to make of it. There are no visible errors in the logs
   and everything runs and reports good health, it's just a lot slower,
   with a lot more CPU-usage.
 
  Hi Oliver,
 
  If you have access to the perf command on this system, could you try
  running:
 
  sudo perf top
 
  And if that doesn't give you much,
 
  sudo perf record -g
 
  then:
 
  sudo perf report | less
 
  during the period of high CPU usage?  This will give you a call graph.
  There may be symbols missing, but it might help track down what the OSDs
  are doing.
 
  Thanks for your help!  I did a couple of runs on my test-cluster,
  loading it with writes from 3 VMs concurrently and measuring the results
  at the first node with all 0.67 Dumpling-components and with the osds
  replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
  while, then copied anything that showed in red and green into this post.
  Here are the results (sorry for the word-wraps):
 
  First, with 0.61.7 osds:
 
   19.91%  [kernel][k] intel_idle
   10.18%  [kernel][k] _raw_spin_lock_irqsave
6.79%  ceph-osd[.] ceph_crc32c_le
4.93%  [kernel][k]
  default_send_IPI_mask_sequence_phys
2.71%  [kernel][k] copy_user_generic_string
1.42%  libc-2.11.3.so  [.] memcpy
1.23%  [kernel][k] find_busiest_group
1.13%  librados.so.2.0.0   [.] ceph_crc32c_le_intel
1.11%  [kernel][k] _raw_spin_lock
0.99%  kvm [.] 0x1931f8
0.92%  [igb]   [k] igb_poll
0.87%  [kernel][k] native_write_cr0
0.80%  [kernel][k] csum_partial
0.78%  [kernel][k] __do_softirq
0.63%  [kernel][k] hpet_legacy_next_event
0.53%  [ip_tables] [k] ipt_do_table
0.50%  libc-2.11.3.so  [.] 0x74433
 
  Second test, 

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-19 Thread Mark Nelson

Hi Oliver,

Glad that helped!  How much more efficient do the cuttlefish OSDs seem 
at this point (with wip-dumpling-pglog-undirty)?  On modern Intel 
platforms we were actually hoping to see CPU usage go down in many cases 
due to the use of hardware CRC32 instructions.


Mark

On 08/19/2013 03:06 PM, Oliver Daudey wrote:

Hey Samuel,

Thanks!  I installed your version, repeated the same tests on my
test-cluster and the extra CPU-loading seems to have disappeared.  Then
I replaced one osd of my production-cluster with your modified version
and it's config-option and it seems to be a lot less CPU-hungry now.
Although the Cuttlefish-osds still seem to be even more CPU-efficient,
your changes have definitely helped a lot.  We seem to be looking in the
right direction, at least for this part of the problem.

BTW, I ran `perf top' on the production-node with your modified osd and
didn't see anything osd-related stand out on top.  PGLog::undirty()
was in there, but with much lower usage, right at the bottom of the
green part of the output.

Many thanks for your help so far!


Regards,

  Oliver

On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:

You're right, PGLog::undirty() looks suspicious.  I just pushed a
branch wip-dumpling-pglog-undirty with a new config
(osd_debug_pg_log_writeout) which if set to false will disable some
strictly debugging checks which occur in PGLog::undirty().  We haven't
actually seen these checks causing excessive cpu usage, so this may be
a red herring.
-Sam

On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey oli...@xs4all.nl wrote:

Hey Mark,

On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:

On 08/17/2013 06:13 AM, Oliver Daudey wrote:

Hey all,

This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
created in the tracker.  Thought I would pass it through the list as
well, to get an idea if anyone else is running into it.  It may only
show under higher loads.  More info about my setup is in the bug-report
above.  Here goes:


I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
+MB/sec on simple linear writes to a file with `dd' inside a VM on this
cluster under regular load and the osds usually averaged 20-100%
CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
the osds shot up to 100% to 400% in `top' (multi-core system) and the
speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
complained that disk-access inside the VMs was significantly slower and
the backups of the RBD-store I was running, also got behind quickly.

After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
rest at 0.67 Dumpling, speed and load returned to normal. I have
repeated this performance-hit upon upgrade on a similar test-cluster
under no additional load at all. Although CPU-usage for the osds wasn't
as dramatic during these tests because there was no base-load from other
VMs, I/O-performance dropped significantly after upgrading during these
tests as well, and returned to normal after downgrading the osds.

I'm not sure what to make of it. There are no visible errors in the logs
and everything runs and reports good health, it's just a lot slower,
with a lot more CPU-usage.


Hi Oliver,

If you have access to the perf command on this system, could you try
running:

sudo perf top

And if that doesn't give you much,

sudo perf record -g

then:

sudo perf report | less

during the period of high CPU usage?  This will give you a call graph.
There may be symbols missing, but it might help track down what the OSDs
are doing.


Thanks for your help!  I did a couple of runs on my test-cluster,
loading it with writes from 3 VMs concurrently and measuring the results
at the first node with all 0.67 Dumpling-components and with the osds
replaced by 0.61.7 Cuttlefish.  I let `perf top' run and settle for a
while, then copied anything that showed in red and green into this post.
Here are the results (sorry for the word-wraps):

First, with 0.61.7 osds:

  19.91%  [kernel][k] intel_idle
  10.18%  [kernel][k] _raw_spin_lock_irqsave
   6.79%  ceph-osd[.] ceph_crc32c_le
   4.93%  [kernel][k]
default_send_IPI_mask_sequence_phys
   2.71%  [kernel][k] copy_user_generic_string
   1.42%  libc-2.11.3.so  [.] memcpy
   1.23%  [kernel][k] find_busiest_group
   1.13%  librados.so.2.0.0   [.] ceph_crc32c_le_intel
   1.11%  [kernel][k] _raw_spin_lock
   0.99%  kvm [.] 0x1931f8
   0.92%  [igb]   [k] igb_poll
   0.87%  [kernel][k] native_write_cr0
   0.80%  [kernel][k] csum_partial
   0.78%  [kernel][k] __do_softirq
   0.63%  [kernel]

Re: [ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)

2013-08-19 Thread Olivier Bonvalet
Le lundi 19 août 2013 à 12:27 +0200, Olivier Bonvalet a écrit :
 Hi,
 
 I have an OSD which crash every time I try to start it (see logs below).
 Is it a known problem ? And is there a way to fix it ?
 
 root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log
 2013-08-19 11:07:48.478558 7f6fe367a780  0 ceph version 0.61.7 
 (8f010aff684e820ecc837c25ac77c7a05d7191ff), process ceph-osd, pid 19327
 2013-08-19 11:07:48.516363 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and 
 appears to work
 2013-08-19 11:07:48.516380 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 
 'filestore fiemap' config option
 2013-08-19 11:07:48.516514 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs
 2013-08-19 11:07:48.517087 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully 
 supported
 2013-08-19 11:07:48.517389 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount found snaps 
 2013-08-19 11:07:49.199483 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal mode: 
 btrfs not detected
 2013-08-19 11:07:52.191336 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
 2013-08-19 11:07:52.196020 7f6fe367a780  1 journal _open /dev/sdk4 fd 18: 
 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
 2013-08-19 11:07:52.196920 7f6fe367a780  1 journal close /dev/sdk4
 2013-08-19 11:07:52.199908 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and 
 appears to work
 2013-08-19 11:07:52.199916 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 
 'filestore fiemap' config option
 2013-08-19 11:07:52.200058 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs
 2013-08-19 11:07:52.200886 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully 
 supported
 2013-08-19 11:07:52.200919 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount found snaps 
 2013-08-19 11:07:52.215850 7f6fe367a780  0 
 filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal mode: 
 btrfs not detected
 2013-08-19 11:07:52.219819 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
 2013-08-19 11:07:52.227420 7f6fe367a780  1 journal _open /dev/sdk4 fd 26: 
 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1
 2013-08-19 11:07:52.500342 7f6fe367a780  0 osd.65 144201 crush map has 
 features 262144, adjusting msgr requires for clients
 2013-08-19 11:07:52.500353 7f6fe367a780  0 osd.65 144201 crush map has 
 features 262144, adjusting msgr requires for osds
 2013-08-19 11:08:13.581709 7f6fbdcb5700 -1 osd/OSD.cc: In function 'OSDMapRef 
 OSDService::get_map(epoch_t)' thread 7f6fbdcb5700 time 2013-08-19 
 11:08:13.579519
 osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl))
 
  ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff)
  1: (OSDService::get_map(unsigned int)+0x44b) [0x6f5b9b]
  2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle, 
 PG::RecoveryCtx*, std::setboost::intrusive_ptrPG, 
 std::lessboost::intrusive_ptrPG , std::allocatorboost::intrusive_ptrPG 
  *)+0x3c8) [0x6f8f48]
  3: (OSD::process_peering_events(std::listPG*, std::allocatorPG*  const, 
 ThreadPool::TPHandle)+0x31f) [0x6f975f]
  4: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG*  const, 
 ThreadPool::TPHandle)+0x14) [0x7391d4]
  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8f8e3a]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x8fa0e0]
  7: (()+0x6b50) [0x7f6fe3070b50]
  8: (clone()+0x6d) [0x7f6fe15cba7d]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.
 
 full logs here : http://pastebin.com/RphNyLU0
 
 

Hi,

still same problem with Ceph 0.61.8 :

2013-08-19 23:01:54.369609 7fdd667a4780  0 osd.65 144279 crush map has features 
262144, adjusting msgr requires for osds
2013-08-19 23:01:58.315115 7fdd405de700 -1 osd/OSD.cc: In function 'OSDMapRef 
OSDService::get_map(epoch_t)' thread 7fdd405de700 time 2013-08-19 
23:01:58.313955
osd/OSD.cc: 4847: FAILED assert(_get_map_bl(epoch, bl))

 ceph version 0.61.8 (a6fdcca3bddbc9f177e4e2bf0d9cdd85006b028b)
 1: (OSDService::get_map(unsigned int)+0x44b) [0x6f736b]
 2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle, 
PG::RecoveryCtx*, std::setboost::intrusive_ptrPG, 
std::lessboost::intrusive_ptrPG , std::allocatorboost::intrusive_ptrPG  
*)+0x3c8) [0x6fa708]
 3: (OSD::process_peering_events(std::listPG*, std::allocatorPG*  const, 
ThreadPool::TPHandle)+0x31f) [0x6faf1f]
 4: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG*  const, 
ThreadPool::TPHandle)+0x14) [0x73a9b4]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8fb69a]
 6: 

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-19 Thread Oliver Daudey
Hey Mark,

If I look at the wip-dumpling-pglog-undirty-version with regular top,
I see a slightly higher base-load on the osd, with significantly more
and higher spikes in it than the Dumpling-osds.  Looking with `perf
top', PGLog::undirty() is still there, although pulling significantly
less CPU.  With the Cuttlefish-osds, I don't see it at all, even under
load.  That may account for the extra load I'm still seeing, but I don't
know what is still going on in it and if that too can be safely disabled
to save some more CPU.

All in all, it's quite close and seems a bit difficult to measure.  I'd
say the CPU-usage with wip-dumpling-pglog-undirty is still a good 30%
higher than Cuttlefish on my production-cluster.  I have yet to upgrade
all osds and compare performance of the cluster as a whole.  Is the
wip-dumpling-pglog-undirty-version considered safe enough to do so?
If you have any tips for other safe benchmarks, I'll try those as well.
Thanks!


   Regards,

  Oliver

On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote:
 Hi Oliver,
 
 Glad that helped!  How much more efficient do the cuttlefish OSDs seem 
 at this point (with wip-dumpling-pglog-undirty)?  On modern Intel 
 platforms we were actually hoping to see CPU usage go down in many cases 
 due to the use of hardware CRC32 instructions.
 
 Mark
 
 On 08/19/2013 03:06 PM, Oliver Daudey wrote:
  Hey Samuel,
 
  Thanks!  I installed your version, repeated the same tests on my
  test-cluster and the extra CPU-loading seems to have disappeared.  Then
  I replaced one osd of my production-cluster with your modified version
  and it's config-option and it seems to be a lot less CPU-hungry now.
  Although the Cuttlefish-osds still seem to be even more CPU-efficient,
  your changes have definitely helped a lot.  We seem to be looking in the
  right direction, at least for this part of the problem.
 
  BTW, I ran `perf top' on the production-node with your modified osd and
  didn't see anything osd-related stand out on top.  PGLog::undirty()
  was in there, but with much lower usage, right at the bottom of the
  green part of the output.
 
  Many thanks for your help so far!
 
 
  Regards,
 
Oliver
 
  On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote:
  You're right, PGLog::undirty() looks suspicious.  I just pushed a
  branch wip-dumpling-pglog-undirty with a new config
  (osd_debug_pg_log_writeout) which if set to false will disable some
  strictly debugging checks which occur in PGLog::undirty().  We haven't
  actually seen these checks causing excessive cpu usage, so this may be
  a red herring.
  -Sam
 
  On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey oli...@xs4all.nl wrote:
  Hey Mark,
 
  On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote:
  On 08/17/2013 06:13 AM, Oliver Daudey wrote:
  Hey all,
 
  This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I
  created in the tracker.  Thought I would pass it through the list as
  well, to get an idea if anyone else is running into it.  It may only
  show under higher loads.  More info about my setup is in the bug-report
  above.  Here goes:
 
 
  I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd
  and mds. I'm using RBD on this cluster as storage for KVM, CephFS is
  unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100
  +MB/sec on simple linear writes to a file with `dd' inside a VM on this
  cluster under regular load and the osds usually averaged 20-100%
  CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for
  the osds shot up to 100% to 400% in `top' (multi-core system) and the
  speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users
  complained that disk-access inside the VMs was significantly slower and
  the backups of the RBD-store I was running, also got behind quickly.
 
  After downgrading only the osds to v0.61.7 Cuttlefish and leaving the
  rest at 0.67 Dumpling, speed and load returned to normal. I have
  repeated this performance-hit upon upgrade on a similar test-cluster
  under no additional load at all. Although CPU-usage for the osds wasn't
  as dramatic during these tests because there was no base-load from other
  VMs, I/O-performance dropped significantly after upgrading during these
  tests as well, and returned to normal after downgrading the osds.
 
  I'm not sure what to make of it. There are no visible errors in the logs
  and everything runs and reports good health, it's just a lot slower,
  with a lot more CPU-usage.
 
  Hi Oliver,
 
  If you have access to the perf command on this system, could you try
  running:
 
  sudo perf top
 
  And if that doesn't give you much,
 
  sudo perf record -g
 
  then:
 
  sudo perf report | less
 
  during the period of high CPU usage?  This will give you a call graph.
  There may be symbols missing, but it might help track down what the OSDs
  are doing.
 
  Thanks for your help!  I did a couple of runs on my test-cluster,
  loading it 

Re: [ceph-users] large memory leak on scrubbing

2013-08-19 Thread Mostowiec Dominik
Hi,
 Is that the only slow request message you see?
No.
Full log: https://www.dropbox.com/s/i3ep5dcimndwvj1/slow_requests.txt.tar.gz 
It start from:
2013-08-16 09:43:39.662878 mon.0 10.174.81.132:6788/0 4276384 : [DBG] osd.4 
10.174.81.131:6805/31460 reported failed by osd.50 10.174.81.135:6842/26019
2013-08-16 09:43:40.711911 mon.0 10.174.81.132:6788/0 4276386 : [DBG] osd.4 
10.174.81.131:6805/31460 reported failed by osd.14 10.174.81.132:6836/2958
2013-08-16 09:43:41.043016 mon.0 10.174.81.132:6788/0 4276388 : [DBG] osd.4 
10.174.81.131:6805/31460 reported failed by osd.13 10.174.81.132:6830/2482
2013-08-16 09:43:41.043047 mon.0 10.174.81.132:6788/0 4276389 : [INF] osd.4 
10.174.81.131:6805/31460 failed (3 reports from 3 peers after 2013-08-16 
09:43:56.042983 = grace 20.00)
2013-08-16 09:43:41.122326 mon.0 10.174.81.132:6788/0 4276390 : [INF] osdmap 
e10294: 144 osds: 143 up, 143 in
2013-08-16 09:43:38.798833 osd.4 10.174.81.131:6805/31460 913 : [WRN] 6 slow 
requests, 6 included below; oldest blocked for  30.190146 secs
2013-08-16 09:43:38.798843 osd.4 10.174.81.131:6805/31460 914 : [WRN] slow 
request 30.190146 seconds old, received at 2013-08-16 09:43:08.585504: 
osd_op(client.22301645.0:48987 .dir.1585245.1 [call rgw.bucket_complete_op] 
16.33d5ea80) v4 currently waiting for subops from [25,133]
2013-08-16 09:43:38.798854 osd.4 10.174.81.131:6805/31460 915 : [WRN] slow 
request 30.189643 seconds old, received at 2013-08-16 09:43:08.586007: 
osd_op(client.22301855.0:49374 .dir.1585245.1 [call rgw.bucket_complete_op] 
16.33d5ea80) v4 currently waiting for subops from [25,133]
2013-08-16 09:43:38.798859 osd.4 10.174.81.131:6805/31460 916 : [WRN] slow 
request 30.188236 seconds old, received at 2013-08-16 09:43:08.587414: 
osd_op(client.22307596.0:47674 .dir.1585245.1 [call rgw.bucket_complete_op] 
16.33d5ea80) v4 currently waiting for subops from [25,133]
2013-08-16 09:43:38.798862 osd.4 10.174.81.131:6805/31460 917 : [WRN] slow 
request 30.187853 seconds old, received at 2013-08-16 09:43:08.587797: 
osd_op(client.22303894.0:51846 .dir.1585245.1 [call rgw.bucket_complete_op] 
16.33d5ea80) v4 currently waiting for subops from [25,133]
...
2013-08-16 09:44:18.126318 mon.0 10.174.81.132:6788/0 4276427 : [INF] osd.4 
10.174.81.131:6805/31460 boot
...
2013-08-16 09:44:23.215918 mon.0 10.174.81.132:6788/0 4276437 : [DBG] osd.25 
10.174.81.133:6810/2961 reported failed by osd.83 10.174.81.137:6837/27963
2013-08-16 09:44:23.704769 mon.0 10.174.81.132:6788/0 4276438 : [INF] pgmap 
v17035051: 32424 pgs: 1 stale+active+clean+scrubbing+deep, 2 active, 31965 
active+clean, 7 stale+active+clean, 29 peering, 415 active+degraded, 5 
active+clean+scrubbing; 6630 GB data, 21420 GB used, 371 TB / 392 TB avail; 
246065/61089697 degraded (0.403%)
2013-08-16 09:44:23.711244 mon.0 10.174.81.132:6788/0 4276439 : [DBG] osd.133 
10.174.81.142:6803/21366 reported failed by osd.26 10.174.81.133:6814/3674
2013-08-16 09:44:23.713597 mon.0 10.174.81.132:6788/0 4276440 : [DBG] osd.133 
10.174.81.142:6803/21366 reported failed by osd.17 10.174.81.132:6806/9188
2013-08-16 09:44:23.753952 mon.0 10.174.81.132:6788/0 4276441 : [DBG] osd.133 
10.174.81.142:6803/21366 reported failed by osd.27 10.174.81.133:6822/5389
2013-08-16 09:44:23.753982 mon.0 10.174.81.132:6788/0 4276442 : [INF] osd.133 
10.174.81.142:6803/21366 failed (3 reports from 3 peers after 2013-08-16 
09:44:38.753913 = grace 20.00)


2013-08-16 09:47:10.229099 mon.0 10.174.81.132:6788/0 4276646 : [INF] pgmap 
v17035216: 32424 pgs: 32424 active+clean; 6630 GB data, 21420 GB used, 371 TB / 
392 TB avail; 0B/s rd, 622KB/s wr, 85op/s

Why osd's are 'reported failed' on scrubbing?

--
Regards 
Dominik 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping osd / continuously reported as failed

2013-08-19 Thread Mostowiec Dominik
Hi,
 Yes, it definitely can as scrubbing takes locks on the PG, which will prevent 
 reads or writes while the message is being processed (which will involve the 
 rgw index being scanned).
It is possible to tune scrubbing config for eliminate slow requests and marking 
osd down when large rgw bucket index is scrubbing?

--
Regards
Dominik

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping osd / continuously reported as failed

2013-08-19 Thread Gregory Farnum
On Mon, Aug 19, 2013 at 3:09 PM, Mostowiec Dominik
dominik.mostow...@grupaonet.pl wrote:
 Hi,
 Yes, it definitely can as scrubbing takes locks on the PG, which will 
 prevent reads or writes while the message is being processed (which will 
 involve the rgw index being scanned).
 It is possible to tune scrubbing config for eliminate slow requests and 
 marking osd down when large rgw bucket index is scrubbing?

Unfortunately not, or we would have mentioned it before. :/ There are
some proposals for sharding bucket indexes that would ameliorate this
problem, and on Cuttlefish or Dumpling the OSD won't get marked down,
but it will still block incoming requests on that object (ie, requests
to access the bucket) while the scrubbing is in place.
That said, that improvement might be sufficient since you haven't
actually shown us how long the object scrub takes.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.61.8 Cuttlefish released

2013-08-19 Thread Sage Weil
On Mon, 19 Aug 2013, James Harper wrote:
  
  We've made another point release for Cuttlefish. This release contains a
  number of fixes that are generally not individually critical, but do trip
  up users from time to time, are non-intrusive, and have held up under
  testing.
  
  Notable changes include:
  
   * librados: fix async aio completion wakeup
   * librados: fix aio completion locking
   * librados: fix rare deadlock during shutdown
 
 Could any of these be causing the segfaults I'm seeing in tapdisk rbd? 
 Are these fixes in dumpling?

They are also in the dumpling branch and 0.67.1.  They might explain it... 
not a slam dunk though.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Guang Yang
Thanks Mark.

What is the design considerations to break large files into 4M chunk rather 
than storing the large file directly?

Thanks,
Guang



 From: Mark Kirkwood mark.kirkw...@catalyst.net.nz
To: Guang Yang yguan...@yahoo.com 
Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com 
Sent: Monday, August 19, 2013 5:18 PM
Subject: Re: [ceph-users] Usage pattern and design of Ceph
 

On 19/08/13 18:17, Guang Yang wrote:

    3. Some industry research shows that one issue of file system is the
 metadata-to-data ratio, in terms of both access and storage, and some
 technic uses the mechanism to combine small files to large physical
 files to reduce the ratio (Haystack for example), if we want to use ceph
 to store photos, should this be a concern as Ceph use one physical file
 per object?

If you use Ceph as a pure object store, and get and put data via the 
basic rados api then sure, one client data object will be stored in one 
Ceph 'object'. However if you use rados gateway (S3 or Swift look-alike 
api) then each client data object will be broken up into chunks at the 
rados level (typically 4M sized chunks).


Regards

Mark___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Guang Yang
Thanks Greg.

Some comments inline...

On Sunday, August 18, 2013, Guang Yang  wrote:

Hi ceph-users,
This is Guang and I am pretty new to ceph, glad to meet you guys in the 
community!


After walking through some documents of Ceph, I have a couple of questions:
  1. Is there any comparison between Ceph and AWS S3, in terms of the ability 
to handle different work-loads (from KB to GB), with corresponding performance 
report?

Not really; any comparison would be highly biased depending on your Amazon ping 
and your Ceph cluster. We've got some internal benchmarks where Ceph looks 
good, but they're not anything we'd feel comfortable publishing.
 [Guang] Yeah, I mean the solely server side time regardless of the RTT impact 
over the comparison.
  2. Looking at some industry solutions for distributed storage, GFS / Haystack 
/ HDFS all use meta-server to store the logical-to-physical mapping within 
memory and avoid disk I/O lookup for file reading, is the concern valid for 
Ceph (in terms of latency to read file)?

These are very different systems. Thanks to CRUSH, RADOS doesn't need to do any 
IO to find object locations; CephFS only does IO if the inode you request has 
fallen out of the MDS cache (not terribly likely in general). This shouldn't be 
an issue...
[Guang]  CephFS only does IO if the inode you request has fallen out of the 
MDS cache, my understanding is, if we use CephFS, we will need to interact 
with Rados twice, the first time to retrieve meta-data (file attribute, owner, 
etc.) and the second time to load data, and both times will need disk I/O in 
terms of inode and data. Is my understanding correct? The way some other 
storage system tried was to cache the file handle in memory, so that it can 
avoid the I/O to read inode in.
 
  3. Some industry research shows that one issue of file system is the 
metadata-to-data ratio, in terms of both access and storage, and some technic 
uses the mechanism to combine small files to large physical files to reduce the 
ratio (Haystack for example), if we want to use ceph to store photos, should 
this be a concern as Ceph use one physical file per object?

...although this might be. The issue basically comes down to how many disk 
seeks are required to retrieve an item, and one way to reduce that number is to 
hack the filesystem by keeping a small number of very large files an 
calculating (or caching) where different objects are inside that file. Since 
Ceph is designed for MB-sized objects it doesn't go to these lengths to 
optimize that path like Haystack might (I'm not familiar with Haystack in 
particular).
That said, you need some pretty extreme latency requirements before this 
becomes an issue and if you're also looking at HDFS or S3 I can't imagine 
you're in that ballpark. You should be fine. :)
[Guang] Yep, that makes a lot sense.
-Greg

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Usage pattern and design of Ceph

2013-08-19 Thread Mark Kirkwood

On 20/08/13 13:27, Guang Yang wrote:

Thanks Mark.

What is the design considerations to break large files into 4M chunk
rather than storing the large file directly?




Quoting Wolfgang from previous reply:

= which is a good thing in terms of replication and OSD usage
distribution


...which covers what I would have said quite well :-)

Cheers

Mark


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd map issues: no such file or directory (ENOENT) AND map wrong image

2013-08-19 Thread David Zafman

Transferring this back the ceph-users.  Sorry, I can't help with rbd issues.  
One thing I will say is that if you are mounting an rbd device with a 
filesystem on a machine to export ftp, you can't also export the same device 
via iSCSI.

David Zafman
Senior Developer
http://www.inktank.com

On Aug 19, 2013, at 8:39 PM, PJ linalin1...@gmail.com wrote:

 2013/8/14 David Zafman david.zaf...@inktank.com
 
 On Aug 12, 2013, at 7:41 PM, Josh Durgin josh.dur...@inktank.com wrote:
 
  On 08/12/2013 07:18 PM, PJ wrote:
 
  If the target rbd device only map on one virtual machine, format it as
  ext4 and mount to two places
mount /dev/rbd0 /nfs -- for nfs server usage
mount /dev/rbd0 /ftp  -- for ftp server usage
  nfs and ftp servers run on the same virtual machine. Will file system
  (ext4) help to handle the simultaneous access from nfs and ftp?
  
  I doubt that'll work perfectly on a normal disk, although rbd should
  behave the same in this case. Consider what happens when to be some
  issues when the same files are modified at once by the ftp and nfs
  servers. You could run ftp on an nfs client on a different machine
  safely.
 
 
 
 Modern Linux kernels will do a bind mount when a block device is mounted on 2 
 different directories.   Think directory hard links.  Simultaneous access 
 will NOT corrupt ext4, but as Josh said modifying the same file at once by 
 ftp and nfs isn't going produce good results.  With file locking 2 nfs 
 clients could coordinate using advisory locking.
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 
 The first issue is reproduced, but there are changes to system configuration. 
 Due to hardware shortage, we only have one physical machine installed one OSD 
 and runs 6 virtual machines on it. Only one monitor (wistor-003) and one FTP 
 server (wistor-004), the other virtual machines are iSCSI servers.
 
 The log size is big because when enable FTP service for a rbd device, we have 
 a rbd map retry loop in case it fails (retry rbd map every 10 sec and last 
 for 3 minutes). Please download monitor log from below link,
 https://www.dropbox.com/s/88cb9q91cjszuug/ceph-mon.wistor-003.log.zip
 
 Here are the operation steps:
 1. The pool rex is created
Around 2013-08-20 09:16:38~09:16:39
 2. The first time to map rbd device on wistor-004 and it fails (all retries 
 failed)
Around 2013-08-20 09:17:43~09:20:46 (180 sec)
 3. Tried second time and it works, but still have 9 fails in retry loop
Around 2013-08-20 09:20:48~09:22:10 (82 sec)
 
 
 
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com