Re: Ceph performance improvement

2012-08-22 Thread Mark Kirkwood

On 22/08/12 22:24, David McBride wrote:

On 22/08/12 09:54, Denis Fondras wrote:


* Test with "dd" from the client using CephFS :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s


Again, the synchronous nature of 'dd' is probably severely affecting 
apparent performance.  I'd suggest looking at some other tools, like 
fio, bonnie++, or iozone, which might generate more representative load.


(Or, if you have a specific use-case in mind, something that generates 
an IO pattern like what you'll be using in production would be ideal!)





Appending conv=fsync to the dd will make the comparison fair enough. 
Looking at the ceph code, it does



sync_file_range(fd, offset, blocksz, SYNC_FILE_RANGE_WRITE);

which is very fast - way faster than fdatasync() and friends (I have 
tested this ... see prev posting on random write performance with file 
writetest.c attached).


I am not convinced the these sort of tests are in any way 'unfair' - for 
instance I would like to use rbd for postgres or mysql data volumes... 
and many database actions involve a stream of block writes similar 
enough to doing dd (e.g bulk row loads, appends to transaction log 
journals).


Cheers

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash

2012-08-22 Thread Gregory Farnum
The tcmalloc backtrace on the OSD suggests this may be unrelated, but
what's the fd limit on your monitor process? You may be approaching
that limit if you've got 500 OSDs and a similar number of clients.

On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov  wrote:
> On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil  wrote:
>> On Thu, 23 Aug 2012, Andrey Korolyov wrote:
>>> Hi,
>>>
>>> today during heavy test a pair of osds and one mon died, resulting to
>>> hard lockup of some kvm processes - they went unresponsible and was
>>> killed leaving zombie processes ([kvm] ). Entire cluster
>>> contain sixteen osd on eight nodes and three mons, on first and last
>>> node and on vm outside cluster.
>>>
>>> osd bt:
>>> #0  0x7fc37d490be3 in
>>> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
>>> unsigned long, int) () from /usr/lib/libtcmalloc.so.4
>>> (gdb) bt
>>> #0  0x7fc37d490be3 in
>>> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
>>> unsigned long, int) () from /usr/lib/libtcmalloc.so.4
>>> #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
>>> /usr/lib/libtcmalloc.so.4
>>> #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
>>> #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
>>> /usr/include/c++/4.7/bits/basic_string.h:246
>>> #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=) at
>>> /usr/include/c++/4.7/bits/basic_string.h:536
>>> #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=)
>>> at /usr/include/c++/4.7/sstream:60
>>> #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=>> out>, __vtt_parm=) at /usr/include/c++/4.7/sstream:439
>>> #7  pretty_version_to_str () at common/version.cc:40
>>> #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
>>> out=...) at common/BackTrace.cc:19
>>> #9  0x0078f450 in handle_fatal_signal (signum=11) at
>>> global/signal_handler.cc:91
>>> #10 
>>> #11 0x7fc37d490be3 in
>>> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
>>> unsigned long, int) () from /usr/lib/libtcmalloc.so.4
>>> #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
>>> /usr/lib/libtcmalloc.so.4
>>> #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
>>> #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() ()
>>> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>>> #15 0x7fc37d1c4796 in ?? () from 
>>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>>> #16 0x7fc37d1c47c3 in std::terminate() () from
>>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>>> #17 0x7fc37d1c49ee in __cxa_throw () from
>>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>>> #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c
>>> "0 == \"unexpected error\"", file=, line=3007,
>>> func=0x90ef80 "unsigned int
>>> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)")
>>> at common/assert.cc:77
>>
>> This means it got an unexpected error when talking to the file system.  If
>> you look in the osd log, it may tell you what that was.  (It may
>> not--there isn't usually the other tcmalloc stuff triggered from the
>> assert handler.)
>>
>> What happens if you restart that ceph-osd daemon?
>>
>> sage
>>
>>
>
> Unfortunately I have completely disabled logs during test, so there
> are no suggestion of assert_fail. The main problem was revealed -
> created VMs was pointed to one monitor instead set of three, so there
> may be some unusual things(btw, crashed mon isn`t one from above, but
> a neighbor of crashed osds on first node). After IPMI reset node
> returns back well and cluster behavior seems to be okay - stuck kvm
> I/O somehow prevented even other module load|unload on this node, so I
> finally decided to do hard reset. Despite I`m using almost generic
> wheezy, glibc was updated to 2.15, may be because of this my trace
> appears first time ever. I`m almost sure that fs does not triggered
> this crash and mainly suspecting stuck kvm processes. I`ll rerun test
> with same conditions tomorrow(~500 vms pointed to one mon and very
> high I/O, but with osd logging).
>
>>> #19 0x0073148f in FileStore::_do_transaction
>>> (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545,
>>> trans_num=trans_num@entry=0) at os/FileStore.cc:3007
>>> #20 0x0073484e in FileStore::do_transactions (this=0x2cde000,
>>> tls=..., op_seq=429545) at os/FileStore.cc:2436
>>> #21 0x0070c680 in FileStore::_do_op (this=0x2cde000,
>>> osr=) at os/FileStore.cc:2259
>>> #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at
>>> common/WorkQueue.cc:54
>>> #23 0x006823ed in ThreadPool::WorkThread::entry
>>> (this=) at ./common/WorkQueue.h:126
>>> #24 0x7fc37e3eee9a in start_thread () from
>>> /lib/x86_64-linux-gnu/libpthread.so.0
>>> #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
>>> #26 0x in ?? ()
>>>
>>> mon bt was exact

Re: Ceph performance improvement / journal on block-dev

2012-08-22 Thread Tommi Virtanen
On Wed, Aug 22, 2012 at 12:12 PM, Dieter Kasper (KD)
 wrote:
>> Your journal is a file on a btrfs partition. That is probably a bad
>> idea for performance. I'd recommend partitioning the drive and using
>> partitions as journals directly.
> can you please teach me how to use the right parameter(s) to realize 'journal 
> on block-dev' ?

Replacing the example paths, use "sudo parted /dev/sdg" or "gksu
gparted /dev/sdg", create partitions, set osd journal to point to a
block device for a partition.

[osd.42]
osd journal = /dev/sdg4

> It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf 
> --mkbtrfs'
> (see below)

Try running it with -x for any chance of extracting debuggable
information from the monster.

> Scanning for Btrfs filesystems
>  HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
> 2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 
> 8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected 
> ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal

Based on that, my best guess would be that you're seeing a journal
from an old run -- perhaps you need to explicitly clear out the block
device contents..

Frankly, you should not use btrfs devs. Any convenience you may gain
is more than doubly offset by pains exactly like these.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: OSD crash

2012-08-22 Thread Andrey Korolyov
On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil  wrote:
> On Thu, 23 Aug 2012, Andrey Korolyov wrote:
>> Hi,
>>
>> today during heavy test a pair of osds and one mon died, resulting to
>> hard lockup of some kvm processes - they went unresponsible and was
>> killed leaving zombie processes ([kvm] ). Entire cluster
>> contain sixteen osd on eight nodes and three mons, on first and last
>> node and on vm outside cluster.
>>
>> osd bt:
>> #0  0x7fc37d490be3 in
>> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
>> unsigned long, int) () from /usr/lib/libtcmalloc.so.4
>> (gdb) bt
>> #0  0x7fc37d490be3 in
>> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
>> unsigned long, int) () from /usr/lib/libtcmalloc.so.4
>> #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
>> /usr/lib/libtcmalloc.so.4
>> #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
>> #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
>> /usr/include/c++/4.7/bits/basic_string.h:246
>> #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=) at
>> /usr/include/c++/4.7/bits/basic_string.h:536
>> #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=)
>> at /usr/include/c++/4.7/sstream:60
>> #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg=> out>, __vtt_parm=) at /usr/include/c++/4.7/sstream:439
>> #7  pretty_version_to_str () at common/version.cc:40
>> #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
>> out=...) at common/BackTrace.cc:19
>> #9  0x0078f450 in handle_fatal_signal (signum=11) at
>> global/signal_handler.cc:91
>> #10 
>> #11 0x7fc37d490be3 in
>> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
>> unsigned long, int) () from /usr/lib/libtcmalloc.so.4
>> #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
>> /usr/lib/libtcmalloc.so.4
>> #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
>> #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() ()
>> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>> #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>> #16 0x7fc37d1c47c3 in std::terminate() () from
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>> #17 0x7fc37d1c49ee in __cxa_throw () from
>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
>> #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c
>> "0 == \"unexpected error\"", file=, line=3007,
>> func=0x90ef80 "unsigned int
>> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)")
>> at common/assert.cc:77
>
> This means it got an unexpected error when talking to the file system.  If
> you look in the osd log, it may tell you what that was.  (It may
> not--there isn't usually the other tcmalloc stuff triggered from the
> assert handler.)
>
> What happens if you restart that ceph-osd daemon?
>
> sage
>
>

Unfortunately I have completely disabled logs during test, so there
are no suggestion of assert_fail. The main problem was revealed -
created VMs was pointed to one monitor instead set of three, so there
may be some unusual things(btw, crashed mon isn`t one from above, but
a neighbor of crashed osds on first node). After IPMI reset node
returns back well and cluster behavior seems to be okay - stuck kvm
I/O somehow prevented even other module load|unload on this node, so I
finally decided to do hard reset. Despite I`m using almost generic
wheezy, glibc was updated to 2.15, may be because of this my trace
appears first time ever. I`m almost sure that fs does not triggered
this crash and mainly suspecting stuck kvm processes. I`ll rerun test
with same conditions tomorrow(~500 vms pointed to one mon and very
high I/O, but with osd logging).

>> #19 0x0073148f in FileStore::_do_transaction
>> (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545,
>> trans_num=trans_num@entry=0) at os/FileStore.cc:3007
>> #20 0x0073484e in FileStore::do_transactions (this=0x2cde000,
>> tls=..., op_seq=429545) at os/FileStore.cc:2436
>> #21 0x0070c680 in FileStore::_do_op (this=0x2cde000,
>> osr=) at os/FileStore.cc:2259
>> #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at
>> common/WorkQueue.cc:54
>> #23 0x006823ed in ThreadPool::WorkThread::entry
>> (this=) at ./common/WorkQueue.h:126
>> #24 0x7fc37e3eee9a in start_thread () from
>> /lib/x86_64-linux-gnu/libpthread.so.0
>> #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
>> #26 0x in ?? ()
>>
>> mon bt was exactly the same as in http://tracker.newdream.net/issues/2762
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord.

Re: OSD crash

2012-08-22 Thread Sage Weil
On Thu, 23 Aug 2012, Andrey Korolyov wrote:
> Hi,
> 
> today during heavy test a pair of osds and one mon died, resulting to
> hard lockup of some kvm processes - they went unresponsible and was
> killed leaving zombie processes ([kvm] ). Entire cluster
> contain sixteen osd on eight nodes and three mons, on first and last
> node and on vm outside cluster.
> 
> osd bt:
> #0  0x7fc37d490be3 in
> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
> unsigned long, int) () from /usr/lib/libtcmalloc.so.4
> (gdb) bt
> #0  0x7fc37d490be3 in
> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
> unsigned long, int) () from /usr/lib/libtcmalloc.so.4
> #1  0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
> /usr/lib/libtcmalloc.so.4
> #2  0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4
> #3  0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at
> /usr/include/c++/4.7/bits/basic_string.h:246
> #4  ~basic_string (this=0x7fc3736639d0, __in_chrg=) at
> /usr/include/c++/4.7/bits/basic_string.h:536
> #5  ~basic_stringbuf (this=0x7fc373663988, __in_chrg=)
> at /usr/include/c++/4.7/sstream:60
> #6  ~basic_ostringstream (this=0x7fc373663980, __in_chrg= out>, __vtt_parm=) at /usr/include/c++/4.7/sstream:439
> #7  pretty_version_to_str () at common/version.cc:40
> #8  0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10,
> out=...) at common/BackTrace.cc:19
> #9  0x0078f450 in handle_fatal_signal (signum=11) at
> global/signal_handler.cc:91
> #10 
> #11 0x7fc37d490be3 in
> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*,
> unsigned long, int) () from /usr/lib/libtcmalloc.so.4
> #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from
> /usr/lib/libtcmalloc.so.4
> #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4
> #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() ()
> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #16 0x7fc37d1c47c3 in std::terminate() () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #17 0x7fc37d1c49ee in __cxa_throw () from
> /usr/lib/x86_64-linux-gnu/libstdc++.so.6
> #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c
> "0 == \"unexpected error\"", file=, line=3007,
> func=0x90ef80 "unsigned int
> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)")
> at common/assert.cc:77

This means it got an unexpected error when talking to the file system.  If 
you look in the osd log, it may tell you what that was.  (It may 
not--there isn't usually the other tcmalloc stuff triggered from the 
assert handler.)

What happens if you restart that ceph-osd daemon?

sage


> #19 0x0073148f in FileStore::_do_transaction
> (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545,
> trans_num=trans_num@entry=0) at os/FileStore.cc:3007
> #20 0x0073484e in FileStore::do_transactions (this=0x2cde000,
> tls=..., op_seq=429545) at os/FileStore.cc:2436
> #21 0x0070c680 in FileStore::_do_op (this=0x2cde000,
> osr=) at os/FileStore.cc:2259
> #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at
> common/WorkQueue.cc:54
> #23 0x006823ed in ThreadPool::WorkThread::entry
> (this=) at ./common/WorkQueue.h:126
> #24 0x7fc37e3eee9a in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #26 0x in ?? ()
> 
> mon bt was exactly the same as in http://tracker.newdream.net/issues/2762
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance improvement / journal on block-dev

2012-08-22 Thread Dieter Kasper (KD)
On Wed, Aug 22, 2012 at 06:29:12PM +0200, Tommi Virtanen wrote:
(...)
> 
> Your journal is a file on a btrfs partition. That is probably a bad
> idea for performance. I'd recommend partitioning the drive and using
> partitions as journals directly.

Hi Tommi,

can you please teach me how to use the right parameter(s) to realize 'journal 
on block-dev' ?

It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf 
--mkbtrfs'
(see below)

Regards,
-Dieter


e.g.
---snip---
modprobe -v brd rd_nr=6 rd_size=1000# 6x 10G RAM DISK

/etc/ceph/ceph.conf
--
[global]
auth supported = none

# set log file
log file = /ceph/log/$name.log
log_to_syslog = true# uncomment this line to log to syslog

# set up pid files
pid file = /var/run/ceph/$name.pid

[mon]  
mon data = /ceph/$name
debug optracker = 0

[mon.alpha]
host = 127.0.0.1
mon addr = 127.0.0.1:6789

[mds]
debug optracker = 0

[mds.0]
host = 127.0.0.1

[osd]
osd data = /data/$name

[osd.0]
host = 127.0.0.1
btrfs devs  = /dev/ram0
osd journal = /dev/ram3

[osd.1]
host = 127.0.0.1
btrfs devs  = /dev/ram1
osd journal = /dev/ram4

[osd.2]
host = 127.0.0.1
btrfs devs  = /dev/ram2
osd journal = /dev/ram5
--

root # mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs
temp dir is /tmp/mkcephfs.wzARGSpFB6
preparing monmap in /tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool --create --clobber --add alpha 127.0.0.1:6789 --print 
/tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool: monmap file /tmp/mkcephfs.wzARGSpFB6/monmap
/usr/bin/monmaptool: generated fsid 40b997ea-387a-4deb-9a30-805cd076a0de
epoch 0
fsid 40b997ea-387a-4deb-9a30-805cd076a0de
last_changed 2012-08-22 21:04:00.553972
created 2012-08-22 21:04:00.553972
0: 127.0.0.1:6789/0 mon.alpha
/usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.wzARGSpFB6/monmap (1 
monitors)
=== osd.0 === 
pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005
umount: /data/osd.0: not mounted
umount: /dev/ram0: not mounted

Btrfs v0.19.1+

ATTENTION:

mkfs.btrfs is not intended to be used directly. Please use the
YaST partitioner to create and manage btrfs filesystems to be
in a supported state on SUSE Linux Enterprise systems.

fs created label (null) on /dev/ram0
nodesize 4096 leafsize 4096 sectorsize 4096 size 9.54GiB
Scanning for Btrfs filesystems
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 
8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected 
ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
 HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
2012-08-22 21:04:01.923505 7fb475e8b780 -1 filestore(/data/osd.0) could not 
find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2012-08-22 21:04:01.937429 7fb475e8b780 -1 created object store /data/osd.0 
journal /dev/ram3 for osd.0 fsid 40b997ea-387a-4deb-9a30-805cd076a0de
creating private key for osd.0 keyring /data/osd.0/keyring
creating /data/osd.0/keyring
collecting osd.0 key
=== osd.1 === 
pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005
umount: /data/osd.1: not mounted
(...)


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-crush

2012-08-22 Thread Sage Weil
On Wed, 22 Aug 2012, Gregory Farnum wrote:
> On Wed, Aug 22, 2012 at 9:33 AM, Sage Weil  wrote:
> > On Wed, 22 Aug 2012, Atchley, Scott wrote:
> >> On Aug 22, 2012, at 10:46 AM, Florian Haas wrote:
> >>
> >> > On 08/22/2012 03:10 AM, Sage Weil wrote:
> >> >> I pushed a branch that changes some of the crush terminology.  Instead 
> >> >> of
> >> >> having a crush type called "pool" that requires you to say things like
> >> >> "pool=default" in the "ceph osd crush set ..." command, it uses "root"
> >> >> instead.  That hopefully reinforces that it is a tree/hierarchy.
> >> >>
> >> >> There is also a patch that changes "bucket" to "node" throughout, since
> >> >> bucket is a term also used by radosgw.
> >> >>
> >> >> Thoughts?  I think the main pain in making this transition is that old
> >> >> clusters have maps that have a type 'pool' and new ones won't, and the
> >> >> docs will need to walk people through both...
> >> >
> >> > "pool" in a crushmap being completely unrelated to a RADOS pool is
> >> > something that I've heard customers/users report as confusing, as well.
> >> > So changing that is probably a good thing. Naming it "root" is probably
> >> > a good choice as well, as it happens to match
> >> > http://ceph.com/wiki/Custom_data_placement_with_CRUSH.
> >> >
> >> > As for changing "bucket" to node... a "node" is normally simply a
> >> > physical server (at least in HA terminology, which many potential Ceph
> >> > users will be familiar with), and CRUSH uses "host" for that. So that's
> >> > another recipe for confusion. How about using something super-generic,
> >> > like "element" or "item"?
> >> >
> >> > Cheers,
> >> > Florian
> >>
> >> My guess is that he is trying to use data structure tree nomenclature
> >> (root, node, leaf). I agree that node is an overloaded term (as is
> >> pool).
> >
> > Yeah...
> >
> >> As for an alternative to bucket which indicates the item is a
> >> collection, what about subtree or branch?
> >
> > I think fixing the overloading of 'pool' in the default crush map is the
> > biggest pain point.  I can live with crush 'buckets' staying the same (esp
> > since that's what the papers and code use pervasively) if we can't come up
> > with a better option.
> 
> I'm definitely most interested in replacing "pool", and "root" works
> for that in my mind. RGW buckets live at a sufficiently different
> level that I think people are unlikely to be confused ? and "bucket"
> is actually a good name for what they are (I'm open to better ones,
> but I don't think that "node" qualifies).

Yeah, sounds good to me.

> > On the pool part, though, the challenge is how to transition.  Existing
> > clusters have maps that use 'pool', and new clusters will use 'root' (or
> > whatever).  Some options:
> >
> >  - document both.  this kills much of the benefit of switching, but is
> >probably inevitable since people will be running different versions.
> >  - make the upgrade process transparently rename the type.  this lets
> >all the tools use the new names.
> >  - make the tools silently translate old names to new names.  this is
> >kludgey in that it makes the code make assumptions about the names of
> >the data it is working with, but would cover everyone except those who
> >created their own crush maps from scratch.
> >  - ?
>
> I would go with option two, and only document the new options ? I
> wouldn't be surprised if the number of people who had changed those
> was zero. Anybody who has done so can certainly be counted on to pay
> enough attention that a line note "changed CRUSH names (see here if
> you customized your map)" would be sufficient, right?

Yeah.  The one wrinkle is that people running old code (e.g., argonaut) 
reading the latest docs will see commands that don't quite work.

At some point we need to fork the docs for each stable release... maybe 
now is the time to do that.

sage

Re: SimpleMessenger dispatching: cause of performance problems?

2012-08-22 Thread Samuel Just
What rbd block size were you using?
-Sam

On Tue, Aug 21, 2012 at 10:29 PM, Andreas Bluemle
 wrote:
> Hi,
>
>
> Samuel Just wrote:
>>
>> Was the cluster complete healthy at the time that those traces were taken?
>> If there were osds going in/out/up/down, it would trigger osdmap updates
>> which
>> would tend to hold the osd_lock for an extended period of time.
>>
>>
>
> The cluster was completely healthy.
>
>> v0.50 included some changes that drastically reduce the purview of
>> osd_lock.
>> In particular, pg op handling no longer grabs the osd_lock and
>> handle_osd_map
>> can proceed independently of the pg worker threads.  Trying that might be
>> interesting.
>>
>>
>
> I'll grab v0.50 and take a look.
>
>
>> -Sam
>>
>> On Tue, Aug 21, 2012 at 12:20 PM, Sage Weil  wrote:
>>
>>>
>>> On Tue, 21 Aug 2012, Sage Weil wrote:
>>>

 On Tue, 21 Aug 2012, Andreas Bluemle wrote:

>
> Hi Sage,
>
> as mentioned, the workload is a single sequential write on
> the client. The write is not O_DIRECT; and consequently
> the messages arrive at the OSD with 124 KByte per write request.
>
> The attached pdf shows a timing diagram of two concurrent
> write operations (one primary and one replication / secondary).
>
> The time spent on the primary write to get the OSD.:osd_lock
> releates nicely with the time when this lock is released by the
> secondary write.
>
>>>
>>> Looking again at this diagram, I'm a bit confused.  Is the Y access the
>>> thread id or something?  And the X axis is time in seconds?
>>>
>>>
>
> X-Axis is time, Y Axis is absolute offset of the write request on the rados
> block device.
>
>>> The big question for me is what on earth the secondary write (or primary,
>>> for that matter) is doing with osd_lock for a full 3 ms...  If my reading
>>> of the units is correct, *that* is the real problem.  It shouldn't be
>>> doing anything that takes that long.  The exception is osdmap handling,
>>> which can do more work, but request processing should be very fast.
>>>
>>> Thanks-
>>> sage
>>>
>>>
>>>

 Ah, I see.

 There isn't a trivial way to pull osd_lock out of the picture; there are
 several data structures it's protecting (pg_map, osdmaps, peer epoch
 map,
 etc.).  Before we try going down that road, I suspect it might be more
 fruitful to see where cpu time is being spent while osd_lock is held.

 How much of an issue does it look like this specific contention is for
 you?  Does it go away with larger writes?

 sage



>
> Hope this helps
>
> Andreas
>
>
>
> Sage Weil wrote:
>
>>
>> On Mon, 20 Aug 2012, Andreas Bluemle wrote:
>>
>>
>>>
>>> Hi Sage,
>>>
>>> Sage Weil wrote:
>>>
>>>

 Hi Andreas,

 On Thu, 16 Aug 2012, Andreas Bluemle wrote:


>
> Hi,
>
> I have been trying to migrate a ceph cluster (ceph-0.48argonaut)
> to a high speed cluster network and encounter scalability problems:
> the overall performance of the ceph cluster does not scale well
> with an increase in the underlying networking speed.
>
> In short:
>
> I believe that the dispatching from SimpleMessenger to
> OSD worker queues causes that scalability issue.
>
> Question: is it possible that this dispatching is causing
> performance
> problems?
>
>

 There is a single 'dispatch' thread that's processing this queue,
 and
 conveniently perf lets you break down its profiling data on a
 per-thread
 basis.  Once you've ruled out the throttler as the culprit, you
 might
 try
 running the daemon with 'perf record -g -- ceph-osd ...' and then
 look
 specifically at where that thread is spending its time.  We
 shouldn't be
 burning that much CPU just doing the sanity checks and then handing
 requests
 off to PGs...

 sage




>>>
>>>   The effect, which I am seeing, may be related to some locking
>>> issue.
>>> As I read the code, there are multiple dispatchers running: one per
>>> SimpleMessenger.
>>>
>>> On a typical OSD node, there is
>>>
>>> - the instance of the SimpleMessenger processing input from the
>>> client
>>> (primary writes)
>>> - other instances of SimpleMessenger, which process input from
>>> neighbor
>>> OSD
>>> nodes
>>>
>>> the latter generate replication writes to the OSD I am looking at.
>>>
>>> On the other hand, there is a single instance of the OSD object
>>> within the
>>> ceph-osd daemon.
>>> When dispatching messages to the OSD, then the OSD::osd_lock is held
>>> for

Re: wip-crush

2012-08-22 Thread Gregory Farnum
On Wed, Aug 22, 2012 at 9:33 AM, Sage Weil  wrote:
> On Wed, 22 Aug 2012, Atchley, Scott wrote:
>> On Aug 22, 2012, at 10:46 AM, Florian Haas wrote:
>>
>> > On 08/22/2012 03:10 AM, Sage Weil wrote:
>> >> I pushed a branch that changes some of the crush terminology.  Instead of
>> >> having a crush type called "pool" that requires you to say things like
>> >> "pool=default" in the "ceph osd crush set ..." command, it uses "root"
>> >> instead.  That hopefully reinforces that it is a tree/hierarchy.
>> >>
>> >> There is also a patch that changes "bucket" to "node" throughout, since
>> >> bucket is a term also used by radosgw.
>> >>
>> >> Thoughts?  I think the main pain in making this transition is that old
>> >> clusters have maps that have a type 'pool' and new ones won't, and the
>> >> docs will need to walk people through both...
>> >
>> > "pool" in a crushmap being completely unrelated to a RADOS pool is
>> > something that I've heard customers/users report as confusing, as well.
>> > So changing that is probably a good thing. Naming it "root" is probably
>> > a good choice as well, as it happens to match
>> > http://ceph.com/wiki/Custom_data_placement_with_CRUSH.
>> >
>> > As for changing "bucket" to node... a "node" is normally simply a
>> > physical server (at least in HA terminology, which many potential Ceph
>> > users will be familiar with), and CRUSH uses "host" for that. So that's
>> > another recipe for confusion. How about using something super-generic,
>> > like "element" or "item"?
>> >
>> > Cheers,
>> > Florian
>>
>> My guess is that he is trying to use data structure tree nomenclature
>> (root, node, leaf). I agree that node is an overloaded term (as is
>> pool).
>
> Yeah...
>
>> As for an alternative to bucket which indicates the item is a
>> collection, what about subtree or branch?
>
> I think fixing the overloading of 'pool' in the default crush map is the
> biggest pain point.  I can live with crush 'buckets' staying the same (esp
> since that's what the papers and code use pervasively) if we can't come up
> with a better option.

I'm definitely most interested in replacing "pool", and "root" works
for that in my mind. RGW buckets live at a sufficiently different
level that I think people are unlikely to be confused — and "bucket"
is actually a good name for what they are (I'm open to better ones,
but I don't think that "node" qualifies).


> On the pool part, though, the challenge is how to transition.  Existing
> clusters have maps that use 'pool', and new clusters will use 'root' (or
> whatever).  Some options:
>
>  - document both.  this kills much of the benefit of switching, but is
>probably inevitable since people will be running different versions.
>  - make the upgrade process transparently rename the type.  this lets
>all the tools use the new names.
>  - make the tools silently translate old names to new names.  this is
>kludgey in that it makes the code make assumptions about the names of
>the data it is working with, but would cover everyone except those who
>created their own crush maps from scratch.
>  - ?
I would go with option two, and only document the new options — I
wouldn't be surprised if the number of people who had changed those
was zero. Anybody who has done so can certainly be counted on to pay
enough attention that a line note "changed CRUSH names (see here if
you customized your map)" would be sufficient, right?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Ceph fixes for 3.6-rc3

2012-08-22 Thread Sage Weil
Hi Linus,

Please pull the following Ceph fixes for -rc3 from

  git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

Jim's fix closes a narrow race introduced with the msgr changes.  One fix 
resolves problems with debugfs initialization that Yan found when multiple 
client instances are created (e.g., two clusters mounted, or rbd + 
cephfs), another one fixes problems with mounting a nonexistent server 
subdirectory, and the last one fixes a divide by zero error from 
unsanitized ioctl input that Dan Carpenter found.

Thanks!
sage



Jim Schutt (1):
  libceph: avoid truncation due to racing banners

Sage Weil (3):
  libceph: delay debugfs initialization until we learn global_id
  ceph: tolerate (and warn on) extraneous dentry from mds
  ceph: avoid divide by zero in __validate_layout()

 fs/ceph/debugfs.c  |1 +
 fs/ceph/inode.c|   15 +
 fs/ceph/ioctl.c|3 +-
 net/ceph/ceph_common.c |1 -
 net/ceph/debugfs.c |4 +++
 net/ceph/messenger.c   |   11 -
 net/ceph/mon_client.c  |   51 +++
 7 files changed, 72 insertions(+), 14 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-crush

2012-08-22 Thread Sage Weil
On Wed, 22 Aug 2012, Atchley, Scott wrote:
> On Aug 22, 2012, at 10:46 AM, Florian Haas wrote:
> 
> > On 08/22/2012 03:10 AM, Sage Weil wrote:
> >> I pushed a branch that changes some of the crush terminology.  Instead of 
> >> having a crush type called "pool" that requires you to say things like 
> >> "pool=default" in the "ceph osd crush set ..." command, it uses "root" 
> >> instead.  That hopefully reinforces that it is a tree/hierarchy.
> >> 
> >> There is also a patch that changes "bucket" to "node" throughout, since 
> >> bucket is a term also used by radosgw.
> >> 
> >> Thoughts?  I think the main pain in making this transition is that old 
> >> clusters have maps that have a type 'pool' and new ones won't, and the 
> >> docs will need to walk people through both...
> > 
> > "pool" in a crushmap being completely unrelated to a RADOS pool is
> > something that I've heard customers/users report as confusing, as well.
> > So changing that is probably a good thing. Naming it "root" is probably
> > a good choice as well, as it happens to match
> > http://ceph.com/wiki/Custom_data_placement_with_CRUSH.
> > 
> > As for changing "bucket" to node... a "node" is normally simply a
> > physical server (at least in HA terminology, which many potential Ceph
> > users will be familiar with), and CRUSH uses "host" for that. So that's
> > another recipe for confusion. How about using something super-generic,
> > like "element" or "item"?
> > 
> > Cheers,
> > Florian
> 
> My guess is that he is trying to use data structure tree nomenclature 
> (root, node, leaf). I agree that node is an overloaded term (as is 
> pool).

Yeah...

> As for an alternative to bucket which indicates the item is a 
> collection, what about subtree or branch?

I think fixing the overloading of 'pool' in the default crush map is the 
biggest pain point.  I can live with crush 'buckets' staying the same (esp 
since that's what the papers and code use pervasively) if we can't come up 
with a better option.

On the pool part, though, the challenge is how to transition.  Existing 
clusters have maps that use 'pool', and new clusters will use 'root' (or 
whatever).  Some options:

 - document both.  this kills much of the benefit of switching, but is 
   probably inevitable since people will be running different versions. 
 - make the upgrade process transparently rename the type.  this lets 
   all the tools use the new names.
 - make the tools silently translate old names to new names.  this is 
   kludgey in that it makes the code make assumptions about the names of 
   the data it is working with, but would cover everyone except those who 
   created their own crush maps from scratch.
 - ?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance improvement

2012-08-22 Thread Tommi Virtanen
On Wed, Aug 22, 2012 at 9:23 AM, Denis Fondras  wrote:
>> Are you sure your osd data and journal are on the disks you think? The
>> /home paths look suspicious -- especially for journal, which often
>> should be a block device.
> I am :)
...
> -rw-r--r-- 1 root root 1048576000 août  22 17:22 /home/osd.0.journal

Your journal is a file on a btrfs partition. That is probably a bad
idea for performance. I'd recommend partitioning the drive and using
partitions as journals directly.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance improvement

2012-08-22 Thread Denis Fondras


Are you sure your osd data and journal are on the disks you think? The
/home paths look suspicious -- especially for journal, which often
should be a block device.



I am :)


Can you share output of "mount" and "ls -ld /home/osd.*"


Here are some details :

root@ceph-osd-0:~# ls -al /dev/disk/by-id/
lrwxrwxrwx 1 root root   9 août  21 21:19 
scsi-SATA_C300-CTFDDAC064104903008FE4 -> ../../sda
lrwxrwxrwx 1 root root   9 août  22 10:57 
scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0124762 -> ../../sdh
lrwxrwxrwx 1 root root   9 août  21 16:03 
scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0137898 -> ../../sdg
lrwxrwxrwx 1 root root   9 août  21 21:19 
scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 -> ../../sdf
lrwxrwxrwx 1 root root   9 août  21 16:03 
scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152562 -> ../../sdc


root@ceph-osd-0:~# mount
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
udev on /dev type devtmpfs 
(rw,relatime,size=10240k,nr_inodes=1020030,mode=755)
devpts on /dev/pts type devpts 
(rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)

tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=817216k,mode=755)
/dev/disk/by-uuid/7d95d243-1788-4c3f-9f89-166c15f880f0 on / type ext3 
(rw,relatime,errors=remount-ro,barrier=1,data=ordered)

tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev,relatime,size=1634432k)
tmpfs on /run/shm type tmpfs (rw,nosuid,nodev,relatime,size=1634432k)
/dev/sda on /home type btrfs (rw,relatime,ssd,space_cache)
/dev/sdf on /home/osd.0 type btrfs (rw,noatime,space_cache)

root@ceph-osd-0:~# ls -ld /home/osd.*
drwxr-xr-x 1 root root236 août  22 17:22 /home/osd.0
-rw-r--r-- 1 root root 1048576000 août  22 17:22 /home/osd.0.journal

Regards,
Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance improvement

2012-08-22 Thread Tommi Virtanen
On Wed, Aug 22, 2012 at 1:54 AM, Denis Fondras  wrote:
> First of all, here is my setup :
> for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal and 4x
> 3TB drive (Western Digital WD30EZRX). Everything but the boot partition is
> BTRFS-formated and 4K-aligned.
...
> [osd]
> osd data = /home/osd.$id
> osd journal = /home/osd.$id.journal
> osd journal size = 1000
> keyring = /etc/ceph/keyring.$name
>
> [osd.0]
> host = ceph-osd-0
> btrfs devs =
> /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201
> btrfs options = rw,noatime

Are you sure your osd data and journal are on the disks you think? The
/home paths look suspicious -- especially for journal, which often
should be a block device.

Can you share output of "mount" and "ls -ld /home/osd.*"
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-crush

2012-08-22 Thread Atchley, Scott
On Aug 22, 2012, at 10:46 AM, Florian Haas wrote:

> On 08/22/2012 03:10 AM, Sage Weil wrote:
>> I pushed a branch that changes some of the crush terminology.  Instead of 
>> having a crush type called "pool" that requires you to say things like 
>> "pool=default" in the "ceph osd crush set ..." command, it uses "root" 
>> instead.  That hopefully reinforces that it is a tree/hierarchy.
>> 
>> There is also a patch that changes "bucket" to "node" throughout, since 
>> bucket is a term also used by radosgw.
>> 
>> Thoughts?  I think the main pain in making this transition is that old 
>> clusters have maps that have a type 'pool' and new ones won't, and the 
>> docs will need to walk people through both...
> 
> "pool" in a crushmap being completely unrelated to a RADOS pool is
> something that I've heard customers/users report as confusing, as well.
> So changing that is probably a good thing. Naming it "root" is probably
> a good choice as well, as it happens to match
> http://ceph.com/wiki/Custom_data_placement_with_CRUSH.
> 
> As for changing "bucket" to node... a "node" is normally simply a
> physical server (at least in HA terminology, which many potential Ceph
> users will be familiar with), and CRUSH uses "host" for that. So that's
> another recipe for confusion. How about using something super-generic,
> like "element" or "item"?
> 
> Cheers,
> Florian

My guess is that he is trying to use data structure tree nomenclature (root, 
node, leaf). I agree that node is an overloaded term (as is pool).

As for an alternative to bucket which indicates the item is a collection, what 
about subtree or branch?

Scott--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: Fix sparse warning

2012-08-22 Thread Sage Weil
On Wed, 22 Aug 2012, Daniel Baluta wrote:
> On Tue, Aug 14, 2012 at 4:27 PM, Iulius Curt  wrote:
> > From: Iulius Curt 
> >
> > Make ceph_monc_do_poolop() static to remove the following sparse warning:
> >  * net/ceph/mon_client.c:616:5: warning: symbol 'ceph_monc_do_poolop' was 
> > not
> >declared. Should it be static?
> >
> > Signed-off-by: Iulius Curt 
> > ---
> >  net/ceph/mon_client.c |2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
> > index 105d533..3875c60 100644
> > --- a/net/ceph/mon_client.c
> > +++ b/net/ceph/mon_client.c
> > @@ -613,7 +613,7 @@ bad:
> >  /*
> >   * Do a synchronous pool op.
> >   */
> > -int ceph_monc_do_poolop(struct ceph_mon_client *monc, u32 op,
> > +static int ceph_monc_do_poolop(struct ceph_mon_client *monc, u32 op,
> > u32 pool, u64 snapid,
> > char *buf, int len)
> >  {
> > --
> > 1.7.9.5
> >
> > --
> 
> Hi Sage,
> 
> Can you have a look on this? :)

Sorry, this one fell through the cracks.  Yes, we can switch it to static, 
but while we're doing that let's drop the ceph_monc_ prefix too (since 
it's private).

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ideal hardware spec?

2012-08-22 Thread Jonathan Proulx
On Wed, Aug 22, 2012 at 04:17:23PM +0200, Wido den Hollander wrote:

:On 08/22/2012 03:55 PM, Jonathan Proulx wrote:

:You can also use the USB sticks[0] from Stec, they have servergrade
:onboard USB sticks for these kind of applications.

Those look quite interesting.

:A couple of questions still need to be answered though:
:* Which OS are you planning on using? Ubuntu 12.04 is recommended

Ubuntu 12.04 is our current preferred OS

:* Which filesystem do you want to use underneath the OSDs?

Whatever I can get to work best in testing :)

Since this is for a research platform not a product I'd likely start with
BTRFS and see if it is "stable enough" and "performant enough" with
fall back to XFS if needed

-Jon

:Wido
:
:[0]: http://www.stec-inc.com/product/ufm.php
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph osd create

2012-08-22 Thread Sage Weil
On Tue, 21 Aug 2012, Mandell Degerness wrote:
> Found it (digging through the source code to find a guess, since it is
> in no way obvious):  --osd-uuid 

Whoops, sorry, yeah.  It appeared in 0.47.

sage

> 
> On Tue, Aug 21, 2012 at 4:38 PM, Mandell Degerness
>  wrote:
> > Thanks, Sage.  This is what I was looking for, but what version of
> > ceph do I need for this to work (it isn't there in Argonaut)?  See
> > below:
> >
> > # ceph-osd -c /etc/ceph/ceph.conf --fsid
> > 8296cc23-9c11-44d7-84c1-16866ef9c4f7 -i 50 --mkfs --osd-fsid
> > e1097bd8-c931-4e2e-8ccb-332a954adace
> >   --conf/-cRead configuration from the given configuration file
> >   -d   Run in foreground, log to stderr.
> >   -f   Run in foreground, log to usual location.
> >   --id/-i  set ID portion of my name
> >   --name/-nset name (TYPE.ID)
> >   --versionshow version and quit
> >
> >   --debug_ms N
> > set message debug level (e.g. 1)
> > 2012-08-21 23:26:50.774858 7f1be9ac1780 -1 unrecognized arg --osd-fsid
> > 2012-08-21 23:26:50.774864 7f1be9ac1780 -1 usage: ceph-osd -i osdid
> > [--osd-data=path] [--osd-journal=path] [--mkfs] [--mkjournal]
> > [--convert-filestore]
> > 2012-08-21 23:26:50.774915 7f1be9ac1780 -1--debug_osd N   set
> > debug level (e.g. 10)
> >
> > # ceph --version
> > ceph version 0.48.1argonaut 
> > (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c)
> >
> > Tommi - Thank you for the suggestion of ceph-disk-prepare and
> > ceph-disk-activate, but they work at too high of a level for our
> > usage.  We need finer control of the block devices.
> >
> > Regards,
> > Mandell Degerness
> >
> > On Tue, Aug 21, 2012 at 11:15 AM, Sage Weil  wrote:
> >> On Tue, 21 Aug 2012, Mandell Degerness wrote:
> >>> OK.  I think I'm getting there.
> >>>
> >>> I want to be able to generate the fsid to be used in the OSD (from the
> >>> file system fsid, if that matters).  Is there a way to inject the fsid
> >>> when initializing the OSD directory?  It doesn't seem to be
> >>> documented.  The alternative would require that we mount the OSD in a
> >>> temp dir to read the fsid file, determine the OSD number, and then
> >>> re-mount it where it belongs, which seems the wrong way to go.
> >>
> >> You can feed in the fsid to ceph-osd --mkfs with --osd-fsid .
> >>
> >> sage
> >>
> >>>
> >>> Regards,
> >>> Mandell Degerness
> >>>
> >>> On Mon, Aug 20, 2012 at 4:26 PM, Tommi Virtanen  wrote:
> >>> > On Mon, Aug 20, 2012 at 3:53 PM, Mandell Degerness
> >>> >  wrote:
> >>> >> We're running Argonaut and it only has the OSD id in the whoami file
> >>> >> and nothing else.
> >>> >
> >>> > My bad, I meant the file "fsid" (note, not "ceph_fsid").
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>> the body of a message to majord...@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: wip-crush

2012-08-22 Thread Florian Haas
On 08/22/2012 03:10 AM, Sage Weil wrote:
> I pushed a branch that changes some of the crush terminology.  Instead of 
> having a crush type called "pool" that requires you to say things like 
> "pool=default" in the "ceph osd crush set ..." command, it uses "root" 
> instead.  That hopefully reinforces that it is a tree/hierarchy.
> 
> There is also a patch that changes "bucket" to "node" throughout, since 
> bucket is a term also used by radosgw.
> 
> Thoughts?  I think the main pain in making this transition is that old 
> clusters have maps that have a type 'pool' and new ones won't, and the 
> docs will need to walk people through both...

"pool" in a crushmap being completely unrelated to a RADOS pool is
something that I've heard customers/users report as confusing, as well.
So changing that is probably a good thing. Naming it "root" is probably
a good choice as well, as it happens to match
http://ceph.com/wiki/Custom_data_placement_with_CRUSH.

As for changing "bucket" to node... a "node" is normally simply a
physical server (at least in HA terminology, which many potential Ceph
users will be familiar with), and CRUSH uses "host" for that. So that's
another recipe for confusion. How about using something super-generic,
like "element" or "item"?

Cheers,
Florian

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ideal hardware spec?

2012-08-22 Thread Mark Nelson

On 08/22/2012 08:55 AM, Jonathan Proulx wrote:

Hi All,


Hi Jonathon!



Yes I'm asking the impossible question, what is the "best" hardware
confing.


That is the impossible question. :)



I'm looking at (possibly) using ceph as backing store for images and
volumes on OpenStack as well as exposing at least the object store for
direct use.

The openstack cluster exists and is currently in the early stages of
use by researchers here, approx 1500 vCPU (counts hyperthreads
actually 768 physical cores) and 3T or RAM across 64 physical nodes.

On the object store side it would be a new resource for usand hard to
say what people would do with it except that it would be many
different things and the use profile would be constantly changing
(which is true of all our existing storage).

In this sense, even though it's a "private cloud" the somewhat
unpredictable useage profile gives it some charateristics of a small
public cloud.

Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
to end up with a 20-30T 3x replicated storage (call me paranoid).

So the monitor specs seem relatively easy to come up with.  For the
OSDs it looks like
http://ceph.com/docs/master/install/hardware-recommendations suggests
1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
node).  On list discussions seem to frequently include an SSD for
journaling (which is similar to what we do for our current ZFS back
NFS storage).

I'm hoping to wrap the hardware in a grant and willing to experiment a
bit with different software configurations to tune it up when/if I get
the hardware in.  So my imediate concern is a hardware spec that will
ahve a reasonable processor:memory:disk ratio and opinions (or better
data) on the utility of SSD.


Before I joined up with Inktank, I was prototyping a private openstack 
cloud for HPC applications at a supercomputing site.  We similarly were 
pursuing grant funding.  I know how it goes!




First is the documented core to disk ratio still current best
practice?  Given a platform with more drive slots could 8 cores handle
more disk? would that need/like more memory?


The big thing is the CPU and memory needed during recovery.  During 
standard operation you shouldn't be pushing the CPU too hard unless you 
are really pushing data through fast and have many drives per node, or 
have severely underspecced the CPU.


Given that you are only shooting for around 90TB of space across 5+ osd 
nodes, you should be able to get away with 12 2TB+ drive 2U boxes. 
That's probably the closest thing we have right now to a "standard" 
configuration.  We use a single 6-core 2.8GHz AMD operation chip in each 
node with 16GB of memory.  It might be worth bumping that up to 24-32GB 
of memory for very large deployments with lots of OSDs.


In terms of controller we are using Dell H700 cards which are similar to 
LSI 9260s, but I think there is a good chance that it may actually be 
better to use H200s (ie LSI 9211-8i or similar) with the IT/JBOD mode 
firmware.  That's one of the commonly used cards in ZFS builds too and 
has a pretty good reputation.


I've actually got a supermicro SC847a chassis and a whole bunch of 
various SATA/SAS/RAID controllers I'm testing now in different 
configurations.  Hopefully I should have some data soon.  For now, our 
best tested configuration is with 12 drive nodes.  Smaller 1U nodes may 
be an option as well, but not very dense.




Have SSD been shown to speed performance with this architecture?


Yes, but in different ways depending on how you use them.  SSDs for data 
storage tend to help mitigate some of the seek behavior issues we've 
seen on the filestore.  This isn't really a reasonable solution for a 
lot of people though.


In terms of the journal, the biggest benefit that SSDs provide is high 
throughput, so you can load multiple journals onto 1 SSD and cram more 
OSDs into one box.  Depending on how much you trust your SSDs, you could 
try either a 10 disk + 2 SSD or a 9 disk + SSD configuration.  Keep in 
mind that this will be writing a lot of data to the SSDs, so you should 
try to undersubscribe them to lengthen the lifespan.  For testing I'm 
doing 3 journals per 180GB Intel 520 SSD.




If so given the 8 drive slot example with seven OSDs presented in the
docs what is the liklihood that using a high performance SSD for the
OS image and also cutting journal/log partitions out of it for the
remaining 7 2-3T near line SAS drives?


Just keep in mind that in this case you're total throughput will likely 
be limited by the SSD unless you get a very fast one (or are using 1GbE 
or have some other bottleneck).




Thanks,
-Jon
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org

RE: Ideal hardware spec?

2012-08-22 Thread Stephen Perkins
Hi all,

Is there a place we can set up a group of hardware recipes that people can
query and modify over time?  It would be good if people could submit and
"group modify" the recipes.   I would envision "hypothetical" configurations
and "deployed/tested" configurations.  

Trekking back through email exchanges like this becomes hard for people who
join later.

I'd like to see a "best" hardware config as well... however, I'm interested
in a SAS switching fabric where the nodes do not have any storage (except
possibly onboard boot drive/USB as listed below).  Each node would have a
SAS HBA that allows it to access a LARGE jbod  provided by a HA set of SAS
Switches (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives
are lun masked for each host.

The thought here is that you can add compute nodes, storage shelves, and
disks all independently.  With proper masking, you could provide redundancy
to cover drive, node, and shelf failures.You could also add disks
"horizontally" if you have spare slots in a shelf, and you could add shelves
"vertically" and increase the disk count available to existing nodes.

My goal is to be able to scale without having to draw the enormous power of
lots of 1U devices or buy lots of disks and shelves each time I wasn't to
add a little capacity.

Anybody looked at atom processors?

- Steve

-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wido den Hollander
Sent: Wednesday, August 22, 2012 9:17 AM
To: Jonathan Proulx
Cc: ceph-devel@vger.kernel.org
Subject: Re: Ideal hardware spec?

Hi,

On 08/22/2012 03:55 PM, Jonathan Proulx wrote:
> Hi All,
>
> Yes I'm asking the impossible question, what is the "best" hardware 
> confing.
>
> I'm looking at (possibly) using ceph as backing store for images and 
> volumes on OpenStack as well as exposing at least the object store for 
> direct use.
>
> The openstack cluster exists and is currently in the early stages of 
> use by researchers here, approx 1500 vCPU (counts hyperthreads 
> actually 768 physical cores) and 3T or RAM across 64 physical nodes.
>
> On the object store side it would be a new resource for usand hard to 
> say what people would do with it except that it would be many 
> different things and the use profile would be constantly changing 
> (which is true of all our existing storage).
>
> In this sense, even though it's a "private cloud" the somewhat 
> unpredictable useage profile gives it some charateristics of a small 
> public cloud.
>
> Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes 
> to end up with a 20-30T 3x replicated storage (call me paranoid).
>

I prefer 3x replication as well. I've seen the "wrong" OSDs die on me too
often.

> So the monitor specs seem relatively easy to come up with.  For the 
> OSDs it looks like 
> http://ceph.com/docs/master/install/hardware-recommendations suggests
> 1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage 
> node).  On list discussions seem to frequently include an SSD for 
> journaling (which is similar to what we do for our current ZFS back 
> NFS storage).
>
> I'm hoping to wrap the hardware in a grant and willing to experiment a 
> bit with different software configurations to tune it up when/if I get 
> the hardware in.  So my imediate concern is a hardware spec that will 
> ahve a reasonable processor:memory:disk ratio and opinions (or better
> data) on the utility of SSD.
>
> First is the documented core to disk ratio still current best 
> practice?  Given a platform with more drive slots could 8 cores handle 
> more disk? would that need/like more memory?
>

I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the OSD
machines, the more the kernel can buffer, which will always be a performance
gain.

You should however ask yourself the question if you want a lot of OSDs per
server and not go for smaller machines with less disks.

For example

- 1U
- 4 cores
- 8GB RAM
- 4 disks
- 1 SSD

Or

- 2U
- 8 cores
- 16GB RAM
- 8 disks
- 1|2 SSDs

Both will give you the same amount of storage, but the impact of loosing one
physicial machine will be larger with the 2U machine.

If you take 1TB disks you'd loose 8TB of storage, that is a lot of recovery
to be done.

Since btrfs (Assuming you are going to use that) is still in development
it's not excluded that your machine goes down due to a kernel panic or other
problems.

My personal favor is having multiple small(er) machines than having a couple
of large machines.

> Have SSD been shown to speed performance with this architecture?
>

I've seen a improvement in performance indeed. Make sure however you have a
recent version of glibc with syncfs support.

> If so given the 8 drive slot example with seven OSDs presented in the 
> docs what is the liklihood that using a high performance SSD for the 
> OS image and also cutting journal/log partitions out of it for the 
> remaining 7 2-3T n

Re: Ideal hardware spec?

2012-08-22 Thread Wido den Hollander

Hi,

On 08/22/2012 03:55 PM, Jonathan Proulx wrote:

Hi All,

Yes I'm asking the impossible question, what is the "best" hardware
confing.

I'm looking at (possibly) using ceph as backing store for images and
volumes on OpenStack as well as exposing at least the object store for
direct use.

The openstack cluster exists and is currently in the early stages of
use by researchers here, approx 1500 vCPU (counts hyperthreads
actually 768 physical cores) and 3T or RAM across 64 physical nodes.

On the object store side it would be a new resource for usand hard to
say what people would do with it except that it would be many
different things and the use profile would be constantly changing
(which is true of all our existing storage).

In this sense, even though it's a "private cloud" the somewhat
unpredictable useage profile gives it some charateristics of a small
public cloud.

Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
to end up with a 20-30T 3x replicated storage (call me paranoid).



I prefer 3x replication as well. I've seen the "wrong" OSDs die on me 
too often.



So the monitor specs seem relatively easy to come up with.  For the
OSDs it looks like
http://ceph.com/docs/master/install/hardware-recommendations suggests
1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
node).  On list discussions seem to frequently include an SSD for
journaling (which is similar to what we do for our current ZFS back
NFS storage).

I'm hoping to wrap the hardware in a grant and willing to experiment a
bit with different software configurations to tune it up when/if I get
the hardware in.  So my imediate concern is a hardware spec that will
ahve a reasonable processor:memory:disk ratio and opinions (or better
data) on the utility of SSD.

First is the documented core to disk ratio still current best
practice?  Given a platform with more drive slots could 8 cores handle
more disk? would that need/like more memory?



I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the 
OSD machines, the more the kernel can buffer, which will always be a 
performance gain.


You should however ask yourself the question if you want a lot of OSDs 
per server and not go for smaller machines with less disks.


For example

- 1U
- 4 cores
- 8GB RAM
- 4 disks
- 1 SSD

Or

- 2U
- 8 cores
- 16GB RAM
- 8 disks
- 1|2 SSDs

Both will give you the same amount of storage, but the impact of loosing 
one physicial machine will be larger with the 2U machine.


If you take 1TB disks you'd loose 8TB of storage, that is a lot of 
recovery to be done.


Since btrfs (Assuming you are going to use that) is still in development 
it's not excluded that your machine goes down due to a kernel panic or 
other problems.


My personal favor is having multiple small(er) machines than having a 
couple of large machines.



Have SSD been shown to speed performance with this architecture?



I've seen a improvement in performance indeed. Make sure however you 
have a recent version of glibc with syncfs support.



If so given the 8 drive slot example with seven OSDs presented in the
docs what is the liklihood that using a high performance SSD for the
OS image and also cutting journal/log partitions out of it for the
remaining 7 2-3T near line SAS drives?



You should make sure your SSD is capable of doing line-speed of your 
network.


If you are connecting the machines with 4G trunks, make sure the SSD is 
capable of doing around 400MB/sec of sustained writes.


I'd recommended the Intel 520 SSDs and change their available capacity 
with hdparm to about 20% of their original capacity. This way the SSD 
always has a lot of free cells available for writing. Reprogramming 
cells is expensive on an SSD.


You can run the OS on the same SSD since that won't do that much I/O. 
I'd recommend not logging locally though, since that will also write to 
the same SSD. Try using remote syslog.


You can also use the USB sticks[0] from Stec, they have servergrade 
onboard USB sticks for these kind of applications.


A couple of questions still need to be answered though:
* Which OS are you planning on using? Ubuntu 12.04 is recommended
* Which filesystem do you want to use underneath the OSDs?

Wido

[0]: http://www.stec-inc.com/product/ufm.php


Thanks,
-Jon
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ideal hardware spec?

2012-08-22 Thread Jonathan Proulx
Hi All,

Yes I'm asking the impossible question, what is the "best" hardware
confing.

I'm looking at (possibly) using ceph as backing store for images and
volumes on OpenStack as well as exposing at least the object store for
direct use.  

The openstack cluster exists and is currently in the early stages of
use by researchers here, approx 1500 vCPU (counts hyperthreads
actually 768 physical cores) and 3T or RAM across 64 physical nodes.

On the object store side it would be a new resource for usand hard to
say what people would do with it except that it would be many
different things and the use profile would be constantly changing
(which is true of all our existing storage).

In this sense, even though it's a "private cloud" the somewhat
unpredictable useage profile gives it some charateristics of a small
public cloud.

Size wise I'm hoping to start out with 3 monitors  and  5(+) OSD nodes
to end up with a 20-30T 3x replicated storage (call me paranoid).

So the monitor specs seem relatively easy to come up with.  For the
OSDs it looks like
http://ceph.com/docs/master/install/hardware-recommendations suggests
1 drive, 1 core and  2G RAM per OSD (with multiple OSDs per storage
node).  On list discussions seem to frequently include an SSD for
journaling (which is similar to what we do for our current ZFS back
NFS storage).

I'm hoping to wrap the hardware in a grant and willing to experiment a
bit with different software configurations to tune it up when/if I get
the hardware in.  So my imediate concern is a hardware spec that will
ahve a reasonable processor:memory:disk ratio and opinions (or better
data) on the utility of SSD.

First is the documented core to disk ratio still current best
practice?  Given a platform with more drive slots could 8 cores handle
more disk? would that need/like more memory?

Have SSD been shown to speed performance with this architecture?

If so given the 8 drive slot example with seven OSDs presented in the
docs what is the liklihood that using a high performance SSD for the
OS image and also cutting journal/log partitions out of it for the
remaining 7 2-3T near line SAS drives?

Thanks,
-Jon
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance improvement

2012-08-22 Thread Alexandre DERUMIER
>>Not sure what version of glibc Wheezy has, but try to make sure you have 
>>one that supports syncfs (you'll also need a semi-new kernel, 3.0+ 
>>should be fine). 

Hi, glibc from wheezy don't have syncfs support.

- Mail original - 

De: "Mark Nelson"  
À: "Denis Fondras"  
Cc: ceph-devel@vger.kernel.org 
Envoyé: Mercredi 22 Août 2012 14:35:28 
Objet: Re: Ceph performance improvement 

On 08/22/2012 03:54 AM, Denis Fondras wrote: 
> Hello all, 

Hello! 

David had some good comments in his reply, so I'll just add in a couple 
of extra thoughts... 

> 
> I'm currently testing Ceph. So far it seems that HA and recovering are 
> very good. 
> The only point that prevents my from using it at datacenter-scale is 
> performance. 
> 
> First of all, here is my setup : 
> - 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - 
> 4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 

Not sure what version of glibc Wheezy has, but try to make sure you have 
one that supports syncfs (you'll also need a semi-new kernel, 3.0+ 
should be fine). 

> (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). It has 1x 320GB drive 
> for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal 
> and 4x 3TB drive (Western Digital WD30EZRX). Everything but the boot 
> partition is BTRFS-formated and 4K-aligned. 
> - 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and 
> Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). 
> Both servers are linked over a 1Gb Ethernet switch (iperf shows about 
> 960Mb/s). 
> 
> Here is my ceph.conf : 
> --cut-here-- 
> [global] 
> auth supported = cephx 
> keyring = /etc/ceph/keyring 
> journal dio = true 
> osd op threads = 24 
> osd disk threads = 24 
> filestore op threads = 6 
> filestore queue max ops = 24 
> osd client message size cap = 1400 
> ms dispatch throttle bytes = 1750 
> 

default values are quite a bit lower for most of these. You may want to 
play with them and see if it has an effect. 

> [mon] 
> mon data = /home/mon.$id 
> keyring = /etc/ceph/keyring.$name 
> 
> [mon.a] 
> host = ceph-osd-0 
> mon addr = 192.168.0.132:6789 
> 
> [mds] 
> keyring = /etc/ceph/keyring.$name 
> 
> [mds.a] 
> host = ceph-osd-0 
> 
> [osd] 
> osd data = /home/osd.$id 
> osd journal = /home/osd.$id.journal 
> osd journal size = 1000 
> keyring = /etc/ceph/keyring.$name 
> 
> [osd.0] 
> host = ceph-osd-0 
> btrfs devs = /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 
> btrfs options = rw,noatime 

Just fyi, we are trying to get away from btrfs devs. 

> --cut-here-- 
> 
> Here are some figures : 
> * Test with "dd" on the OSD server (on drive 
> /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : 
> # dd if=/dev/zero of=testdd bs=4k count=4M 
> 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s 

Good job using a data file that is much bigger than main memory! That 
looks pretty accurate for a 7200rpm spinning disk. For dd benchmarks, 
you should probably throw in conv=fdatasync at the end though. 

> 
> => iostat (on the OSD server) : 
> avg-cpu: %user %nice %system %iowait %steal %idle 
> 0,00 0,00 0,52 41,99 0,00 57,48 
> 
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
> sdf 247,00 0,00 125520,00 0 125520 
> 
> * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD 
> server (on drive 
> /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : 
> # time tar xzf src.tar.gz 
> real 0m9.669s 
> user 0m8.405s 
> sys 0m4.736s 
> 
> # time rm -rf * 
> real 0m3.647s 
> user 0m0.036s 
> sys 0m3.552s 
> 
> => iostat (on the OSD server) : 
> avg-cpu: %user %nice %system %iowait %steal %idle 
> 10,83 0,00 28,72 16,62 0,00 43,83 
> 
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
> sdf 1369,00 0,00 9300,00 0 9300 
> 
> * Test with "dd" from the client using RBD : 
> # dd if=/dev/zero of=testdd bs=4k count=4M 
> 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s 

RBD caching should definitely be enabled for a test like this. I'd be 
surprised if you got 42MB/s without it though... 

> 
> => iostat (on the OSD server) : 
> avg-cpu: %user %nice %system %iowait %steal %idle 
> 4,57 0,00 30,46 27,66 0,00 37,31 
> 
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
> sda 317,00 0,00 57400,00 0 57400 
> sdf 237,00 0,00 88336,00 0 88336 
> 
> * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the 
> client using RBD : 
> # time tar xzf src.tar.gz 
> real 0m26.955s 
> user 0m9.233s 
> sys 0m11.425s 
> 
> # time rm -rf * 
> real 0m8.545s 
> user 0m0.128s 
> sys 0m8.297s 
> 
> => iostat (on the OSD server) : 
> avg-cpu: %user %nice %system %iowait %steal %idle 
> 4,59 0,00 24,74 30,61 0,00 40,05 
> 
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn 
> sda 239,00 0,00 54772,00 0 54772 
> sdf 441,00 0,00 50836,00 0 50836 
> 
> * Test with "dd" from the client using CephFS : 
> # dd if=/dev/zero of=testdd bs=4k count=4M 
> 17179869184 bytes (17 G

Re: Ceph performance improvement

2012-08-22 Thread Mark Nelson

On 08/22/2012 03:54 AM, Denis Fondras wrote:

Hello all,


Hello!

David had some good comments in his reply, so I'll just add in a couple 
of extra thoughts...




I'm currently testing Ceph. So far it seems that HA and recovering are
very good.
The only point that prevents my from using it at datacenter-scale is
performance.

First of all, here is my setup :
- 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 -
4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49


Not sure what version of glibc Wheezy has, but try to make sure you have 
one that supports syncfs (you'll also need a semi-new kernel, 3.0+ 
should be fine).



(commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). It has 1x 320GB drive
for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal
and 4x 3TB drive (Western Digital WD30EZRX). Everything but the boot
partition is BTRFS-formated and 4K-aligned.
- 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and
Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac).
Both servers are linked over a 1Gb Ethernet switch (iperf shows about
960Mb/s).

Here is my ceph.conf :
--cut-here--
[global]
auth supported = cephx
keyring = /etc/ceph/keyring
journal dio = true
osd op threads = 24
osd disk threads = 24
filestore op threads = 6
filestore queue max ops = 24
osd client message size cap = 1400
ms dispatch throttle bytes = 1750



default values are quite a bit lower for most of these.  You may want to 
play with them and see if it has an effect.



[mon]
mon data = /home/mon.$id
keyring = /etc/ceph/keyring.$name

[mon.a]
host = ceph-osd-0
mon addr = 192.168.0.132:6789

[mds]
keyring = /etc/ceph/keyring.$name

[mds.a]
host = ceph-osd-0

[osd]
osd data = /home/osd.$id
osd journal = /home/osd.$id.journal
osd journal size = 1000
keyring = /etc/ceph/keyring.$name

[osd.0]
host = ceph-osd-0
btrfs devs = /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201
btrfs options = rw,noatime


Just fyi, we are trying to get away from btrfs devs.


--cut-here--

Here are some figures :
* Test with "dd" on the OSD server (on drive
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s


Good job using a data file that is much bigger than main memory! That 
looks pretty accurate for a 7200rpm spinning disk.  For dd benchmarks, 
you should probably throw in conv=fdatasync at the end though.




=> iostat (on the OSD server) :
avg-cpu: %user %nice %system %iowait %steal %idle
0,00 0,00 0,52 41,99 0,00 57,48

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdf 247,00 0,00 125520,00 0 125520

* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD
server (on drive
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :
# time tar xzf src.tar.gz
real 0m9.669s
user 0m8.405s
sys 0m4.736s

# time rm -rf *
real 0m3.647s
user 0m0.036s
sys 0m3.552s

=> iostat (on the OSD server) :
avg-cpu: %user %nice %system %iowait %steal %idle
10,83 0,00 28,72 16,62 0,00 43,83

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdf 1369,00 0,00 9300,00 0 9300

* Test with "dd" from the client using RBD :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s


RBD caching should definitely be enabled for a test like this.  I'd be 
surprised if you got 42MB/s without it though...




=> iostat (on the OSD server) :
avg-cpu: %user %nice %system %iowait %steal %idle
4,57 0,00 30,46 27,66 0,00 37,31

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 317,00 0,00 57400,00 0 57400
sdf 237,00 0,00 88336,00 0 88336

* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the
client using RBD :
# time tar xzf src.tar.gz
real 0m26.955s
user 0m9.233s
sys 0m11.425s

# time rm -rf *
real 0m8.545s
user 0m0.128s
sys 0m8.297s

=> iostat (on the OSD server) :
avg-cpu: %user %nice %system %iowait %steal %idle
4,59 0,00 24,74 30,61 0,00 40,05

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 239,00 0,00 54772,00 0 54772
sdf 441,00 0,00 50836,00 0 50836

* Test with "dd" from the client using CephFS :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s

=> iostat (on the OSD server) :
avg-cpu: %user %nice %system %iowait %steal %idle
2,26 0,00 20,30 27,07 0,00 50,38

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 710,00 0,00 58836,00 0 58836
sdf 722,00 0,00 32768,00 0 32768


* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the
client using CephFS :
# time tar xzf src.tar.gz
real 3m55.260s
user 0m8.721s
sys 0m11.461s



Ouch, that's taking a while!  In addition to the comments that David 
made, be aware that you are also testing the metadata server with 
cephFS.  Right now that's not getting a lot of attention as we are 
primarily focusing on RADOS performance at the moment.  For this kind of 
test though, distributed filesyste

Re: Ceph performance improvement

2012-08-22 Thread Denis Fondras

Thank you for the answer David.



That looks like you're writing to a filesystem on that disk, rather than
the block device itself -- but lets say you've got 139MB/sec
(1112Mbit/sec) of straight-line performance.

Note: this is already faster than your network link can go -- you can,
at best, only achieve 120MB/sec over your gigabit link.



Yes, I am aware of that, I can't get more than the GB link. However, I 
mentionned this to show that the disk should not be a bottleneck.




Is this a dd to the RBD device directly, or is this a write to a file in
a filesystem created on top of it?



The RBD device is mounted and formatted with BTRFS.


dd will write blocks synchronously -- that is, it will write one block,
wait for the write to complete, then write the next block, and so on.
Because of the durability guarantees provided by ceph, this will result
in dd doing a lot of waiting around while writes are being sent over the
network and written out on your OSD.



Thank you for that information.


(If you're using the default replication count of 2, probably twice? I'm
not exactly sure what Ceph does when it only has one OSD to work on..?)



I don't know exactly how it behaves but "ceph -s" tells the cluster is 
degraded at 50%. Adding a second OSD allows Ceph to replicate.




Just ignoring networking and storage for a moment, this also isn't a
fair test: you're comparing the decompress-and-unpack time of a 139MB
tarball on a 3GHz Pentium 4 with 1GB of RAM and a quad-core Xeon E5 that
has 8GB.



That's a very good point ! Comparing figures on the same host tells a 
different story (/mnt is Ceph RBD device) :)


root@ceph-osd-1:/home# time tar xzf ../src.tar.gz && sync

real0m43.668s
user0m9.649s
sys 0m20.897s

root@ceph-osd-1:/mnt# time tar xzf ../src.tar.gz && sync

real0m38.022s
user0m9.101s
sys 0m11.265s

Thank you again,
Denis
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph performance improvement

2012-08-22 Thread David McBride

On 22/08/12 09:54, Denis Fondras wrote:


The only point that prevents my from using it at datacenter-scale is
performance.



Here are some figures :
* Test with "dd" on the OSD server (on drive
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s


That looks like you're writing to a filesystem on that disk, rather than 
the block device itself -- but lets say you've got 139MB/sec 
(1112Mbit/sec) of straight-line performance.


Note: this is already faster than your network link can go -- you can, 
at best, only achieve 120MB/sec over your gigabit link.



* Test with "dd" from the client using RBD :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s


Is this a dd to the RBD device directly, or is this a write to a file in 
a filesystem created on top of it?


dd will write blocks synchronously -- that is, it will write one block, 
wait for the write to complete, then write the next block, and so on. 
Because of the durability guarantees provided by ceph, this will result 
in dd doing a lot of waiting around while writes are being sent over the 
network and written out on your OSD.


(If you're using the default replication count of 2, probably twice? 
I'm not exactly sure what Ceph does when it only has one OSD to work on..?)



* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the
client using RBD :
# time tar xzf src.tar.gz
real0m26.955s
user0m9.233s
sys 0m11.425s


Just ignoring networking and storage for a moment, this also isn't a 
fair test: you're comparing the decompress-and-unpack time of a 139MB 
tarball on a 3GHz Pentium 4 with 1GB of RAM and a quad-core Xeon E5 that 
has 8GB.


Even ignoring the relative CPU difference, then unless you're doing 
something clever that you haven't described, there's no guarantee that 
the files in the latter case have actually been written to disk -- you 
have enough memory on your server for it to buffer all of those writes 
in RAM.  You'd need to add a sync() call or similar at the end of your 
timing run to ensure that all of those writes have actually been 
committed to disk.



* Test with "dd" from the client using CephFS :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s


Again, the synchronous nature of 'dd' is probably severely affecting 
apparent performance.  I'd suggest looking at some other tools, like 
fio, bonnie++, or iozone, which might generate more representative load.


(Or, if you have a specific use-case in mind, something that generates 
an IO pattern like what you'll be using in production would be ideal!)


Cheers,
David
--
David McBride 
Unix Specialist, University Computing Service
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: Fix sparse warning

2012-08-22 Thread Daniel Baluta
On Tue, Aug 14, 2012 at 4:27 PM, Iulius Curt  wrote:
> From: Iulius Curt 
>
> Make ceph_monc_do_poolop() static to remove the following sparse warning:
>  * net/ceph/mon_client.c:616:5: warning: symbol 'ceph_monc_do_poolop' was not
>declared. Should it be static?
>
> Signed-off-by: Iulius Curt 
> ---
>  net/ceph/mon_client.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
> index 105d533..3875c60 100644
> --- a/net/ceph/mon_client.c
> +++ b/net/ceph/mon_client.c
> @@ -613,7 +613,7 @@ bad:
>  /*
>   * Do a synchronous pool op.
>   */
> -int ceph_monc_do_poolop(struct ceph_mon_client *monc, u32 op,
> +static int ceph_monc_do_poolop(struct ceph_mon_client *monc, u32 op,
> u32 pool, u64 snapid,
> char *buf, int len)
>  {
> --
> 1.7.9.5
>
> --

Hi Sage,

Can you have a look on this? :)

thanks,
Daniel.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph performance improvement

2012-08-22 Thread Denis Fondras

Hello all,

I'm currently testing Ceph. So far it seems that HA and recovering are 
very good.
The only point that prevents my from using it at datacenter-scale is 
performance.


First of all, here is my setup :
- 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - 
4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 
(commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac).  It  has 1x 320GB 
drive for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the 
journal and 4x 3TB drive (Western Digital WD30EZRX). Everything but the 
boot partition is BTRFS-formated and 4K-aligned.
- 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and 
Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac).
Both servers are linked over a 1Gb Ethernet switch (iperf shows about 
960Mb/s).


Here is my ceph.conf :
--cut-here--
[global]
auth supported = cephx
keyring = /etc/ceph/keyring
journal dio = true
osd op threads = 24
osd disk threads = 24
filestore op threads = 6
filestore queue max ops = 24
osd client message size cap = 1400
ms dispatch throttle bytes =  1750

[mon]
mon data = /home/mon.$id
keyring = /etc/ceph/keyring.$name

[mon.a]
host = ceph-osd-0
mon addr = 192.168.0.132:6789

[mds]
keyring = /etc/ceph/keyring.$name

[mds.a]
host = ceph-osd-0

[osd]
osd data = /home/osd.$id
osd journal = /home/osd.$id.journal
osd journal size = 1000
keyring = /etc/ceph/keyring.$name

[osd.0]
host = ceph-osd-0
btrfs devs = 
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201

btrfs options = rw,noatime
--cut-here--

Here are some figures :
* Test with "dd" on the OSD server (on drive 
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :

# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0,000,000,52   41,990,00   57,48

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdf 247,00 0,00125520,00  0 125520

* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD 
server (on drive 
/dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) :

# time tar xzf src.tar.gz
real0m9.669s
user0m8.405s
sys 0m4.736s

# time rm -rf *
real0m3.647s
user0m0.036s
sys 0m3.552s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  10,830,00   28,72   16,620,00   43,83

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sdf1369,00 0,00  9300,00  0   9300

* Test with "dd" from the client using RBD :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   4,570,00   30,46   27,660,00   37,31

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sda 317,00 0,00 57400,00  0  57400
sdf 237,00 0,00 88336,00  0  88336

* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the 
client using RBD :

# time tar xzf src.tar.gz
real0m26.955s
user0m9.233s
sys 0m11.425s

# time rm -rf *
real0m8.545s
user0m0.128s
sys 0m8.297s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   4,590,00   24,74   30,610,00   40,05

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sda 239,00 0,00 54772,00  0  54772
sdf 441,00 0,00 50836,00  0  50836

* Test with "dd" from the client using CephFS :
# dd if=/dev/zero of=testdd bs=4k count=4M
17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   2,260,00   20,30   27,070,00   50,38

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sda 710,00 0,00 58836,00  0  58836
sdf 722,00 0,00 32768,00  0  32768


* Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the 
client using CephFS :

# time tar xzf src.tar.gz
real3m55.260s
user0m8.721s
sys 0m11.461s

# time rm -rf *
real9m2.319s
user0m0.320s
sys 0m4.572s

=> iostat (on the OSD server) :
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
  14,400,00   15,942,310,00   67,35

Device:tpskB_read/skB_wrtn/skB_readkB_wrtn
sda 174,00 0,00 10772,00