Re: Ceph performance improvement
On 22/08/12 22:24, David McBride wrote: On 22/08/12 09:54, Denis Fondras wrote: * Test with "dd" from the client using CephFS : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s Again, the synchronous nature of 'dd' is probably severely affecting apparent performance. I'd suggest looking at some other tools, like fio, bonnie++, or iozone, which might generate more representative load. (Or, if you have a specific use-case in mind, something that generates an IO pattern like what you'll be using in production would be ideal!) Appending conv=fsync to the dd will make the comparison fair enough. Looking at the ceph code, it does sync_file_range(fd, offset, blocksz, SYNC_FILE_RANGE_WRITE); which is very fast - way faster than fdatasync() and friends (I have tested this ... see prev posting on random write performance with file writetest.c attached). I am not convinced the these sort of tests are in any way 'unfair' - for instance I would like to use rbd for postgres or mysql data volumes... and many database actions involve a stream of block writes similar enough to doing dd (e.g bulk row loads, appends to transaction log journals). Cheers Mark -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
The tcmalloc backtrace on the OSD suggests this may be unrelated, but what's the fd limit on your monitor process? You may be approaching that limit if you've got 500 OSDs and a similar number of clients. On Wed, Aug 22, 2012 at 6:55 PM, Andrey Korolyov wrote: > On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil wrote: >> On Thu, 23 Aug 2012, Andrey Korolyov wrote: >>> Hi, >>> >>> today during heavy test a pair of osds and one mon died, resulting to >>> hard lockup of some kvm processes - they went unresponsible and was >>> killed leaving zombie processes ([kvm] ). Entire cluster >>> contain sixteen osd on eight nodes and three mons, on first and last >>> node and on vm outside cluster. >>> >>> osd bt: >>> #0 0x7fc37d490be3 in >>> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, >>> unsigned long, int) () from /usr/lib/libtcmalloc.so.4 >>> (gdb) bt >>> #0 0x7fc37d490be3 in >>> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, >>> unsigned long, int) () from /usr/lib/libtcmalloc.so.4 >>> #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from >>> /usr/lib/libtcmalloc.so.4 >>> #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 >>> #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at >>> /usr/include/c++/4.7/bits/basic_string.h:246 >>> #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=) at >>> /usr/include/c++/4.7/bits/basic_string.h:536 >>> #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=) >>> at /usr/include/c++/4.7/sstream:60 >>> #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=>> out>, __vtt_parm=) at /usr/include/c++/4.7/sstream:439 >>> #7 pretty_version_to_str () at common/version.cc:40 >>> #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, >>> out=...) at common/BackTrace.cc:19 >>> #9 0x0078f450 in handle_fatal_signal (signum=11) at >>> global/signal_handler.cc:91 >>> #10 >>> #11 0x7fc37d490be3 in >>> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, >>> unsigned long, int) () from /usr/lib/libtcmalloc.so.4 >>> #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from >>> /usr/lib/libtcmalloc.so.4 >>> #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 >>> #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() () >>> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >>> #15 0x7fc37d1c4796 in ?? () from >>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >>> #16 0x7fc37d1c47c3 in std::terminate() () from >>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >>> #17 0x7fc37d1c49ee in __cxa_throw () from >>> /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >>> #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c >>> "0 == \"unexpected error\"", file=, line=3007, >>> func=0x90ef80 "unsigned int >>> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)") >>> at common/assert.cc:77 >> >> This means it got an unexpected error when talking to the file system. If >> you look in the osd log, it may tell you what that was. (It may >> not--there isn't usually the other tcmalloc stuff triggered from the >> assert handler.) >> >> What happens if you restart that ceph-osd daemon? >> >> sage >> >> > > Unfortunately I have completely disabled logs during test, so there > are no suggestion of assert_fail. The main problem was revealed - > created VMs was pointed to one monitor instead set of three, so there > may be some unusual things(btw, crashed mon isn`t one from above, but > a neighbor of crashed osds on first node). After IPMI reset node > returns back well and cluster behavior seems to be okay - stuck kvm > I/O somehow prevented even other module load|unload on this node, so I > finally decided to do hard reset. Despite I`m using almost generic > wheezy, glibc was updated to 2.15, may be because of this my trace > appears first time ever. I`m almost sure that fs does not triggered > this crash and mainly suspecting stuck kvm processes. I`ll rerun test > with same conditions tomorrow(~500 vms pointed to one mon and very > high I/O, but with osd logging). > >>> #19 0x0073148f in FileStore::_do_transaction >>> (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545, >>> trans_num=trans_num@entry=0) at os/FileStore.cc:3007 >>> #20 0x0073484e in FileStore::do_transactions (this=0x2cde000, >>> tls=..., op_seq=429545) at os/FileStore.cc:2436 >>> #21 0x0070c680 in FileStore::_do_op (this=0x2cde000, >>> osr=) at os/FileStore.cc:2259 >>> #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at >>> common/WorkQueue.cc:54 >>> #23 0x006823ed in ThreadPool::WorkThread::entry >>> (this=) at ./common/WorkQueue.h:126 >>> #24 0x7fc37e3eee9a in start_thread () from >>> /lib/x86_64-linux-gnu/libpthread.so.0 >>> #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 >>> #26 0x in ?? () >>> >>> mon bt was exact
Re: Ceph performance improvement / journal on block-dev
On Wed, Aug 22, 2012 at 12:12 PM, Dieter Kasper (KD) wrote: >> Your journal is a file on a btrfs partition. That is probably a bad >> idea for performance. I'd recommend partitioning the drive and using >> partitions as journals directly. > can you please teach me how to use the right parameter(s) to realize 'journal > on block-dev' ? Replacing the example paths, use "sudo parted /dev/sdg" or "gksu gparted /dev/sdg", create partitions, set osd journal to point to a block device for a partition. [osd.42] osd journal = /dev/sdg4 > It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf > --mkbtrfs' > (see below) Try running it with -x for any chance of extracting debuggable information from the monster. > Scanning for Btrfs filesystems > HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device > 2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid > 8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected > ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal Based on that, my best guess would be that you're seeing a journal from an old run -- perhaps you need to explicitly clear out the block device contents.. Frankly, you should not use btrfs devs. Any convenience you may gain is more than doubly offset by pains exactly like these. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: OSD crash
On Thu, Aug 23, 2012 at 2:33 AM, Sage Weil wrote: > On Thu, 23 Aug 2012, Andrey Korolyov wrote: >> Hi, >> >> today during heavy test a pair of osds and one mon died, resulting to >> hard lockup of some kvm processes - they went unresponsible and was >> killed leaving zombie processes ([kvm] ). Entire cluster >> contain sixteen osd on eight nodes and three mons, on first and last >> node and on vm outside cluster. >> >> osd bt: >> #0 0x7fc37d490be3 in >> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, >> unsigned long, int) () from /usr/lib/libtcmalloc.so.4 >> (gdb) bt >> #0 0x7fc37d490be3 in >> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, >> unsigned long, int) () from /usr/lib/libtcmalloc.so.4 >> #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from >> /usr/lib/libtcmalloc.so.4 >> #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 >> #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at >> /usr/include/c++/4.7/bits/basic_string.h:246 >> #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=) at >> /usr/include/c++/4.7/bits/basic_string.h:536 >> #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=) >> at /usr/include/c++/4.7/sstream:60 >> #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg=> out>, __vtt_parm=) at /usr/include/c++/4.7/sstream:439 >> #7 pretty_version_to_str () at common/version.cc:40 >> #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, >> out=...) at common/BackTrace.cc:19 >> #9 0x0078f450 in handle_fatal_signal (signum=11) at >> global/signal_handler.cc:91 >> #10 >> #11 0x7fc37d490be3 in >> tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, >> unsigned long, int) () from /usr/lib/libtcmalloc.so.4 >> #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from >> /usr/lib/libtcmalloc.so.4 >> #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 >> #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() () >> from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >> #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >> #16 0x7fc37d1c47c3 in std::terminate() () from >> /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >> #17 0x7fc37d1c49ee in __cxa_throw () from >> /usr/lib/x86_64-linux-gnu/libstdc++.so.6 >> #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c >> "0 == \"unexpected error\"", file=, line=3007, >> func=0x90ef80 "unsigned int >> FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)") >> at common/assert.cc:77 > > This means it got an unexpected error when talking to the file system. If > you look in the osd log, it may tell you what that was. (It may > not--there isn't usually the other tcmalloc stuff triggered from the > assert handler.) > > What happens if you restart that ceph-osd daemon? > > sage > > Unfortunately I have completely disabled logs during test, so there are no suggestion of assert_fail. The main problem was revealed - created VMs was pointed to one monitor instead set of three, so there may be some unusual things(btw, crashed mon isn`t one from above, but a neighbor of crashed osds on first node). After IPMI reset node returns back well and cluster behavior seems to be okay - stuck kvm I/O somehow prevented even other module load|unload on this node, so I finally decided to do hard reset. Despite I`m using almost generic wheezy, glibc was updated to 2.15, may be because of this my trace appears first time ever. I`m almost sure that fs does not triggered this crash and mainly suspecting stuck kvm processes. I`ll rerun test with same conditions tomorrow(~500 vms pointed to one mon and very high I/O, but with osd logging). >> #19 0x0073148f in FileStore::_do_transaction >> (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545, >> trans_num=trans_num@entry=0) at os/FileStore.cc:3007 >> #20 0x0073484e in FileStore::do_transactions (this=0x2cde000, >> tls=..., op_seq=429545) at os/FileStore.cc:2436 >> #21 0x0070c680 in FileStore::_do_op (this=0x2cde000, >> osr=) at os/FileStore.cc:2259 >> #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at >> common/WorkQueue.cc:54 >> #23 0x006823ed in ThreadPool::WorkThread::entry >> (this=) at ./common/WorkQueue.h:126 >> #24 0x7fc37e3eee9a in start_thread () from >> /lib/x86_64-linux-gnu/libpthread.so.0 >> #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 >> #26 0x in ?? () >> >> mon bt was exactly the same as in http://tracker.newdream.net/issues/2762 >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord.
Re: OSD crash
On Thu, 23 Aug 2012, Andrey Korolyov wrote: > Hi, > > today during heavy test a pair of osds and one mon died, resulting to > hard lockup of some kvm processes - they went unresponsible and was > killed leaving zombie processes ([kvm] ). Entire cluster > contain sixteen osd on eight nodes and three mons, on first and last > node and on vm outside cluster. > > osd bt: > #0 0x7fc37d490be3 in > tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > unsigned long, int) () from /usr/lib/libtcmalloc.so.4 > (gdb) bt > #0 0x7fc37d490be3 in > tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > unsigned long, int) () from /usr/lib/libtcmalloc.so.4 > #1 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from > /usr/lib/libtcmalloc.so.4 > #2 0x7fc37d4a2287 in tc_delete () from /usr/lib/libtcmalloc.so.4 > #3 0x008b1224 in _M_dispose (__a=..., this=0x6266d80) at > /usr/include/c++/4.7/bits/basic_string.h:246 > #4 ~basic_string (this=0x7fc3736639d0, __in_chrg=) at > /usr/include/c++/4.7/bits/basic_string.h:536 > #5 ~basic_stringbuf (this=0x7fc373663988, __in_chrg=) > at /usr/include/c++/4.7/sstream:60 > #6 ~basic_ostringstream (this=0x7fc373663980, __in_chrg= out>, __vtt_parm=) at /usr/include/c++/4.7/sstream:439 > #7 pretty_version_to_str () at common/version.cc:40 > #8 0x00791630 in ceph::BackTrace::print (this=0x7fc373663d10, > out=...) at common/BackTrace.cc:19 > #9 0x0078f450 in handle_fatal_signal (signum=11) at > global/signal_handler.cc:91 > #10 > #11 0x7fc37d490be3 in > tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, > unsigned long, int) () from /usr/lib/libtcmalloc.so.4 > #12 0x7fc37d490eb4 in tcmalloc::ThreadCache::Scavenge() () from > /usr/lib/libtcmalloc.so.4 > #13 0x7fc37d49eb97 in tc_free () from /usr/lib/libtcmalloc.so.4 > #14 0x7fc37d1c6670 in __gnu_cxx::__verbose_terminate_handler() () > from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > #15 0x7fc37d1c4796 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > #16 0x7fc37d1c47c3 in std::terminate() () from > /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > #17 0x7fc37d1c49ee in __cxa_throw () from > /usr/lib/x86_64-linux-gnu/libstdc++.so.6 > #18 0x00844e11 in ceph::__ceph_assert_fail (assertion=0x90c01c > "0 == \"unexpected error\"", file=, line=3007, > func=0x90ef80 "unsigned int > FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int)") > at common/assert.cc:77 This means it got an unexpected error when talking to the file system. If you look in the osd log, it may tell you what that was. (It may not--there isn't usually the other tcmalloc stuff triggered from the assert handler.) What happens if you restart that ceph-osd daemon? sage > #19 0x0073148f in FileStore::_do_transaction > (this=this@entry=0x2cde000, t=..., op_seq=op_seq@entry=429545, > trans_num=trans_num@entry=0) at os/FileStore.cc:3007 > #20 0x0073484e in FileStore::do_transactions (this=0x2cde000, > tls=..., op_seq=429545) at os/FileStore.cc:2436 > #21 0x0070c680 in FileStore::_do_op (this=0x2cde000, > osr=) at os/FileStore.cc:2259 > #22 0x0083ce01 in ThreadPool::worker (this=0x2cde828) at > common/WorkQueue.cc:54 > #23 0x006823ed in ThreadPool::WorkThread::entry > (this=) at ./common/WorkQueue.h:126 > #24 0x7fc37e3eee9a in start_thread () from > /lib/x86_64-linux-gnu/libpthread.so.0 > #25 0x7fc37c9864cd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > #26 0x in ?? () > > mon bt was exactly the same as in http://tracker.newdream.net/issues/2762 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance improvement / journal on block-dev
On Wed, Aug 22, 2012 at 06:29:12PM +0200, Tommi Virtanen wrote: (...) > > Your journal is a file on a btrfs partition. That is probably a bad > idea for performance. I'd recommend partitioning the drive and using > partitions as journals directly. Hi Tommi, can you please teach me how to use the right parameter(s) to realize 'journal on block-dev' ? It looks like something is not OK during 'mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs' (see below) Regards, -Dieter e.g. ---snip--- modprobe -v brd rd_nr=6 rd_size=1000# 6x 10G RAM DISK /etc/ceph/ceph.conf -- [global] auth supported = none # set log file log file = /ceph/log/$name.log log_to_syslog = true# uncomment this line to log to syslog # set up pid files pid file = /var/run/ceph/$name.pid [mon] mon data = /ceph/$name debug optracker = 0 [mon.alpha] host = 127.0.0.1 mon addr = 127.0.0.1:6789 [mds] debug optracker = 0 [mds.0] host = 127.0.0.1 [osd] osd data = /data/$name [osd.0] host = 127.0.0.1 btrfs devs = /dev/ram0 osd journal = /dev/ram3 [osd.1] host = 127.0.0.1 btrfs devs = /dev/ram1 osd journal = /dev/ram4 [osd.2] host = 127.0.0.1 btrfs devs = /dev/ram2 osd journal = /dev/ram5 -- root # mkcephfs -a -c /etc/ceph/ceph.conf --mkbtrfs temp dir is /tmp/mkcephfs.wzARGSpFB6 preparing monmap in /tmp/mkcephfs.wzARGSpFB6/monmap /usr/bin/monmaptool --create --clobber --add alpha 127.0.0.1:6789 --print /tmp/mkcephfs.wzARGSpFB6/monmap /usr/bin/monmaptool: monmap file /tmp/mkcephfs.wzARGSpFB6/monmap /usr/bin/monmaptool: generated fsid 40b997ea-387a-4deb-9a30-805cd076a0de epoch 0 fsid 40b997ea-387a-4deb-9a30-805cd076a0de last_changed 2012-08-22 21:04:00.553972 created 2012-08-22 21:04:00.553972 0: 127.0.0.1:6789/0 mon.alpha /usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.wzARGSpFB6/monmap (1 monitors) === osd.0 === pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005 umount: /data/osd.0: not mounted umount: /dev/ram0: not mounted Btrfs v0.19.1+ ATTENTION: mkfs.btrfs is not intended to be used directly. Please use the YaST partitioner to create and manage btrfs filesystems to be in a supported state on SUSE Linux Enterprise systems. fs created label (null) on /dev/ram0 nodesize 4096 leafsize 4096 sectorsize 4096 size 9.54GiB Scanning for Btrfs filesystems HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 2012-08-22 21:04:01.519073 7fb475e8b780 -1 journal check: ondisk fsid 8b18c558-8b40-4b07-aa66-61fecb4dd89d doesn't match expected ee0b8bf1-dd4a-459e-a218-3f590f9a8c16, invalid (someone else's?) journal HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device 2012-08-22 21:04:01.923505 7fb475e8b780 -1 filestore(/data/osd.0) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 2012-08-22 21:04:01.937429 7fb475e8b780 -1 created object store /data/osd.0 journal /dev/ram3 for osd.0 fsid 40b997ea-387a-4deb-9a30-805cd076a0de creating private key for osd.0 keyring /data/osd.0/keyring creating /data/osd.0/keyring collecting osd.0 key === osd.1 === pushing conf and monmap to 127.0.0.1:/tmp/mkfs.ceph.11005 umount: /data/osd.1: not mounted (...) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-crush
On Wed, 22 Aug 2012, Gregory Farnum wrote: > On Wed, Aug 22, 2012 at 9:33 AM, Sage Weil wrote: > > On Wed, 22 Aug 2012, Atchley, Scott wrote: > >> On Aug 22, 2012, at 10:46 AM, Florian Haas wrote: > >> > >> > On 08/22/2012 03:10 AM, Sage Weil wrote: > >> >> I pushed a branch that changes some of the crush terminology. Instead > >> >> of > >> >> having a crush type called "pool" that requires you to say things like > >> >> "pool=default" in the "ceph osd crush set ..." command, it uses "root" > >> >> instead. That hopefully reinforces that it is a tree/hierarchy. > >> >> > >> >> There is also a patch that changes "bucket" to "node" throughout, since > >> >> bucket is a term also used by radosgw. > >> >> > >> >> Thoughts? I think the main pain in making this transition is that old > >> >> clusters have maps that have a type 'pool' and new ones won't, and the > >> >> docs will need to walk people through both... > >> > > >> > "pool" in a crushmap being completely unrelated to a RADOS pool is > >> > something that I've heard customers/users report as confusing, as well. > >> > So changing that is probably a good thing. Naming it "root" is probably > >> > a good choice as well, as it happens to match > >> > http://ceph.com/wiki/Custom_data_placement_with_CRUSH. > >> > > >> > As for changing "bucket" to node... a "node" is normally simply a > >> > physical server (at least in HA terminology, which many potential Ceph > >> > users will be familiar with), and CRUSH uses "host" for that. So that's > >> > another recipe for confusion. How about using something super-generic, > >> > like "element" or "item"? > >> > > >> > Cheers, > >> > Florian > >> > >> My guess is that he is trying to use data structure tree nomenclature > >> (root, node, leaf). I agree that node is an overloaded term (as is > >> pool). > > > > Yeah... > > > >> As for an alternative to bucket which indicates the item is a > >> collection, what about subtree or branch? > > > > I think fixing the overloading of 'pool' in the default crush map is the > > biggest pain point. I can live with crush 'buckets' staying the same (esp > > since that's what the papers and code use pervasively) if we can't come up > > with a better option. > > I'm definitely most interested in replacing "pool", and "root" works > for that in my mind. RGW buckets live at a sufficiently different > level that I think people are unlikely to be confused ? and "bucket" > is actually a good name for what they are (I'm open to better ones, > but I don't think that "node" qualifies). Yeah, sounds good to me. > > On the pool part, though, the challenge is how to transition. Existing > > clusters have maps that use 'pool', and new clusters will use 'root' (or > > whatever). Some options: > > > > - document both. this kills much of the benefit of switching, but is > >probably inevitable since people will be running different versions. > > - make the upgrade process transparently rename the type. this lets > >all the tools use the new names. > > - make the tools silently translate old names to new names. this is > >kludgey in that it makes the code make assumptions about the names of > >the data it is working with, but would cover everyone except those who > >created their own crush maps from scratch. > > - ? > > I would go with option two, and only document the new options ? I > wouldn't be surprised if the number of people who had changed those > was zero. Anybody who has done so can certainly be counted on to pay > enough attention that a line note "changed CRUSH names (see here if > you customized your map)" would be sufficient, right? Yeah. The one wrinkle is that people running old code (e.g., argonaut) reading the latest docs will see commands that don't quite work. At some point we need to fork the docs for each stable release... maybe now is the time to do that. sage
Re: SimpleMessenger dispatching: cause of performance problems?
What rbd block size were you using? -Sam On Tue, Aug 21, 2012 at 10:29 PM, Andreas Bluemle wrote: > Hi, > > > Samuel Just wrote: >> >> Was the cluster complete healthy at the time that those traces were taken? >> If there were osds going in/out/up/down, it would trigger osdmap updates >> which >> would tend to hold the osd_lock for an extended period of time. >> >> > > The cluster was completely healthy. > >> v0.50 included some changes that drastically reduce the purview of >> osd_lock. >> In particular, pg op handling no longer grabs the osd_lock and >> handle_osd_map >> can proceed independently of the pg worker threads. Trying that might be >> interesting. >> >> > > I'll grab v0.50 and take a look. > > >> -Sam >> >> On Tue, Aug 21, 2012 at 12:20 PM, Sage Weil wrote: >> >>> >>> On Tue, 21 Aug 2012, Sage Weil wrote: >>> On Tue, 21 Aug 2012, Andreas Bluemle wrote: > > Hi Sage, > > as mentioned, the workload is a single sequential write on > the client. The write is not O_DIRECT; and consequently > the messages arrive at the OSD with 124 KByte per write request. > > The attached pdf shows a timing diagram of two concurrent > write operations (one primary and one replication / secondary). > > The time spent on the primary write to get the OSD.:osd_lock > releates nicely with the time when this lock is released by the > secondary write. > >>> >>> Looking again at this diagram, I'm a bit confused. Is the Y access the >>> thread id or something? And the X axis is time in seconds? >>> >>> > > X-Axis is time, Y Axis is absolute offset of the write request on the rados > block device. > >>> The big question for me is what on earth the secondary write (or primary, >>> for that matter) is doing with osd_lock for a full 3 ms... If my reading >>> of the units is correct, *that* is the real problem. It shouldn't be >>> doing anything that takes that long. The exception is osdmap handling, >>> which can do more work, but request processing should be very fast. >>> >>> Thanks- >>> sage >>> >>> >>> Ah, I see. There isn't a trivial way to pull osd_lock out of the picture; there are several data structures it's protecting (pg_map, osdmaps, peer epoch map, etc.). Before we try going down that road, I suspect it might be more fruitful to see where cpu time is being spent while osd_lock is held. How much of an issue does it look like this specific contention is for you? Does it go away with larger writes? sage > > Hope this helps > > Andreas > > > > Sage Weil wrote: > >> >> On Mon, 20 Aug 2012, Andreas Bluemle wrote: >> >> >>> >>> Hi Sage, >>> >>> Sage Weil wrote: >>> >>> Hi Andreas, On Thu, 16 Aug 2012, Andreas Bluemle wrote: > > Hi, > > I have been trying to migrate a ceph cluster (ceph-0.48argonaut) > to a high speed cluster network and encounter scalability problems: > the overall performance of the ceph cluster does not scale well > with an increase in the underlying networking speed. > > In short: > > I believe that the dispatching from SimpleMessenger to > OSD worker queues causes that scalability issue. > > Question: is it possible that this dispatching is causing > performance > problems? > > There is a single 'dispatch' thread that's processing this queue, and conveniently perf lets you break down its profiling data on a per-thread basis. Once you've ruled out the throttler as the culprit, you might try running the daemon with 'perf record -g -- ceph-osd ...' and then look specifically at where that thread is spending its time. We shouldn't be burning that much CPU just doing the sanity checks and then handing requests off to PGs... sage >>> >>> The effect, which I am seeing, may be related to some locking >>> issue. >>> As I read the code, there are multiple dispatchers running: one per >>> SimpleMessenger. >>> >>> On a typical OSD node, there is >>> >>> - the instance of the SimpleMessenger processing input from the >>> client >>> (primary writes) >>> - other instances of SimpleMessenger, which process input from >>> neighbor >>> OSD >>> nodes >>> >>> the latter generate replication writes to the OSD I am looking at. >>> >>> On the other hand, there is a single instance of the OSD object >>> within the >>> ceph-osd daemon. >>> When dispatching messages to the OSD, then the OSD::osd_lock is held >>> for
Re: wip-crush
On Wed, Aug 22, 2012 at 9:33 AM, Sage Weil wrote: > On Wed, 22 Aug 2012, Atchley, Scott wrote: >> On Aug 22, 2012, at 10:46 AM, Florian Haas wrote: >> >> > On 08/22/2012 03:10 AM, Sage Weil wrote: >> >> I pushed a branch that changes some of the crush terminology. Instead of >> >> having a crush type called "pool" that requires you to say things like >> >> "pool=default" in the "ceph osd crush set ..." command, it uses "root" >> >> instead. That hopefully reinforces that it is a tree/hierarchy. >> >> >> >> There is also a patch that changes "bucket" to "node" throughout, since >> >> bucket is a term also used by radosgw. >> >> >> >> Thoughts? I think the main pain in making this transition is that old >> >> clusters have maps that have a type 'pool' and new ones won't, and the >> >> docs will need to walk people through both... >> > >> > "pool" in a crushmap being completely unrelated to a RADOS pool is >> > something that I've heard customers/users report as confusing, as well. >> > So changing that is probably a good thing. Naming it "root" is probably >> > a good choice as well, as it happens to match >> > http://ceph.com/wiki/Custom_data_placement_with_CRUSH. >> > >> > As for changing "bucket" to node... a "node" is normally simply a >> > physical server (at least in HA terminology, which many potential Ceph >> > users will be familiar with), and CRUSH uses "host" for that. So that's >> > another recipe for confusion. How about using something super-generic, >> > like "element" or "item"? >> > >> > Cheers, >> > Florian >> >> My guess is that he is trying to use data structure tree nomenclature >> (root, node, leaf). I agree that node is an overloaded term (as is >> pool). > > Yeah... > >> As for an alternative to bucket which indicates the item is a >> collection, what about subtree or branch? > > I think fixing the overloading of 'pool' in the default crush map is the > biggest pain point. I can live with crush 'buckets' staying the same (esp > since that's what the papers and code use pervasively) if we can't come up > with a better option. I'm definitely most interested in replacing "pool", and "root" works for that in my mind. RGW buckets live at a sufficiently different level that I think people are unlikely to be confused — and "bucket" is actually a good name for what they are (I'm open to better ones, but I don't think that "node" qualifies). > On the pool part, though, the challenge is how to transition. Existing > clusters have maps that use 'pool', and new clusters will use 'root' (or > whatever). Some options: > > - document both. this kills much of the benefit of switching, but is >probably inevitable since people will be running different versions. > - make the upgrade process transparently rename the type. this lets >all the tools use the new names. > - make the tools silently translate old names to new names. this is >kludgey in that it makes the code make assumptions about the names of >the data it is working with, but would cover everyone except those who >created their own crush maps from scratch. > - ? I would go with option two, and only document the new options — I wouldn't be surprised if the number of people who had changed those was zero. Anybody who has done so can certainly be counted on to pay enough attention that a line note "changed CRUSH names (see here if you customized your map)" would be sufficient, right? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Ceph fixes for 3.6-rc3
Hi Linus, Please pull the following Ceph fixes for -rc3 from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus Jim's fix closes a narrow race introduced with the msgr changes. One fix resolves problems with debugfs initialization that Yan found when multiple client instances are created (e.g., two clusters mounted, or rbd + cephfs), another one fixes problems with mounting a nonexistent server subdirectory, and the last one fixes a divide by zero error from unsanitized ioctl input that Dan Carpenter found. Thanks! sage Jim Schutt (1): libceph: avoid truncation due to racing banners Sage Weil (3): libceph: delay debugfs initialization until we learn global_id ceph: tolerate (and warn on) extraneous dentry from mds ceph: avoid divide by zero in __validate_layout() fs/ceph/debugfs.c |1 + fs/ceph/inode.c| 15 + fs/ceph/ioctl.c|3 +- net/ceph/ceph_common.c |1 - net/ceph/debugfs.c |4 +++ net/ceph/messenger.c | 11 - net/ceph/mon_client.c | 51 +++ 7 files changed, 72 insertions(+), 14 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-crush
On Wed, 22 Aug 2012, Atchley, Scott wrote: > On Aug 22, 2012, at 10:46 AM, Florian Haas wrote: > > > On 08/22/2012 03:10 AM, Sage Weil wrote: > >> I pushed a branch that changes some of the crush terminology. Instead of > >> having a crush type called "pool" that requires you to say things like > >> "pool=default" in the "ceph osd crush set ..." command, it uses "root" > >> instead. That hopefully reinforces that it is a tree/hierarchy. > >> > >> There is also a patch that changes "bucket" to "node" throughout, since > >> bucket is a term also used by radosgw. > >> > >> Thoughts? I think the main pain in making this transition is that old > >> clusters have maps that have a type 'pool' and new ones won't, and the > >> docs will need to walk people through both... > > > > "pool" in a crushmap being completely unrelated to a RADOS pool is > > something that I've heard customers/users report as confusing, as well. > > So changing that is probably a good thing. Naming it "root" is probably > > a good choice as well, as it happens to match > > http://ceph.com/wiki/Custom_data_placement_with_CRUSH. > > > > As for changing "bucket" to node... a "node" is normally simply a > > physical server (at least in HA terminology, which many potential Ceph > > users will be familiar with), and CRUSH uses "host" for that. So that's > > another recipe for confusion. How about using something super-generic, > > like "element" or "item"? > > > > Cheers, > > Florian > > My guess is that he is trying to use data structure tree nomenclature > (root, node, leaf). I agree that node is an overloaded term (as is > pool). Yeah... > As for an alternative to bucket which indicates the item is a > collection, what about subtree or branch? I think fixing the overloading of 'pool' in the default crush map is the biggest pain point. I can live with crush 'buckets' staying the same (esp since that's what the papers and code use pervasively) if we can't come up with a better option. On the pool part, though, the challenge is how to transition. Existing clusters have maps that use 'pool', and new clusters will use 'root' (or whatever). Some options: - document both. this kills much of the benefit of switching, but is probably inevitable since people will be running different versions. - make the upgrade process transparently rename the type. this lets all the tools use the new names. - make the tools silently translate old names to new names. this is kludgey in that it makes the code make assumptions about the names of the data it is working with, but would cover everyone except those who created their own crush maps from scratch. - ? sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance improvement
On Wed, Aug 22, 2012 at 9:23 AM, Denis Fondras wrote: >> Are you sure your osd data and journal are on the disks you think? The >> /home paths look suspicious -- especially for journal, which often >> should be a block device. > I am :) ... > -rw-r--r-- 1 root root 1048576000 août 22 17:22 /home/osd.0.journal Your journal is a file on a btrfs partition. That is probably a bad idea for performance. I'd recommend partitioning the drive and using partitions as journals directly. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance improvement
Are you sure your osd data and journal are on the disks you think? The /home paths look suspicious -- especially for journal, which often should be a block device. I am :) Can you share output of "mount" and "ls -ld /home/osd.*" Here are some details : root@ceph-osd-0:~# ls -al /dev/disk/by-id/ lrwxrwxrwx 1 root root 9 août 21 21:19 scsi-SATA_C300-CTFDDAC064104903008FE4 -> ../../sda lrwxrwxrwx 1 root root 9 août 22 10:57 scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0124762 -> ../../sdh lrwxrwxrwx 1 root root 9 août 21 16:03 scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0137898 -> ../../sdg lrwxrwxrwx 1 root root 9 août 21 21:19 scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 -> ../../sdf lrwxrwxrwx 1 root root 9 août 21 16:03 scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152562 -> ../../sdc root@ceph-osd-0:~# mount sysfs on /sys type sysfs (rw,nosuid,nodev,noexec,relatime) proc on /proc type proc (rw,nosuid,nodev,noexec,relatime) udev on /dev type devtmpfs (rw,relatime,size=10240k,nr_inodes=1020030,mode=755) devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000) tmpfs on /run type tmpfs (rw,nosuid,noexec,relatime,size=817216k,mode=755) /dev/disk/by-uuid/7d95d243-1788-4c3f-9f89-166c15f880f0 on / type ext3 (rw,relatime,errors=remount-ro,barrier=1,data=ordered) tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k) tmpfs on /tmp type tmpfs (rw,nosuid,nodev,relatime,size=1634432k) tmpfs on /run/shm type tmpfs (rw,nosuid,nodev,relatime,size=1634432k) /dev/sda on /home type btrfs (rw,relatime,ssd,space_cache) /dev/sdf on /home/osd.0 type btrfs (rw,noatime,space_cache) root@ceph-osd-0:~# ls -ld /home/osd.* drwxr-xr-x 1 root root236 août 22 17:22 /home/osd.0 -rw-r--r-- 1 root root 1048576000 août 22 17:22 /home/osd.0.journal Regards, Denis -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance improvement
On Wed, Aug 22, 2012 at 1:54 AM, Denis Fondras wrote: > First of all, here is my setup : > for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal and 4x > 3TB drive (Western Digital WD30EZRX). Everything but the boot partition is > BTRFS-formated and 4K-aligned. ... > [osd] > osd data = /home/osd.$id > osd journal = /home/osd.$id.journal > osd journal size = 1000 > keyring = /etc/ceph/keyring.$name > > [osd.0] > host = ceph-osd-0 > btrfs devs = > /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 > btrfs options = rw,noatime Are you sure your osd data and journal are on the disks you think? The /home paths look suspicious -- especially for journal, which often should be a block device. Can you share output of "mount" and "ls -ld /home/osd.*" -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-crush
On Aug 22, 2012, at 10:46 AM, Florian Haas wrote: > On 08/22/2012 03:10 AM, Sage Weil wrote: >> I pushed a branch that changes some of the crush terminology. Instead of >> having a crush type called "pool" that requires you to say things like >> "pool=default" in the "ceph osd crush set ..." command, it uses "root" >> instead. That hopefully reinforces that it is a tree/hierarchy. >> >> There is also a patch that changes "bucket" to "node" throughout, since >> bucket is a term also used by radosgw. >> >> Thoughts? I think the main pain in making this transition is that old >> clusters have maps that have a type 'pool' and new ones won't, and the >> docs will need to walk people through both... > > "pool" in a crushmap being completely unrelated to a RADOS pool is > something that I've heard customers/users report as confusing, as well. > So changing that is probably a good thing. Naming it "root" is probably > a good choice as well, as it happens to match > http://ceph.com/wiki/Custom_data_placement_with_CRUSH. > > As for changing "bucket" to node... a "node" is normally simply a > physical server (at least in HA terminology, which many potential Ceph > users will be familiar with), and CRUSH uses "host" for that. So that's > another recipe for confusion. How about using something super-generic, > like "element" or "item"? > > Cheers, > Florian My guess is that he is trying to use data structure tree nomenclature (root, node, leaf). I agree that node is an overloaded term (as is pool). As for an alternative to bucket which indicates the item is a collection, what about subtree or branch? Scott-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: Fix sparse warning
On Wed, 22 Aug 2012, Daniel Baluta wrote: > On Tue, Aug 14, 2012 at 4:27 PM, Iulius Curt wrote: > > From: Iulius Curt > > > > Make ceph_monc_do_poolop() static to remove the following sparse warning: > > * net/ceph/mon_client.c:616:5: warning: symbol 'ceph_monc_do_poolop' was > > not > >declared. Should it be static? > > > > Signed-off-by: Iulius Curt > > --- > > net/ceph/mon_client.c |2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c > > index 105d533..3875c60 100644 > > --- a/net/ceph/mon_client.c > > +++ b/net/ceph/mon_client.c > > @@ -613,7 +613,7 @@ bad: > > /* > > * Do a synchronous pool op. > > */ > > -int ceph_monc_do_poolop(struct ceph_mon_client *monc, u32 op, > > +static int ceph_monc_do_poolop(struct ceph_mon_client *monc, u32 op, > > u32 pool, u64 snapid, > > char *buf, int len) > > { > > -- > > 1.7.9.5 > > > > -- > > Hi Sage, > > Can you have a look on this? :) Sorry, this one fell through the cracks. Yes, we can switch it to static, but while we're doing that let's drop the ceph_monc_ prefix too (since it's private). Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideal hardware spec?
On Wed, Aug 22, 2012 at 04:17:23PM +0200, Wido den Hollander wrote: :On 08/22/2012 03:55 PM, Jonathan Proulx wrote: :You can also use the USB sticks[0] from Stec, they have servergrade :onboard USB sticks for these kind of applications. Those look quite interesting. :A couple of questions still need to be answered though: :* Which OS are you planning on using? Ubuntu 12.04 is recommended Ubuntu 12.04 is our current preferred OS :* Which filesystem do you want to use underneath the OSDs? Whatever I can get to work best in testing :) Since this is for a research platform not a product I'd likely start with BTRFS and see if it is "stable enough" and "performant enough" with fall back to XFS if needed -Jon :Wido : :[0]: http://www.stec-inc.com/product/ufm.php -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ceph osd create
On Tue, 21 Aug 2012, Mandell Degerness wrote: > Found it (digging through the source code to find a guess, since it is > in no way obvious): --osd-uuid Whoops, sorry, yeah. It appeared in 0.47. sage > > On Tue, Aug 21, 2012 at 4:38 PM, Mandell Degerness > wrote: > > Thanks, Sage. This is what I was looking for, but what version of > > ceph do I need for this to work (it isn't there in Argonaut)? See > > below: > > > > # ceph-osd -c /etc/ceph/ceph.conf --fsid > > 8296cc23-9c11-44d7-84c1-16866ef9c4f7 -i 50 --mkfs --osd-fsid > > e1097bd8-c931-4e2e-8ccb-332a954adace > > --conf/-cRead configuration from the given configuration file > > -d Run in foreground, log to stderr. > > -f Run in foreground, log to usual location. > > --id/-i set ID portion of my name > > --name/-nset name (TYPE.ID) > > --versionshow version and quit > > > > --debug_ms N > > set message debug level (e.g. 1) > > 2012-08-21 23:26:50.774858 7f1be9ac1780 -1 unrecognized arg --osd-fsid > > 2012-08-21 23:26:50.774864 7f1be9ac1780 -1 usage: ceph-osd -i osdid > > [--osd-data=path] [--osd-journal=path] [--mkfs] [--mkjournal] > > [--convert-filestore] > > 2012-08-21 23:26:50.774915 7f1be9ac1780 -1--debug_osd N set > > debug level (e.g. 10) > > > > # ceph --version > > ceph version 0.48.1argonaut > > (commit:a7ad701b9bd479f20429f19e6fea7373ca6bba7c) > > > > Tommi - Thank you for the suggestion of ceph-disk-prepare and > > ceph-disk-activate, but they work at too high of a level for our > > usage. We need finer control of the block devices. > > > > Regards, > > Mandell Degerness > > > > On Tue, Aug 21, 2012 at 11:15 AM, Sage Weil wrote: > >> On Tue, 21 Aug 2012, Mandell Degerness wrote: > >>> OK. I think I'm getting there. > >>> > >>> I want to be able to generate the fsid to be used in the OSD (from the > >>> file system fsid, if that matters). Is there a way to inject the fsid > >>> when initializing the OSD directory? It doesn't seem to be > >>> documented. The alternative would require that we mount the OSD in a > >>> temp dir to read the fsid file, determine the OSD number, and then > >>> re-mount it where it belongs, which seems the wrong way to go. > >> > >> You can feed in the fsid to ceph-osd --mkfs with --osd-fsid . > >> > >> sage > >> > >>> > >>> Regards, > >>> Mandell Degerness > >>> > >>> On Mon, Aug 20, 2012 at 4:26 PM, Tommi Virtanen wrote: > >>> > On Mon, Aug 20, 2012 at 3:53 PM, Mandell Degerness > >>> > wrote: > >>> >> We're running Argonaut and it only has the OSD id in the whoami file > >>> >> and nothing else. > >>> > > >>> > My bad, I meant the file "fsid" (note, not "ceph_fsid"). > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>> the body of a message to majord...@vger.kernel.org > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > >>> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: wip-crush
On 08/22/2012 03:10 AM, Sage Weil wrote: > I pushed a branch that changes some of the crush terminology. Instead of > having a crush type called "pool" that requires you to say things like > "pool=default" in the "ceph osd crush set ..." command, it uses "root" > instead. That hopefully reinforces that it is a tree/hierarchy. > > There is also a patch that changes "bucket" to "node" throughout, since > bucket is a term also used by radosgw. > > Thoughts? I think the main pain in making this transition is that old > clusters have maps that have a type 'pool' and new ones won't, and the > docs will need to walk people through both... "pool" in a crushmap being completely unrelated to a RADOS pool is something that I've heard customers/users report as confusing, as well. So changing that is probably a good thing. Naming it "root" is probably a good choice as well, as it happens to match http://ceph.com/wiki/Custom_data_placement_with_CRUSH. As for changing "bucket" to node... a "node" is normally simply a physical server (at least in HA terminology, which many potential Ceph users will be familiar with), and CRUSH uses "host" for that. So that's another recipe for confusion. How about using something super-generic, like "element" or "item"? Cheers, Florian -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideal hardware spec?
On 08/22/2012 08:55 AM, Jonathan Proulx wrote: Hi All, Hi Jonathon! Yes I'm asking the impossible question, what is the "best" hardware confing. That is the impossible question. :) I'm looking at (possibly) using ceph as backing store for images and volumes on OpenStack as well as exposing at least the object store for direct use. The openstack cluster exists and is currently in the early stages of use by researchers here, approx 1500 vCPU (counts hyperthreads actually 768 physical cores) and 3T or RAM across 64 physical nodes. On the object store side it would be a new resource for usand hard to say what people would do with it except that it would be many different things and the use profile would be constantly changing (which is true of all our existing storage). In this sense, even though it's a "private cloud" the somewhat unpredictable useage profile gives it some charateristics of a small public cloud. Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes to end up with a 20-30T 3x replicated storage (call me paranoid). So the monitor specs seem relatively easy to come up with. For the OSDs it looks like http://ceph.com/docs/master/install/hardware-recommendations suggests 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage node). On list discussions seem to frequently include an SSD for journaling (which is similar to what we do for our current ZFS back NFS storage). I'm hoping to wrap the hardware in a grant and willing to experiment a bit with different software configurations to tune it up when/if I get the hardware in. So my imediate concern is a hardware spec that will ahve a reasonable processor:memory:disk ratio and opinions (or better data) on the utility of SSD. Before I joined up with Inktank, I was prototyping a private openstack cloud for HPC applications at a supercomputing site. We similarly were pursuing grant funding. I know how it goes! First is the documented core to disk ratio still current best practice? Given a platform with more drive slots could 8 cores handle more disk? would that need/like more memory? The big thing is the CPU and memory needed during recovery. During standard operation you shouldn't be pushing the CPU too hard unless you are really pushing data through fast and have many drives per node, or have severely underspecced the CPU. Given that you are only shooting for around 90TB of space across 5+ osd nodes, you should be able to get away with 12 2TB+ drive 2U boxes. That's probably the closest thing we have right now to a "standard" configuration. We use a single 6-core 2.8GHz AMD operation chip in each node with 16GB of memory. It might be worth bumping that up to 24-32GB of memory for very large deployments with lots of OSDs. In terms of controller we are using Dell H700 cards which are similar to LSI 9260s, but I think there is a good chance that it may actually be better to use H200s (ie LSI 9211-8i or similar) with the IT/JBOD mode firmware. That's one of the commonly used cards in ZFS builds too and has a pretty good reputation. I've actually got a supermicro SC847a chassis and a whole bunch of various SATA/SAS/RAID controllers I'm testing now in different configurations. Hopefully I should have some data soon. For now, our best tested configuration is with 12 drive nodes. Smaller 1U nodes may be an option as well, but not very dense. Have SSD been shown to speed performance with this architecture? Yes, but in different ways depending on how you use them. SSDs for data storage tend to help mitigate some of the seek behavior issues we've seen on the filestore. This isn't really a reasonable solution for a lot of people though. In terms of the journal, the biggest benefit that SSDs provide is high throughput, so you can load multiple journals onto 1 SSD and cram more OSDs into one box. Depending on how much you trust your SSDs, you could try either a 10 disk + 2 SSD or a 9 disk + SSD configuration. Keep in mind that this will be writing a lot of data to the SSDs, so you should try to undersubscribe them to lengthen the lifespan. For testing I'm doing 3 journals per 180GB Intel 520 SSD. If so given the 8 drive slot example with seven OSDs presented in the docs what is the liklihood that using a high performance SSD for the OS image and also cutting journal/log partitions out of it for the remaining 7 2-3T near line SAS drives? Just keep in mind that in this case you're total throughput will likely be limited by the SSD unless you get a very fast one (or are using 1GbE or have some other bottleneck). Thanks, -Jon -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org
RE: Ideal hardware spec?
Hi all, Is there a place we can set up a group of hardware recipes that people can query and modify over time? It would be good if people could submit and "group modify" the recipes. I would envision "hypothetical" configurations and "deployed/tested" configurations. Trekking back through email exchanges like this becomes hard for people who join later. I'd like to see a "best" hardware config as well... however, I'm interested in a SAS switching fabric where the nodes do not have any storage (except possibly onboard boot drive/USB as listed below). Each node would have a SAS HBA that allows it to access a LARGE jbod provided by a HA set of SAS Switches (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives are lun masked for each host. The thought here is that you can add compute nodes, storage shelves, and disks all independently. With proper masking, you could provide redundancy to cover drive, node, and shelf failures.You could also add disks "horizontally" if you have spare slots in a shelf, and you could add shelves "vertically" and increase the disk count available to existing nodes. My goal is to be able to scale without having to draw the enormous power of lots of 1U devices or buy lots of disks and shelves each time I wasn't to add a little capacity. Anybody looked at atom processors? - Steve -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wido den Hollander Sent: Wednesday, August 22, 2012 9:17 AM To: Jonathan Proulx Cc: ceph-devel@vger.kernel.org Subject: Re: Ideal hardware spec? Hi, On 08/22/2012 03:55 PM, Jonathan Proulx wrote: > Hi All, > > Yes I'm asking the impossible question, what is the "best" hardware > confing. > > I'm looking at (possibly) using ceph as backing store for images and > volumes on OpenStack as well as exposing at least the object store for > direct use. > > The openstack cluster exists and is currently in the early stages of > use by researchers here, approx 1500 vCPU (counts hyperthreads > actually 768 physical cores) and 3T or RAM across 64 physical nodes. > > On the object store side it would be a new resource for usand hard to > say what people would do with it except that it would be many > different things and the use profile would be constantly changing > (which is true of all our existing storage). > > In this sense, even though it's a "private cloud" the somewhat > unpredictable useage profile gives it some charateristics of a small > public cloud. > > Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes > to end up with a 20-30T 3x replicated storage (call me paranoid). > I prefer 3x replication as well. I've seen the "wrong" OSDs die on me too often. > So the monitor specs seem relatively easy to come up with. For the > OSDs it looks like > http://ceph.com/docs/master/install/hardware-recommendations suggests > 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage > node). On list discussions seem to frequently include an SSD for > journaling (which is similar to what we do for our current ZFS back > NFS storage). > > I'm hoping to wrap the hardware in a grant and willing to experiment a > bit with different software configurations to tune it up when/if I get > the hardware in. So my imediate concern is a hardware spec that will > ahve a reasonable processor:memory:disk ratio and opinions (or better > data) on the utility of SSD. > > First is the documented core to disk ratio still current best > practice? Given a platform with more drive slots could 8 cores handle > more disk? would that need/like more memory? > I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the OSD machines, the more the kernel can buffer, which will always be a performance gain. You should however ask yourself the question if you want a lot of OSDs per server and not go for smaller machines with less disks. For example - 1U - 4 cores - 8GB RAM - 4 disks - 1 SSD Or - 2U - 8 cores - 16GB RAM - 8 disks - 1|2 SSDs Both will give you the same amount of storage, but the impact of loosing one physicial machine will be larger with the 2U machine. If you take 1TB disks you'd loose 8TB of storage, that is a lot of recovery to be done. Since btrfs (Assuming you are going to use that) is still in development it's not excluded that your machine goes down due to a kernel panic or other problems. My personal favor is having multiple small(er) machines than having a couple of large machines. > Have SSD been shown to speed performance with this architecture? > I've seen a improvement in performance indeed. Make sure however you have a recent version of glibc with syncfs support. > If so given the 8 drive slot example with seven OSDs presented in the > docs what is the liklihood that using a high performance SSD for the > OS image and also cutting journal/log partitions out of it for the > remaining 7 2-3T n
Re: Ideal hardware spec?
Hi, On 08/22/2012 03:55 PM, Jonathan Proulx wrote: Hi All, Yes I'm asking the impossible question, what is the "best" hardware confing. I'm looking at (possibly) using ceph as backing store for images and volumes on OpenStack as well as exposing at least the object store for direct use. The openstack cluster exists and is currently in the early stages of use by researchers here, approx 1500 vCPU (counts hyperthreads actually 768 physical cores) and 3T or RAM across 64 physical nodes. On the object store side it would be a new resource for usand hard to say what people would do with it except that it would be many different things and the use profile would be constantly changing (which is true of all our existing storage). In this sense, even though it's a "private cloud" the somewhat unpredictable useage profile gives it some charateristics of a small public cloud. Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes to end up with a 20-30T 3x replicated storage (call me paranoid). I prefer 3x replication as well. I've seen the "wrong" OSDs die on me too often. So the monitor specs seem relatively easy to come up with. For the OSDs it looks like http://ceph.com/docs/master/install/hardware-recommendations suggests 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage node). On list discussions seem to frequently include an SSD for journaling (which is similar to what we do for our current ZFS back NFS storage). I'm hoping to wrap the hardware in a grant and willing to experiment a bit with different software configurations to tune it up when/if I get the hardware in. So my imediate concern is a hardware spec that will ahve a reasonable processor:memory:disk ratio and opinions (or better data) on the utility of SSD. First is the documented core to disk ratio still current best practice? Given a platform with more drive slots could 8 cores handle more disk? would that need/like more memory? I'd still suggest about 2GB of RAM per OSD. The more RAM you have in the OSD machines, the more the kernel can buffer, which will always be a performance gain. You should however ask yourself the question if you want a lot of OSDs per server and not go for smaller machines with less disks. For example - 1U - 4 cores - 8GB RAM - 4 disks - 1 SSD Or - 2U - 8 cores - 16GB RAM - 8 disks - 1|2 SSDs Both will give you the same amount of storage, but the impact of loosing one physicial machine will be larger with the 2U machine. If you take 1TB disks you'd loose 8TB of storage, that is a lot of recovery to be done. Since btrfs (Assuming you are going to use that) is still in development it's not excluded that your machine goes down due to a kernel panic or other problems. My personal favor is having multiple small(er) machines than having a couple of large machines. Have SSD been shown to speed performance with this architecture? I've seen a improvement in performance indeed. Make sure however you have a recent version of glibc with syncfs support. If so given the 8 drive slot example with seven OSDs presented in the docs what is the liklihood that using a high performance SSD for the OS image and also cutting journal/log partitions out of it for the remaining 7 2-3T near line SAS drives? You should make sure your SSD is capable of doing line-speed of your network. If you are connecting the machines with 4G trunks, make sure the SSD is capable of doing around 400MB/sec of sustained writes. I'd recommended the Intel 520 SSDs and change their available capacity with hdparm to about 20% of their original capacity. This way the SSD always has a lot of free cells available for writing. Reprogramming cells is expensive on an SSD. You can run the OS on the same SSD since that won't do that much I/O. I'd recommend not logging locally though, since that will also write to the same SSD. Try using remote syslog. You can also use the USB sticks[0] from Stec, they have servergrade onboard USB sticks for these kind of applications. A couple of questions still need to be answered though: * Which OS are you planning on using? Ubuntu 12.04 is recommended * Which filesystem do you want to use underneath the OSDs? Wido [0]: http://www.stec-inc.com/product/ufm.php Thanks, -Jon -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ideal hardware spec?
Hi All, Yes I'm asking the impossible question, what is the "best" hardware confing. I'm looking at (possibly) using ceph as backing store for images and volumes on OpenStack as well as exposing at least the object store for direct use. The openstack cluster exists and is currently in the early stages of use by researchers here, approx 1500 vCPU (counts hyperthreads actually 768 physical cores) and 3T or RAM across 64 physical nodes. On the object store side it would be a new resource for usand hard to say what people would do with it except that it would be many different things and the use profile would be constantly changing (which is true of all our existing storage). In this sense, even though it's a "private cloud" the somewhat unpredictable useage profile gives it some charateristics of a small public cloud. Size wise I'm hoping to start out with 3 monitors and 5(+) OSD nodes to end up with a 20-30T 3x replicated storage (call me paranoid). So the monitor specs seem relatively easy to come up with. For the OSDs it looks like http://ceph.com/docs/master/install/hardware-recommendations suggests 1 drive, 1 core and 2G RAM per OSD (with multiple OSDs per storage node). On list discussions seem to frequently include an SSD for journaling (which is similar to what we do for our current ZFS back NFS storage). I'm hoping to wrap the hardware in a grant and willing to experiment a bit with different software configurations to tune it up when/if I get the hardware in. So my imediate concern is a hardware spec that will ahve a reasonable processor:memory:disk ratio and opinions (or better data) on the utility of SSD. First is the documented core to disk ratio still current best practice? Given a platform with more drive slots could 8 cores handle more disk? would that need/like more memory? Have SSD been shown to speed performance with this architecture? If so given the 8 drive slot example with seven OSDs presented in the docs what is the liklihood that using a high performance SSD for the OS image and also cutting journal/log partitions out of it for the remaining 7 2-3T near line SAS drives? Thanks, -Jon -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance improvement
>>Not sure what version of glibc Wheezy has, but try to make sure you have >>one that supports syncfs (you'll also need a semi-new kernel, 3.0+ >>should be fine). Hi, glibc from wheezy don't have syncfs support. - Mail original - De: "Mark Nelson" À: "Denis Fondras" Cc: ceph-devel@vger.kernel.org Envoyé: Mercredi 22 Août 2012 14:35:28 Objet: Re: Ceph performance improvement On 08/22/2012 03:54 AM, Denis Fondras wrote: > Hello all, Hello! David had some good comments in his reply, so I'll just add in a couple of extra thoughts... > > I'm currently testing Ceph. So far it seems that HA and recovering are > very good. > The only point that prevents my from using it at datacenter-scale is > performance. > > First of all, here is my setup : > - 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - > 4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 Not sure what version of glibc Wheezy has, but try to make sure you have one that supports syncfs (you'll also need a semi-new kernel, 3.0+ should be fine). > (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). It has 1x 320GB drive > for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal > and 4x 3TB drive (Western Digital WD30EZRX). Everything but the boot > partition is BTRFS-formated and 4K-aligned. > - 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and > Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). > Both servers are linked over a 1Gb Ethernet switch (iperf shows about > 960Mb/s). > > Here is my ceph.conf : > --cut-here-- > [global] > auth supported = cephx > keyring = /etc/ceph/keyring > journal dio = true > osd op threads = 24 > osd disk threads = 24 > filestore op threads = 6 > filestore queue max ops = 24 > osd client message size cap = 1400 > ms dispatch throttle bytes = 1750 > default values are quite a bit lower for most of these. You may want to play with them and see if it has an effect. > [mon] > mon data = /home/mon.$id > keyring = /etc/ceph/keyring.$name > > [mon.a] > host = ceph-osd-0 > mon addr = 192.168.0.132:6789 > > [mds] > keyring = /etc/ceph/keyring.$name > > [mds.a] > host = ceph-osd-0 > > [osd] > osd data = /home/osd.$id > osd journal = /home/osd.$id.journal > osd journal size = 1000 > keyring = /etc/ceph/keyring.$name > > [osd.0] > host = ceph-osd-0 > btrfs devs = /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 > btrfs options = rw,noatime Just fyi, we are trying to get away from btrfs devs. > --cut-here-- > > Here are some figures : > * Test with "dd" on the OSD server (on drive > /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : > # dd if=/dev/zero of=testdd bs=4k count=4M > 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s Good job using a data file that is much bigger than main memory! That looks pretty accurate for a 7200rpm spinning disk. For dd benchmarks, you should probably throw in conv=fdatasync at the end though. > > => iostat (on the OSD server) : > avg-cpu: %user %nice %system %iowait %steal %idle > 0,00 0,00 0,52 41,99 0,00 57,48 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sdf 247,00 0,00 125520,00 0 125520 > > * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD > server (on drive > /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : > # time tar xzf src.tar.gz > real 0m9.669s > user 0m8.405s > sys 0m4.736s > > # time rm -rf * > real 0m3.647s > user 0m0.036s > sys 0m3.552s > > => iostat (on the OSD server) : > avg-cpu: %user %nice %system %iowait %steal %idle > 10,83 0,00 28,72 16,62 0,00 43,83 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sdf 1369,00 0,00 9300,00 0 9300 > > * Test with "dd" from the client using RBD : > # dd if=/dev/zero of=testdd bs=4k count=4M > 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s RBD caching should definitely be enabled for a test like this. I'd be surprised if you got 42MB/s without it though... > > => iostat (on the OSD server) : > avg-cpu: %user %nice %system %iowait %steal %idle > 4,57 0,00 30,46 27,66 0,00 37,31 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 317,00 0,00 57400,00 0 57400 > sdf 237,00 0,00 88336,00 0 88336 > > * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the > client using RBD : > # time tar xzf src.tar.gz > real 0m26.955s > user 0m9.233s > sys 0m11.425s > > # time rm -rf * > real 0m8.545s > user 0m0.128s > sys 0m8.297s > > => iostat (on the OSD server) : > avg-cpu: %user %nice %system %iowait %steal %idle > 4,59 0,00 24,74 30,61 0,00 40,05 > > Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn > sda 239,00 0,00 54772,00 0 54772 > sdf 441,00 0,00 50836,00 0 50836 > > * Test with "dd" from the client using CephFS : > # dd if=/dev/zero of=testdd bs=4k count=4M > 17179869184 bytes (17 G
Re: Ceph performance improvement
On 08/22/2012 03:54 AM, Denis Fondras wrote: Hello all, Hello! David had some good comments in his reply, so I'll just add in a couple of extra thoughts... I'm currently testing Ceph. So far it seems that HA and recovering are very good. The only point that prevents my from using it at datacenter-scale is performance. First of all, here is my setup : - 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - 4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 Not sure what version of glibc Wheezy has, but try to make sure you have one that supports syncfs (you'll also need a semi-new kernel, 3.0+ should be fine). (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). It has 1x 320GB drive for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal and 4x 3TB drive (Western Digital WD30EZRX). Everything but the boot partition is BTRFS-formated and 4K-aligned. - 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). Both servers are linked over a 1Gb Ethernet switch (iperf shows about 960Mb/s). Here is my ceph.conf : --cut-here-- [global] auth supported = cephx keyring = /etc/ceph/keyring journal dio = true osd op threads = 24 osd disk threads = 24 filestore op threads = 6 filestore queue max ops = 24 osd client message size cap = 1400 ms dispatch throttle bytes = 1750 default values are quite a bit lower for most of these. You may want to play with them and see if it has an effect. [mon] mon data = /home/mon.$id keyring = /etc/ceph/keyring.$name [mon.a] host = ceph-osd-0 mon addr = 192.168.0.132:6789 [mds] keyring = /etc/ceph/keyring.$name [mds.a] host = ceph-osd-0 [osd] osd data = /home/osd.$id osd journal = /home/osd.$id.journal osd journal size = 1000 keyring = /etc/ceph/keyring.$name [osd.0] host = ceph-osd-0 btrfs devs = /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 btrfs options = rw,noatime Just fyi, we are trying to get away from btrfs devs. --cut-here-- Here are some figures : * Test with "dd" on the OSD server (on drive /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s Good job using a data file that is much bigger than main memory! That looks pretty accurate for a 7200rpm spinning disk. For dd benchmarks, you should probably throw in conv=fdatasync at the end though. => iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 0,00 0,00 0,52 41,99 0,00 57,48 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdf 247,00 0,00 125520,00 0 125520 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD server (on drive /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : # time tar xzf src.tar.gz real 0m9.669s user 0m8.405s sys 0m4.736s # time rm -rf * real 0m3.647s user 0m0.036s sys 0m3.552s => iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 10,83 0,00 28,72 16,62 0,00 43,83 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sdf 1369,00 0,00 9300,00 0 9300 * Test with "dd" from the client using RBD : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s RBD caching should definitely be enabled for a test like this. I'd be surprised if you got 42MB/s without it though... => iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 4,57 0,00 30,46 27,66 0,00 37,31 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 317,00 0,00 57400,00 0 57400 sdf 237,00 0,00 88336,00 0 88336 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the client using RBD : # time tar xzf src.tar.gz real 0m26.955s user 0m9.233s sys 0m11.425s # time rm -rf * real 0m8.545s user 0m0.128s sys 0m8.297s => iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 4,59 0,00 24,74 30,61 0,00 40,05 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 239,00 0,00 54772,00 0 54772 sdf 441,00 0,00 50836,00 0 50836 * Test with "dd" from the client using CephFS : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s => iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 2,26 0,00 20,30 27,07 0,00 50,38 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 710,00 0,00 58836,00 0 58836 sdf 722,00 0,00 32768,00 0 32768 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the client using CephFS : # time tar xzf src.tar.gz real 3m55.260s user 0m8.721s sys 0m11.461s Ouch, that's taking a while! In addition to the comments that David made, be aware that you are also testing the metadata server with cephFS. Right now that's not getting a lot of attention as we are primarily focusing on RADOS performance at the moment. For this kind of test though, distributed filesyste
Re: Ceph performance improvement
Thank you for the answer David. That looks like you're writing to a filesystem on that disk, rather than the block device itself -- but lets say you've got 139MB/sec (1112Mbit/sec) of straight-line performance. Note: this is already faster than your network link can go -- you can, at best, only achieve 120MB/sec over your gigabit link. Yes, I am aware of that, I can't get more than the GB link. However, I mentionned this to show that the disk should not be a bottleneck. Is this a dd to the RBD device directly, or is this a write to a file in a filesystem created on top of it? The RBD device is mounted and formatted with BTRFS. dd will write blocks synchronously -- that is, it will write one block, wait for the write to complete, then write the next block, and so on. Because of the durability guarantees provided by ceph, this will result in dd doing a lot of waiting around while writes are being sent over the network and written out on your OSD. Thank you for that information. (If you're using the default replication count of 2, probably twice? I'm not exactly sure what Ceph does when it only has one OSD to work on..?) I don't know exactly how it behaves but "ceph -s" tells the cluster is degraded at 50%. Adding a second OSD allows Ceph to replicate. Just ignoring networking and storage for a moment, this also isn't a fair test: you're comparing the decompress-and-unpack time of a 139MB tarball on a 3GHz Pentium 4 with 1GB of RAM and a quad-core Xeon E5 that has 8GB. That's a very good point ! Comparing figures on the same host tells a different story (/mnt is Ceph RBD device) :) root@ceph-osd-1:/home# time tar xzf ../src.tar.gz && sync real0m43.668s user0m9.649s sys 0m20.897s root@ceph-osd-1:/mnt# time tar xzf ../src.tar.gz && sync real0m38.022s user0m9.101s sys 0m11.265s Thank you again, Denis -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph performance improvement
On 22/08/12 09:54, Denis Fondras wrote: The only point that prevents my from using it at datacenter-scale is performance. Here are some figures : * Test with "dd" on the OSD server (on drive /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s That looks like you're writing to a filesystem on that disk, rather than the block device itself -- but lets say you've got 139MB/sec (1112Mbit/sec) of straight-line performance. Note: this is already faster than your network link can go -- you can, at best, only achieve 120MB/sec over your gigabit link. * Test with "dd" from the client using RBD : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s Is this a dd to the RBD device directly, or is this a write to a file in a filesystem created on top of it? dd will write blocks synchronously -- that is, it will write one block, wait for the write to complete, then write the next block, and so on. Because of the durability guarantees provided by ceph, this will result in dd doing a lot of waiting around while writes are being sent over the network and written out on your OSD. (If you're using the default replication count of 2, probably twice? I'm not exactly sure what Ceph does when it only has one OSD to work on..?) * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the client using RBD : # time tar xzf src.tar.gz real0m26.955s user0m9.233s sys 0m11.425s Just ignoring networking and storage for a moment, this also isn't a fair test: you're comparing the decompress-and-unpack time of a 139MB tarball on a 3GHz Pentium 4 with 1GB of RAM and a quad-core Xeon E5 that has 8GB. Even ignoring the relative CPU difference, then unless you're doing something clever that you haven't described, there's no guarantee that the files in the latter case have actually been written to disk -- you have enough memory on your server for it to buffer all of those writes in RAM. You'd need to add a sync() call or similar at the end of your timing run to ensure that all of those writes have actually been committed to disk. * Test with "dd" from the client using CephFS : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s Again, the synchronous nature of 'dd' is probably severely affecting apparent performance. I'd suggest looking at some other tools, like fio, bonnie++, or iozone, which might generate more representative load. (Or, if you have a specific use-case in mind, something that generates an IO pattern like what you'll be using in production would be ideal!) Cheers, David -- David McBride Unix Specialist, University Computing Service -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: Fix sparse warning
On Tue, Aug 14, 2012 at 4:27 PM, Iulius Curt wrote: > From: Iulius Curt > > Make ceph_monc_do_poolop() static to remove the following sparse warning: > * net/ceph/mon_client.c:616:5: warning: symbol 'ceph_monc_do_poolop' was not >declared. Should it be static? > > Signed-off-by: Iulius Curt > --- > net/ceph/mon_client.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c > index 105d533..3875c60 100644 > --- a/net/ceph/mon_client.c > +++ b/net/ceph/mon_client.c > @@ -613,7 +613,7 @@ bad: > /* > * Do a synchronous pool op. > */ > -int ceph_monc_do_poolop(struct ceph_mon_client *monc, u32 op, > +static int ceph_monc_do_poolop(struct ceph_mon_client *monc, u32 op, > u32 pool, u64 snapid, > char *buf, int len) > { > -- > 1.7.9.5 > > -- Hi Sage, Can you have a look on this? :) thanks, Daniel. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ceph performance improvement
Hello all, I'm currently testing Ceph. So far it seems that HA and recovering are very good. The only point that prevents my from using it at datacenter-scale is performance. First of all, here is my setup : - 1 OSD/MDS/MON on a Supermicro X9DR3-F/X9DR3-F (1x Intel Xeon E5-2603 - 4 cores and 8GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). It has 1x 320GB drive for the system, 1x 64GB SSD (Crucial C300 - /dev/sda) for the journal and 4x 3TB drive (Western Digital WD30EZRX). Everything but the boot partition is BTRFS-formated and 4K-aligned. - 1 client (P4 3.00GHz dual-core, 1GB RAM) running Debian Sid/Wheezy and Ceph version 0.49 (commit:ca6265d0f4d68a5eb82b5bfafb450e8e696633ac). Both servers are linked over a 1Gb Ethernet switch (iperf shows about 960Mb/s). Here is my ceph.conf : --cut-here-- [global] auth supported = cephx keyring = /etc/ceph/keyring journal dio = true osd op threads = 24 osd disk threads = 24 filestore op threads = 6 filestore queue max ops = 24 osd client message size cap = 1400 ms dispatch throttle bytes = 1750 [mon] mon data = /home/mon.$id keyring = /etc/ceph/keyring.$name [mon.a] host = ceph-osd-0 mon addr = 192.168.0.132:6789 [mds] keyring = /etc/ceph/keyring.$name [mds.a] host = ceph-osd-0 [osd] osd data = /home/osd.$id osd journal = /home/osd.$id.journal osd journal size = 1000 keyring = /etc/ceph/keyring.$name [osd.0] host = ceph-osd-0 btrfs devs = /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201 btrfs options = rw,noatime --cut-here-- Here are some figures : * Test with "dd" on the OSD server (on drive /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 123,746 s, 139 MB/s => iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 0,000,000,52 41,990,00 57,48 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdf 247,00 0,00125520,00 0 125520 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz to the OSD server (on drive /dev/disk/by-id/scsi-SATA_WDC_WD30EZRX-00_WD-WMAWZ0152201) : # time tar xzf src.tar.gz real0m9.669s user0m8.405s sys 0m4.736s # time rm -rf * real0m3.647s user0m0.036s sys 0m3.552s => iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 10,830,00 28,72 16,620,00 43,83 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sdf1369,00 0,00 9300,00 0 9300 * Test with "dd" from the client using RBD : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 406,941 s, 42,2 MB/s => iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 4,570,00 30,46 27,660,00 37,31 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sda 317,00 0,00 57400,00 0 57400 sdf 237,00 0,00 88336,00 0 88336 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the client using RBD : # time tar xzf src.tar.gz real0m26.955s user0m9.233s sys 0m11.425s # time rm -rf * real0m8.545s user0m0.128s sys 0m8.297s => iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 4,590,00 24,74 30,610,00 40,05 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sda 239,00 0,00 54772,00 0 54772 sdf 441,00 0,00 50836,00 0 50836 * Test with "dd" from the client using CephFS : # dd if=/dev/zero of=testdd bs=4k count=4M 17179869184 bytes (17 GB) written, 338,29 s, 50,8 MB/s => iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 2,260,00 20,30 27,070,00 50,38 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sda 710,00 0,00 58836,00 0 58836 sdf 722,00 0,00 32768,00 0 32768 * Test with unpacking and deleting OpenBSD/5.1 src.tar.gz from the client using CephFS : # time tar xzf src.tar.gz real3m55.260s user0m8.721s sys 0m11.461s # time rm -rf * real9m2.319s user0m0.320s sys 0m4.572s => iostat (on the OSD server) : avg-cpu: %user %nice %system %iowait %steal %idle 14,400,00 15,942,310,00 67,35 Device:tpskB_read/skB_wrtn/skB_readkB_wrtn sda 174,00 0,00 10772,00