On Wed, Mar 26, 2014 at 2:04 AM, Gregory Farnum <[email protected]> wrote: > On Thu, Mar 20, 2014 at 3:49 AM, Andreas Joachim Peters > <[email protected]> wrote: >> Hi, >> >> I did some Firefly ceph-0.77-900.gce9bfb8 testing of EC/Tiering deploying 64 >> OSD with in-memory filesystems (RapidDisk with ext4) on a single 256 GB box. >> The raw write performance of this box is ~3 GB/s for all and ~450 MB/s per >> OSD. It provides 250k IOPS per OSD. >> >> I compared several algorithms and configurations ... >> >> Here are the results (there is no significant difference between 64 or 10 >> OSDS for the performance, tried both but not for 24+8 !) with 4M objects, 32 >> client threads .... >> >> 1 rep: 1.1 GB/s >> 2 rep: 886 MB/s >> 3 rep: 750 MB/s >> cauchy 4+2: 880 MB/s >> liber8tion: 4+2: 875 MB/s >> cauchy 6+3: 780 MB/s >> cauchy 16+8: 520 MB/s >> cauchy 24+8: 450 MB/s >> >> Then I added a single replica cache pool in front of cauchy 4+2. >> >> The write performance is now 1.1 GB/s as expected when the cache is not >> full. If I shrink the cache pool in front forcing continuous eviction during >> the benchmark it degrades to stable 140 MB/s. >> >> The single threaded client reduces from 260 MB/s to 165 MB/s. >> >> What is strange to me is that after a "rados bench" there are objects left >> in the cache and the back-end tier. They only disappear if I set the >> "forward" and force the eviction. Is that by design the desired behaviour to >> not apply the deletion? > > That's not too surprising -- you probably put enough data into the > cluster that some of the bench objects got evicted into the cold > storage pool, and then they were deleted by rados bench. The cache > pool needs to keep the object around with a "deleted" and "dirty" flag > to make sure it eventually gets cleaned up from the backing cold pool > -- as happened when you set to forward and forced an eviction. > >> >> Some observations: >> - I think it is important to document the alignment requirements for appends >> (e.g. if you do rados put it needs aligned appends and the 4M blocks are not >> aligned for every combination of (k,m) ). >> >> - another observation is that seems difficult to run 64 OSDs on a box. I >> have no obvious memory limitation but it requires ~30k threads and it was >> difficult to create several pools with many PGs without having OSDs core >> dumping because resources are not available. >> >> - when OSD get 100% full they core dump most of the time. In my case all >> OSDs become full at the same time and when this happended there is no way to >> get the cluster up again without manually deleting objects in the OSD >> directories and make some space. >> >> - I get a syntax error in the CEPH CENTOS(RHEL6) startup script: >> >> awk: { d=$2/1073741824 ; r = sprintf(\"%.2f\", d); print r } >> awk: ^ backslash not last character on line >> >> - I have run several times into a situation where the only way out was to >> delete the whole cluster and set it up from scratch >> >> - I got this reproducable stack trace with a EC pool and a front end tier: >> osd/ReplicatedPG.cc: 5554: FAILED assert(cop->data.length() + >> cop->temp_cursor.data_offset == cop->cursor.data_offset) >> >> ceph version 0.77-900-gce9bfb8 (ce9bfb879c32690d030db6b2a349b7b6f6e6a468) >> 1: >> (ReplicatedPG::_write_copy_chunk(boost::shared_ptr<ReplicatedPG::CopyOp>, >> PGBackend::PGTransaction*)+0x7dd) [0x8a376d] >> 2: >> (ReplicatedPG::_build_finish_copy_transaction(boost::shared_ptr<ReplicatedPG::CopyOp>, >> PGBackend::PGTransaction*)+0x114) [0x8a3954] >> 3: (ReplicatedPG::process_copy_chunk(hobject_t, unsigned long, int)+0x507) >> [0x8f1097] >> 4: (C_Copyfrom::finish(int)+0xb7) [0x93fa67] >> 5: (Context::complete(int)+0x9) [0x65d4b9] >> 6: (Finisher::finisher_thread_entry()+0x1d8) [0xa9a528] >> 7: /lib64/libpthread.so.0() [0x3386a079d1] >> 8: (clone()+0x6d) [0x33866e8b6d] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. > > Hmm, we've had a lot of bug fixes going in lately (and I know some > were around that copy infrastructure), so I bet that's fixed now. > >> >> Moreover I did some trivial testing of the meta data part of CephFS and >> ceph-fuse: >> >> - I created a directory hierarchy with like 10/1000/100 = 1 Mio directories. >> After creation the MDS uses 5.5 GB of memory, ceph-fuse 1.8 GB. It takes 33 >> minutes to do "find /ceph" on this hierarchy. If I restart the MDS and do >> the same it takes 18 minutes. After this operation the MDS uses ~10 GB of >> memory (10k per directory for one entry). > > Hmm. That's more than I would expect, but not impossibly so if the MDS > was having trouble keeping the relevant directories in-memory. We have > not done any optimization around that sort of scenario right now and > it's a pretty hard workload for a distributed storage system. :/ > >> >> If I do "ls -laRt /ceph" I get "no such file or directory" after some time. >> When this happened one can pick one of the directory and do a single "ls -la >> <dir>". The first time one get's again "no such file or directory", the >> second time it eventually works and shows the contents.
It's symptom of the dir complete bug (exits in kernel < 3.12) Yan, Zheng > > Can you expand on that a bit? What is "after some time"? > > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
